咱們在幫助紐約時報(The New York Times,如下簡稱NYT)開發一套基於內容的推薦系統,你們能夠把這套系統看做一個很是簡單的推薦系統開發示例。依託用戶近期的文章瀏覽數據,咱們會爲其推薦適合閱讀的新文章,而想作到這一點,只需以這篇文章的文本數據爲基礎,推薦給用戶相似的內容。web
數據檢驗 如下是數據集中第一篇NYT文章中的摘錄,咱們已經作過文本處理。數據庫
'TOKYO — State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia's Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, [...]"bash
首先須要解決的問題是,該如何將這段內容矢量化,而且設計諸如Parts-of-Speech 、N-grams 、sentiment scores 或 Named Entities等新特徵。dom
顯然NLP tunnel有深刻研究的價值,甚至能夠花費不少時間在既有方案上作實驗。但真正的科學每每是從試水最簡單可行的方案開始的,這樣後續的迭代纔會越發完善。ide
而在這篇文章中,咱們就開始執行這個簡單可行的方案。函數
數據拆分 咱們須要將標準數據進行預加工,方法是肯定數據庫中符合要求的特徵,打亂順序,而後將這些特徵分別放入訓練和測試集。oop
# move articles to an array
articles = df.body.values
# move article section names to an array
sections = df.section_name.values
# move article web_urls to an array
web_url = df.web_url.values
# shuffle these three arrays
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)
# split the shuffled articles into two arrays
n = 10
# one will have all but the last 10 articles -- think of this as your training set/corpus
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]
# the other will have those last 10 articles -- think of this as your test set/corpus
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]
複製代碼
文本矢量化系統 能夠從Bag-of-Words(BoW)、Tf-Idf、Word2Vec等幾種不一樣的文本矢量化系統中選擇。測試
咱們選擇Tf-Idf的緣由之一是,不一樣於BoW,Tf-Idf識別詞彙重要性的方式除文本頻率外,還包括逆文檔頻率。ui
舉例,一個像「Obama」這樣的詞彙雖然在文章中僅出現幾回(不包括相似「a」、「the」這樣不能傳達太多信息的詞彙),但出如今多篇不一樣的文章中,那麼就應該獲得更高的權重值。this
由於「Obama」既不是停用詞,也不是平常用語(即說明該詞彙與文章主題高度相關)。
類似性準則 肯定類似性準則時有好幾種方案,好比將Jacard和Cosine作對比。
Jacard的實現依靠兩集之間的比較及重疊元素選擇。考慮到已選擇Tf-Idf做爲文本矢量化系統,做爲選項,Jacard類似性並沒有意義。若是選擇BoWs矢量化,可能Jacard可能才能發揮做用。
所以,咱們嘗試將Cosine做爲類似性準則。
從Tf-Idf爲每篇文章中的每一個標記分配權重開始,就可以從不一樣文章標記的權重之間取點積了。若是文章A中相似「Obama」 或者「White House」這樣的標記權重較高,而且文章B中也是如此,那麼相對於文章B中相同標記權重低的狀況來講,二者的類似性乘積將得出一個更大的數值。
創建推薦系統 根據用戶已讀文章和全部語料庫中的其餘文章(即訓練數據)的類似性數值,如今你就能夠創建一個輸出前N篇文章的函數,而後開始給用戶推薦了。
def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
'''This function calculates similarity scores between a document and a corpus INPUT: vectorized document corpus, 2D array text document corpus, 1D array user article, 1D array article section names, 1D array article URLs, 1D array number of articles to recommend, int OUTPUT: top n recommendations, 1D array top n corresponding section names, 1D array top n corresponding URLs, 1D array similarity scores bewteen user article and entire corpus, 1D array '''
# calculate similarity between the corpus (i.e. the "test" data) and the user's article
similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
# get sorted similarity score indices
sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
# get sorted similarity scores
sorted_sim_scores = similarity_scores[sorted_indicies]
# get top n most similar documents
top_n_recs = X_train[sorted_indicies[:n]]
# get top n corresponding document section names
rec_sections = X_train_sections[sorted_indicies[:n]]
# get top n corresponding urls
rec_urls = X_train_urls[sorted_indicies[:n]]
# return recommendations and corresponding article meta-data
return top_n_recs, rec_sections, rec_urls, sorted_sim_scores
複製代碼
如下是該函數的執行步驟:
1.計算用戶文章和語料庫的類似性;
2.將類似性分值從高到低排序;
3.得出前N篇最類似的文章;
4.獲取對應前N篇文章的小標題及URL;
5.返回前N篇文章,小標題,URL和分值
結果驗證 如今咱們已經能夠根據用戶正在閱讀的內容,爲他們推薦可供閱讀的文章來檢測結果是否可行了。
接下來讓咱們將用戶文章和對應小標題與推薦文章和對應小標題做對比。
首先看一下類似性分值。
# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
# .
# .
複製代碼
Cosine類似度的取值範圍在0-1,因而可知其分值並不高。該如何提升分值呢? 能夠選擇相似Doc2Vec這樣不一樣的矢量化系統,也能夠換一個類似性準則。儘管如此,仍是讓咱們來看一下結果。
# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'
# corresponding section names for top n recs
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'
複製代碼
從結果能夠看出,推薦的小標題是符合須要的。
#user's article X_test[k] 'LOS ANGELES — The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard. If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we're not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'
前五篇推薦文章都與讀者當前閱讀的文章相關,事實證實該推薦系統符合預期。
關於驗證的說明 經過比較推薦文本和小標題的ad-hoc驗證過程,代表咱們的推薦系統能夠按照要求正常運行。
人工驗證的效果還不錯,不過咱們最終但願獲得的是一個徹底自動化的系統,以便於將其放入模型並自我驗證。
如何將該推薦系統放入模型不是本文的主題,本文旨在說明如何在真實數據集的基礎上設計這樣的推薦系統原型。
原文做者爲數據科學家Alexander Barriga,由國內智能推薦平臺先薦_個性化推薦專家編譯,部分有刪改,轉載請註明出處。