前面的博客介紹過如何構建一個推薦系統,以及簡要的介紹了協同過濾的實現。本篇博客,筆者將介紹協同過濾在推薦系統的應用。推薦系統是大數據和機器學習中最多見、最容易理解的應用之一。其實,在平常的生活當中,咱們會頻繁的遇到推薦的場景 ,好比你在電商網站購買商品、使用視頻App觀看視頻、在手機上下載各類遊戲等,這些都是使用了推薦技術來個性化你想要的內容和物品。html
本篇博客將經過如下方式來介紹,經過創建協同過濾模型,利用訂單數據來想用戶推薦預期的物品。步驟以下:算法
完成本篇博客所須要的技術使用Python和機器學習Turicreate來實現。Python所須要的依賴庫以下:app
本次演示的數據源,包含以下:機器學習
加載Python依賴庫,實現代碼以下:ide
import pandas as pd import numpy as np import time import turicreate as tc from sklearn.model_selection import train_test_split
查看數據集,實現代碼以下:函數
customers = pd.read_csv('customer_id.csv') transactions = pd.read_csv('customer_data.csv') print(customers.head()) print(transactions.head())
預覽結果以下:oop
將上述csv中的數據集中,將products列中的每一個物品列表分解成行,並計算用戶購買的產品數量。性能
實現代碼以下:學習
transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')]) data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), id_vars=['customerId'], value_name='products') \ .dropna().drop(['variable'], axis=1) \ .groupby(['customerId', 'products']) \ .agg({'products': 'count'}) \ .rename(columns={'products': 'purchase_count'}) \ .reset_index() \ .rename(columns={'products': 'productId'}) data['productId'] = data['productId'].astype(np.int64) print(data.shape) print(data.head())
預覽截圖以下:測試
實現代碼以下:
def create_data_dummy(data): data_dummy = data.copy() data_dummy['purchase_dummy'] = 1 return data_dummy data_dummy = create_data_dummy(data) print(data_dummy.head())
預覽結果以下:
實現代碼以下:
df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') print(df_matrix.head())
預覽結果以下:
矩陣規範化實現代碼以下:
df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) print(df_matrix_norm.head())
預覽結果以下:
建立一個表做爲模型的輸入,實現代碼以下:
d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] data_norm = pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna() print(data_norm.shape) print(data_norm.head())
預覽結果以下:
上述步驟能夠組合成下面定義的函數,實現代碼以下 :
def normalize_data(data): df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna()
上面,咱們規範化了用戶的購買歷史記錄,從0到1(1是一個物品的最多購買次數,0是該物品的0個購買計數)。
拆分函數實現以下:
def split_data(data): ''' Splits dataset into training and test set. Args: data (pandas.DataFrame) Returns train_data (tc.SFrame) test_data (tc.SFrame) ''' train, test = train_test_split(data, test_size = .2) train_data = tc.SFrame(train) test_data = tc.SFrame(test) return train_data, test_data
如今咱們有了是三個數據集,分別是購買計數、購買虛擬數據和按比例的購買計數,這裏咱們將每一個數據集分開進行建模,實現代碼以下:
train_data, test_data = split_data(data) train_data_dummy, test_data_dummy = split_data(data_dummy) train_data_norm, test_data_norm = split_data(data_norm)
print(train_data)
這裏打印訓練結果數據,預覽結果以下:
在運行更加複雜的方法(好比協同過濾)以前,咱們應該運行一個基線模型來比較和評估模型。因爲基線一般使用一種很是簡單的方法,所以若是在這種方法以外使用的技術顯示出相對較好的準確性和複雜性,則應該選擇這些技術。
Baseline Model是機器學習領域的一個術語,簡而言之,就是使用最廣泛的狀況來作結果預測。好比,猜硬幣遊戲,最簡單的策略就是一直選擇正面或者反面,這樣從預測的模型結果來看,你是有50%的準確率的。
一種更復雜可是更常見的預測購買商品的方法就是協同過濾。下面,咱們首先定義要在模型中使用的變量,代碼以下:
# constant variables to define field names include: user_id = 'customerId' item_id = 'productId' users_to_recommend = list(customers[user_id]) n_rec = 10 # number of items to recommend n_display = 30 # to display the first few rows in an output dataset
Turicreate使咱們很是容易去調用建模技術,所以,定義全部模型的函數以下:
def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display): if name == 'popularity': model = tc.popularity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target) elif name == 'cosine': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='cosine') elif name == 'pearson': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='pearson') recom = model.recommend(users=users_to_recommend, k=n_rec) recom.print_rows(n_display) return model
購買計數實現代碼以下:
name = 'popularity' target = 'purchase_count' popularity = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(popularity)
截圖以下:
購買虛擬人代碼以下:
name = 'popularity' target = 'purchase_dummy' pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pop_dummy)
截圖以下:
按比例購買計數實現代碼以下:
name = 'popularity' target = 'scaled_purchase_freq' pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pop_norm)
截圖以下:
根據用戶如何在協做購買物品的基礎上推薦類似的物品。例如,若是用戶1和用戶2購買了相似的物品,好比用戶1購買的X、Y、Z,用戶2購買了X、Y、Y,那麼咱們能夠向用戶2推薦物品Z。
公式以下:
購買計數代碼以下:
name = 'cosine' target = 'purchase_count' cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(cos)
截圖以下:
購買虛擬人代碼以下:
name = 'cosine' target = 'purchase_dummy' cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(cos_dummy)
截圖以下:
按比例購買計數,實現代碼以下:
name = 'cosine' target = 'scaled_purchase_freq' cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(cos_norm)
截圖以下:
類似性是兩個向量之間的皮爾遜係數。
購買計數實現代碼:
name = 'pearson' target = 'purchase_count' pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pear)
截圖以下:
購買虛擬人實現代碼:
name = 'pearson' target = 'purchase_dummy' pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pear_dummy)
截圖以下:
按比例購買計數:
name = 'pearson' target = 'scaled_purchase_freq' pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) print(pear_norm)
截圖以下:
在評價推薦引擎時,咱們可使用RMSE和精準召回的概念。
爲什麼召回和準確度如此重要呢?
下面,咱們爲模型求值建立初識可調用變量,實現代碼以下:
models_w_counts = [popularity, cos, pear] models_w_dummy = [pop_dummy, cos_dummy, pear_dummy] models_w_norm = [pop_norm, cos_norm, pear_norm] names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts'] names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy'] names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts']
而後,讓咱們比較一下咱們基於RMSE和精準召回特性構建的全部模型,代碼以下:
eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts) eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy) eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm)
評估結果輸出以下:
完成實例代碼以下:
import pandas as pd import numpy as np import time import turicreate as tc from sklearn.model_selection import train_test_split customers = pd.read_csv('customer_id.csv') transactions = pd.read_csv('customer_data.csv') # print(customers.head()) # print(transactions.head()) transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')]) data = pd.melt(transactions.set_index('customerId')['products'].apply(pd.Series).reset_index(), id_vars=['customerId'], value_name='products') \ .dropna().drop(['variable'], axis=1) \ .groupby(['customerId', 'products']) \ .agg({'products': 'count'}) \ .rename(columns={'products': 'purchase_count'}) \ .reset_index() \ .rename(columns={'products': 'productId'}) data['productId'] = data['productId'].astype(np.int64) # print(data.shape) # print(data.head()) def create_data_dummy(data): data_dummy = data.copy() data_dummy['purchase_dummy'] = 1 return data_dummy data_dummy = create_data_dummy(data) # print(data_dummy.head()) df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') # print(df_matrix.head()) df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) # print(df_matrix_norm.head()) # create a table for input to the modeling d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] data_norm = pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna() # print(data_norm.shape) # print(data_norm.head()) def normalize_data(data): df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId') df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min()) d = df_matrix_norm.reset_index() d.index.names = ['scaled_purchase_freq'] return pd.melt(d, id_vars=['customerId'], value_name='scaled_purchase_freq').dropna() def split_data(data): ''' Splits dataset into training and test set. Args: data (pandas.DataFrame) Returns train_data (tc.SFrame) test_data (tc.SFrame) ''' train, test = train_test_split(data, test_size = .2) train_data = tc.SFrame(train) test_data = tc.SFrame(test) return train_data, test_data train_data, test_data = split_data(data) train_data_dummy, test_data_dummy = split_data(data_dummy) train_data_norm, test_data_norm = split_data(data_norm) # print(train_data) # constant variables to define field names include: user_id = 'customerId' item_id = 'productId' users_to_recommend = list(customers[user_id]) n_rec = 10 # number of items to recommend n_display = 30 # to display the first few rows in an output dataset def model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display): if name == 'popularity': model = tc.popularity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target) elif name == 'cosine': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='cosine') elif name == 'pearson': model = tc.item_similarity_recommender.create(train_data, user_id=user_id, item_id=item_id, target=target, similarity_type='pearson') recom = model.recommend(users=users_to_recommend, k=n_rec) recom.print_rows(n_display) return model name = 'popularity' target = 'purchase_count' popularity = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(popularity) name = 'popularity' target = 'purchase_dummy' pop_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pop_dummy) name = 'popularity' target = 'scaled_purchase_freq' pop_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pop_norm) name = 'cosine' target = 'purchase_count' cos = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(cos) name = 'cosine' target = 'purchase_dummy' cos_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(cos_dummy) name = 'cosine' target = 'scaled_purchase_freq' cos_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(cos_norm) name = 'pearson' target = 'purchase_count' pear = model(train_data, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pear) name = 'pearson' target = 'purchase_dummy' pear_dummy = model(train_data_dummy, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pear_dummy) name = 'pearson' target = 'scaled_purchase_freq' pear_norm = model(train_data_norm, name, user_id, item_id, target, users_to_recommend, n_rec, n_display) # print(pear_norm) models_w_counts = [popularity, cos, pear] models_w_dummy = [pop_dummy, cos_dummy, pear_dummy] models_w_norm = [pop_norm, cos_norm, pear_norm] names_w_counts = ['Popularity Model on Purchase Counts', 'Cosine Similarity on Purchase Counts', 'Pearson Similarity on Purchase Counts'] names_w_dummy = ['Popularity Model on Purchase Dummy', 'Cosine Similarity on Purchase Dummy', 'Pearson Similarity on Purchase Dummy'] names_w_norm = ['Popularity Model on Scaled Purchase Counts', 'Cosine Similarity on Scaled Purchase Counts', 'Pearson Similarity on Scaled Purchase Counts'] eval_counts = tc.recommender.util.compare_models(test_data, models_w_counts, model_names=names_w_counts) eval_dummy = tc.recommender.util.compare_models(test_data_dummy, models_w_dummy, model_names=names_w_dummy) eval_norm = tc.recommender.util.compare_models(test_data_norm, models_w_norm, model_names=names_w_norm) # Final Output Result # final_model = tc.item_similarity_recommender.create(tc.SFrame(data_dummy), user_id=user_id, item_id=item_id, target='purchase_dummy', similarity_type='cosine') # recom = final_model.recommend(users=users_to_recommend, k=n_rec) # recom.print_rows(n_display) # df_rec = recom.to_dataframe() # print(df_rec.shape) # print(df_rec.head())
這篇博客就和你們分享到這裏,若是你們在研究學習的過程中有什麼問題,能夠加羣進行討論或發送郵件給我,我會盡我所能爲您解答,與君共勉!
另外,博主出書了《Kafka並不難學》和《Hadoop大數據挖掘從入門到進階實戰》,喜歡的朋友或同窗, 能夠在公告欄那裏點擊購買連接購買博主的書進行學習,在此感謝你們的支持。關注下面公衆號,根據提示,可免費獲取書籍的教學視頻。