一:入門html
一、基本用法git
(1)、自動交叉驗證github
Surprise有一套內置的 算法和數據集供您使用。在最簡單的形式中,只需幾行代碼便可運行交叉驗證程序:算法
from surprise import SVD from surprise import Dataset from surprise.model_selection import cross_validate # Load the movielens-100k dataset (download it if needed), # 加載movielens-100k數據集(若是須要,下載) data = Dataset.load_builtin('ml-100k') # #咱們將使用着名的SVD算法。 # We'll use the famous SVD algorithm. algo = SVD() #運行5倍交叉驗證並打印結果 # Run 5-fold cross-validation and print results cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
輸出結果:api
Evaluating RMSE, MAE of algorithm SVD on 5 split(s). Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std RMSE (testset) 0.9398 0.9321 0.9413 0.9349 0.9329 0.9362 0.0037 MAE (testset) 0.7400 0.7351 0.7400 0.7364 0.7370 0.7377 0.0020 Fit time 5.66 5.47 5.46 5.60 5.77 5.59 0.12 Test time 0.24 0.14 0.18 0.15 0.15 0.17 0.04
該load_builtin()
方法將提供下載movielens-100k數據集(若是還沒有下載),並將其保存.surprise_data
在主目錄的文件夾中(您也能夠選擇將其保存在其餘位置)。數組
咱們在這裏使用衆所周知的 SVD
算法,可是有許多其餘算法可用。dom
該cross_validate()
函數根據cv
參數運行交叉驗證過程,並計算一些accuracy
度量。咱們在這裏使用經典的5倍交叉驗證,但能夠使用更高級的迭代器函數
(2)、測試集分解和fit()方法工具
若是您不想運行完整的交叉驗證程序,能夠使用對 train_test_split()
給定大小的訓練集和測試集進行採樣,並使用您的選擇。您將須要使用將在列車集上訓練算法的方法,以及將返回從testset進行的預測的方法:accuracy metric
fit()
test()
學習
from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise.model_selection import train_test_split # Load the movielens-100k dataset (download it if needed), data = Dataset.load_builtin('ml-100k') # sample random trainset and testset # 隨機測試集和訓練集 # test set is made of 25% of the ratings. # 將25%的數據設置成測試集 trainset, testset = train_test_split(data, test_size=.25) # We'll use the famous SVD algorithm. algo = SVD() # Train the algorithm on the trainset, and predict ratings for the testset # 在訓練集中訓練算法,並預測數據 algo.fit(trainset) predictions = algo.test(testset) # Then compute RMSE accuracy.rmse(predictions)
執行結果:
RMSE: 0.9461
(3)、訓練整個訓練集和predict()方法
顯然,咱們也能夠簡單地將算法擬合到整個數據集,而不是運行交叉驗證。這能夠經過使用build_full_trainset()
將構建trainset
對象的方法來完成 :
from surprise import KNNBasic from surprise import Dataset # Load the movielens-100k dataset data = Dataset.load_builtin('ml-100k') # Retrieve the trainset. # 檢索訓練集 trainset = data.build_full_trainset() # Build an algorithm, and train it. # 構建算法並訓練 algo = KNNBasic() algo.fit(trainset) uid = str(196) # raw user id (as in the ratings file). They are **strings**! iid = str(302) # raw item id (as in the ratings file). They are **strings**! # get a prediction for specific users and items. # #獲取特定用戶和項目的預測。 pred = algo.predict(uid, iid, r_ui=4, verbose=True)
預測結果:
user: 196 item: 302 r_ui = 4.00 est = 4.06 {'actual_k': 40, 'was_impossible': False}
# est表示預測值
以上都是使用內置的數據集。
二、使用自定義數據集
Surprise有一組內置 數據集,但您固然能夠使用自定義數據集。加載評級數據集能夠從文件(例如csv文件)或從pandas數據幀完成。不管哪一種方式,您都須要Reader
爲Surprise定義一個對象,以便可以解析文件或數據幀。
# 要從文件(例如csv文件)加載數據集,您將須要如下 load_from_file()方法: from surprise import BaselineOnly from surprise import Dataset from surprise import Reader from surprise.model_selection import cross_validate import os # path to dataset file # 數據集路徑 file_path = os.path.expanduser(r'C:/Users/FELIX/.surprise_data/ml-100k/ml-100k/u.data') # As we're loading a custom dataset, we need to define a reader. In the # movielens-100k dataset, each line has the following format: # 'user item rating timestamp', separated by '\t' characters. # #當咱們加載自定義數據集時,咱們須要定義一個reader。在 # #movielens-100k數據集中,每一行都具備如下格式: # #'user item rating timestamp',以'\ t'字符分隔。 reader = Reader(line_format='user item rating timestamp', sep='\t') data = Dataset.load_from_file(file_path, reader=reader) # We can now use this dataset as we please, e.g. calling cross_validate # #咱們如今能夠隨意使用這個數據集,例如調用cross_validate cross_validate(BaselineOnly(), data, verbose=True)
# 要從pandas數據框加載數據集,您將須要該 load_from_df()方法。您還須要一個Reader對象,但只能rating_scale指定參數。數據框必須有三列,對應於用戶(原始)ID,項目(原始)ID以及此順序中的評級。所以,每行對應於給定的評級。這不是限制性的,由於您能夠輕鬆地從新排序數據框的列 import pandas as pd from surprise import NormalPredictor from surprise import Dataset from surprise import Reader from surprise.model_selection import cross_validate # Creation of the dataframe. Column names are irrelevant. # #建立數據幀。列名可有可無。 ratings_dict = {'itemID': [1, 1, 1, 2, 2], 'userID': [9, 32, 2, 45, 'user_foo'], 'rating': [3, 2, 4, 3, 1]} df = pd.DataFrame(ratings_dict) # A reader is still needed but only the rating_scale param is requiered. # #仍然須要一個reader,但只須要rating_scale param。 reader = Reader(rating_scale=(1, 5)) # The columns must correspond to user id, item id and ratings (in that order). # #列必須對應於用戶ID,項目ID和評級(按此順序)。 data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader) # We can now use this dataset as we please, e.g. calling cross_validate # #咱們如今能夠隨意使用這個數據集,例如調用cross_validate cross_validate(NormalPredictor(), data, cv=2)
三、使用交叉驗證迭代器
對於交叉驗證,咱們能夠使用cross_validate()
爲咱們完成全部艱苦工做的功能。可是爲了更好地控制,咱們還能夠實現交叉驗證迭代器,並使用split()
迭代器的test()
方法和算法的 方法對每一個拆分進行預測 。這是一個例子,咱們使用經典的K-fold交叉驗證程序和3個拆分:
from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise.model_selection import KFold # Load the movielens-100k dataset data = Dataset.load_builtin('ml-100k') # define a cross-validation iterator # define一個交叉驗證迭代器 kf = KFold(n_splits=3) algo = SVD() for trainset, testset in kf.split(data): # train and test algorithm. #訓練和測試算法。 algo.fit(trainset) predictions = algo.test(testset) # Compute and print Root Mean Squared Error # 計算並打印輸出 accuracy.rmse(predictions, verbose=True)
能夠使用其餘交叉驗證迭代器,如LeaveOneOut或ShuffleSplit。在這裏查看全部可用的迭代器。Surprise的交叉驗證工具的設計源於優秀的scikit-learn API。
交叉驗證的一個特例是當摺疊已經被某些文件預約義時。例如,movielens-100K數據集已經提供了5個訓練和測試文件(u1.base,u1.test ... u5.base,u5.test)。驚喜能夠經過使用surprise.model_selection.split.PredefinedKFold
對象來處理這種狀況:
from surprise import SVD from surprise import Dataset from surprise import Reader from surprise import accuracy from surprise.model_selection import PredefinedKFold # path to dataset folder files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/') # This time, we'll use the built-in reader. reader = Reader('ml-100k') # folds_files is a list of tuples containing file paths: # [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)] train_file = files_dir + 'u%d.base' test_file = files_dir + 'u%d.test' folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)] data = Dataset.load_from_folds(folds_files, reader=reader) pkf = PredefinedKFold() algo = SVD() for trainset, testset in pkf.split(data): # train and test algorithm. algo.fit(trainset) predictions = algo.test(testset) # Compute and print Root Mean Squared Error accuracy.rmse(predictions, verbose=True)
固然,也能夠對單個文件進行訓練和測試。可是folds_files參數仍然要列表的形式。
四、使用GridSearchCV調整算法參數
該cross_validate()
函數報告針對給定參數集的交叉驗證過程的準確度度量。若是你想知道哪一個參數組合能產生最好的結果,那麼這個 GridSearchCV類
就能夠解決了。給定一個dict
參數,該類詳盡地嘗試全部參數組合並報告任何精度測量的最佳參數(在不一樣的分裂上取平均值)。它受到scikit-learn的GridSearchCV的啓發。
from surprise import SVD from surprise import Dataset from surprise.model_selection import GridSearchCV # Use movielens-100K data = Dataset.load_builtin('ml-100k') param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005], 'reg_all': [0.4, 0.6]} gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3) gs.fit(data) # best RMSE score print(gs.best_score['rmse']) # 輸出最高的準確率的值 # combination of parameters that gave the best RMSE score print(gs.best_params['rmse']) # 輸出最好的批次,學習率參數
經過上面操做獲得最佳參數後就能夠使用該參數的算法:
# We can now use the algorithm that yields the best rmse: algo = gs.best_estimator['rmse'] algo.fit(data.build_full_trainset())