surprise庫文檔翻譯

時間 2019-11-06

標籤 surprise 文檔翻譯简体版

原文原文鏈接

這裏的格式並無作過多的處理，可參考於OneNote筆記連接算法

因爲OneNote取消了單頁分享，若是須要請留下郵箱，我會郵件發送pdf版本，後續再解決這個問題數組

推薦算法庫surprise安裝app

pip install surprise

基本用法
• 自動交叉驗證框架

# Load the movielens-100k dataset (download it if needed),
    data = Dataset.load_builtin('ml-100k')
            # We'll use the famous SVD algorithm.
    algo = SVD()
            # Run 5-fold cross-validation and print results
    cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
            load_builtin方法會自動下載「movielens-100k」數據集，放在.surprise_data目錄下面
        • 使用自定義的數據集
            # path to dataset file
    file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
            # As we're loading a custom dataset, we need to define a reader. In the
    # movielens-100k dataset, each line has the following format:
    # 'user item rating timestamp', separated by '\t' characters.
    reader = Reader(line_format='user item rating timestamp', sep='\t')
            data = Dataset.load_from_file(file_path, reader=reader)
            # We can now use this dataset as we please, e.g. calling cross_validate
    cross_validate(BaselineOnly(), data, verbose=True)

交叉驗證dom

○ cross_validate(算法，數據集，評估模塊measures=[]，交叉驗證折數cv)
    ○ 經過test方法和KFold也能夠對數據集進行更詳細的操做，也可使用LeaveOneOut或是ShuffleSplit
    from surprise import SVD
    from surprise import Dataset
    from surprise import accuracy
    from surprise.model_selection import Kfold
    
    # Load the movielens-100k dataset
    data = Dataset.load_builtin('ml-100k')
    
    # define a cross-validation iterator
    kf = KFold(n_splits=3)
    
    algo = SVD()
    for trainset, testset in kf.split(data):
        # train and test algorithm.
        algo.fit(trainset)
        predictions = algo.test(testset)
        # Compute and print Root Mean Squared Error
        accuracy.rmse(predictions, verbose=True)

使用GridSearchCV來調節算法參數函數

若是須要對算法參數來進行比較測試，GridSearchCV類能夠提供解決方案

例如對SVD的參數嘗試不一樣的值工具

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

# Use movielens-100K
data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator['rmse']
algo.fit(data.build_full_trainset())

使用預測算法性能

○ 基線估算配置
        § 在使用最小二乘法（ALS）時傳入參數：
            1) reg_i：項目正則化參數，默認值爲10
            2) reg_u：用戶正則化參數，默認值爲15
            3) n_epochs：als過程當中的迭代次數，默認值爲10
            print('Using ALS')
            bsl_options = {'method': 'als',
                           'n_epochs': 5,
                           'reg_u': 12,
                           'reg_i': 5
                           }
            algo = BaselineOnly(bsl_options=bsl_options)
        § 在使用隨機梯度降低（SGD）時傳入參數：
            1) reg：優化成本函數的正則化參數，默認值爲0.02
            2) learning_rate：SGD的學習率，默認值爲0.005
            3) n_epochs：SGD過程當中的迭代次數，默認值爲20
            print('Using SGD')
            bsl_options = {'method': 'sgd',
                           'learning_rate': .00005,
                           }
            algo = BaselineOnly(bsl_options=bsl_options)
        § 在建立KNN算法時候來傳遞參數
            bsl_options = {'method': 'als',
                           'n_epochs': 20,
                           }
            sim_options = {'name': 'pearson_baseline'}
            algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)
    ○ 類似度配置
        § name：要使用的類似度名稱，默認是MSD
        § user_based：是否時基於用戶計算類似度，默認爲True
        § min_support：最小的公共數目，當最小的公共用戶或者公共項目小於min_support時候，類似度爲0
        § shrinkage：收縮參數，默認值爲100
        i. sim_options = {'name': 'cosine',
                       'user_based': False  # compute  similarities between items
                       }
        algo = KNNBasic(sim_options=sim_options)
        ii. sim_options = {'name': 'pearson_baseline',
                       'shrinkage': 0  # no shrinkage
                       }
        algo = KNNBasic(sim_options=sim_options)
• 其餘一些問題
    ○ 如何獲取top-N的推薦
        

from collections import defaultdict
            
            from surprise import SVD
            from surprise import Dataset
            
            
            def get_top_n(predictions, n=10):
                '''Return the top-N recommendation for each user from a set of predictions.
            
                Args:
                    predictions(list of Prediction objects): The list of predictions, as
                        returned by the test method of an algorithm.
                    n(int): The number of recommendation to output for each user. Default
                        is 10.
            
                Returns:
                A dict where keys are user (raw) ids and values are lists of tuples:
                    [(raw item id, rating estimation), ...] of size n.
                '''
            
                # First map the predictions to each user.
                top_n = defaultdict(list)
                for uid, iid, true_r, est, _ in predictions:
                    top_n[uid].append((iid, est))
            
                # Then sort the predictions for each user and retrieve the k highest ones.
                for uid, user_ratings in top_n.items():
                    user_ratings.sort(key=lambda x: x[1], reverse=True)
                    top_n[uid] = user_ratings[:n]
            
                return top_n
            
            
            # First train an SVD algorithm on the movielens dataset.
            data = Dataset.load_builtin('ml-100k')
            trainset = data.build_full_trainset()
            algo = SVD()
            algo.fit(trainset)
            
            # Than predict ratings for all pairs (u, i) that are NOT in the training set.
            testset = trainset.build_anti_testset()
            predictions = algo.test(testset)
            
            top_n = get_top_n(predictions, n=10)
            
            # Print the recommended items for each user
            for uid, user_ratings in top_n.items():
                print(uid, [iid for (iid, _) in user_ratings])
    ○ 如何計算精度

from collections import defaultdict學習

from surprise import Dataset
    from surprise import SVD
    from surprise.model_selection import KFold
    
    
    def precision_recall_at_k(predictions, k=10, threshold=3.5):
        '''Return precision and recall at k metrics for each user.'''
    
        # First map the predictions to each user.
        user_est_true = defaultdict(list)
        for uid, _, true_r, est, _ in predictions:
            user_est_true[uid].append((est, true_r))
    
        precisions = dict()
        recalls = dict()
        for uid, user_ratings in user_est_true.items():
    
            # Sort user ratings by estimated value
            user_ratings.sort(key=lambda x: x[0], reverse=True)
    
            # Number of relevant items
            n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
    
            # Number of recommended items in top k
            n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
    
            # Number of relevant and recommended items in top k
            n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                                  for (est, true_r) in user_ratings[:k])
    
            # Precision@K: Proportion of recommended items that are relevant
            precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1
    
            # Recall@K: Proportion of relevant items that are recommended
            recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1
    
        return precisions, recalls
    
    
    data = Dataset.load_builtin('ml-100k')
    kf = KFold(n_splits=5)
    algo = SVD()
    
    for trainset, testset in kf.split(data):
        algo.fit(trainset)
        predictions = algo.test(testset)
        precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)
    
        # Precision and recall can then be averaged over all users
        print(sum(prec for prec in precisions.values()) / len(precisions))
        print(sum(rec for rec in recalls.values()) / len(recalls))
    ○ 如何得到用戶（或項目）的k個最近鄰居

import io # needed because of weird encoding of u.item file測試

from surprise import KNNBaseline
    from surprise import Dataset
    from surprise import get_dataset_dir
    
    
    def read_item_names():
        """Read the u.item file from MovieLens 100-k dataset and return two
        mappings to convert raw ids into movie names and movie names into raw ids.
        """
    
        file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
        rid_to_name = {}
        name_to_rid = {}
        with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
            for line in f:
                line = line.split('|')
                rid_to_name[line[0]] = line[1]
                name_to_rid[line[1]] = line[0]
    
        return rid_to_name, name_to_rid
    
    
    # First, train the algortihm to compute the similarities between items
    data = Dataset.load_builtin('ml-100k')
    trainset = data.build_full_trainset()
    sim_options = {'name': 'pearson_baseline', 'user_based': False}
    algo = KNNBaseline(sim_options=sim_options)
    algo.fit(trainset)
    
    # Read the mappings raw id <-> movie name
    rid_to_name, name_to_rid = read_item_names()
    
    # Retrieve inner id of the movie Toy Story
    toy_story_raw_id = name_to_rid['Toy Story (1995)']
    toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
    
    # Retrieve inner ids of the nearest neighbors of Toy Story.
    toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
    
    # Convert inner ids of the neighbors into names.
    toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                           for inner_id in toy_story_neighbors)
    toy_story_neighbors = (rid_to_name[rid]
                           for rid in toy_story_neighbors)
    
    print()
    print('The 10 nearest neighbors of Toy Story are:')
    for movie in toy_story_neighbors:
        print(movie)
    ○ 解釋一下什麼是raw_id和inner_id？
        i. 用戶和項目有本身的raw_id和inner_id，原生id是評分文件或者pandas數據集中定義的id，重點在於要知道你使用predict()或者其餘方法時候接收原生的id
        ii. 在訓練集建立時，每個原生的id映射到inner id（這是一個惟一的整數，方便surprise操做），原生id和內部id之間的轉換能夠用訓練集中的to_inner_uid(), to_inner_iid(), to_raw_uid(), 以及to_raw_iid()方法
    ○ 默認數據集下載到了哪裏？怎麼修改這個位置
        i. 默認數據集下載到了——「~/.surprise_data」中
        ii. 若是須要修改，能夠經過設置「SURPRISE_DATA_FOLDER」環境變量來修改位置
• API合集
    ○ 推薦算法包
        random_pred.NormalPredictor    Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
        baseline_only.    BaselineOnly    Algorithm predicting the baseline estimate for given user and item.
        knns.KNNBasic    A basic collaborative filtering algorithm.
        knns.KNNWithMeans    A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
        knns.KNNWithZScore    A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
        knns.KNNBaseline    A basic collaborative filtering algorithm taking into account a baseline rating.
        matrix_factorization.SVD    The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
        matrix_factorization.SVDpp        The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
        matrix_factorization.NMF    A collaborative filtering algorithm based on Non-negative Matrix Factorization.
        slope_one.SlopeOne        A simple yet accurate collaborative filtering algorithm.
        co_clustering.CoClustering    A collaborative filtering algorithm based on co-clustering.
    ○ 推薦算法基類
        § class surprise.prediction_algorithms.algo_base.AlgoBase(**kwargs)
        § 若是算法須要計算類似度，那麼baseline_options參數能夠用來配置
        § 方法介紹：
            1) compute_baselines() 計算用戶和項目的基線，這個方法只能適用於Pearson類似度或者BaselineOnly算法，返回一個包含用戶類似度和用戶類似度的元組
            2) compute_similarities() 類似度矩陣，計算類似度矩陣的方式取決於sim_options算法建立時候所傳遞的參數，返回類似度矩陣
            3) default_preditction() 默認的預測值，若是計算期間發生了異常，那麼預測值則使用這個值。默認狀況下時全部評分的均值（能夠在子類中重寫，以改變這個值），返回一個浮點類型
            4) fit(trainset) 在給定的訓練集上訓練算法，每一個派生類都會調用這個方法做爲訓練算法的第一個基本步驟，它負責初始化一些內部結構和設置self.trainset屬性，返回self指針
            5) get_neighbors(iid, k) 返回inner id所對應的k個最近鄰居的，取決於這個iid所對應的是用戶仍是項目（由sim_options裏面的user_based是True仍是False決定），返回K個最近鄰居的內部id列表
            6) predict(uid, iid, r_ui=None, clip=True, verbose=False) 計算給定的用戶和項目的評分預測，該方法將原生id轉換爲內部id，而後調用estimate每一個派生類中定義的方法。若是結果是一個不可能的預測結果，那麼會根據default_prediction()來計算預測值
            另外解釋一下clip，這個參數決定是否對預測結果進行近似。舉個例子來講，若是預測結果是5.5，而評分的區間是[1,5]，那麼將預測結果修改成5；若是預測結果小於1，那麼修改成1。默認爲True
            verbose參數決定了是否打印每一個預測的詳細信息。默認值爲False
            返回值，一個rediction對象，包含了：
                a) 原生用戶id
                b) 原生項目id
                c) 真實評分
                d) 預測評分
                e) 可能對後面預測有用的一些其餘的詳細信息
            7) test(testset, verbose=False) 在給定的測試集上測試算法，即估計給定測試集中的全部評分。返回值是prediction對象的列表
            8) 
    ○ 預測模塊
        § surprise.prediction_algorithms.predictions模塊定義了Prediction命名元組和PredictionImpossible異常
        § Prediction
            □ 用於儲存預測結果的命名元組
            □ 僅用於文檔和打印等目的
            □ 參數：
                uid    原生用戶id
                iid    原生項目id
                r_ui    浮點型的真實評分
                est    浮點型的預測評分
                details    預測相關的其餘詳細信息
        § surprise.prediction_algorithms.predictions.PredictionImpossible
            □ 當預測不可能時候，出現這個異常
            □ 這個異常會設置當前的預測評分變爲默認值（全局平均值）
    ○ model_selection包
        § 交叉驗證迭代器
            □ 該模塊中包含各類交叉驗證迭代器：
                KFold    基礎交叉驗證迭代器
                RepeatedKFold    重複KFold交叉驗證迭代器
                ShuffleSplit    具備隨機訓練集和測試集的基本交叉驗證迭代器
                LeaveOneOut    交叉驗證迭代器，其中每一個用戶再測試集中只有一個評級
                PredefinedKFold    使用load_from_folds方法加載數據集時的交叉驗證迭代器
            □ 該模塊中還包含了將數據集分爲訓練集和測試集的功能
                train_test_split(data, test_size=0,2, train_size=None, random_state=None, shuffle=True)
                     data，要拆分的數據集
                     test_size，若是是浮點數，表示要包含在測試集中的評分比例；若是是整數，則表示測試集中固定的評分數；若是是None，則設置爲訓練集大小的補碼；默認爲0.2
                     train_size，若是是浮點數，表示要包含在訓練集中的評分比例；若是是整數，則表示訓練集中固定的評分數；若是是None，則設置爲訓練集大小的補碼；默認爲None
                     random_state，整形，一個隨機種子，若是屢次拆分後得到的訓練集和測試集沒有多大分別，能夠用這個參數來定義隨機種子
                     shuffle，布爾值，是否在數據集中改變評分，默認爲True
        § 交叉驗證
            surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse'，u'mae'], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch=u'2 * n_jobs', verbose=False)
                ® algo，算法
                ® data，數據集
                ® measures，字符串列表，指定評估方案
                ® cv，交叉迭代器或者整形或者None，若是是迭代器那麼按照指定的參數；若是是int，則使用KFold交叉驗證迭代器，以參數爲摺疊次數；若是是None，那麼使用默認的KFold，默認摺疊次數5
                ® return_train_measures，是否計算訓練集的性能指標，默認爲False
                ® n_jobs，整形，並行進行評估的最大摺疊數。若是爲-1，那麼使用全部的CPU；若是爲1，那麼沒有並行計算（有利於調試）；若是小於-1，那麼使用（CPU數目 + n_jobs + 1）個CPU計算；默認值爲1
                ® pre_dispatch，整形或者字符串，控制在並行執行期間調度的做業數。（減小這個數量可有助於避免在分配過多的做業多於CPU可處理內容時候的內存消耗）這個參數能夠是：
                     None，全部做業會當即建立並生成
                     int，給出生成的總做業數確切數量
                     string，給出一個表達式做爲函數n_jobs，例如「2*n_jobs」
                默認爲2*n_jobs
            返回值是一個字典：
                ® test_*，*對應評估方案，例如「test_rmse」
                ® train_*，*對應評估方案，例如「train_rmse」。當return_train_measures爲True時候生效
                ® fit_time，數組，每一個分割出來的訓練數據評估時間，以秒爲單位
                ® test_time，數組，每一個分割出來的測試數據評估時間，以秒爲單位
        § 參數搜索
            □ class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u'2 * n_jobs', joblib_verbose=0)
                ® 參數相似於上文中交叉驗證
                ® refit，布爾或者整形。若是爲True，使用第一個評估方案中最佳平均性能的參數，在整個數據集上從新構造算法measures；經過傳遞字符串能夠指定其餘的評估方案；默認爲False
                ® joblib_verbose，控制joblib的詳細程度，整形數字越高，消息越多
            □ 內部方法：
                a) best_estimator，字典，使用measures方案的最佳評估值，對全部的分片計算平均
                b) best_score，浮點數，計算平均得分
                c) best_params，字典，得到measure中最佳的參數組合
                d) best_index，整數，獲取用於該指標cv_results的最高精度（平均下來的）的指數
                e) cv_results，數組字典，measures中全部的參數組合的訓練和測試的時間
                f) fit，經過cv參數給出不一樣的分割方案，對全部的參數組合計算
                g) predit，當refit爲False時候生效，傳入數組，見上文
                h) test，當refit爲False時候生效，傳入數組，見上文
            □ class surprise.model_selection.search.RandomizedSearchCV（algo_class，param_distributions，n_iter = 10，measures = [u'rmse'，u'mae']，cv = None，refit = False，return_train_measures = False，n_jobs = 1，pre_dispatch = u'2 * n_jobs'，random_state =無，joblib_verbose = 0 ）
            隨機抽樣進行計算而非像上面的進行瓊劇
    ○ 類似度模塊
        § similarities模塊中包含了用於計算用戶或者項目之間類似度的工具：
            1) cosine
            2) msd
            3) pearson
            4) pearson_baseline
    ○ 精度模塊
        § surprise.accuracy模塊提供了用於計算一組預測的精度指標的工具：
            1) rmse（均方根偏差）
            2) mae（平均絕對偏差）
            3) fcp
    ○ 數據集模塊
        § dataset模塊定義了用於管理數據集的Dataset類和其餘子類
        § class surprise.dataset.Dataset（reader）
        § 內部方法：
            1) load_builtin(name=u'ml-100k')，加載內置數據集，返回一個Dataset對象
            2) load_from_df(df, reader)，df（dataframe），數據框架，要求必須具備三列（要求順序），用戶原生id，項目原生id，評分；reader，指定字段內容
            3) load_from_file(file_path, reader)，從文件中加載數據，參數爲路徑和讀取器
            4) load_from_folds(folds_files, reader)，處理一種特殊狀況，movielens-100k數據集中已經定義好了訓練集和測試集，能夠經過這個方法導入
    ○ 訓練集類
        § class surprise.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)
        § 屬性分析：
            1) ur，用戶評分列表（item_inner_id，rating）的字典，鍵是用戶的inner_id
            2) ir，項目評分列表（user_inner_id，rating）的字典，鍵是項目的inner_id
            3) n_users，用戶數量
            4) n_items，項目數量
            5) n_ratings，總評分數
            6) rating_scale，評分的最高以及最低的元組
            7) global_mean，全部評級的平均值
        § 方法分析：
            1) all_items()，生成函數，迭代全部項目，返回全部項目的內部id
            2) all_ratings(),生成函數，迭代全部評分，返回一個(uid, iid, rating)的元組
            3) all_users()，生成函數，迭代全部的用戶，然會用戶的內部id
            4) build_anti_testset(fill=None)，返回能夠在test()方法中用做測試集的評分列表，參數決定填充未知評級的值，若是使用None則使用global_mean
            5) knows_item(iid)，標誌物品是否屬於訓練集
            6) knows_user(uid)，標誌用戶是否屬於訓練集
            7) to_inner_iid(riid)，將項目原始id轉換爲內部id
            8) to_innser_uid(ruid)，將用戶原始id轉換爲內部id
            9) to_raw_iid(iiid)，將項目的內部id轉換爲原始id
            10) to_raw_uid(iuid)，將用戶的內部id轉換爲原始id
    ○ 讀取器類
        § class surprise.reader.Reader(name=None, line_format=u'user item rating', sep=None, rating_scale=(1, 5), skip_lines=0)
        Reader類用於解析包含評分的文件，要求這樣的文件每行只指定一個評分，而且須要每行遵照這個接口：用戶；項目；評分；[時間戳]，不要求順序，可是須要指定
        § 參數分析：
            1) name，若是指定，則返回一個內置的數據集Reader，並忽略其餘參數，可接受的值是"ml-100k"，「m1l-1m」和「jester」。默認爲None
            2) line_format，string類型，字段名稱，指定時須要用空格分割，默認是「user item rating」
            3) sep，char類型，指定字段之間的分隔符
            4) rating_scale，元組類型，評分區間，默認爲(1,5)
            5) skip_lines，int類型，要在文件開頭跳過的行數，默認爲0
    ○ 轉儲模塊
        § surprise.dump.dump(file_name, predictions=None, algo=None, verbose=0)
            □ 一個pickle的基本包裝器，用來序列化預測或者算法的列表
            □ 參數分析：
                a) file_name，str，指定轉儲的位置
                b) predictions，Prediction列表，用來轉儲的預測
                c) algo，Algorithm，用來轉儲的算法
                d) verbose，詳細程度，0或者1
        § surprise.dump.load(file_name)
            □ 用於讀取轉儲文件
            □ 返回一個元組（predictions, algo），其中可能爲None

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。