public static void main(String[] args) throws Exception { int userId = 1; int rankNum = 2; QingRS qingRS = new QingRS(); for(int neighberNum = 2; neighberNum < 10; neighberNum++) { System.out.println("neigherNum=" + neighberNum); qingRS.initRecommenderIntro(filename, neighberNum); String resultStr = qingRS.getRecommender(userId, rankNum); System.out.println(resultStr); } }
Recommend=313 4.5
neigherNum=3
Recommend=286 5.0
neigherNum=4
Recommend=286 5.0
neigherNum=5
Recommend=990 5.0
neigherNum=6
Recommend=990 5.0
neigherNum=7
Recommend=990 5.0
neigherNum=8
Recommend=990 5.0
neigherNum=9
Recommend=990 5.0java
解釋: neigherhood一開始變化時, 參考的人數增多了, 所謂三個臭皮匠頂過一個諸葛亮, 推薦將會變化, 可是隨着neigherhood的變大, 加再多的人進來也只是湊人數而已沒有多大的決定能力.python
List<RecommendedItem> recommendations = recommender.recommend(userid,
rankNum);git
可是: 咱們發現除了neigherNum = 2之外, 推薦結果均發生了變化, 並且數據開始震盪, 若是將neigherNum放大到30, 推薦結果依舊不停地震盪. github
Recommend=313 4.5
neigherNum=3
Recommend=323 5.0
neigherNum=4
Recommend=898 5.0
neigherNum=5
Recommend=323 5.0
neigherNum=6
Recommend=323 5.0
neigherNum=7
Recommend=898 5.0
neigherNum=8
Recommend=326 5.0
neigherNum=9
Recommend=326 5.0算法
解釋???: 問題應該出在排序算法上, Mahout爲了節約內存使用了qSort, 所以排序算法不穩定. 可是我去查看Mahout源代碼發現GenericUserBasedRecommender中使用了Collections.sort(), sort默認使用的是MergeSort, 因此排序應該是穩定的. 依舊存在着疑問.sql
代碼以下:數據庫
package com.qingfeng.rs.test; import java.io.File; import java.io.IOException; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.model.DataModel; public class QingDataModelTest { private final static String filename = "data/u.data"; public static void main(String[] args) throws IOException, TasteException { DataModel dataModel = new FileDataModel(new File(filename)); // compute the max and min value // 計算最大最小值 float maxValue = dataModel.getMaxPreference(); float minValue = dataModel.getMinPreference(); // compute the number of usersNum and itemsNum // 計算用戶和物品總數 int usersNum = dataModel.getNumUsers(); int itemsNum = dataModel.getNumItems(); int[] itemsNumForUsers = new int[usersNum]; int[] usersNumForItems = new int[itemsNum]; LongPrimitiveIterator userIDs = dataModel.getUserIDs(); int i = 0; while (userIDs.hasNext()) { itemsNumForUsers[i++] = dataModel.getPreferencesFromUser( userIDs.next()).length(); } assert (i == usersNum); LongPrimitiveIterator itemIDs = dataModel.getItemIDs(); i = 0; while (itemIDs.hasNext()) { usersNumForItems[i++] = dataModel.getPreferencesForItem( itemIDs.next()).length(); } assert (i == itemsNum); // compute mean and variance // 計算平均值和方差 double usersMean; double usersVar; int sum = 0; int sqSum = 0; for (int num : itemsNumForUsers) { sum += num; sqSum += num * num; } usersMean = (double) sum / usersNum; double userSqMean = (double) sqSum / usersNum; usersVar = Math.sqrt(userSqMean - usersMean * usersMean); double itemsMean; double itemsVar; sum = 0; sqSum = 0; for (int num : usersNumForItems) { sum += num; sqSum += num * num; } itemsMean = (double) sum / itemsNum; double itemsSqMean = (double) sqSum / itemsNum; itemsVar = Math.sqrt(itemsSqMean - itemsMean * itemsMean); System.out.println("Preference=(" + minValue + ", " + maxValue + ")"); System.out.println("usersNum=" + usersNum + ", userMean=" + usersMean + ", userVar=" + usersVar); System.out.println("itemsNum=" + itemsNum + ", itemsMean=" + itemsMean + ", itemsVar=" + itemsVar); } }
設置門限過濾數據apache
for (int num : itemsNumForUsers) { sum += num; if (num < 20) { countLower++; // System.out.println("user warning(" + countLower + ")=" + num); } sqSum += num * num; } System.out.println("user warning(" + countLower + ")"); for (int num : usersNumForItems) { sum += num; if (num < 20) { countLower++; //System.out.println("item warning(" + countLower + ")=" + num); } sqSum += num * num; } System.out.println("item warning(" + countLower + ")");
物品評分較爲稀疏程度和物品總數大小是一致的. 使用user-based則用戶少,節約內存, 且矩陣緻密。api
設置門限爲20時, 發現物品矩陣稀疏、方差大和過濾器的統計結果item warning(743)大是一致, 此處先不過濾數據, 後期再說.緩存
注:固然優秀的過濾器須要改變門限值來不停的調試
public class QingMemoryTest { private static final String filename = "data/u.data"; public static void main(String[] args) throws Exception { DataModel dataModel = new FileDataModel(new File(filename)); UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel); UserNeighborhood neighborhood = new NearestNUserNeighborhood(5, similarity, dataModel); Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, similarity); System.out.println("1: jvm free-memory= " + Runtime.getRuntime().freeMemory() + "Bytes"); System.gc(); System.out.println("2: jvm free-memory= " + Runtime.getRuntime().freeMemory() + "Bytes"); // dataModel被回收, 因此推薦結果錯誤. System.out.println(recommender.recommend(1, 2).get(1).getValue()); } }
after dataModel: jvm used-memory= 19.2872314453125MB
after similarity: jvm used-memory= 19.2872314453125MB
after neighborhood: jvm used-memory= 19.58240509033203MB
after recommender: jvm used-memory= 19.58240509033203MB
recommend=340
after recommend first: jvm used-memory= 19.877883911132812MB
after gc: jvm used-memory= 9.829483032226562MB
recommend=340
after recommend second: jvm used-memory= 9.829483032226562MB
分析: 由上述數據可見,gc回收內存後, JVM內存消耗回收了10Mbytes, 與猜想一致.
問題: 回收完數據後, 爲何recommender還能夠進行推薦, 並且沒有額外的內存開銷???
數據增加10倍, 即便用1M數據進行測試
簡單統計分析結果:
user warning(0)
item warning(663)
Preference=(1.0, 5.0)
usersNum=6040, userMean=165.5975165562914, userVar=192.73107252940773
itemsNum=3706, itemsMean=269.88909875876953, itemsVar=383.9960197430679
估計內存消耗: usersNum和itemsNum增加了3到6倍, 而類似矩陣消耗內存爲平方級別, 那麼內存消耗上線爲9到36倍; 此外數據增加10倍, DataModel內存消耗爲線性增加, 增加10倍內存消耗. 那麼估計內存消耗= 2.8M * 10 + (9~36)*8M = 100M ~ 316M內存之間. 若是不存儲類似矩陣, 那麼內存消耗爲28M左右.
因爲數據以"::"做爲分割符, 用python簡單處理一下,替換爲\t
f = open("result.dat", "w") for line in open("ratings.dat", "r"): newLine = line.replace("::", "\t") f.write(newLine)
運行結果以下
start: jvm used-memory= 0.5967178344726562MB
after dataModel: jvm used-memory= 204.9770050048828MB
after similarity: jvm used-memory= 204.9770050048828MB
after neighborhood: jvm used-memory= 204.9770050048828MB
after recommender: jvm used-memory= 204.9770050048828MB
recommend=2908
after recommend first: jvm used-memory= 208.10643768310547MB
after gc: jvm used-memory= 76.12030029296875MB
recommend=2908
after recommend second: jvm used-memory= 76.12030029296875MB
分析: 由上述數據能夠: 數據回收了132Mbytes, 76M爲運行開銷. 與估計內存消耗移植. DataModel線性增加, 類似矩陣平方級別增加.
結論: 若是評分數增長到10M級別, 用戶或者物品數增加3~10倍, 那麼須要4G到40G的內存才能快速的計算出推薦結果, 須要增長內存條, 設置JVM配置以及使用hadoop來實現. 另外真實的數據用戶數達到GB級別, 總數達到TB級別, 須要的內存數量和運算量是十分恐怖的. 傳統地算法已經沒法知足要求, 須要藉助Hadoop這種分佈式來實現運算.
固然內存不夠大, 硬盤能夠很大, 處理10M級別以上的推薦數據時, 選擇使用MysqlJDBCDataModel來實現存儲.
另外: 據數盟的一位Q友說, "淘寶有8kw的商品(記憶也許有出入),用戶2億,多大的矩陣啊". 每次想到這裏, 都會默默地閉上雙眼, 遙想遠方的宇宙, 數據又是多麼地浩淼. 在上帝眼中, 咱們也許還只是玩過家家, 學1+1的小孩子吧.
此外, 後期但願考慮user-based, item-based, slope-one算法的比較, 同時參考運行時間.
類似矩陣選擇下面4種
PearsonCorrelationSimilarity EuclideanDistanceSimilarity TanimotoCoefficientSimilarity LogLikeLihoodSimilarity
[注:其中EuclideanDistanceSimilarity比較特殊, 它沒有實現UserSimilarity接口, 因此不能放到一個Collection<UserSimilarity>容器中]
[注: 勿看了org.apache.mahout.math.hadoop.similarity.cooccurrence.measures文件]
參數調整隻選擇近鄰N和threashold
這裏給出代碼原型, 可是在普通PC上跑100K的數據集都太慢了, 使用intro.csv這個toy數據跑一跑.
N選擇[2, 4, 8, ... 64], Threshold選擇[0.9, 0.85, ... 0.7];
代碼以下:
public class QingParaTest { private final String filename = "data/intro.csv"; private double threshold = 0.95; private int neighborNum = 2; private ArrayList<UserSimilarity> userSims; private final int SIM_NUM = 4; private final int NEIGHBOR_NUM = 64; private final double THRESHOLD_LOW = 0.7; public static void main(String[] args) throws IOException, TasteException { new QingParaTest().valuate(); } public QingParaTest() { super(); this.userSims = new ArrayList<UserSimilarity>(); } private void valuate() throws IOException, TasteException { DataModel dataModel = new FileDataModel(new File(filename)); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); // populate Similarity populateUserSims(dataModel); int simBest = -1; double scoreBest = 5.0; int neighborBest = -1; double thresholdBest = -1; System.out.println("SIM\tNeighborNum\t\tThreshold\tscore"); for (int i = 0; i < SIM_NUM; i++) { for (neighborNum = 2; neighborNum <= NEIGHBOR_NUM; neighborNum *= 2) { for (threshold = 0.75; threshold >= THRESHOLD_LOW; threshold -= 0.05) { double score = 5.0; QingRecommenderBuilder qRcommenderBuilder = new QingRecommenderBuilder( userSims.get(i), neighborNum, threshold); // Use 70% of the data to train; test using the other 30%. score = evaluator.evaluate(qRcommenderBuilder, null, dataModel, 0.7, 1.0); System.out.println((i + 1) + "\t" + neighborNum + "\t" + threshold + "\t" + score); if (score < scoreBest) { scoreBest = score; simBest = i + 1; neighborBest = neighborNum; thresholdBest = threshold; } } } } System.out.println("The best parameter"); System.out.println(simBest + "\t" + neighborBest + "\t" + thresholdBest + "\t" + scoreBest); } private void populateUserSims(DataModel dataModel) throws TasteException { UserSimilarity userSimilarity = new PearsonCorrelationSimilarity( dataModel); userSims.add(userSimilarity); userSimilarity = new TanimotoCoefficientSimilarity(dataModel); userSims.add(userSimilarity); userSimilarity = new LogLikelihoodSimilarity(dataModel); userSims.add(userSimilarity); userSimilarity = new EuclideanDistanceSimilarity(dataModel); userSims.add(userSimilarity); } } class QingRecommenderBuilder implements RecommenderBuilder { private UserSimilarity userSimilarity; private int neighborNum; private double threshold; public QingRecommenderBuilder(UserSimilarity userSimilarity, int neighborNum, double threshold) { super(); this.userSimilarity = userSimilarity; this.neighborNum = neighborNum; this.threshold = threshold; } @Override public Recommender buildRecommender(DataModel dataModel) throws TasteException { UserNeighborhood neighborhood = new NearestNUserNeighborhood( neighborNum, threshold, userSimilarity, dataModel); return new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity); } }
運行結果以下:
SIM NeighborNum Threshold score
1 2 0.75 0.4858379364013672
1 2 0.7 NaN
1 4 0.75 0.4676065444946289
1 4 0.7 NaN
1 8 0.75 0.8704338073730469
1 8 0.7 0.014162302017211914
1 16 0.75 NaN
1 16 0.7 0.7338032722473145
1 32 0.75 0.7338032722473145
1 32 0.7 0.4858379364013672
1 64 0.75 NaN
1 64 0.7 1.0
The best parameter
1 8 0.7 0.014162302017211914
分析: 運行最佳的結果爲N = 8, Threshold = 0.7 固然, 這個方法, 十分的粗糙, 可是也說明了參數的重要性, 畢竟推薦系統上線了必須有優秀的A\B Test結果, 要否則還不如使用打折, 優惠券來的簡單實在.
順便截一張Mahout in Action上一個真實案例的數據, 以下圖所示
public class SlopeOne { public static void main(String[] args) throws IOException, TasteException { DataModel dataModel = new FileDataModel(new File("data/intro.csv")); RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); double score = evaluator.evaluate(new SlopeOneNoWeighting(), null, dataModel, 0.7, 1.0); System.out.println(score); } } class SlopeOneNoWeighting implements RecommenderBuilder { public Recommender buildRecommender(DataModel model) throws TasteException { DiffStorage diffStorage = new MemoryDiffStorage(model, Weighting.UNWEIGHTED, Long.MAX_VALUE); return new SlopeOneRecommender(model, Weighting.UNWEIGHTED, Weighting.UNWEIGHTED, diffStorage); } }
類似距離(距離越小值越大) |
優勢 | 缺點 | 取值範圍 |
PearsonCorrelation
相似於計算兩個矩陣的協方差
|
不受用戶評分偏高
或者偏低習慣影響的影響
|
1. 若是兩個item類似個數小於2時
沒法計算類似距離.
[可使用item類似個數門限來解決.]
沒有考慮兩個用戶之間的交集大小[使用
weight參數來解決]
2. 沒法計算兩個徹底相同的items
|
[-1, 1] |
EuclideanDistanceSimilarity
計算歐氏距離, 使用1/(1+d)
|
使用與評分大小較
重要的場合
|
若是評分不重要則須要歸一化,
計算量大
同時每次有數據更新時麻煩
|
[-1, 1] |
CosineMeasureSimilarity
計算角度
|
與PearsonCorrelation一致 | [-1, 1] | |
SpearmanCorrelationSimilarity
使用ranking來取代評分的
PearsonCorrelation
|
徹底依賴評分和徹底放棄評分之間的平衡 |
計算rank消耗時間過大
不利於數據更新
|
[-1, 1] |
CacheUserSimilarity
保存了一些tag, reference
|
緩存常常查詢的user-similarity | 額外的內存開銷 | |
TanimotoCoefficientSimilarity
統計兩個向量的交集佔並集的比例
同時並集個數越多, 越相近.
|
適合只有相關性
而沒有評分的狀況
|
沒有考慮評分,信息丟失了 | [-1,1] |
LogLikeLihoodSimilarity
是TanimoteCoefficientSimilarity
的一種基於機率論改進
|
計算二者重合的偶然性
考慮了兩個item相鄰的獨特性
|
計算複雜 | [-1,1] |