本文來自於:http://blog.fens.me/mahout-recommendation-apijava
前言git
用Mahout來構建推薦系統,是一件既簡單又困難的事情。簡單是由於Mahout完整地封裝了「協同過濾」算法,並實現了並行化,提供很是簡單的API接口;困難是由於咱們不瞭解算法細節,很難去根據業務的場景進行算法配置和調優。github
本文將深刻算法API去解釋Mahout推薦算法底層的一些事。算法
目錄apache
Mahoutt推薦算法,從數據處理能力上,能夠劃分爲2類:api
1). 單機內存算法實現數據結構
單機內存算法實現:就是在單機下運行的算法,是由cf.taste項目實現的,像個人們熟悉的UserCF,ItemCF都支持單機內存運行,而且參數能夠靈活配置。單機算法的基本實例,請參考文章:用Maven構建Mahout項目app
單機內存算法的問題在於,受限於單機的資源。對於中等規模的數據,像1G,10G的數據量,有能力進行計算,可是超過100G的數據量,對於單機來講是不可能完成的任務。dom
2). 基於Hadoop的分步式算法實現eclipse
基於Hadoop的分步式算法實現:就是把單機內存算法並行化,把任務分散到多臺計算機一塊兒運行。Mahout提供了ItemCF基於Hadoop並行化算法實現。基於Hadoop的分步式算法實現,請參考文章:
Mahout分步式程序開發 基於物品的協同過濾ItemCF
分步式並行算法的問題在於,如何讓單機算法並行化。在單機算法中,咱們只須要考慮算法,數據結構,內存,CPU就夠了,可是分步式算法還要額外考慮不少的狀況,好比多節點的數據合併,數據排序,網路通訊的效率,節點宕機重算,數據分步式存儲等等的不少問題。
Mahout提供了2個評估推薦器的指標,查準率和召回率(查全率),這兩個指標是搜索引擎中經典的度量方法。
相關 不相關 檢索到 A C 未檢索到 B D
被檢索到的越多越好,這是追求「查全率」,即A/(A+B),越大越好。
被檢索到的,越相關的越多越好,不相關的越少越好,這是追求「查準率」,即A/(A+C),越大越好。
在大規模數據集合中,這兩個指標是相互制約的。當但願索引出更多的數據的時候,查準率就會降低,當但願索引更準確的時候,會索引更少的數據。
1). 系統環境:
2). Recommender接口文件:
org.apache.mahout.cf.taste.recommender.Recommender.java
接口中方法的解釋:
經過Recommender接口,我能夠猜出核心算法,應該會在子類的estimatePreference()方法中進行實現。
3). 經過繼承關係到Recommender接口的子類:
推薦算法實現類:
下面將分別介紹每種算法的實現。
測試數據集:item.csv
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0
測試程序:org.conan.mymahout.recommendation.job.RecommenderTest.java
package org.conan.mymahout.recommendation.job;
import java.io.IOException;
import java.util.List;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.common.RandomUtils;
public class RecommenderTest {
final static int NEIGHBORHOOD_NUM = 2;
final static int RECOMMENDER_NUM = 3;
public static void main(String[] args) throws TasteException, IOException {
RandomUtils.useTestSeed();
String file = "datafile/item.csv";
DataModel dataModel = RecommendFactory.buildDataModel(file);
slopeOne(dataModel);
}
public static void userCF(DataModel dataModel) throws TasteException{}
public static void itemCF(DataModel dataModel) throws TasteException{}
public static void slopeOne(DataModel dataModel) throws TasteException{}
...
每種算法都一個單獨的方法進行算法測試,如userCF(),itemCF(),slopeOne()….
基於用戶的協同過濾,經過不一樣用戶對物品的評分來評測用戶之間的類似性,基於用戶之間的類似性作出推薦。簡單來說就是:給用戶推薦和他興趣類似的其餘用戶喜歡的物品。
舉例說明:
基於用戶的 CF 的基本思想至關簡單,基於用戶對物品的偏好找到相鄰鄰居用戶,而後將鄰居用戶喜歡的推薦給當前用戶。
計算上,就是將一個用戶對全部物品的偏好做爲一個向量 來計算用戶之間的類似度,找到 K 鄰居後,根據鄰居的類似度權重以及他們對物品的偏好,預測當前用戶沒有偏好的未涉及物品,計算獲得一個排序的物品列表做爲推薦。
圖 2 給出了一個例子,對於用戶 A,根據用戶的歷史偏好,這裏只計算獲得一個鄰居 – 用戶 C,而後將用戶 C 喜歡的物品 D 推薦給用戶 A。
算法API: org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender
@Override public float estimatePreference(long userID, long itemID) throws TasteException { DataModel model = getDataModel(); Float actualPref = model.getPreferenceValue(userID, itemID); if (actualPref != null) { return actualPref; } long[] theNeighborhood = neighborhood.getUserNeighborhood(userID); return doEstimatePreference(userID, theNeighborhood, itemID); } protected float doEstimatePreference(long theUserID, long[] theNeighborhood, long itemID) throws TasteException { if (theNeighborhood.length == 0) { return Float.NaN; } DataModel dataModel = getDataModel(); double preference = 0.0; double totalSimilarity = 0.0; int count = 0; for (long userID : theNeighborhood) { if (userID != theUserID) { // See GenericItemBasedRecommender.doEstimatePreference() too Float pref = dataModel.getPreferenceValue(userID, itemID); if (pref != null) { double theSimilarity = similarity.userSimilarity(theUserID, userID); if (!Double.isNaN(theSimilarity)) { preference += theSimilarity * pref; totalSimilarity += theSimilarity; count++; } } } } // Throw out the estimate if it was based on no data points, of course, but also if based on // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment. // The reason is that in this case the estimate is, simply, the user's rating for one item // that happened to have a defined similarity. The similarity score doesn't matter, and that // seems like a bad situation. if (count <= 1) { return Float.NaN; } float estimate = (float) (preference / totalSimilarity); if (capper != null) { estimate = capper.capEstimate(estimate); } return estimate; }
測試程序:
public static void userCF(DataModel dataModel) throws TasteException {
UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
UserNeighborhood userNeighborhood = RecommendFactory.userNeighborhood(RecommendFactory.NEIGHBORHOOD.NEAREST, userSimilarity, dataModel, NEIGHBORHOOD_NUM);
RecommenderBuilder recommenderBuilder = RecommendFactory.userRecommender(userSimilarity, userNeighborhood, true);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, true);
}
}
程序輸出:
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.0
Recommender IR Evaluator: [Precision:0.5,Recall:0.5]
uid:1,(104,4.333333)(106,4.000000)
uid:2,(105,4.049678)
uid:3,(103,3.512787)(102,2.747869)
uid:4,(102,3.000000)
用R語言重寫UserCF的實現,請參考文章:用R解析Mahout用戶推薦協同過濾算法(UserCF)
基於item的協同過濾,經過用戶對不一樣item的評分來評測item之間的類似性,基於item之間的類似性作出推薦。簡單來說就是:給用戶推薦和他以前喜歡的物品類似的物品。
舉例說明:
基於物品的 CF 的原理和基於用戶的 CF 相似,只是在計算鄰居時採用物品自己,而不是從用戶的角度,
即基於用戶對物品的偏好找到類似的物品,而後根據用戶的歷史偏好,推薦類似的物品給他。
從計算 的角度看,就是將全部用戶對某個物品的偏好做爲一個向量來計算物品之間的類似度,獲得物品的類似物品後,
根據用戶歷史的偏好預測當前用戶尚未表示偏好的 物品,計算獲得一個排序的物品列表做爲推薦。
圖 3 給出了一個例子,對於物品 A,根據全部用戶的歷史偏好,喜歡物品 A 的用戶都喜歡物品 C,得出物品 A 和物品 C 比較類似,
而用戶 C 喜歡物品 A,那麼能夠推斷出用戶 C 可能也喜歡物品 C。
算法API: org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
@Override public float estimatePreference(long userID, long itemID) throws TasteException { PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID); Float actualPref = getPreferenceForItem(preferencesFromUser, itemID); if (actualPref != null) { return actualPref; } return doEstimatePreference(userID, preferencesFromUser, itemID); } protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID) throws TasteException { double preference = 0.0; double totalSimilarity = 0.0; int count = 0; double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs()); for (int i = 0; i < similarities.length; i++) { double theSimilarity = similarities[i]; if (!Double.isNaN(theSimilarity)) { // Weights can be negative! preference += theSimilarity * preferencesFromUser.getValue(i); totalSimilarity += theSimilarity; count++; } } // Throw out the estimate if it was based on no data points, of course, but also if based on // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment. // The reason is that in this case the estimate is, simply, the user's rating for one item // that happened to have a defined similarity. The similarity score doesn't matter, and that // seems like a bad situation. if (count <= 1) { return Float.NaN; } float estimate = (float) (preference / totalSimilarity); if (capper != null) { estimate = capper.capEstimate(estimate); } return estimate; }
測試程序:
public static void itemCF(DataModel dataModel) throws TasteException {
ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
RecommenderBuilder recommenderBuilder = RecommendFactory.itemRecommender(itemSimilarity, true);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, true);
}
}
程序輸出:
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.8676552772521973
Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
uid:1,(105,3.823529)(104,3.722222)(106,3.478261)
uid:2,(106,2.984848)(105,2.537037)(107,2.000000)
uid:3,(106,3.648649)(102,3.380000)(103,3.312500)
uid:4,(107,4.722222)(105,4.313953)(102,4.025000)
uid:5,(107,3.736842)
這個算法在mahout-0.8版本中,已經被@Deprecated。
SlopeOne是一種簡單高效的協同過濾算法。經過均差計算進行評分。SlopeOne論文下載(PDF)
1). 舉例說明:
用戶X,Y,Z,對於物品A,B進行打分,以下表,求Z對B的打分是多少?
Slope one算法認爲:平均值能夠代替某兩個未知個體之間的打分差別,事物A對事物B的平均差是:((5 - 4) + (4 - 2)) / 2 = 1.5,就獲得Z對B的打分是,3-1.5 = 1.5。
Slope one算法將用戶的評分之間的關係看做簡單的線性關係:
Y = mX + b
2). 平均加權計算:
用戶X,Y,Z,對於物品A,B,C進行打分,以下表,求Z對A的打分是多少?
經過這種簡單的方式,咱們能夠快速計算出一個評分項,完成推薦過程!
算法API: org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender
@Override public float estimatePreference(long userID, long itemID) throws TasteException { DataModel model = getDataModel(); Float actualPref = model.getPreferenceValue(userID, itemID); if (actualPref != null) { return actualPref; } return doEstimatePreference(userID, itemID); } private float doEstimatePreference(long userID, long itemID) throws TasteException { double count = 0.0; double totalPreference = 0.0; PreferenceArray prefs = getDataModel().getPreferencesFromUser(userID); RunningAverage[] averages = diffStorage.getDiffs(userID, itemID, prefs); int size = prefs.length(); for (int i = 0; i < size; i++) { RunningAverage averageDiff = averages[i]; if (averageDiff != null) { double averageDiffValue = averageDiff.getAverage(); if (weighted) { double weight = averageDiff.getCount(); if (stdDevWeighted) { double stdev = ((RunningAverageAndStdDev) averageDiff).getStandardDeviation(); if (!Double.isNaN(stdev)) { weight /= 1.0 + stdev; } // If stdev is NaN, then it is because count is 1. Because we're weighting by count, // the weight is already relatively low. We effectively assume stdev is 0.0 here and // that is reasonable enough. Otherwise, dividing by NaN would yield a weight of NaN // and disqualify this pref entirely // (Thanks Daemmon) } totalPreference += weight * (prefs.getValue(i) + averageDiffValue); count += weight; } else { totalPreference += prefs.getValue(i) + averageDiffValue; count += 1.0; } } } if (count <= 0.0) { RunningAverage itemAverage = diffStorage.getAverageItemPref(itemID); return itemAverage == null ? Float.NaN : (float) itemAverage.getAverage(); } else { return (float) (totalPreference / count); } }
測試程序:
public static void slopeOne(DataModel dataModel) throws TasteException {
RecommenderBuilder recommenderBuilder = RecommendFactory.slopeOneRecommender();
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, true);
}
}
程序輸出:
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.3333333333333333
Recommender IR Evaluator: [Precision:0.25,Recall:0.5]
uid:1,(105,5.750000)(104,5.250000)(106,4.500000)
uid:2,(105,2.286115)(106,1.500000)
uid:3,(106,2.000000)(102,1.666667)(103,1.625000)
uid:4,(105,4.976859)(102,3.509071)
這個算法在mahout-0.8版本中,已經被@Deprecated。
算法來自論文:
This algorithm is based in the paper of Robert M. Bell and Yehuda Koren in ICDM '07.
(TODO未完)
算法API: org.apache.mahout.cf.taste.impl.recommender.knn.KnnItemBasedRecommender
@Override protected float doEstimatePreference(long theUserID, PreferenceArray preferencesFromUser, long itemID) throws TasteException { DataModel dataModel = getDataModel(); int size = preferencesFromUser.length(); FastIDSet possibleItemIDs = new FastIDSet(size); for (int i = 0; i < size; i++) { possibleItemIDs.add(preferencesFromUser.getItemID(i)); } possibleItemIDs.remove(itemID); List mostSimilar = mostSimilarItems(itemID, possibleItemIDs.iterator(), neighborhoodSize, null); long[] theNeighborhood = new long[mostSimilar.size() + 1]; theNeighborhood[0] = -1; List usersRatedNeighborhood = Lists.newArrayList(); int nOffset = 0; for (RecommendedItem rec : mostSimilar) { theNeighborhood[nOffset++] = rec.getItemID(); } if (!mostSimilar.isEmpty()) { theNeighborhood[mostSimilar.size()] = itemID; for (int i = 0; i < theNeighborhood.length; i++) { PreferenceArray usersNeighborhood = dataModel.getPreferencesForItem(theNeighborhood[i]); int size1 = usersRatedNeighborhood.isEmpty() ? usersNeighborhood.length() : usersRatedNeighborhood.size(); for (int j = 0; j < size1; j++) { if (i == 0) { usersRatedNeighborhood.add(usersNeighborhood.getUserID(j)); } else { if (j >= usersRatedNeighborhood.size()) { break; } long index = usersRatedNeighborhood.get(j); if (!usersNeighborhood.hasPrefWithUserID(index) || index == theUserID) { usersRatedNeighborhood.remove(index); j--; } } } } } double[] weights = null; if (!mostSimilar.isEmpty()) { weights = getInterpolations(itemID, theNeighborhood, usersRatedNeighborhood); } int i = 0; double preference = 0.0; double totalSimilarity = 0.0; for (long jitem : theNeighborhood) { Float pref = dataModel.getPreferenceValue(theUserID, jitem); if (pref != null) { double weight = weights[i]; preference += pref * weight; totalSimilarity += weight; } i++; } return totalSimilarity == 0.0 ? Float.NaN : (float) (preference / totalSimilarity); } }
測試程序:
public static void itemKNN(DataModel dataModel) throws TasteException {
ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
RecommenderBuilder recommenderBuilder = RecommendFactory.itemKNNRecommender(itemSimilarity, new NonNegativeQuadraticOptimizer(), 10);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, true);
}
}
程序輸出:
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.5
Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
uid:1,(107,5.000000)(104,3.501168)(106,3.498198)
uid:2,(105,2.878995)(106,2.878086)(107,2.000000)
uid:3,(103,3.667444)(102,3.667161)(106,3.667019)
uid:4,(107,4.750247)(102,4.122755)(105,4.122709)
uid:5,(107,3.833621)
(TODO未完)
算法API: org.apache.mahout.cf.taste.impl.recommender.svd.SVDRecommender
@Override public float estimatePreference(long userID, long itemID) throws TasteException { double[] userFeatures = factorization.getUserFeatures(userID); double[] itemFeatures = factorization.getItemFeatures(itemID); double estimate = 0; for (int feature = 0; feature < userFeatures.length; feature++) { estimate += userFeatures[feature] * itemFeatures[feature]; } return (float) estimate; }
測試程序:
public static void svd(DataModel dataModel) throws TasteException {
RecommenderBuilder recommenderBuilder = RecommendFactory.svdRecommender(new ALSWRFactorizer(dataModel, 10, 0.05, 10));
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, true);
}
}
程序輸出:
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.09990564982096355
Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
uid:1,(104,4.032909)(105,3.390885)(107,1.858541)
uid:2,(105,3.761718)(106,2.951908)(107,1.561116)
uid:3,(103,5.593422)(102,2.458930)(106,-0.091259)
uid:4,(105,4.068329)(102,3.534025)(107,0.206257)
uid:5,(107,0.105169)
這個算法在mahout-0.8版本中,已經被@Deprecated。
(TODO未完)
算法API: org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender
@Override
public float estimatePreference(long userID, long itemID) throws TasteException {
DataModel model = getDataModel();
Float actualPref = model.getPreferenceValue(userID, itemID);
if (actualPref != null) {
return actualPref;
}
buildClusters();
List topRecsForUser = topRecsByUserID.get(userID);
if (topRecsForUser != null) {
for (RecommendedItem item : topRecsForUser) {
if (itemID == item.getItemID()) {
return item.getValue();
}
}
}
// Hmm, we have no idea. The item is not in the user's cluster
return Float.NaN;
}
測試程序:
public static void treeCluster(DataModel dataModel) throws TasteException {
UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.LOGLIKELIHOOD, dataModel);
ClusterSimilarity clusterSimilarity = RecommendFactory.clusterSimilarity(RecommendFactory.SIMILARITY.FARTHEST_NEIGHBOR_CLUSTER, userSimilarity);
RecommenderBuilder recommenderBuilder = RecommendFactory.treeClusterRecommender(clusterSimilarity, 10);
RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);
LongPrimitiveIterator iter = dataModel.getUserIDs();
while (iter.hasNext()) {
long uid = iter.nextLong();
List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
RecommendFactory.showItems(uid, list, true);
}
}
程序輸出:
AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:NaN
Recommender IR Evaluator: [Precision:NaN,Recall:0.0]
算法及適用場景:
算法評分的結果:
經過對上面幾種算法的一平分比較:itemCF,itemKNN,SVD的Rrecision,Recall的評分值是最好的,而且itemCF和 SVD的AVERAGE_ABSOLUTE_DIFFERENCE是最低的,因此,從算法的角度知道了,哪一個算法是更準確的或者會索引到更多的數據集。
另外的一些因素: