【甘道夫】基於Mahout0.9+CDH5.2執行分佈式ItemCF推薦算法

時間 2019-11-12

標籤甘道夫基於 mahout0.9+cdh5.2 mahout cdh 執行分佈式 itemcf 推薦算法欄目系統架構简体版

原文原文鏈接

環境：

hadoop-2.5.0-cdh5.2.0

mahout-0.9-cdh5.2.0

引言

儘管Mahout已經宣佈再也不繼續基於Mapreduce開發，遷移到Spark。但是實際面臨的狀況是公司集羣沒有足夠的內存支持Spark這僅僅把內存當飯吃的猛獸。再加上項目進度的壓力以及開發者的技能現狀，因此不得不繼續使用Mahout一段時間。

今天記錄的是命令行執行 ItemCF on Hadoop的過程。

歷史

以前讀過一些前輩們關於的Mahout ItemCF on Hadoop編程的相關文章。描寫敘述的都是怎樣基於Mahout編程實現 ItemCF on Hadoop 。由於沒空親自研究。因此一直遵循前輩們編程實現的作法，比方下面這段在各大博客都頻繁出現的代碼：

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.mahout.cf.taste.hadoop.item.RecommenderJob;

public class ItemCFHadoop {

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(ItemCFHadoop.class);

GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);

String[] remainingArgs = optionParser.getRemainingArgs();

if (remainingArgs.length != 5) {

System.out.println("args length: "+remainingArgs.length);

System.err.println("Usage: hadoop jar <jarname> <package>.ItemCFHadoop <inputpath> <outputpath> <tmppath> <booleanData> <similarityClassname>");

System.exit(2);

}

System.out.println("input : "+remainingArgs[0]);

System.out.println("output : "+remainingArgs[1]);

System.out.println("tempdir : "+remainingArgs[2]);

System.out.println("booleanData : "+remainingArgs[3]);

System.out.println("similarityClassname : "+remainingArgs[4]);

StringBuilder sb = new StringBuilder();

sb.append("--input ").append(remainingArgs[0]);

sb.append(" --output ").append(remainingArgs[1]);

sb.append(" --tempDir ").append(remainingArgs[2]);

sb.append(" --booleanData ").append(remainingArgs[3]);

sb.append(" --similarityClassname ").append(remainingArgs[4]);

conf.setJobName("ItemCFHadoop");

RecommenderJob job = new RecommenderJob();

job.setConf(conf);

job.run(sb.toString().split(" "));

}

以上代碼是可運行的，僅僅要在命令行中傳入正確的參數就可以順利完畢 ItemCF on Hadoop的任務。

但是，假設按這麼個代碼邏輯。其實是在Java中作了命令行的工做。爲什麼不直接經過命令行運行呢？

官網資料

前輩們爲我指明瞭道路， ItemCF on Hadoop的任務是經過 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob類實現的。

官網（https://builds.apache.org/job/Mahout-Quality/javadoc/ ）中對於 org.apache.mahout.cf.taste.hadoop.item.RecommenderJob類的說明例如如下：

Runs a completely distributed recommender job as a series of mapreduces.

Preferences in the input file should look like userID, itemID[, preferencevalue]

Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference).

The preference value is assumed to be parseable as a double. The user IDs and item IDs are parsed as longs.

Command line arguments specific to this class are:

--input(path): Directory containing one or more text files with the preference data

--output(path): output path where recommender output should go

--tempDir (path): Specifies a directory where the job may place temp files (default "temp")

--similarityClassname (classname): Name of vector similarity class to instantiate or a predefined similarity from VectorSimilarityMeasure

--usersFile (path): only compute recommendations for user IDs contained in this file (optional)

--itemsFile (path): only include item IDs from this file in the recommendations (optional)

--filterFile (path): file containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional)

--numRecommendations (integer): Number of recommendations to compute per user (10)

--booleanData (boolean): Treat input data as having no pref values (false)

--maxPrefsPerUser (integer): Maximum number of preferences considered per user in final recommendation phase (10)

--maxSimilaritiesPerItem (integer): Maximum number of similarities considered per item (100)

--minPrefsPerUser (integer): ignore users with less preferences than this in the similarity computation (1)

--maxPrefsPerUserInItemSimilarity (integer): max number of preferences to consider per user in the item similarity computation phase, users with more preferences will be sampled down (1000)

--threshold (double): discard item pairs with a similarity value below this

爲了方便具有英語閱讀能力的同窗。上面保留了原文，如下是翻譯：

執行一個全然分佈式的推薦任務，經過一系列mapreduce任務實現。

輸入文件裏的偏好數據格式爲：userID, itemID[, preferencevalue]。

當中。preferencevalue並不是必須的。java

userID和itemID將被解析爲long類型。preferencevalue將被解析爲double類型。

該類可以接收的命令行參數例如如下：

--input(path): 存儲用戶偏好數據的文件夾。該文件夾下可以包括一個或多個存儲用戶偏好數據的文本文件；
--output(path): 結算結果的輸出文件夾
--tempDir (path): 存儲暫時文件的文件夾
--similarityClassname (classname): 向量類似度計算類。可選的類似度算法包含CityBlockSimilarity，CooccurrenceCountSimilarity，CosineSimilarity，CountbasedMeasure。EuclideanDistanceSimilarity，LoglikelihoodSimilarity。PearsonCorrelationSimilarity, TanimotoCoefficientSimilarity。注意參數中要帶上包名。
--usersFile (path): 指定一個包括了一個或多個存儲userID的文件路徑，僅爲該路徑下所有文件包括的userID作推薦計算 (該選項可選)
--itemsFile (path): 指定一個包括了一個或多個存儲itemID的文件路徑，僅爲該路徑下所有文件包括的itemID作推薦計算 (該選項可選)
--filterFile (path): 指定一個路徑，該路徑下的文件包括了[userID,itemID]值對，userID和itemID用逗號分隔。計算結果將不會爲user推薦[userID,itemID]值對中包括的item (該選項可選)
--numRecommendations (integer): 爲每個用戶推薦的item數量，默以爲10
--booleanData (boolean): 假設輸入數據不包括偏好數值，則將該參數設置爲true，默以爲false
--maxPrefsPerUser (integer): 在最後計算推薦結果的階段，針對每一個user使用的偏好數據的最大數量，默以爲10
--maxSimilaritiesPerItem (integer): 針對每個item的類似度最大值，默以爲100
--minPrefsPerUser (integer): 在類似度計算中，忽略所有偏好數據量少於該值的用戶。默以爲1
--maxPrefsPerUserInItemSimilarity (integer): 在item類似度計算階段。針對每個用戶考慮的偏好數據最大數量，默以爲1000
--threshold (double): 忽略類似度低於該閥值的item對

命令行運行

用於測試的用戶偏好數據【 userID, itemID, preferencevalue 】：

1,101,2

1,102,5

1,103,1

2,101,1

2,102,3

2,103,2

2,104,6

3,101,1

3,104,1

3,105,1

3,107,2

4,101,2

4,103,2

4,104,5

4,106,3

5,101,3

5,102,5

5,103,6

5,104,8

5,105,1

5,106,1

相關基礎環境配置無缺後。在命令行運行例如如下命令就能夠進行 ItemCF on Hadoop推薦計算：

hadoop jar $MAHOUT_HOME/mahout-core-0.9-cdh5.2.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /UserPreference --output /CFOutput --tempDir /tmp --similarityClassname org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.LoglikelihoodSimilarity

注：這裏僅僅使用了最重要的參數，不少其它的參數使用調優需結合實際項目進行測試。

計算結果【 userID [ itemID1:score1, itemID2:score2...... ] 】：

1 [104:3.4706533,106:1.7326527,105:1.5989419]

2 [106:3.8991857,105:3.691359]

3 [106:1.0,103:1.0,102:1.0]

4 [105:3.2909648,102:3.2909648]

5 [107:3.2898135]

相關標籤/搜索

推薦算法

甘道夫

itemcf