用 Mahout 和 Elasticsearch 實現推薦系統

時間 2019-11-06

標籤 mahout elasticsearch 實現推薦系統欄目日誌分析简体版

原文原文鏈接

原文地址html

本文內容

軟件
步驟
控制相關性
總結
參考資料

本文介紹如何用帶 Apache Mahout 的 MapR Sandbox for Hadoop 和 Elasticsearch 搭建推薦引擎，只須要不多的代碼。python

This tutorial will give step-by-step instructions on how to:git

使用的電影評分數據位於 http://grouplens.org/datasets/movielens/
使用 Apache Mahout 的協同過濾（collaborative filtering）搭建和訓練機器學習模型
使用 Elasticsearch 的搜索技術簡化推薦系統的開發

遷移到：http://www.bdata-cap.com/newsinfo/1712675.html

軟件

該文章運行在 MapReduce Sandbox。還要求在 Sandbox 上安裝 Elasticsearch 和 Mahout。github

從 http://grouplens.org/datasets/movielens/ 下載 10M MovieLens 數據
安裝 Mahout
安裝 Elasticsearch

步驟

Step 1: 索引（Index）電影元數據到 Elasticsearch

在 Elasticsearch 中，默認狀況下，文檔的全部字段都會被索引。最簡單的文檔是隻有一級 JSON 結構。文檔包含在索引中，文檔中的類型告訴 Elasticsearch 如何解釋文檔中的字段。算法

你能夠把 Elasticsearch 的索引看作是關係型數據庫中的數據庫實例，而類型看作是數據庫表，字段看作表定義（可是這個字段，在 Elasticsearch 中的意義更普遍），文檔看作是表的某行記錄。數據庫

針對本例，文檔類型是 film。並具備以下字段：電影ID（id）、標題（title）、上映時間（year）、電影類型/標籤（genre，基因）、指示（indicators）、indicators數組的數量（numFields）：apache

 "id": "65006",

 "title": "Impulse",

 "year": "2008",

 "genre": ["Mystery","Thriller"],

 "indicators": ["154","272",」154","308", "535", "583", "593", "668", "670", "680", "702", "745"],

 "numFields": 12

經過 9200 端口訪問 Elasticsearch RESTful API 與其通訊，或者命令行用 curl 命令。參看 Elasticsearch REST interface 和 Elasticsearch 101 tutorial。json

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

使用 Elasticsearch's REST API 的 put mapping 命令能夠定義文檔的類型。下面的請求在 bigmovie 索引中建立名爲 film 的映射（mapping）。該映射定義一個類型爲 integer 類型的 numFields 字段。默認狀況，全部字段都被存儲並索引，整型也如此。api

curl -XPUT 'http://localhost:9200/bigmovie' -d '

  "mappings": {

    "film" : {

      "properties" : {

        "numFields" : { "type" :   "integer" }

}'

電影信息包含在 movies.dat 文件中。文件的每行表示一部電影，字段的含義以下所示：數組

MovieID::Title::Genres

例如：

65006::Impulse (2008)::Mystery|Thriller

圖 1 電影《衝動（Impulse）》（2008）、類型「懸疑/驚悚」

下面 Python 腳本把 movies.dat 文件中的數據轉換成 JSON 格式，以便導入 Elasticsearch：

import re

import json

count=0

with open('movies.dat','rb') as csv_file:

   content = csv_file.readlines()

   for line in content:

        fixed = re.sub("::", "\t", line).rstrip().split("\t")

   if len(fixed)==3:

          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))

          genre = fixed[2].split('|')

          print '{ "create" : { "_index" : "bigmovie", "_type" : "film",

          "_id" : "%s" } }' %  fixed[0]

          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'

          % (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

運行該 Python 文件，轉換結果輸出到 index.json：

$ python index.py > index.json

將產生以下 Elasticsearch 須要的格式：

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }

{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }

{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

文件中的每行建立索引和類型，並添加電影信息。這是利用 Elasticsearch 批量導入數據。

Elasticsearch 批量 API 能夠執行對索引的操做，用同一個 API，不一樣的 http 請求（如 get、put、post、delete）。下面命令讓 Elasticsearch 批量加載 index.json 文中的內容：

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json; echo

加載電影信息後，你就能夠利用 REST API 進行查詢了。你也可使用 Chrome 的 Elasticsearch 插件——Sense 進行操做（Kibana 4 提供的一個插件）。示例以下所示：

下面是檢索 id 爲 1237的電影：

Step 2: 使用 Mahout 從用戶評分數據中建立 Movie indicators

評分包含在 ratings.dat 文件中。該文件每行表示某個用戶對某個電影的評分，格式以下所示：

UserID::MovieID::Rating::Timestamp

例如：

71567::2294::5::912577968

71567::2338::2::912578016

ratings.data 文件用 "::" 作分隔符，轉換成 tab 後 Mahout 才能使用。能夠用 sed 命令把 :: 替換成 tab：

sed -i 's/::/\t/g' ratings.dat

該命令打開文件，把"::" 替換成"\t" 後，從新保存。Updates are only supported with MapR NFS and thus this command probably won't work on other NFS-on-Hadoop implementations. MapR Direct Access NFS allows files to be modified (supports random reads and writes) and accessed via mounting the Hadoop cluster over NFS.

sed 命令會產生以下格式的內容，該格式能夠做爲 Mahout 的輸入：

71567    2294    5    912580553

71567    2338    2    912580553

通常格式爲：item1 item2 rating timestamp，即「物品1 物品2 評分」，本例不使用 timestamp。

啓動 Mahout 物品類似度（itemsimilarity）做業，命令以下所示：

 mahout itemsimilarity \

  --input /user/user01/mlinput/ratings.dat \

  --output /user/user01/mloutput \

  --similarityClassname SIMILARITY_LOGLIKELIHOOD \

  --booleanData TRUE \

  --tempDir /user/user01/temp

The argument 「-s SIMILARITY_LOGLIKELIHOOD」 tells the recommender to use the Log Likelihood Ratio (LLR) method for determining which items co-occur anomalously often and thus which co-occurrences can be used as indicators of preference. 類似度默認是 0.9；this can be adjusted based on the use case with the --threshold parameter, which will discard pairs with lower similarity (the default is a fine choice). Mahout 經過啓動不少 Hadoop MapReduce 做業計算推薦，最後將產生輸出文件，該文件位於 /user/user01/mloutput 目錄。輸出文件格式以下所示：

64957   64997   0.9604835425701245
64957   65126   0.919355104432831
64957   65133   0.9580439772229588

通常格式爲：item1id item2id similarity，即「物品1 物品2 類似度」。

Step 3: 添加 Movie indicators 到 Elasticsearch 的電影文檔

下一步，咱們從上面的輸出文件添加 indicators 到 Elasticsearch 的 film 文檔。例如，把電影的 indicators 放到 indicators 字段：

  "id": "65006",

  "title": "Impulse",

  "year": "2008",

  "genre": ["Mystery","Thriller"],

  "indicators": ["1076", "1936", "2057", "2204"],

  "numFields": 4

左面的表顯示文檔中包含 indicator 的內容，右邊的表顯示哪些文檔包含某個 indicator：

圖 2 文檔與 indicator

若是想要檢索 indicator 爲 1237 和 551 的電影，那麼本例將返回 id 爲 8298 的文檔（電影）。若是檢索 1237 或 551，那麼將返回 id 爲 829八、3 和 64418 的電影。

下面腳本將讀取 Mahout 的輸出文件 part-r-00000，爲每部電影建立 indicator 數組，而後輸出 JSON 文件，用該文件更新 Elasticsearch bigmovie 索引的 film 類型的 indicator 字段。

import fileinput

from string import join

import json

import csv

import json

### read the output from MAHOUT and collect into hash ###

with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:

    csv_reader = csv.reader(csv_file,delimiter='\t')

    old_id = ""

    indicators = []

    update = {"update" : {"_id":""}}

    doc = {"doc" : {"indicators":[], "numFields":0}}

    for row in csv_reader:

        id = row[0]

        if (id != old_id and old_id != ""):

            update["update"]["_id"] = old_id

            doc["doc"]["indicators"] = indicators

            doc["doc"]["numFields"] = len(indicators)

            print(json.dumps(update))

            print(json.dumps(doc))

            indicators = [row[1]]

        else:

            indicators.append(row[1])

        old_id = id

下面命令會執行 update.py 的 Python 腳本，並輸出 update.json：

$ python update.py > update.json

上面 Python 腳本將建立以下內容的文件：

{"update": {"_id": "1"}}

{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}

{"update": {"_id": "2"}}

{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

在命令行，用 curl 命令調用 Elasticsearch REST bulk 請求，把該文件 update.json 做爲輸入，就能夠更新 indicator 字段：

$ curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

Step 4: 檢索 Film 索引的 indicator 字段進行推薦

如今，你就能夠檢索 film 的 indicator 字段進行查詢並推薦。例如，某人喜歡電影 1237 和 551，你想推薦相似的電影，能夠執行以下 Elasticsearch 查詢得到推薦，將返回indicator 數組爲 1237 和 551 的電影，即 1237=Seventh Seal（第七封印），551=Nightmare Before Christmas（聖誕夜驚魂）：

curl 'http://localhost:9200/bigmovie/film/_search?pretty' -d '

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

},

  "fields":["_id","title","genre"],

  "size":"8"

}'

上面查詢 indicator 爲 1237 或 551，而且不是 1237 或 551 的電影。下面示例使用 Sense 插件進行查詢，右邊是檢索結果，推薦結果是「A Man Named Pearl（這個是紀錄片）」和「Used People（寡婦三弄）」。

控制相關性

全文檢索引擎根據相關度排序，Elasticsearch 用 _score 字段表示文檔的相關度分數（relevance score）。function_score 容許你查詢時修改該分數。random_score 用一個種子變量使用散列生成分數。Elasticsearch 查詢以下所示，random_score 函數用於把變量添加到檢索結果，以便完成 dithering：

  "query": {

    "function_score": {

      "query": {

         "bool": {

           "must": [ { "match": { "indicators":"1237 551"} } ],

           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]

},

      "functions":[ {"random_score": {"seed":"48" } } ],

      "score_mode":"sum"

相關性抖動（dithering）有意地包含排名靠，但相關性較低的結果，以便拓展訓練數據，提供給推薦引擎。若是沒有 dithering，那麼明天的訓練數據僅僅是教模型今天已經知道的事情。增長 dithering，會幫助拓展推薦模型。若是模型給出的答案接近優秀的，那麼 dithering 能夠幫助找到正確答案。有效的 dithering 會減小今天的準確性，而改進明天的訓練數據（和將來的性能，算法的準確性也屬於性能的範疇），換句話說，爲了讓未來的推薦準確，須要減小過去對未來的影響。

總結

We showed in this tutorial how to use Apache Mahout and Elasticsearch with the MapR Sandbox to build a basic recommendation engine. You can go beyond a basic recommender and get even better results with a few simple additions to the design to add cross recommendation of items, which leverages a variety of interactions and items for making recommendations. You can find more information about these technologies here:

參考資料

若想學習更多關於推薦引擎的組件和邏輯，參看 "An Inside Look at the Components of a Recommendation Engine"，該文章詳細描述了推薦引擎的架構、Mahout 協同過濾（collaborative filtering）和 Elasticsearch 檢索引擎。

更多關於推薦引擎、機器學習和 Elasticsearch 的資源，以下所示：

Tutorial Category Reference:

相關標籤/搜索