基於elasticsearch和mahout的電影推薦

時間 2019-12-05

原文原文鏈接

1. 下載示例數據

ml-10m.zip
安裝mahout
安裝Elasticsearch

2. 建立elasticsearch的index

要求film要有以下格式:python

{
  "id": "65006",
  "title": "Impulse",
  "year": "2008",
  "genre": ["Mystery","Thriller"],
  "indicators": ["154","272",」154","308", "535", "583", "593", "668", "670", "680", "702", "745"],
  "numFields": 12
}

使用算法

curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'json

格式的命令對Elasticsearch進行操做.具體能夠參考 Elasticsearch 101 tutorial.app

此例中,使用以下格式建立indexdom

curl -XPUT 'http://localhost:9200/bigmovie' -d '
{
  "mappings": {
    "film" : {
      "properties" : {
        "numFields" : { "type" :   "integer" }
      }
    }
  }
}'

3. 導入movie的詳細信息

解壓下載的數據文件,查看movies.dat,其中的數據格式爲:curl

MovieID::Title::Genreselasticsearch

如65006::Impulse (2008)::Mystery|Thrilleroop

使用python對其轉化,以便導入到Elasticsearch中.url

import re
import json
count=0
with open('movies.dat','rb') as csv_file:
   content = csv_file.readlines()
   for line in content:
        fixed = re.sub("::", "\t", line).rstrip().split("\t")
        if len(fixed)==3:
          title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1]))
          genre = fixed[2].split('|')
          print '{ "create" : { "_index" : "bigmovie", "_type" : "film","_id" : "%s" } }' %  fixed[0]
          print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'% (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))

運行 $ python index.py > index.json 對數據進行格式化,生成index.json.其格式爲rest

{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } }
{ "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] }
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } }
{ "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }

如今能夠導入到Elasticsearch中了.

curl -s -XPOST localhost:9200/_bulk --data-binary @index.json

如今能夠使用rest客戶端或者curl訪問Elasticsearch獲得查詢結果了.

4.利用mahout生成推薦信息.

已經完成了電影的詳細了,下面生成有關推薦的信息. 查看數據文件ratins.dat.其格式爲:

UserID::MovieID::Rating::Timestamp

如:

71567::2294::5::912577968
71567::2338::2::912578016

rating.dat 使用::做爲分隔符,mahout要求\t爲分隔符,因此須要格式化一下.

sed -i 's/::/\t/g' ratings.dat

會將ratings.dat修改成格式爲item1 item2 rating timestamp,如

71567    2294    5    912580553
71567    2338    2    912580553

如今能夠使用mahout對數據計算類似性了.

mahout itemsimilarity \
  --input /user/user01/mlinput/ratings.dat \
  --output /user/user01/mloutput \
  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
  --booleanData TRUE \
  --tempDir /user/user01/temp

這裏只涉及到計算,能夠設置MAHOUT_LOCAL=true,不須要hadoop的支持. 這裏使用的是SIMILARITY_LOGLIKELIHOOD``(Log Likelihood Ratio (LLR)),也能夠使用其餘算法生成的文件在/user/user01/mloutput目錄下. 有以下格式(item1id item2id similarity),如:

64957   64997   0.9604835425701245
64957   65126   0.919355104432831
64957   65133   0.9580439772229588

5. 更新elasticsearch的movie的信息.

如今要將上面生成的結果添加到Elasticseaarch中,如

{
  "id": "65006",
  "title": "Impulse",
  "year": "2008",
  "genre": ["Mystery","Thriller"],
  "indicators": ["1076", "1936", "2057", "2204"],
  "numFields": 4
}

使用python讀取文件,轉爲json.

import fileinput
from string import join
import json
import csv
import json
### read the output from MAHOUT and collect into hash ###
with open('/user/user01/mloutput/part-r-00000','rb') as csv_file:
    csv_reader = csv.reader(csv_file,delimiter='\t')
    old_id = ""
    indicators = []
    update = {"update" : {"_id":""}}
    doc = {"doc" : {"indicators":[], "numFields":0}}
    for row in csv_reader:
        id = row[0]
        if (id != old_id and old_id != ""):
            update["update"]["_id"] = old_id
            doc["doc"]["indicators"] = indicators
            doc["doc"]["numFields"] = len(indicators)
            print(json.dumps(update))
            print(json.dumps(doc))
            indicators = [row[1]]
        else:
            indicators.append(row[1])
        old_id = id

$ python update.py > update.json

結果update.json,格式爲:

{"update": {"_id": "1"}}
{"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}}
{"update": {"_id": "2"}}
{"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}

導入到Elasticsearch中

curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo

6. 查詢示例

curl 'http://master41:9200/bigmovie/film/_search?pretty' -d '
{
  "query": {
    "function_score": {
      "query": {
         "bool": {
           "must": [ { "match": { "indicators":"1237 551"} } ],
           "must_not": [ { "ids": { "values": ["1237", "551"] } } ]
         }
      },
      "functions":[ {"random_score": {"seed":"48" } } ],
      "score_mode":"sum"
    }
  },
  "fields":["_id","title","genre"],
  "size":"8"
}'

相關標籤/搜索