要求film要有以下格式:python
{ "id": "65006", "title": "Impulse", "year": "2008", "genre": ["Mystery","Thriller"], "indicators": ["154","272",」154","308", "535", "583", "593", "668", "670", "680", "702", "745"], "numFields": 12 }
使用算法
curl -X<VERB> 'http://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'
json
格式的命令對Elasticsearch進行操做.具體能夠參考 Elasticsearch 101 tutorial.app
此例中,使用以下格式建立indexdom
curl -XPUT 'http://localhost:9200/bigmovie' -d ' { "mappings": { "film" : { "properties" : { "numFields" : { "type" : "integer" } } } } }'
解壓下載的數據文件,查看movies.dat,其中的數據格式爲:curl
MovieID::Title::Genres
elasticsearch
如65006::Impulse (2008)::Mystery|Thriller
oop
使用python對其轉化,以便導入到Elasticsearch中.url
import re import json count=0 with open('movies.dat','rb') as csv_file: content = csv_file.readlines() for line in content: fixed = re.sub("::", "\t", line).rstrip().split("\t") if len(fixed)==3: title = re.sub(" \(.*\)$", "", re.sub('"','', fixed[1])) genre = fixed[2].split('|') print '{ "create" : { "_index" : "bigmovie", "_type" : "film","_id" : "%s" } }' % fixed[0] print '{ "id": "%s", "title" : "%s", "year":"%s" , "genre":%s }'% (fixed[0],title, fixed[1][-5:-1], json.dumps(genre))
運行 $ python index.py > index.json
對數據進行格式化,生成index.json
.其格式爲rest
{ "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "1" } } { "id": "1", "title" : "Toy Story", "year":"1995" , "genre":["Adventure", "Animation", "Children", "Comedy", "Fantasy"] } { "create" : { "_index" : "bigmovie", "_type" : "film", "_id" : "2" } } { "id": "2", "title" : "Jumanji", "year":"1995" , "genre":["Adventure", "Children", "Fantasy"] }
如今能夠導入到Elasticsearch中了.
curl -s -XPOST localhost:9200/_bulk --data-binary @index.json
如今能夠使用rest客戶端或者curl訪問Elasticsearch獲得查詢結果了.
已經完成了電影的詳細了,下面生成有關推薦的信息. 查看數據文件ratins.dat
.其格式爲:
UserID::MovieID::Rating::Timestamp
如:
71567::2294::5::912577968 71567::2338::2::912578016
rating.dat
使用::
做爲分隔符,mahout要求\t
爲分隔符,因此須要格式化一下.
sed -i 's/::/\t/g' ratings.dat
會將ratings.dat
修改成格式爲item1 item2 rating timestamp
,如
71567 2294 5 912580553 71567 2338 2 912580553
如今能夠使用mahout對數據計算類似性了.
mahout itemsimilarity \ --input /user/user01/mlinput/ratings.dat \ --output /user/user01/mloutput \ --similarityClassname SIMILARITY_LOGLIKELIHOOD \ --booleanData TRUE \ --tempDir /user/user01/temp
這裏只涉及到計算,能夠設置MAHOUT_LOCAL=true
,不須要hadoop的支持. 這裏使用的是SIMILARITY_LOGLIKELIHOOD``(Log Likelihood Ratio (LLR))
,也能夠使用其餘算法 生成的文件在/user/user01/mloutput
目錄下. 有以下格式(item1id item2id similarity)
,如:
64957 64997 0.9604835425701245 64957 65126 0.919355104432831 64957 65133 0.9580439772229588
如今要將上面生成的結果添加到Elasticseaarch中,如
{ "id": "65006", "title": "Impulse", "year": "2008", "genre": ["Mystery","Thriller"], "indicators": ["1076", "1936", "2057", "2204"], "numFields": 4 }
使用python讀取文件,轉爲json.
import fileinput from string import join import json import csv import json ### read the output from MAHOUT and collect into hash ### with open('/user/user01/mloutput/part-r-00000','rb') as csv_file: csv_reader = csv.reader(csv_file,delimiter='\t') old_id = "" indicators = [] update = {"update" : {"_id":""}} doc = {"doc" : {"indicators":[], "numFields":0}} for row in csv_reader: id = row[0] if (id != old_id and old_id != ""): update["update"]["_id"] = old_id doc["doc"]["indicators"] = indicators doc["doc"]["numFields"] = len(indicators) print(json.dumps(update)) print(json.dumps(doc)) indicators = [row[1]] else: indicators.append(row[1]) old_id = id
$ python update.py > update.json
結果update.json
,格式爲:
{"update": {"_id": "1"}} {"doc": {"indicators": ["75", "118", "494", "512", "609", "626", "631", "634", "648", "711", "761", "810", "837", "881", "910", "1022", "1030", "1064", "1301", "1373", "1390", "1588", "1806", "2053", "2083", "2090", "2096", "2102", "2286", "2375", "2378", "2641", "2857", "2947", "3147", "3429", "3438", "3440", "3471", "3483", "3712", "3799", "3836", "4016", "4149", "4544", "4545", "4720", "4732", "4901", "5004", "5159", "5309", "5313", "5323", "5419", "5574", "5803", "5841", "5902", "5940", "6156", "6208", "6250", "6383", "6618", "6713", "6889", "6890", "6909", "6944", "7046", "7099", "7281", "7367", "7374", "7439", "7451", "7980", "8387", "8666", "8780", "8819", "8875", "8974", "9009", "25947", "27721", "31660", "32300", "33646", "40339", "42725", "45517", "46322", "46559", "46972", "47384", "48150", "49272", "55668", "63808"], "numFields": 102}} {"update": {"_id": "2"}} {"doc": {"indicators": ["15", "62", "153", "163", "181", "231", "239", "280", "333", "355", "374", "436", "473", "485", "489", "502", "505", "544", "546", "742", "829", "1021", "1474", "1562", "1588", "1590", "1713", "1920", "1967", "2002", "2012", "2045", "2115", "2116", "2139", "2143", "2162", "2296", "2338", "2399", "2408", "2447", "2616", "2793", "2798", "2822", "3157", "3243", "3327", "3438", "3440", "3477", "3591", "3614", "3668", "3802", "3869", "3968", "3972", "4090", "4103", "4247", "4370", "4467", "4677", "4686", "4846", "4967", "4980", "5283", "5313", "5810", "5843", "5970", "6095", "6383", "6385", "6550", "6764", "6863", "6881", "6888", "6952", "7317", "8424", "8536", "8633", "8641", "26870", "27772", "31658", "32954", "33004", "34334", "34437", "39419", "40278", "42011", "45210", "45447", "45720", "48142", "50347", "53464", "55553", "57528"], "numFields": 106}}
導入到Elasticsearch中
curl -s -XPOST localhost:9200/bigmovie/film/_bulk --data-binary @update.json; echo
curl 'http://master41:9200/bigmovie/film/_search?pretty' -d ' { "query": { "function_score": { "query": { "bool": { "must": [ { "match": { "indicators":"1237 551"} } ], "must_not": [ { "ids": { "values": ["1237", "551"] } } ] } }, "functions":[ {"random_score": {"seed":"48" } } ], "score_mode":"sum" } }, "fields":["_id","title","genre"], "size":"8" }'