本文前面探索階段所用的數據集太大,致使多個分析運行一天也出不告終果,因此後面在推薦系統的建模中,又換成了較小的 MovieLens 1M 數據集。正則表達式
import pandas as pd import numpy as np import datetime
%matplotlib inline import matplotlib.pyplot as plt
ratings = pd.read_csv('./ml-latest/ratings.csv', header=0)
userId | movieId | rating | timestamp | |
0 | 1 | 169 | 2.5 | 1204927694 |
1 | 1 | 2471 | 3.0 | 1204927438 |
2 | 1 | 48516 | 5.0 | 1204927435 |
3 | 2 | 2571 | 3.5 | 1436165433 |
4 | 2 | 109487 | 4.0 | 1436165496 |
ratings['timestamp']=ratings.timestamp.map(datetime.datetime.utcfromtimestamp) # 時間格式轉換
userId | movieId | rating | timestamp | |
0 | 1 | 169 | 2.5 | 2008-03-07 22:08:14 |
1 | 1 | 2471 | 3.0 | 2008-03-07 22:03:58 |
2 | 1 | 48516 | 5.0 | 2008-03-07 22:03:55 |
3 | 2 | 2571 | 3.5 | 2015-07-06 06:50:33 |
4 | 2 | 109487 | 4.0 | 2015-07-06 06:51:36 |
userId 22884377 movieId 22884377 rating 22884377 timestamp 22884377 dtype: int64
movies = pd.read_csv('./ml-latest/movies.csv', header=0)
movieId | title | genres | |
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
movies = movies.set_index("movieId") movies.head()
title | genres | |
movieId | ||
1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
2 | Jumanji (1995) | Adventure|Children|Fantasy |
3 | Grumpier Old Men (1995) | Comedy|Romance |
4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
5 | Father of the Bride Part II (1995) | Comedy |
title 34208 genres 34208 dtype: int64
該數據集包含了對 34208 部電影的 22884377 個評分數據。測試
moviefreq = ratings.movieId.value_counts() # 統計每部電影的評分人數,可看出電影的流行程度,默認是降序排列 moviefreq.count()
sorted_byfreq = movies.loc[moviefreq.index] # 根據頻次大小依次取電影信息 sorted_byfreq['ranking']=range(moviefreq.count()) # 加上排名 sorted_byfreq['freq']=moviefreq # 加上頻次 sorted_byfreq.iloc[0:10] # 前十大流行電影
title | genres | ranking | freq | |
356 | Forrest Gump (1994) | Comedy|Drama|Romance|War | 0 | 81296 |
296 | Pulp Fiction (1994) | Comedy|Crime|Drama|Thriller | 1 | 79091 |
318 | Shawshank Redemption, The (1994) | Crime|Drama | 2 | 77887 |
593 | Silence of the Lambs, The (1991) | Crime|Horror|Thriller | 3 | 76271 |
480 | Jurassic Park (1993) | Action|Adventure|Sci-Fi|Thriller | 4 | 69545 |
260 | Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Sci-Fi | 5 | 67092 |
2571 | Matrix, The (1999) | Action|Sci-Fi|Thriller | 6 | 64830 |
110 | Braveheart (1995) | Action|Drama|War | 7 | 61267 |
1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 8 | 60424 |
527 | Schindler's List (1993) | Drama|War | 9 | 59857 |
sorted_byfreq[sorted_byfreq.title.str.contains('Kill Bill')] # 查看某部電影的流行度
title | genres | ranking | freq | |
6874 | Kill Bill: Vol. 1 (2003) | Action|Crime|Thriller | 110 | 26225 |
7438 | Kill Bill: Vol. 2 (2004) | Action|Drama|Thriller | 157 | 22301 |
moviefreq1 = moviefreq.copy() moviefreq1.index = range(moviefreq1.count()) # 對索引從新賦值,方便畫圖 fig, ax = plt.subplots(1, 1, figsize=(12, 4)) moviefreq1.plot(ax=ax, title='Rating times');
由上可見,對於 34208 部電影來講,評分個數最多的也只有 81296 個,而一共有 247753 個用戶,更不用提大量評分次數遠小於 10000 的電影,可見,推薦的空間是很是大的。動畫
ts = ratings.timestamp.copy()
0 2008-03-07 22:08:14 1 2008-03-07 22:03:58 2 2008-03-07 22:03:55 3 2015-07-06 06:50:33 4 2015-07-06 06:51:36 Name: timestamp, dtype: datetime64[ns]
ts2 = pd.Series(np.ones(ts.count()).astype('Int32'), index=ts.values).sort_index()
ts3 = ts2.to_period("Y").groupby(level=0).count()
fig, ax = plt.subplots(1, 1, figsize=(12, 4)) ts3.plot(ax=ax, kind='bar', title='Rating times');
meanrating = ratings['rating'].groupby(ratings['movieId']).mean()
meanrating = meanrating.sort_values(ascending = False)
movieId 95517 5.0 148781 5.0 141483 5.0 136872 5.0 139134 5.0 Name: rating, dtype: float64
sorted_byrate = movies.loc[meanrating.index] # 根據頻次大小依次取電影信息 sorted_byrate['ranking']=range(meanrating.count()) # 加上排名 sorted_byrate['rating']=meanrating # 加上評分 sorted_byrate.iloc[0:10] # 前十大流行電影
title | genres | ranking | rating | |
movieId | ||||
95517 | Barchester Chronicles, The (1982) | Drama | 0 | 5.0 |
148781 | Under the Electric Sky (2014) | Documentary | 1 | 5.0 |
141483 | Lost Rivers (2013) | Documentary | 2 | 5.0 |
136872 | Zapatlela (1993) | (no genres listed) | 3 | 5.0 |
139134 | Soodhu Kavvum (2013) | Comedy|Thriller | 4 | 5.0 |
135727 | Aarya (2004) | Comedy|Drama|Romance | 5 | 5.0 |
103143 | Donos de Portugal (2012) | Documentary | 6 | 5.0 |
141434 | My Friend Victoria (2014) | Drama | 7 | 5.0 |
150268 | Dilwale (2015) | Action|Children|Comedy|Crime|Drama|Romance | 8 | 5.0 |
148857 | Christmas, Again (2015) | (no genres listed) | 9 | 5.0 |
上面全 5 分的電影居然一部都沒看過!
sorted_byrate[sorted_byrate.title.str.contains('Kill Bill')] # 查看某部電影的評分
title | genres | ranking | freq | |
movieId | ||||
6874 | Kill Bill: Vol. 1 (2003) | Action|Crime|Thriller | 3066 | 3.889743 |
7438 | Kill Bill: Vol. 2 (2004) | Action|Drama|Thriller | 3387 | 3.856621 |
sorted_byrate = movies.loc[meanrating.index] # 根據頻次大小依次取電影信息 sorted_byrate['ranking']=range(meanrating.count()) # 加上排名 sorted_byrate['rating']=meanrating # 加上評分 sorted_byrate['freq']=moviefreq.loc[meanrating.index] # 加上評分個數 sorted_byrate.iloc[0:10] # 前十大流行電影
title | genres | ranking | rating | freq | |
movieId | |||||
95517 | Barchester Chronicles, The (1982) | Drama | 0 | 5.0 | 1 |
148781 | Under the Electric Sky (2014) | Documentary | 1 | 5.0 | 1 |
141483 | Lost Rivers (2013) | Documentary | 2 | 5.0 | 1 |
136872 | Zapatlela (1993) | (no genres listed) | 3 | 5.0 | 1 |
139134 | Soodhu Kavvum (2013) | Comedy|Thriller | 4 | 5.0 | 1 |
135727 | Aarya (2004) | Comedy|Drama|Romance | 5 | 5.0 | 1 |
103143 | Donos de Portugal (2012) | Documentary | 6 | 5.0 | 1 |
141434 | My Friend Victoria (2014) | Drama | 7 | 5.0 | 1 |
150268 | Dilwale (2015) | Action|Children|Comedy|Crime|Drama|Romance | 8 | 5.0 | 2 |
148857 | Christmas, Again (2015) | (no genres listed) | 9 | 5.0 | 1 |
sorted_byrate[sorted_byrate.title.str.contains('Kill Bill')] # 查看某部電影的評分
title | genres | ranking | rating | freq | |
movieId | |||||
6874 | Kill Bill: Vol. 1 (2003) | Action|Crime|Thriller | 3066 | 3.889743 | 26225 |
7438 | Kill Bill: Vol. 2 (2004) | Action|Drama|Thriller | 3387 | 3.856621 | 22301 |
原來這些全 5 分的電影都只有 1 個評分!這就把排名排上去了!看來平均分不靠譜,得把評分人次也考慮進去!
先把評分少於 30 個的剔出去。
sorted_byrate2 = sorted_byrate[sorted_byrate.freq>30]
sorted_byrate2.head(10) # 前十大評分最高電影
title | genres | ranking | rating | freq | |
movieId | |||||
318 | Shawshank Redemption, The (1994) | Crime|Drama | 627 | 4.441710 | 77887 |
858 | Godfather, The (1972) | Crime|Drama | 641 | 4.353639 | 49846 |
50 | Usual Suspects, The (1995) | Crime|Mystery|Thriller | 668 | 4.318987 | 53195 |
527 | Schindler's List (1993) | Drama|War | 675 | 4.290952 | 59857 |
140737 | The Lost Room (2006) | Action|Fantasy|Mystery | 680 | 4.280822 | 73 |
1221 | Godfather: Part II, The (1974) | Crime|Drama | 681 | 4.268878 | 32247 |
2019 | Seven Samurai (Shichinin no samurai) (1954) | Action|Adventure|Drama | 682 | 4.262134 | 12753 |
904 | Rear Window (1954) | Mystery|Thriller | 815 | 4.246988 | 19422 |
1193 | One Flew Over the Cuckoo's Nest (1975) | Drama | 816 | 4.242451 | 35832 |
2959 | Fight Club (1999) | Action|Crime|Drama|Thriller | 817 | 4.233925 | 48879 |
前十大評分最高電影和前十大評分次數最高的電影中是有重合的,如《Shawshank Redemption》和《Schindler's List》,由此,咱們能夠驗證下評分均值和評分次數的相關性。
fig, ax = plt.subplots(1, 1, figsize=(12, 4)) ax.scatter(sorted_byrate['freq'],sorted_byrate['rating']);
userfreq = ratings.userId.value_counts() # 統計每一個人的評分次數,默認是降序排列 userfreq.count()
確實是 247753 我的的評分,一個不差,跟 README.txt 說的同樣。
185430 9281 46750 7515 204165 7057 135877 6015 58040 5801 Name: userId, dtype: int64
timesfreq = userfreq.copy() timesfreq.index = range(timesfreq.count()) # 對索引從新賦值,方便畫圖 timesfreq.head()
0 9281 1 7515 2 7057 3 6015 4 5801 Name: userId, dtype: int64
fig, ax = plt.subplots(1, 1, figsize=(15, 4)) timesfreq.plot(ax=ax); ax.set_xlabel("People ID"); ax.set_ylabel("Rating times");
評分超過 2000 次的人有 295 個,這些人太愛看電影了。至於後面評分較少的,也多是加入評分較晚,或者看過不少電影,只是沒在這評分而已,因此這裏面確定也隱藏了很多電影達人。
下面分別看看評分 2000 次以上和 2000 次如下的評分次數分佈。
timesfreq1 = userfreq[userfreq>=2000].copy() timesfreq1.index = range(timesfreq1.count()) # 對索引從新賦值,方便畫圖 fig, ax = plt.subplots(1, 1, figsize=(15, 4)) timesfreq1.plot(ax=ax); ax.set_xlabel("People ID"); ax.set_ylabel("Rating times");
timesfreq2 = userfreq[userfreq<2000].copy() timesfreq2.index = range(timesfreq2.count()) # 對索引從新賦值,方便畫圖 fig, ax = plt.subplots(1, 1, figsize=(15, 4)) timesfreq2.plot(ax=ax); ax.set_xlabel("People ID"); ax.set_ylabel("Rating times");
看下評分次數少於 10 次的用戶個數。
timesfreq3 = userfreq[userfreq<10].copy() timesfreq3.index = range(timesfreq3.count()) # 對索引從新賦值,方便畫圖 fig, ax = plt.subplots(1, 1, figsize=(15, 4)) timesfreq3.plot(ax=ax); ax.set_xlabel("People ID"); ax.set_ylabel("Rating times");
評分 1 次的就有四千多人。
onerating = ratings[ratings.userId.isin(userfreq[userfreq==1].index.values.tolist())] # 這裏的 isin 方法但是費了好大勁找到的 print onerating.count() print onerating.head()
userId 4251 movieId 4251 rating 4251 timestamp 4251 dtype: int64 userId movieId rating timestamp 10137 108 2302 4.5 1352678182 19688 215 318 3.0 1434516586 23937 263 1029 4.5 1207138536 30266 356 3254 4.0 1325107825 32553 376 7153 5.0 1427304194
fig, ax = plt.subplots(1, 1, figsize=(15, 4)) onerating.rating.value_counts().plot(ax=ax, kind='bar', title='Ratings');
movieId 1 Toy Story (1995) 2 Jumanji (1995) 3 Grumpier Old Men (1995) 4 Waiting to Exhale (1995) 5 Father of the Bride Part II (1995) Name: title, dtype: object
movieyears = movies.title.str.extract('(\((\d{4})\))', expand=True).ix[:,1] # 使用正則表達式取出上映年份 movieyears.head()
movieId 1 1995 2 1995 3 1995 4 1995 5 1995 Name: 1, dtype: object
yearfreq = movieyears.value_counts() # 統計每部電影的上映年份,可看出電影的流行程度,默認是降序排列 yearfreq.count()
yearfreqsort = yearfreq.sort_index() yearfreqsort.head()
1874 1 1878 1 1887 1 1888 2 1890 3 Name: 1, dtype: int64
fig, ax = plt.subplots(3, 1, figsize=(15, 12)) yearfreqsort.plot(ax=ax[0], kind='bar', title='freq'); yearfreqsort.iloc[0:60].plot(ax=ax[1], kind='bar', title='freq'); yearfreqsort.iloc[60:].plot(ax=ax[2], kind='bar', title='freq');
看每一年的電影個數,能夠感覺到歷史的變遷。電影個數在上個世紀 90 年代以前一直增加緩慢,到了 90 年代中期開始飛速增加,直到今天。
沒想到 1900 年以前還有幾部電影。看看什麼名字。
movies.ix[movieyears[movieyears<'1900'].index] # 1900 前的電影
title | genres | |
movieId | ||
82337 | Four Heads Are Better Than One (Un homme de tê... | Fantasy |
82362 | Pyramid of Triboulet, The (La pyramide de Trib... | Fantasy |
88674 | Edison Kinetoscopic Record of a Sneeze (1894) | Documentary |
94431 | Ella Lola, a la Trilby (1898) | (no genres listed) |
94657 | Turkish Dance, Ella Lola (1898) | (no genres listed) |
94951 | Dickson Experimental Sound Film (1894) | Musical |
95541 | Blacksmith Scene (1893) | (no genres listed) |
96009 | Kiss, The (1896) | Romance |
98981 | Arrival of a Train, The (1896) | Documentary |
113048 | Tables Turned on the Gardener (1895) | Comedy |
120869 | Employees Leaving the Lumière Factory (1895) | Documentary |
125978 | Santa Claus (1898) | Sci-Fi |
129849 | Old Man Drinking a Glass of Beer (1898) | (no genres listed) |
129851 | Dickson Greeting (1891) | (no genres listed) |
140539 | Pauvre Pierrot (1892) | Animation |
140547 | The Merry Skeleton (1898) | Comedy |
140549 | Serpentine Dance: Loïe Fuller (1897) | (no genres listed) |
142851 | Arab Cortege, Geneva (1896) | Documentary |
148040 | Man Walking Around a Corner (1887) | (no genres listed) |
148042 | Accordion Player (1888) | Documentary |
148044 | Monkeyshines, No. 1 (1890) | Comedy |
148046 | Monkeyshines, No. 2 (1890) | (no genres listed) |
148048 | Sallie Gardner at a Gallop (1878) | (no genres listed) |
148050 | Traffic Crossing Leeds Bridge (1888) | Documentary |
148052 | London's Trafalgar Square (1890) | (no genres listed) |
148054 | Passage de Venus (1874) | Documentary |
148064 | Newark Athlete (1891) | Documentary |
148462 | Men Boxing (1891) | Action|Documentary |
148703 | The Wave (1891) | Documentary |
148705 | A Hand Shake (1892) | (no genres listed) |
148877 | Fencing (1892) | (no genres listed) |
movieyears1 = movieyears.str[:3] + "0s" yearfreq1 = movieyears1.value_counts() # 統計每部電影的上映年份,可看出電影的流行程度,默認是降序排列 yearfreqsort1 = yearfreq1.sort_index() fig, ax = plt.subplots(1, 1, figsize=(15, 4)) yearfreqsort1.plot(ax=ax, kind='bar', title='freq');
genreslist = [] # 存儲爲全部電影標註的基因 genreseries = movies.genres.str.split(pat = "|") genrecount = genreseries.count() for i in range(genrecount): genreslist.extend(genreseries.iloc[i]) # 把 Series 中的全部元素展平組成一個 list len(genreslist)
title 34208 genres 34208 dtype: int64
allmoviegenres = pd.Series(genreslist) genrestats = allmoviegenres.value_counts() fig, ax = plt.subplots(1, 1, figsize=(15, 4)) genrestats.plot(ax=ax, kind='bar', title='freq');
#mi = movies.ix[ratings.movieId] #genreslist = [] # 存儲爲全部電影標註的基因 #genreseries = mi.genres.str.split(pat = "|") #genrecount = genreseries.count() #for i in range(genrecount): # genreslist.extend(genreseries.iloc[i]) # 把 Series 中的全部元素展平組成一個 list #allmoviegenres = pd.Series(genreslist) #genrestats = allmoviegenres.value_counts() #fig, ax = plt.subplots(1, 1, figsize=(15, 4)) #genrestats.plot(ax=ax, kind='bar', title='freq');
上面的代碼沒有考慮內存,下面對代碼作個優化,對 ratings 裏的電影一一提取基因來統計,稍微作改動下就沒有內存問題了,但運算速度就沒辦法了,仍是要對 22884377 個評分一個一個提取電影基因。
s1 = pd.Series(np.zeros(20,dtype=np.int32),index=['Drama','Comedy','Thriller','Romance','Action', \ 'Crime','Horror','Documentary','Adventure','Sci-Fi', \ 'Mystery','Fantasy','Children','Animation','War', \ '(no genres listed)','Musical','Western','Film-Noir','IMAX']) rcount = ratings.count()[0] for i in range(rcount): if (0 == (i%1000000)): # 至關於進度條,否則 7 個小時暗箱運行也不知道進度 print i mid = ratings.movieId.iloc[i] grs = movies.ix[mid].genres.split("|") s1.ix[grs] += 1
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000 10000000 11000000 12000000 13000000 14000000 15000000 16000000 17000000 18000000 19000000 20000000 21000000 22000000
Drama 10137200 Comedy 8437502 Thriller 6123348 Romance 4342070 Action 6547286 Crime 3803018 Horror 1685352 Documentary 279609 Adventure 5117321 Sci-Fi 3694890 Mystery 1794175 Fantasy 2449668 Children 1923874 Animation 1362681 War 1206361 (no genres listed) 2454 Musical 974697 Western 472588 Film-Noir 238480 IMAX 676420 dtype: int32
s1_sort = s1.sort_values(ascending = False) # 排序
fig, ax = plt.subplots(1, 1, figsize=(15, 4)) s1_sort.plot(ax=ax, kind='bar', title='freq');
因爲上面使用大數據集的慘痛教訓,這裏換成了較小的 MovieLens 1M Dataset 數據集。
ratings1m_train = pd.read_csv('./ml-1m/ratings.dat', sep="::", names=['userId','movieId','rating','timestamp'],engine='python')
userId | movieId | rating | timestamp | |
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
ratings1m_train = ratings1m_train.drop(['timestamp','rating'], axis=1) # TopN 推薦忽略具體評分
userId 1000209 movieId 1000209 dtype: int64
from sklearn import cross_validation from sklearn.cross_validation import train_test_split
userId | movieId | |
0 | 1 | 1193 |
1 | 1 | 661 |
2 | 1 | 914 |
3 | 1 | 3408 |
4 | 1 | 2355 |
totalcount = ratings1m_train.count()[0]
all_index = np.arange(totalcount) train_index = np.random.choice(totalcount, int(0.8*totalcount), replace=False) # 從 0 到 totalcount 之間抽 80%,無放回 test_index = np.setdiff1d(all_index, train_index) # 集合的差
train_data = ratings1m_train.iloc[train_index] test_data = ratings1m_train.iloc[test_index]
users = ratings1m_train.userId.unique() users
array([ 1, 2, 3, ..., 6038, 6039, 6040], dtype=int64)
movies = ratings1m_train.movieId.unique() movies
array([1193, 661, 914, ..., 2845, 3607, 2909], dtype=int64)
該數據集包含了 6040 個用戶對 3706 部電影的 1000209 個評分。
userSimilarity = pd.DataFrame(0, columns=users, index=users) # 用戶類似度矩陣,初始化爲 0
通常認爲 RMSE 比 MAE 更苛刻,經過平方項加大了對預測不許的評分的懲罰。
評分預測一直是推薦系統研究的熱點,對此,亞馬遜前科學家 Greg Linden 認爲,電影推薦的目的是找出用戶最有可能感興趣的電影,而不是預測用戶看了電影后會給多少分,所以 TopN 更符合應用需求,也許有一部電影用戶看了給分很高,但其它用戶看的可能性很小,所以預測用戶是否會看一部電影,比預測評分更重要。
本次做業是研究 TopN 推薦問題,忽略數據集中的具體評分。TopN 推薦的任務是預測用戶會不會對某部電影評分,而不是預測評多少分。
movie_users = pd.Series('',index=movies) # 這裏 Series 不能直接存空的 list,因此只有先存個空字符串,而後用 split 把它轉爲 list movie_users = movie_users.str.split() for i in train_data.index.values: # 掃描訓練數據集 movie_users.ix[train_data.movieId.ix[i]].append(train_data.userId.ix[i])
import math C = dict() N = dict() for i in movie_users.index.values: for u in movie_users.ix[i]: N.setdefault(u,1) N[u] += 1 for v in movie_users.ix[i]: if u == v: continue C.setdefault(u,{}) C[u].setdefault(v,0) C[u][v] += 1 for u, related_users in C.items(): for v, cuv in related_users.items(): userSimilarity.ix[u][v]=cuv / math.sqrt(N[u]*N[v])
from operator import itemgetter def recommend(uid, n_sim_user, n_rec_movie): K = n_sim_user N = n_rec_movie rank = dict() watched_movies = train_data[train_data.userId == uid].movieId.values simusers = userSimilarity.ix[uid].sort_values(ascending=False)[0:K] for v in simusers.index.values: for m in train_data[train_data.userId == v].movieId.values: if m in watched_movies: continue rank.setdefault(m,0) rank[m] += simusers.ix[v] # 返回 N 部電影 return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]
令系統的用戶集合爲 U, R(u) 是根據用戶在訓練集上的行爲給用戶做出的推薦列表,而 T(u) 是用戶在測試集上的行爲列表。那麼推薦結果的準確率定義爲:
\[Precision=\frac{\sum_{u\in U}|R(u)\cap T(u)|}{\sum_{u\in U}|R(u)|}\]
\[Recall=\frac{\sum_{u\in U}|R(u)\cap T(u)|}{\sum_{u\in U}|T(u)|}\]
\[Recall=\frac{\sum_{u\in U}|R(u)|}{|I|}\]
def evaluate(n_sim_user, n_rec_movie): N = n_rec_movie hit = 0 rec_count = 0 test_count = 0 all_rec_movies = set() popular_sum = 0 movie_count = ratings1m_train.movieId.unique().shape[0] for uid in train_data.userId.values: test_movies = test_data[test_data.userId == uid].movieId rec_movies = recommend(uid, n_sim_user, n_rec_movie) for movie, w in rec_movies: if movie in test_movies.values: hit += 1 all_rec_movies.add(movie) rec_count += N test_count += test_movies.count() precision = hit / (1.0*rec_count) recall = hit / (1.0*test_count) coverage = len(all_rec_movies) / (1.0*movie_count) return (precision, recall, coverage)
print evaluate(20, 10)