大做業-電影推薦系統

時間 2019-11-09

標籤大做推薦系統简体版

原文原文鏈接

電影推薦系統

推薦系統的文獻汗牛充棟，你們對此應該都不陌生。之因此選這個題目一是簡單，在一週多晚上十點之後的自由時間裏，只有選簡單的題目才能完成，即使如此，依然捉襟見肘；二是但願好好研究下數據，一步步推到推薦系統的設計，而不是像之前直奔算法，固然也是時間緣由，這裏對數據的探索也是遠遠不夠的。python

本文前面探索階段所用的數據集太大，致使多個分析運行一天也出不告終果，因此後面在推薦系統的建模中，又換成了較小的 MovieLens 1M 數據集。正則表達式

import pandas as pd
import numpy as np
import datetime

%matplotlib inline
import matplotlib.pyplot as plt

數據探索

在數據探索階段，剛開始是漫無目的的，跟着感受走，對數據哪一塊感興趣就統計下，排個序，畫個圖，這樣慢慢就對數據熟悉起來，熟悉以後再來考慮如何利用用戶歷史行爲數據來作推薦。算法

讀取數據app

讀取評分數據：dom

ratings = pd.read_csv('./ml-latest/ratings.csv', header=0)

ratings.head()

	userId	movieId	rating	timestamp
0	1	169	2.5	1204927694
1	1	2471	3.0	1204927438
2	1	48516	5.0	1204927435
3	2	2571	3.5	1436165433
4	2	109487	4.0	1436165496

ratings['timestamp']=ratings.timestamp.map(datetime.datetime.utcfromtimestamp) # 時間格式轉換

ratings.head()

	userId	movieId	rating	timestamp
0	1	169	2.5	2008-03-07 22:08:14
1	1	2471	3.0	2008-03-07 22:03:58
2	1	48516	5.0	2008-03-07 22:03:55
3	2	2571	3.5	2015-07-06 06:50:33
4	2	109487	4.0	2015-07-06 06:51:36

ratings.count()

userId       22884377
movieId      22884377
rating       22884377
timestamp    22884377
dtype: int64

讀取電影信息數據：ide

movies = pd.read_csv('./ml-latest/movies.csv', header=0)

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

movies = movies.set_index("movieId")
movies.head()

	title	genres
movieId
1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	Jumanji (1995)	Adventure\|Children\|Fantasy
3	Grumpier Old Men (1995)	Comedy\|Romance
4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
5	Father of the Bride Part II (1995)	Comedy

movies.count()

title     34208
genres    34208
dtype: int64

該數據集包含了對 34208 部電影的 22884377 個評分數據。測試

流行度調查大數據

moviefreq = ratings.movieId.value_counts() # 統計每部電影的評分人數，可看出電影的流行程度，默認是降序排列
moviefreq.count()

sorted_byfreq = movies.loc[moviefreq.index] # 根據頻次大小依次取電影信息
sorted_byfreq['ranking']=range(moviefreq.count()) # 加上排名
sorted_byfreq['freq']=moviefreq # 加上頻次
sorted_byfreq.iloc[0:10] # 前十大流行電影

	title	genres	ranking	freq
356	Forrest Gump (1994)	Comedy\|Drama\|Romance\|War	0	81296
296	Pulp Fiction (1994)	Comedy\|Crime\|Drama\|Thriller	1	79091
318	Shawshank Redemption, The (1994)	Crime\|Drama	2	77887
593	Silence of the Lambs, The (1991)	Crime\|Horror\|Thriller	3	76271
480	Jurassic Park (1993)	Action\|Adventure\|Sci-Fi\|Thriller	4	69545
260	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Sci-Fi	5	67092
2571	Matrix, The (1999)	Action\|Sci-Fi\|Thriller	6	64830
110	Braveheart (1995)	Action\|Drama\|War	7	61267
1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	8	60424
527	Schindler's List (1993)	Drama\|War	9	59857

sorted_byfreq[sorted_byfreq.title.str.contains('Kill Bill')] # 查看某部電影的流行度

	title	genres	ranking	freq
6874	Kill Bill: Vol. 1 (2003)	Action\|Crime\|Thriller	110	26225
7438	Kill Bill: Vol. 2 (2004)	Action\|Drama\|Thriller	157	22301

再來看下每部電影的評分次數分佈。優化

moviefreq1 = moviefreq.copy()
moviefreq1.index = range(moviefreq1.count()) # 對索引從新賦值，方便畫圖
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
moviefreq1.plot(ax=ax, title='Rating times');

由上可見，對於 34208 部電影來講，評分個數最多的也只有 81296 個，而一共有 247753 個用戶，更不用提大量評分次數遠小於 10000 的電影，可見，推薦的空間是很是大的。動畫

評分時間區間

統計下評分時間的頻次分佈。

ts = ratings.timestamp.copy()

ts.head()

0   2008-03-07 22:08:14
1   2008-03-07 22:03:58
2   2008-03-07 22:03:55
3   2015-07-06 06:50:33
4   2015-07-06 06:51:36
Name: timestamp, dtype: datetime64[ns]

ts2 = pd.Series(np.ones(ts.count()).astype('Int32'), index=ts.values).sort_index()

ts3 = ts2.to_period("Y").groupby(level=0).count()

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
ts3.plot(ax=ax, kind='bar', title='Rating times');

評分次數隨年份的分佈。

評分排名

怎麼根據評分來對電影排名？先試試平均分吧，這個貌似比較公平。

meanrating = ratings['rating'].groupby(ratings['movieId']).mean()

meanrating = meanrating.sort_values(ascending = False)

meanrating.head()

movieId
95517     5.0
148781    5.0
141483    5.0
136872    5.0
139134    5.0
Name: rating, dtype: float64

sorted_byrate = movies.loc[meanrating.index] # 根據頻次大小依次取電影信息
sorted_byrate['ranking']=range(meanrating.count()) # 加上排名
sorted_byrate['rating']=meanrating # 加上評分
sorted_byrate.iloc[0:10] # 前十大流行電影

	title	genres	ranking	rating
movieId
95517	Barchester Chronicles, The (1982)	Drama	0	5.0
148781	Under the Electric Sky (2014)	Documentary	1	5.0
141483	Lost Rivers (2013)	Documentary	2	5.0
136872	Zapatlela (1993)	(no genres listed)	3	5.0
139134	Soodhu Kavvum (2013)	Comedy\|Thriller	4	5.0
135727	Aarya (2004)	Comedy\|Drama\|Romance	5	5.0
103143	Donos de Portugal (2012)	Documentary	6	5.0
141434	My Friend Victoria (2014)	Drama	7	5.0
150268	Dilwale (2015)	Action\|Children\|Comedy\|Crime\|Drama\|Romance	8	5.0
148857	Christmas, Again (2015)	(no genres listed)	9	5.0

上面全 5 分的電影居然一部都沒看過！

sorted_byrate[sorted_byrate.title.str.contains('Kill Bill')] # 查看某部電影的評分

	title	genres	ranking	freq
movieId
6874	Kill Bill: Vol. 1 (2003)	Action\|Crime\|Thriller	3066	3.889743
7438	Kill Bill: Vol. 2 (2004)	Action\|Drama\|Thriller	3387	3.856621

把評分人次也加上。

sorted_byrate = movies.loc[meanrating.index] # 根據頻次大小依次取電影信息
sorted_byrate['ranking']=range(meanrating.count()) # 加上排名
sorted_byrate['rating']=meanrating # 加上評分
sorted_byrate['freq']=moviefreq.loc[meanrating.index] # 加上評分個數
sorted_byrate.iloc[0:10] # 前十大流行電影

	title	genres	ranking	rating	freq
movieId
95517	Barchester Chronicles, The (1982)	Drama	0	5.0	1
148781	Under the Electric Sky (2014)	Documentary	1	5.0	1
141483	Lost Rivers (2013)	Documentary	2	5.0	1
136872	Zapatlela (1993)	(no genres listed)	3	5.0	1
139134	Soodhu Kavvum (2013)	Comedy\|Thriller	4	5.0	1
135727	Aarya (2004)	Comedy\|Drama\|Romance	5	5.0	1
103143	Donos de Portugal (2012)	Documentary	6	5.0	1
141434	My Friend Victoria (2014)	Drama	7	5.0	1
150268	Dilwale (2015)	Action\|Children\|Comedy\|Crime\|Drama\|Romance	8	5.0	2
148857	Christmas, Again (2015)	(no genres listed)	9	5.0	1

sorted_byrate[sorted_byrate.title.str.contains('Kill Bill')] # 查看某部電影的評分

	title	genres	ranking	rating	freq
movieId
6874	Kill Bill: Vol. 1 (2003)	Action\|Crime\|Thriller	3066	3.889743	26225
7438	Kill Bill: Vol. 2 (2004)	Action\|Drama\|Thriller	3387	3.856621	22301

原來這些全 5 分的電影都只有 1 個評分！這就把排名排上去了！看來平均分不靠譜，得把評分人次也考慮進去！

先把評分少於 30 個的剔出去。

sorted_byrate2 = sorted_byrate[sorted_byrate.freq>30]

sorted_byrate2.head(10) # 前十大評分最高電影

	title	genres	ranking	rating	freq
movieId
318	Shawshank Redemption, The (1994)	Crime\|Drama	627	4.441710	77887
858	Godfather, The (1972)	Crime\|Drama	641	4.353639	49846
50	Usual Suspects, The (1995)	Crime\|Mystery\|Thriller	668	4.318987	53195
527	Schindler's List (1993)	Drama\|War	675	4.290952	59857
140737	The Lost Room (2006)	Action\|Fantasy\|Mystery	680	4.280822	73
1221	Godfather: Part II, The (1974)	Crime\|Drama	681	4.268878	32247
2019	Seven Samurai (Shichinin no samurai) (1954)	Action\|Adventure\|Drama	682	4.262134	12753
904	Rear Window (1954)	Mystery\|Thriller	815	4.246988	19422
1193	One Flew Over the Cuckoo's Nest (1975)	Drama	816	4.242451	35832
2959	Fight Club (1999)	Action\|Crime\|Drama\|Thriller	817	4.233925	48879

這樣看着就正常多了！高分電影有點好電影的樣子了。之後沒電影看了，就來這排行榜上挨着找，不信你都看過！

評分均值和評分次數的相關性

前十大評分最高電影和前十大評分次數最高的電影中是有重合的，如《Shawshank Redemption》和《Schindler's List》，由此，咱們能夠驗證下評分均值和評分次數的相關性。

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
ax.scatter(sorted_byrate['freq'],sorted_byrate['rating']);

由上圖可見，評分少的不見得就平均分就低，從總體趨勢來看，評分次數多的，平均分也高。可見，流行電影確實受人歡迎。從另外一個角度看，存在很多廣受大衆歡迎的電影，但也存在很多看的人很少，但評分很高質量很好的電影，太流行的電影確定你們都看過了，關鍵是如何找到那些還比較小衆的電影，這些電影可能具有大衆歡迎的元素，但因宣傳作得很差沒被大衆發現，也多是這些電影就是小衆，在小圈子裏很是受歡迎，但到更大的人羣中就不行。如何把這些電影推薦給什麼時候的人，是個性化推薦要考慮的問題。

反做弊

找出那些沒有真實評分，只給假評分的。

先統計下每一個人評分次數的分佈。

userfreq = ratings.userId.value_counts() # 統計每一個人的評分次數，默認是降序排列
userfreq.count()

確實是 247753 我的的評分，一個不差，跟 README.txt 說的同樣。

userfreq.head()

185430    9281
46750     7515
204165    7057
135877    6015
58040     5801
Name: userId, dtype: int64

timesfreq = userfreq.copy()
timesfreq.index = range(timesfreq.count()) # 對索引從新賦值，方便畫圖
timesfreq.head()

0    9281
1    7515
2    7057
3    6015
4    5801
Name: userId, dtype: int64

fig, ax = plt.subplots(1, 1, figsize=(15, 4))
timesfreq.plot(ax=ax);
ax.set_xlabel("People ID");
ax.set_ylabel("Rating times");

由圖可見，人們的評分次數是呈冪律分佈的，只有少數人的評分次數巨多，而後迅速過渡到絕大多數人的評分次數。

timesfreq[timesfreq>2000].count()

評分超過 2000 次的人有 295 個，這些人太愛看電影了。至於後面評分較少的，也多是加入評分較晚，或者看過不少電影，只是沒在這評分而已，因此這裏面確定也隱藏了很多電影達人。

下面分別看看評分 2000 次以上和 2000 次如下的評分次數分佈。

timesfreq1 = userfreq[userfreq>=2000].copy()
timesfreq1.index = range(timesfreq1.count()) # 對索引從新賦值，方便畫圖
fig, ax = plt.subplots(1, 1, figsize=(15, 4))
timesfreq1.plot(ax=ax);
ax.set_xlabel("People ID");
ax.set_ylabel("Rating times");

timesfreq2 = userfreq[userfreq<2000].copy()
timesfreq2.index = range(timesfreq2.count()) # 對索引從新賦值，方便畫圖
fig, ax = plt.subplots(1, 1, figsize=(15, 4))
timesfreq2.plot(ax=ax);
ax.set_xlabel("People ID");
ax.set_ylabel("Rating times");

不得不說，冪律分佈無處不在啊。

看下評分次數少於 10 次的用戶個數。

userfreq[userfreq<10].count()

居然有三萬多人。

timesfreq3 = userfreq[userfreq<10].copy()
timesfreq3.index = range(timesfreq3.count()) # 對索引從新賦值，方便畫圖
fig, ax = plt.subplots(1, 1, figsize=(15, 4))
timesfreq3.plot(ax=ax);
ax.set_xlabel("People ID");
ax.set_ylabel("Rating times");

userfreq[userfreq==1].count()

評分 1 次的就有四千多人。

onerating = ratings[ratings.userId.isin(userfreq[userfreq==1].index.values.tolist())] # 這裏的 isin 方法但是費了好大勁找到的
print onerating.count()
print onerating.head()

userId       4251
movieId      4251
rating       4251
timestamp    4251
dtype: int64
       userId  movieId  rating   timestamp
10137     108     2302     4.5  1352678182
19688     215      318     3.0  1434516586
23937     263     1029     4.5  1207138536
30266     356     3254     4.0  1325107825
32553     376     7153     5.0  1427304194

fig, ax = plt.subplots(1, 1, figsize=(15, 4))
onerating.rating.value_counts().plot(ax=ax, kind='bar', title='Ratings');

沒發現什麼異常狀況，原本想着只有一次評分的是否是都是來爲某電影刷分的，如今否認這種想法。

電影時間的分佈

movies.title.head()

movieId
1                      Toy Story (1995)
2                        Jumanji (1995)
3               Grumpier Old Men (1995)
4              Waiting to Exhale (1995)
5    Father of the Bride Part II (1995)
Name: title, dtype: object

movieyears = movies.title.str.extract('(\((\d{4})\))', expand=True).ix[:,1] # 使用正則表達式取出上映年份
movieyears.head()

movieId
1    1995
2    1995
3    1995
4    1995
5    1995
Name: 1, dtype: object

yearfreq = movieyears.value_counts() # 統計每部電影的上映年份，可看出電影的流行程度，默認是降序排列
yearfreq.count()

yearfreqsort = yearfreq.sort_index()
yearfreqsort.head()

1874    1
1878    1
1887    1
1888    2
1890    3
Name: 1, dtype: int64

看下這些電影的年份分佈。

第一幅圖太密集了，就分兩幅圖顯示。

fig, ax = plt.subplots(3, 1, figsize=(15, 12))
yearfreqsort.plot(ax=ax[0], kind='bar', title='freq');
yearfreqsort.iloc[0:60].plot(ax=ax[1], kind='bar', title='freq');
yearfreqsort.iloc[60:].plot(ax=ax[2], kind='bar', title='freq');

看每一年的電影個數，能夠感覺到歷史的變遷。電影個數在上個世紀 90 年代以前一直增加緩慢，到了 90 年代中期開始飛速增加，直到今天。

沒想到 1900 年以前還有幾部電影。看看什麼名字。

movies.ix[movieyears[movieyears<'1900'].index] # 1900 前的電影

	title	genres
movieId
82337	Four Heads Are Better Than One (Un homme de tê...	Fantasy
82362	Pyramid of Triboulet, The (La pyramide de Trib...	Fantasy
88674	Edison Kinetoscopic Record of a Sneeze (1894)	Documentary
94431	Ella Lola, a la Trilby (1898)	(no genres listed)
94657	Turkish Dance, Ella Lola (1898)	(no genres listed)
94951	Dickson Experimental Sound Film (1894)	Musical
95541	Blacksmith Scene (1893)	(no genres listed)
96009	Kiss, The (1896)	Romance
98981	Arrival of a Train, The (1896)	Documentary
113048	Tables Turned on the Gardener (1895)	Comedy
120869	Employees Leaving the Lumière Factory (1895)	Documentary
125978	Santa Claus (1898)	Sci-Fi
129849	Old Man Drinking a Glass of Beer (1898)	(no genres listed)
129851	Dickson Greeting (1891)	(no genres listed)
140539	Pauvre Pierrot (1892)	Animation
140547	The Merry Skeleton (1898)	Comedy
140549	Serpentine Dance: Loïe Fuller (1897)	(no genres listed)
142851	Arab Cortege, Geneva (1896)	Documentary
148040	Man Walking Around a Corner (1887)	(no genres listed)
148042	Accordion Player (1888)	Documentary
148044	Monkeyshines, No. 1 (1890)	Comedy
148046	Monkeyshines, No. 2 (1890)	(no genres listed)
148048	Sallie Gardner at a Gallop (1878)	(no genres listed)
148050	Traffic Crossing Leeds Bridge (1888)	Documentary
148052	London's Trafalgar Square (1890)	(no genres listed)
148054	Passage de Venus (1874)	Documentary
148064	Newark Athlete (1891)	Documentary
148462	Men Boxing (1891)	Action\|Documentary
148703	The Wave (1891)	Documentary
148705	A Hand Shake (1892)	(no genres listed)
148877	Fencing (1892)	(no genres listed)

都沒看過，不過確實挺厲害的，那時咱們仍是大清朝啊。

下面按年代顯示電影個數。

movieyears1 = movieyears.str[:3] + "0s"
yearfreq1 = movieyears1.value_counts() # 統計每部電影的上映年份，可看出電影的流行程度，默認是降序排列
yearfreqsort1 = yearfreq1.sort_index()
fig, ax = plt.subplots(1, 1, figsize=(15, 4))
yearfreqsort1.plot(ax=ax, kind='bar', title='freq');

能夠看到電影不斷增多的趨勢，之後也會愈來愈多。

電影基因的分佈

genreslist = [] # 存儲爲全部電影標註的基因
genreseries = movies.genres.str.split(pat = "|")
genrecount = genreseries.count()
for i in range(genrecount):
    genreslist.extend(genreseries.iloc[i]) # 把 Series 中的全部元素展平組成一個 list
len(genreslist)

上面的代碼運行得比較久，好在數據量不大，看個電視的功夫就完了，但下面再用這個代碼就很差使了。

movies.count()

title     34208
genres    34208
dtype: int64

allmoviegenres = pd.Series(genreslist)
genrestats = allmoviegenres.value_counts()
fig, ax = plt.subplots(1, 1, figsize=(15, 4))
genrestats.plot(ax=ax, kind='bar', title='freq');

戲劇最多，喜劇其次，接着是驚悚、浪漫、動做、犯罪、恐怖、記錄、冒險、科幻、神祕、幻想、兒童、動畫、戰爭、音樂劇、西部、黑色、IMAX。

下面咱們看下全部評分的影片的基因分佈。

#mi = movies.ix[ratings.movieId]
#genreslist = [] # 存儲爲全部電影標註的基因
#genreseries = mi.genres.str.split(pat = "|")
#genrecount = genreseries.count()
#for i in range(genrecount):
#    genreslist.extend(genreseries.iloc[i]) # 把 Series 中的全部元素展平組成一個 list
#allmoviegenres = pd.Series(genreslist)
#genrestats = allmoviegenres.value_counts()
#fig, ax = plt.subplots(1, 1, figsize=(15, 4))
#genrestats.plot(ax=ax, kind='bar', title='freq');

上面這段代碼運行了一夜，早上起來一看，內存錯誤……，爲了省事兒，代碼寫得慘不忍睹……

它實現的功能是取出全部評分涉及電影的基因，並對基因作統計，主要是爲了看看觀衆觀看的電影基因的分佈，跟上面的電影基因統計還不同。觀衆的觀看記錄表明了用戶的興趣所在，無論最後給的是高分低分，總算是由於感興趣纔看的，因此這裏對這些作個統計。

上面的代碼沒有考慮內存，下面對代碼作個優化，對 ratings 裏的電影一一提取基因來統計，稍微作改動下就沒有內存問題了，但運算速度就沒辦法了，仍是要對 22884377 個評分一個一個提取電影基因。

s1 = pd.Series(np.zeros(20,dtype=np.int32),index=['Drama','Comedy','Thriller','Romance','Action', \
                                                  'Crime','Horror','Documentary','Adventure','Sci-Fi', \
                                                  'Mystery','Fantasy','Children','Animation','War', \
                                                  '(no genres listed)','Musical','Western','Film-Noir','IMAX'])
rcount = ratings.count()[0]
for i in range(rcount):
    if (0 == (i%1000000)): # 至關於進度條，否則 7 個小時暗箱運行也不知道進度
        print i
    mid = ratings.movieId.iloc[i]
    grs = movies.ix[mid].genres.split("|")
    s1.ix[grs] += 1

上面這塊代碼運行了七個多小時！很少說了，趕忙把結果保存下來！

s1.to_csv('genres_distribution.csv')

s1

Drama                 10137200
Comedy                 8437502
Thriller               6123348
Romance                4342070
Action                 6547286
Crime                  3803018
Horror                 1685352
Documentary             279609
Adventure              5117321
Sci-Fi                 3694890
Mystery                1794175
Fantasy                2449668
Children               1923874
Animation              1362681
War                    1206361
(no genres listed)        2454
Musical                 974697
Western                 472588
Film-Noir               238480
IMAX                    676420
dtype: int32

s1_sort = s1.sort_values(ascending = False) # 排序

fig, ax = plt.subplots(1, 1, figsize=(15, 4))
s1_sort.plot(ax=ax, kind='bar', title='freq');

由上可見，觀衆的評分記錄的電影基因跟電影實際存在的基因排序是一致的，能夠用電影的拍攝是爲了知足觀衆需求來解釋，也能夠說是觀衆有什麼看什麼。

除了紀錄片電影的觀看人次相比偏少，有點靠後，供過於求，固然仍是建議你們多看記錄片，獲取知識是一個春風化雨的過程。

推薦系統

因爲上面使用大數據集的慘痛教訓，這裏換成了較小的 MovieLens 1M Dataset 數據集。

讀取數據

ratings1m_train = pd.read_csv('./ml-1m/ratings.dat', sep="::", names=['userId','movieId','rating','timestamp'],engine='python')

ratings1m_train.head()

	userId	movieId	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

ratings1m_train = ratings1m_train.drop(['timestamp','rating'], axis=1) # TopN 推薦忽略具體評分

ratings1m_train.count()

userId     1000209
movieId    1000209
dtype: int64

分離訓練集和測試集

from sklearn import cross_validation
from sklearn.cross_validation import train_test_split

ratings1m_train.head()

	userId	movieId
0	1	1193
1	1	661
2	1	914
3	1	3408
4	1	2355

totalcount = ratings1m_train.count()[0]

all_index = np.arange(totalcount)
train_index = np.random.choice(totalcount, int(0.8*totalcount), replace=False) # 從 0 到 totalcount 之間抽 80%，無放回
test_index = np.setdiff1d(all_index, train_index) # 集合的差

train_data = ratings1m_train.iloc[train_index]
test_data = ratings1m_train.iloc[test_index]

計算用戶類似度

users = ratings1m_train.userId.unique()
users

array([   1,    2,    3, ..., 6038, 6039, 6040], dtype=int64)

movies = ratings1m_train.movieId.unique()
movies

array([1193,  661,  914, ..., 2845, 3607, 2909], dtype=int64)

該數據集包含了 6040 個用戶對 3706 部電影的 1000209 個評分。

userSimilarity = pd.DataFrame(0, columns=users, index=users) # 用戶類似度矩陣，初始化爲 0

預測準確度是度量一個推薦系統預測用戶行爲的能力。這是個很是重要的離線評測指標，計算該指標時要有個離線的數據集，該數據集包含用戶的歷史行爲記錄，而後分紅訓練集和測試集，最後經過在訓練集上創建的用戶行爲模型用於測試集，而後計算預測行爲和在測試集的實際行爲的重合度做爲預測準確度。

通常認爲 RMSE 比 MAE 更苛刻，經過平方項加大了對預測不許的評分的懲罰。

準確率是推薦列表中有多大比例是發生了行爲的。召回率是用戶實際發生的行爲有多大比例是來自推薦。

評分預測一直是推薦系統研究的熱點，對此，亞馬遜前科學家 Greg Linden 認爲，電影推薦的目的是找出用戶最有可能感興趣的電影，而不是預測用戶看了電影后會給多少分，所以 TopN 更符合應用需求，也許有一部電影用戶看了給分很高，但其它用戶看的可能性很小，所以預測用戶是否會看一部電影，比預測評分更重要。

本次做業是研究 TopN 推薦問題，忽略數據集中的具體評分。TopN 推薦的任務是預測用戶會不會對某部電影評分，而不是預測評多少分。

預測推薦

創建物品到用戶的倒排表，對於每一個物品都保存對該物品產生過行爲的用戶列表。

movie_users = pd.Series('',index=movies) # 這裏 Series 不能直接存空的 list，因此只有先存個空字符串，而後用 split 把它轉爲 list
movie_users = movie_users.str.split()

for i in train_data.index.values: # 掃描訓練數據集
    movie_users.ix[train_data.movieId.ix[i]].append(train_data.userId.ix[i])

import math

C = dict()
N = dict()

for i in movie_users.index.values:
    for u in movie_users.ix[i]:
        N.setdefault(u,1)
        N[u] += 1
        for v in movie_users.ix[i]:
            if u == v:
                continue
            C.setdefault(u,{})
            C[u].setdefault(v,0)
            C[u][v] += 1
for u, related_users in C.items():
    for v, cuv in related_users.items():
        userSimilarity.ix[u][v]=cuv / math.sqrt(N[u]*N[v])

from operator import itemgetter
def recommend(uid, n_sim_user, n_rec_movie):
    K = n_sim_user
    N = n_rec_movie
    rank = dict()
    watched_movies = train_data[train_data.userId == uid].movieId.values
    
    simusers = userSimilarity.ix[uid].sort_values(ascending=False)[0:K]
    for v in simusers.index.values:
        for m in train_data[train_data.userId == v].movieId.values:
            if m in watched_movies:
                    continue
            rank.setdefault(m,0)
            rank[m] += simusers.ix[v]
                
    # 返回 N 部電影
    return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]

最後計算離線推薦算法的準確率、召回率和覆蓋率。

令系統的用戶集合爲 U， R(u) 是根據用戶在訓練集上的行爲給用戶做出的推薦列表，而 T(u) 是用戶在測試集上的行爲列表。那麼推薦結果的準確率定義爲：

\[Precision=\frac{\sum_{u\in U}|R(u)\cap T(u)|}{\sum_{u\in U}|R(u)|}\]

推薦結果的召回率定義爲：

\[Recall=\frac{\sum_{u\in U}|R(u)\cap T(u)|}{\sum_{u\in U}|T(u)|}\]

推薦系統的覆蓋率爲：

\[Recall=\frac{\sum_{u\in U}|R(u)|}{|I|}\]

代碼以下：

def evaluate(n_sim_user, n_rec_movie):
    N = n_rec_movie
    hit = 0
    rec_count = 0
    test_count = 0
    all_rec_movies = set()
    popular_sum = 0
    movie_count = ratings1m_train.movieId.unique().shape[0]
    
    for uid in train_data.userId.values:
        test_movies = test_data[test_data.userId == uid].movieId
        rec_movies = recommend(uid, n_sim_user, n_rec_movie)
        for movie, w in rec_movies:
            if movie in test_movies.values:
                hit += 1
            all_rec_movies.add(movie)
        rec_count += N
        test_count += test_movies.count()
        
    precision = hit / (1.0*rec_count)
    recall = hit / (1.0*test_count)
    coverage = len(all_rec_movies) / (1.0*movie_count)
    
    return (precision, recall, coverage)

print evaluate(20, 10)

上面代碼運行了一個晚上沒有出結果，只得放棄，計算量太大。

相關標籤/搜索

推薦系統