Comprehensive Guide to build a Recommendation Engine from scratch (in Python) / 從0開始搭建推薦系統

 

https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/, 一篇詳細的入門級的推薦系統的文章,這篇文章內容詳實,格式漂亮,推薦給你們.python

下面是翻譯,翻譯關注的是意思,不是直譯哈,你們將就着看, 若是英文好,推薦看原文,原文的排版比我這個舒服多了. git

NOTE: 原文中發現一個有誤的地方,下面我會用 紅色 標出來. 同時,我在翻譯的過程當中,有疑慮或者值得商榷的地方,我會用 藍色 標出來.github

 

Comprehensive Guide to build a Recommendation Engine from scratch (in Python)

 從0開始搭建推薦系統 

 

 

Introduction / 簡介

In today’s world, every customer is faced with multiple choices. For example, If I’m looking for a book to read without any specific idea of what I want, there’s a wide range of possibilities how my search might pan out. I might waste a lot of time browsing around on the internet and trawling through various sites hoping to strike gold. I might look for recommendations from other people.算法

 現現在,每一個顧客都面臨多種選擇. 好比,假如我想讀書又不知道看點什麼,那就有無數種可能的選擇。我可能會浪費大把的時間在網上搜索或者盲目地在各個書店撒網但願淘到喜歡的書。我也可能需求其餘人的推薦.數據庫

 

But if there was a site or app which could recommend me books based on what I have read previously, that would be a massive help. Instead of wasting time on various sites, I could just log in and voila! 10 recommended books tailored to my taste.數組

 可是若是有這麼一個網站或者APP能基於我之前讀過什麼來給我推薦書籍,那該多好啊. 我不是在各個網站浪費時間,而是直接找到10本知足我口味的書細細品味.app

 

  

 

This is what recommendation engines do and their power is being harnessed by most businesses these days. From Amazon to Netflix, Google to Goodreads, recommendation engines are one of the most widely used applications of machine learning techniques.less

這就是推薦引擎要作的事情,而且它的功用也獲得的大多數商業行爲的使用. 從 Amazon 到 Netfix, Google 到 Goodreads, 推薦引擎成爲了最多見的機器學習技術的一種.dom

In this article, we will cover various types of recommendation engine algorithms and fundamentals of creating them in Python. We will also see the mathematics behind the workings of these algorithms. Finally, we will create our own recommendation engine using matrix factorization.機器學習

在這篇文章裏,咱們將講到各類不一樣的推薦引擎算法和原理並用python 建立他們. 咱們也將看到這些算法背後的數學知識. 最後,咱們將用 matrix factorization 來建立咱們本身的推薦引擎.

 

Table of Contents 目錄

  1. What are recommendation engines? 什麼是推薦引擎
  2. How does a recommendation engine work? 推薦引擎是怎麼工做的Case study in Python using the MovieLens dataset 基於MovieLens數據集用python實現的案例
    1. Data collection 數據收集
    2. Data storage 數據存儲
    3. Filtering the data 數據過濾
      1. Content based filtering 基於內容的過濾
      2. Collaborative filtering 協同過濾
  3. Case study in Python using the MovieLens dataset / 基於MovieLens 數據集的案例學習python實現
  4. Building collaborative filtering model from scratch / 從0開始建立協同過濾模型
  5. Building Simple popularity and collaborative filtering model using Turicreate / 使用 Turicreate 建立簡單的基於歡迎度的模型和協同過濾模型
  6. Introduction to matrix factorization / 介紹 matrix factorization
  7. Building a recommendation engine using matrix factorization / 使用 matrix factorization 建立推薦引擎
  8. Evaluation metrics for recommendation engines / 推薦引擎效果評估
    1. Recall 
    2. Precision 
    3. RMSE (Root Mean Squared Error)
    4. Mean Reciprocal Rank
    5. MAP at k (Mean Average Precision at cutoff k)
    6. NDCG (Normalized Discounted Cumulative Gain)
  9. What else can be tried? / 其餘有哪些能夠嘗試?

 

 

1. What are recommendation engines? / 推薦引擎是什麼?

Till recently, people generally tended to buy products recommended to them by their friends or the people they trust. This used to be the primary method of purchase when there was any doubt about the product. But with the advent of the digital age, that circle has expanded to include online sites that utilize some sort of recommendation engine.

人們通常傾向於在買東西時尋求他們的朋友或者他們相信的人推薦的產品. 這是過去最多見的購買方式尤爲是當他們對要買的產品有疑慮時. 可是隨着數字時代的到來,這個過程就包含了使用推薦引擎來在線的推薦產品.

 

A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behavior of a customer and based on that, recommends products which the users might be likely to buy.

推薦引擎使用不一樣的算法過濾數據,並推薦給用戶最相關的東西. 它捕捉客戶的歷史行爲,而且基於這些歷史行爲給用戶推薦他們最可能買的東西.

 

If a completely new user visits an e-commerce site, that site will not have any past history of that user. So how does the site go about recommending products to the user in such a scenario? One possible solution could be to recommend the best selling products, i.e. the products which are high in demand. Another possible solution could be to recommend the products which would bring the maximum profit to the business.

若是是一個全新的用戶訪問電商網站,這個站點沒有這個用戶的任何歷史信息. 那麼這種狀況下網站怎麼推薦呢?一個可能的方案是推薦熱銷的商品,好比,高需求量的商品; 另外一個方案是推薦那些給公司帶來高收益的產品.

 

If we can recommend a few items to a customer based on their needs and interests, it will create a positive impact on the user experience and lead to frequent visits. Hence, businesses nowadays are building smart and intelligent recommendation engines by studying the past behavior of their users.

若是咱們能根據用戶的需求和興趣推薦一些產品給他們,這將產生有益的效果而且帶來更多的訪問. 所以,現代商業活動正在基於他們用戶的行爲創建只能推薦引擎.

 

Now that we have an intuition of recommendation engines, let’s now look at how they work.

如今咱們對推薦引擎有了一個直觀的認識,讓咱們來看看它是怎麼工做的

 

2. How does a recommendation engine work? / 推薦引擎怎麼工做的?

 

Before we deep dive into this topic, first we’ll think of how we can recommend items to users:

 在咱們深刻研究以前,咱們先想一下咱們怎麼推薦產品給用戶

  • We can recommend items to a user which are most popular among all the users / 咱們能夠給全部用戶推薦正在熱銷的產品
  • We can divide the users into multiple segments based on their preferences (user features) and recommend items to them based on the segment they belong to / 咱們也能夠按照用戶的喜愛特徵把用戶分組,而後按照分組分別推薦不一樣的東西給他們

 

Both of the above methods have their drawbacks. In the first case, the most popular items would be the same for each user so everybody will see the same recommendations. While in the second case, as the number of users increases, the number of features will also increase. So classifying the users into various segments will be a very difficult task.

 以上兩種方法都有缺陷.第一種方法,全部人都收到的是同樣的產品的推薦. 第二種方法,若是用戶不斷增長,用戶的喜愛特徵也會增長,再次給用戶歸類是一個困難的工做.

 

The main problem here is that we are unable to tailor recommendations based on the specific interest of the users. It’s like Amazon is recommending you buy a laptop just because it’s been bought by the majority of the shoppers. But thankfully, Amazon (or any other big firm) does not recommend products using the above mentioned approach. They use some personalized methods which help them in recommending products more accurately.

 這裏的主要問題是咱們不能根據用戶各自的興趣來定製化的推薦. 就像Amanzon 由於其餘不少人買了筆記本電腦也給你推薦了筆記本電腦. 可是幸運的是,Amazon並非這樣推薦的, 他們使用的個性化的推薦方法,這樣的推薦更加精確.

 

Let’s now focus on how a recommendation engine works by going through the following steps.

 如今咱們更深刻的按步驟講解推薦引擎的工做方式.

 

 

2.1 Data collection / 數據收集

This is the first and most crucial step for building a recommendation engine. The data can be collected by two means: explicitly and implicitly. Explicit data is information that is provided intentionally, i.e. input from the users such as movie ratings. Implicit data is information that is not provided intentionally but gathered from available data streams like search history, clicks, order history, etc.

這是建立推薦引擎第一步也是最重要的一步。兩種方法能夠收集收據:明採和暗採。明採就是讓用戶主動輸入反饋,好比對所購商品的評價打分;暗採就是系統在前臺收集後臺記錄用戶的行爲數據,常見的有用戶的鼠標點擊行爲,搜索記錄, 購買記錄等.

 

In the above image, Netflix is collecting the data explicitly in the form of ratings given by user to different movies.

上圖中,Netfix 使用的就用明採的方法讓用戶主動給電影打分評級.

 

Here the order history of a user is recorded by Amazon which is an example of implicit mode of data collection.

 這是一個Amazon 網站的用戶購買記錄,屬於暗採方法.

 

2.2 Data storage / 數據存儲

The amount of data dictates how good the recommendations of the model can get. For example, in a movie recommendation system, the more ratings users give to movies, the better the recommendations get for other users. The type of data plays an important role in deciding the type of storage that has to be used. This type of storage could include a standard SQL database, a NoSQL database or some kind of object storage.

 數據量的多少會影響推薦模型的精度. 好比,一個電影推薦系統,用戶給的打分越多,推薦就會越準確. 數據的類型決定了存儲類型. 存儲類型有標準的SQL數據庫,NoSQL 數據庫和對象存儲.

 

 

2.3 Filtering the data / 數據過濾

After collecting and storing the data, we have to filter it so as to extract the relevant information required to make the final recommendations.

收集到數據並存儲好之後,咱們要過濾數據來找到相關信息,而後作出最終推薦.

 

Source: intheshortestrun

There are various algorithms that help us make the filtering process easier. In the next section, we will go through each algorithm in detail.

有各類各樣的算法能夠幫咱們的過濾流程更加容易. 接下來,咱們就一個一個來看看.

 

2.3.1 Content based filtering / 基於內容的過濾

This algorithm recommends products which are similar to the ones that a user has liked in the past.

這個算法推薦給用戶那些他之前喜歡過的東西的相似產品.

        Source: Medium

For example, if a person has liked the movie 「Inception」, then this algorithm will recommend movies that fall under the same genre. But how does the algorithm understand which genre to pick and recommend movies from?

好比,若是一我的喜歡電影《盜夢空間》,那麼這個算法就會推薦相同類型的電影. 可是,這個算法怎麼知道根據電影的哪一個屬性來推薦呢?

 

Consider the example of Netflix. They save all the information related to each user in a vector form. This vector contains the past behavior of the user, i.e. the movies liked/disliked by the user and the ratings given by them. This vector is known as the profile vector. All the information related to movies is stored in another vector called the item vector. Item vector contains the details of each movie, like genre, cast, director, etc.

考慮Netflix 的例子,他們保存了全部的用戶信息在向量裏,這個向量包含了這個用戶的過往行爲,好比他喜歡或者不喜歡那些電影,給電影評了多少分. 這個向量叫作用戶屬性向量。全部與電影自己相關的屬性,好比電影類型,導演,年代信息等會保存在另外一個向量裏,叫產品屬性向量.

 

The content-based filtering algorithm finds the cosine of the angle between the profile vector and item vector, i.e. cosine similarity. Suppose A is the profile vector and B is the item vector, then the similarity between them can be calculated as:

基於內容的過濾算法找出用戶屬性向量和產品屬性向量的夾角的cos值,好比 cos 類似性. 假如A是用戶屬性向量,B是產品屬性向量,他們二者的類似性能夠計算以下:

(譯者注:我以爲A和B是兩種徹底不一樣的類型,不該該去計算類似性,這裏值得商榷)

 

Based on the cosine value, which ranges between -1 to 1, the movies are arranged in descending order and one of the two below approaches is used for recommendations:

基於cos值(-1 ~ 1),把電影按照從大到小的值排列,而後使用下面兩種方法推薦:

  • Top-n approach: where the top n movies are recommended (Here n can be decided by the business)  / 推薦最大的幾個,至因而幾個,取決於業務需求
  • Rating scale approach: Where a threshold is set and all the movies above that threshold are recommended / 設定一個閾值,凡是>閾值的都推薦

 

Other methods that can be used to calculate the similarity are: / 除了cos類似性,其餘計算類似性的方法有:

  • Euclidean Distance: Similar items will lie in close proximity to each other if plotted in n-dimensional space. So, we can calculate the distance between items and based on that distance, recommend items to the user. The formula for the euclidean distance is given by: / 歐幾里得距離: 沒啥說頭,本身看公式

  • Pearson’s Correlation: It tells us how much two items are correlated. Higher the correlation, more will be the similarity. Pearson’s correlation can be calculated using the following formula: / Pearson相關性(譯者注:我理解就是cos相關性的一個變種,先求mean,在算cos 類似性,爲的是消除人爲喜愛形成的參數分佈差別,好比有些人容易知足對產品老是打高分,有些人比較挑剔容易打低分)

A major drawback of this algorithm is that it is limited to recommending items that are of the same type. It will never recommend products which the user has not bought or liked in the past. So if a user has watched or liked only action movies in the past, the system will recommend only action movies. It’s a very narrow way of building an engine.

這個算法主要的缺點是隻能推薦用戶買過的類似的東西。若是用戶沒有買過東西,那就無能爲力了. 若是用戶只看過動做片,那麼它就只會推進做片給用戶. 這種推薦相對比較low.

 

To improve on this type of system, we need an algorithm that can recommend items not just based on the content, but the behavior of users as well.

爲了改善這個系統,咱們須要一個算法,既能基於內容,也能基於用戶行爲來推薦.

 

2.3.2 Collaborative filtering / 協同過濾

Let us understand this with an example. If person A likes 3 movies, say Interstellar, Inception and Predestination, and person B likes Inception, Predestination and The Prestige, then they have almost similar interests. We can say with some certainty that A should like The Prestige and B should like Interstellar. The collaborative filtering algorithm uses 「User Behavior」 for recommending items. This is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. There are different types of collaborating filtering techniques and we shall look at them in detail below.

 舉一個例子來理解. 若是一我的A喜歡3個電影 <星際>,<盜夢空間>,和<前目的地>, 另外一我的B喜歡<盜夢空間>,<前目的地>和<致命魔術>. 咱們必定程度上能夠說A應該會喜歡<致命魔術>,B應該會喜歡<星際>. 協同過濾使用「用戶行爲「來推薦. 這是業界經常使用的算法由於他不依賴與任何其餘信息. 有幾種不一樣的協同過濾技術,咱們接下來仔細來了解下.

User-User collaborative filtering / 用戶-用戶 協同過濾

This algorithm first finds the similarity score between users. Based on this similarity score, it then picks out the most similar users and recommends products which these similar users have liked or bought previously.

 這個算法先找到用戶之間的類似性. 基於這個類似性數據,給用戶推薦那些和他有最像的行爲的用戶買過的東西.

 


Source: Medium

In terms of our movies example from earlier, this algorithm finds the similarity between each user based on the ratings they have previously given to different movies. The prediction of an item for a user u is calculated by computing the weighted sum of the user ratings given by other users to an item i.

根據咱們以前的電影例子,這個算法找出給過評分的用戶的類似性. 用戶u 給商品 i 打分的預測公式以下:

The prediction Pu,i is given by:

Here, / 這裏

  • Pu,i is the prediction of an item / 用戶u 給商品 i 評分的預測值
  • Rv,i is the rating given by a user v to a movie 
  • Su,v is the similarity between users

Now, we have the ratings for users in profile vector and based on that we have to predict the ratings for other users. Following steps are followed to do so:

如今,咱們有了用戶屬性向量,基於這個咱們就能夠作預測了. 具體步驟以下:

  1. For predictions we need the similarity between the user u and v. We can make use of Pearson correlation. / 咱們選的是Pearson相關性
  2. First we find the items rated by both the users and based on the ratings, correlation between the users is calculated.  / 那就算吧
  3. The predictions can be calculated using the similarity values. This algorithm, first of all calculates the similarity between each user and then based on each similarity calculates the predictions. Users having higher correlation will tend to be similar. / 算出一個大的矩陣,矩陣裏面的數字越接近1,越類似.
  4. Based on these prediction values, recommendations are made. Let us understand it with an example: / 下面例子講的很清楚

Consider the user-movie rating matrix: 

這是一個user-movie 矩陣:

User/Movie x1 x2 x3 x4 x5 Mean User Rating
A 4 1 4 3
B 4 2 3 3
C 1 4 4 3

 

Here we have a user movie rating matrix. To understand this in a more practical manner, let’s find the similarity between users (A, C) and (B, C) in the above table. Common movies rated by A/[ and C are movies x2 and x4 and by B and C are movies x2, x4 and x5.

A和C對電影x2,x4 都打分了,B和C對 x2,x4,x5 都打分了. 下面的算式pearson 相關性算式.

 

The correlation between user A and C is more than the correlation between B and C. Hence users A and C have more similarity and the movies liked by user A will be recommended to user C and vice versa.

能夠看出, A 和C相關性很高,若是A喜歡什麼,那就推薦給C. 相反亦然.

 

This algorithm is quite time consuming as it involves calculating the similarity for each user and then calculating prediction for each similarity score. One way of handling this problem is to select only a few users (neighbors) instead of all to make predictions, i.e. instead of making predictions for all similarity values, we choose only few similarity values. There are various ways to select the neighbors:

這個算法很是耗時間由於它要先計算類似性矩陣而後計算預測值. 一個解決這個問題的方法是求出類似度矩陣後選擇一部分鄰居用戶來計算預測值.

  • Select a threshold similarity and choose all the users above that value / 把類似性大於必定閾值的的用戶選進來
  • Randomly select the users / 隨機選用戶
  • Arrange the neighbors in descending order of their similarity value and choose top-N users / 選前幾個最類似的用戶
  • Use clustering for choosing neighbors / 使用聚類算法找鄰居用戶

This algorithm is useful when the number of users is less. Its not effective when there are a large number of users as it will take a lot of time to compute the similarity between all user pairs. This leads us to item-item collaborative filtering, which is effective when the number of users is more than the items being recommended.

這個算法若是用戶數量少還能夠,若是用戶太多了,類似性矩陣就會很大,計算量太大,全部若是用戶數量大於商品數量的狀況下,採用的是接下來說到的 產品-產品 協同過濾.

 

Item-Item collaborative filtering / 產品-產品協同過濾

In this algorithm, we compute the similarity between each pair of items.

這個算法,咱們技術產品之間的類似性

Source: Medium

So in our case we will find the similarity between each movie pair and based on that, we will recommend similar movies which are liked by the users in the past. This algorithm works similar to user-user collaborative filtering with just a little change – instead of taking the weighted sum of ratings of 「user-neighbors」, we take the weighted sum of ratings of 「item-neighbors」. The prediction is given by:

 基本思想是,你喜歡過一種商品,那就把類似的商品推給你. 預測公式以下.

Now we will find the similarity between items. / 類似性用cos 類似性

Now, as we have the similarity between each movie and the ratings, predictions are made and based on those predictions, similar movies are recommended. Let us understand it with an example. / 看例子很清楚

User/Movie x1 x2 x3 x4 x5
A 4 1 2 4 4
B 2 4 4 2 1
C 1 3 4
Mean Item Rating 3 2 3 3 3

 

Here the mean item rating is the average of all the ratings given to a particular item (compare it with the table we saw in user-user filtering). Instead of finding the user-user similarity as we saw earlier, we find the item-item similarity.

To do this, first we need to find such users who have rated those items and based on the ratings, similarity between the items is calculated. Let us find the similarity between movies (x1, x4) and (x1, x5). Common users who have rated movies x1 and x4 are A and B while the users who have rated movies x1 and x5 are also A and B.

(譯者注: 原本說好用 cos 類似性的,結果做者忽悠咱們,他用了 pearson 類似性,由於每一個數組都借去了mean. 其實不必, 直接cos類似性就挺好.)

 

The similarity between movie x1 and x4 is more than the similarity between movie x1 and x5. So based on these similarity values, if any user searches for movie x1, they will be recommended movie x4 and vice versa. Before going further and implementing these concepts, there is a question which we must know the answer to – what will happen if a new user or a new item is added in the dataset? It is called a Cold Start. There can be two types of cold start:

這裏有個有意思的概念注意一下,叫冷啓動,就是一個新用戶和新產品加入了咱們的系統

  1. Visitor Cold Start
  2. Product Cold Start

Visitor Cold Start means that a new user is introduced in the dataset. Since there is no history of that user, the system does not know the preferences of that user. It becomes harder to recommend products to that user. So, how can we solve this problem? One basic approach could be to apply a popularity based strategy, i.e. recommend the most popular products. These can be determined by what has been popular recently overall or regionally. Once we know the preferences of the user, recommending products will be easier.

新用戶加入,我不知道你的過往行爲,那就能夠給你推送最近流行的產品

On the other hand, Product Cold Start means that a new product is launched in the market or added to the system. User action is most important to determine the value of any product. More the interaction a product receives, the easier it is for our model to recommend that product to the right user. We can make use of Content based filtering to solve this problem. The system first uses the content of the new product for recommendations and then eventually the user actions on that product.

新產品加入,我不知道你和其餘產品的類似性,那就人工來標註新產品的屬性,好比給產品打上日用品,家電等標籤,而後基於內容推送, 有人正在關注家電那就推薦給他.

Now let’s solidify our understanding of these concepts using a case study in Python. Get your machines ready because this is going to be fun!

如今咱們經過python 的例子來加深理解. 準備好機器,開幹吧!

 

 

3. Case study in Python using the MovieLens Dataset / 基於MovieLens數據集的案例學習,使用python語言

 

We will work on the MovieLens dataset and build a model to recommend movies to the end users. This data has been collected by the GroupLens Research Project at the University of Minnesota. The dataset can be downloaded from here. This dataset consists of:

咱們將基於明尼蘇達大學的GroupLens 研究項目採集的MovieLens數據集來建立一個模型. 這個數據集有943個用戶1682個電影的數據,總共100k的rating 數據. 數據集能夠在這裏下載

 

  • 100,000 ratings (1-5) from 943 users on 1682 movies
  • Demographic information of the users (age, gender, occupation, etc.)

 

First, we’ll import our standard libraries and read the dataset in Python.

首先,導入python經常使用庫並讀取數據集

 

import pandas as pd
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# pass in column names for each CSV as the column name is not given in the file and read them using pandas.
# You can check the column names from the readme file

#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,encoding='latin-1')

#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,
encoding='latin-1')

  

 

After loading the dataset, we should look at the content of each file (users, ratings, items).

 

  • Users

 

print(users.shape)
users.head()

 

 

So, we have 943 users in the dataset and each user has 5 features, i.e. user_ID, age, sex, occupation and zip_code. Now let’s look at the ratings file.

 

  • Ratings               

 

print(ratings.shape)
ratings.head()

 

 

 

We have 100k ratings for different user and movie combinations. Now finally examine the items file.

 

  • Items

 

print(items.shape)
items.head()

 

 

 

This dataset contains attributes of 1682 movies. There are 24 columns out of which last 19 columns specify the genre of a particular movie. These are binary columns, i.e., a value of 1 denotes that the movie belongs to that genre, and 0 otherwise.

這個數據集包含了1682部電影. 24列中的最後19列表示的是電影的特定類型, 這些類型都是用二進制0和1表示的,1表示具備這種類型,0表示沒有.

 

The dataset has already been divided into train and test by GroupLens where the test data has 10 ratings for each user, i.e. 9,430 rows in total. We will read both these files into our Python environment.

 這個數據集的全部者GroupLens 已經分好了訓練集和測試集, 測試集裏面每一個用戶有10個打分,全部總共有9430個打分.

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_train = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')
ratings_train.shape, ratings_test.shape

 

 

It’s finally time to build our recommend engine! / 最後咱們來建立咱們的推薦引擎!

 

 

4. Building collaborative filtering model from scratch / 從0開始建立協同過濾模型

 

We will recommend movies based on user-user similarity and item-item similarity. For that, first we need to calculate the number of unique users and movies.

咱們將基於 用戶-用戶 類似性和 產品-產品 類似性來推薦電影. 由於咱們先算出有多少不一樣的用戶和電影.

 

n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie_id.unique().shape[0]

 

 

Now, we will create a user-item matrix which can be used to calculate the similarity between users and items.

如今咱們創建一個用戶-產品 矩陣, 用它來算用戶之間和產品之間的類似性.

 

data_matrix = np.zeros((n_users, n_items))
for line in ratings.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[3]

 

 

Now, we will calculate the similarity. We can use the pairwise_distance function from sklearn to calculate the cosine similarity.

如今,咱們來算類似性. 咱們用 sklearn 裏的pairwise_distance 函數來算cos類似性. (譯者注:我以爲這裏不對,全部改爲了下面的紅色代碼,具體緣由看代碼註釋)

 

from sklearn.metrics.pairwise import pairwise_distances 
user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

# NOTE: why use pairwise_distances? why not cosine_similarity? cosine_distance = 1-cosine_similarity. i believe cosine_similarity is right for here.
# let's change it to consine_similarity user_similarity = cosine_similarity(data_matrix) item_similarity = cosine_similarity(data_matrix.T)

 

 

This gives us the item-item and user-user similarity in an array form. The next step is to make predictions based on these similarities. Let’s define a function to do just that.

算出了產品-產品類似性和 用戶-用戶類似性後. 下一步就是基於類似性作預測. 預測函數定義以下:

 

def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

 

 

Finally, we will make predictions based on user similarity and item similarity.

最後,咱們調用預測函數來預測.

 

user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

 

 

As it turns out, we also have a library which generates all these recommendations automatically. Let us now learn how to create a recommendation engine using turicreate in Python. To get familiar with turicreate and to install it on your machine, refer here.

上面咱們是本身實現的協同過濾,其實有現成的庫來作這些. 接下來咱們一塊兒學習用Apple的turicreate 機器學習庫來建立推薦引擎. 想了解turicreate 能夠看這裏

 

 

5. Building a simple popularity and collaborative filtering model using Turicreate / 使用Turicreate 建立簡單的基於流行度和協同過濾的模型

 

After installing turicreate, first let’s import it and read the train and test dataset in our environment. Since we will be using turicreate, we will need to convert the dataset in SFrames.

安裝好turicreate後,咱們把dataframe 格式轉成turicreate的 SFrame格式

 

import turicreate
train_data = turicreate.SFrame(ratings_train)
test_data = turicreate.Sframe(ratings_test)

 

We have user behavior as well as attributes of the users and movies, so we can make content based as well as collaborative filtering algorithms. We will start with a simple popularity model and then build a collaborative filtering model.

咱們有用戶行爲還有用戶和電影的屬性,全部咱們能基於內容推薦,也能作協同過濾. 咱們先從基於流行度的模型開始,而後再會建立一個協同過濾模型.

 

First we’ll build a model which will recommend movies based on the most popular choices, i.e., a model where all the users receive the same recommendation(s). We will use the turicreate recommender function popularity_recommender for this.

首先咱們建立一個基於流行度的模型,就是看什麼東西你們都喜歡就推薦什麼

popularity_model = turicreate.popularity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating')
 

Various arguments which we have used are:

這些是要用到的參數: 看名字就很直觀了,不解釋了

 

  • train_data: the SFrame which contains the required training data
  • user_id: the column name which represents each user ID
  • item_id: the column name which represents each item to be recommended (movie_id)
  • target: the column name representing scores/ratings given by the user

 

It’s prediction time! We will recommend the top 5 items for the first 5 users in our dataset.

如今來看預測! 咱們給前5個用戶每人推薦5個產品.

 

popularity_recomm = popularity_model.recommend(users=[1,2,3,4,5],k=5)
popularity_recomm.print_rows(num_rows=25)

 

 

 

Note that the recommendations for all users are the same – 1467, 1201, 1189, 1122, 814. And they’re all in the same order! This confirms that all the recommended movies have an average rating of 5, i.e. all the users who watched the movie gave it a top rating. Thus our popularity system works as expected.

無論是哪一個用戶,推薦給他的東西都是同樣的,就是最流行的5個產品.

 

After building a popularity model, we will now build a collaborative filtering model. Let’s train the item similarity model and make top 5 recommendations for the first 5 users.

如今咱們來建立一個協同過濾模型。咱們來選擇建立協同過濾裏邊的item-item 類似度模型.

 

#Training the model
item_sim_model = turicreate.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating', similarity_type='cosine')

#Making recommendations
item_sim_recomm = item_sim_model.recommend(users=[1,2,3,4,5],k=5)
item_sim_recomm.print_rows(num_rows=25)

 

 

 

Here we can see that the recommendations (movie_id) are different for each user. So personalization exists, i.e. for different users we have a different set of recommendations.

看,每一個用於獲得的推薦不同,個性化產品推薦來了!!!

 

In this model, we do not have the ratings for each movie given by each user. We must find a way to predict all these missing ratings. For that, we have to find a set of features which can define how a user rates the movies. These are called latent features. We need to find a way to extract the most important latent features from the the existing features. Matrix factorization, covered in the next section, is one such technique which uses the lower dimension dense matrix and helps in extracting the important latent features.

 CF 模型是基於類似度的,還有另一種模型 matrix factorization 和 content based 模型很相似,區別在於content based 是用戶手動打了一些屬性標籤,而matrix factorization 是嘗試自動去找出一些隱含的屬性標籤,雖然這些標籤沒有具體的名字,咱們也不知道它發現了什麼屬性. 接下來就是matrix factorization的介紹.

 

 

暫時翻譯到這裏,有空繼續

 

6. Introduction to matrix factorization

 

Let’s understand matrix factorization with an example. Consider a user-movie ratings matrix (1-5) given by different users to different movies.

 

 

Here user_id is the unique ID of different users and each movie is also assigned a unique ID. A rating of 0.0 represents that the user has not rated that particular movie (1 is the lowest rating a user can give). We want to predict these missing ratings. Using matrix factorization, we can find some latent features that can determine how a user rates a movie. We decompose the matrix into constituent parts in such a way that the product of these parts generates the original matrix.

 

 

Let us assume that we have to find k latent features. So we can divide our rating matrix R(MxN) into P(MxK) and Q(NxK) such that P x QT (here QT is the transpose of Q matrix) approximates the R matrix:

 

, where:

 

  • M is the total number of users
  • N is the total number of movies
  • K is the total latent features
  • R is MxN user-movie rating matrix
  • P is MxK user-feature affinity matrix which represents the association between users and features
  • Q is NxK item-feature relevance matrix which represents the association between movies and features
  • Σ is KxK diagonal feature weight matrix which represents the essential weights of features

 

Choosing the latent features through matrix factorization removes the noise from the data. How? Well, it removes the feature(s) which does not determine how a user rates a movie. Now to get the rating ruifor a movie qik rated by a user puk across all the latent features k, we can calculate the dot product of the 2 vectors and add them to get the ratings based on all the latent features.

 

 

This is how matrix factorization gives us the ratings for the movies which have not been rated by the users. But how can we add new data to our user-movie rating matrix, i.e. if a new user joins and rates a movie, how will we add this data to our pre-existing matrix?

 

Let me make it easier for you through the matrix factorization method. If a new user joins the system, there will be no change in the diagonal feature weight matrix Σ, as well as the item-feature relevance matrix Q. The only change will occur in the user-feature affinity matrix P. We can apply some matrix multiplication methods to do that.

 

We have,

 

 

Let’s multiply with Q on both sides.

 

 

Now, we have

 

 

So,

 

 

Simplifying it further, we can get the P matrix:

 

 

This is the updated user-feature affinity matrix. Similarly, if a new movie is added to the system, we can follow similar steps to get the updated item-feature relevance matrix Q.

 

Remember, we decomposed R matrix into P and Q. But how do we decide which P and Q matrix will approximate the R matrix? We can use the gradient descent algorithm for doing this. The objective here is to minimize the squared error between the actual rating and the one estimated using P and Q. The squared error is given by:

 

 

Here,

 

  • eui is the error
  • rui is the actual rating given by user u to the movie i
  • řui is the predicted rating by user u for the movie i

 

Our aim was to decide the p and q value in such a way that this error is minimized. We need to update the p and q values so as to get the optimized values of these matrices which will give the least error. Now we will define an update rule for puk and qki. The update rule in gradient descent is defined by the gradient of the error to be minimized.

 

 

 

As we now have the gradients, we can apply the update rule for puk and qki.

 

 

 

Here α is the learning rate which decides the size of each update. The above updates can be repeated until the error is minimized. Once that’s done, we get the optimal P and Q matrix which can be used to predict the ratings. Let us quickly recap how this algorithm works and then we will build the recommendation engine to predict the ratings for the unrated movies.

 

Below is how matrix factorization works for predicting ratings:

 

# for f = 1,2,....,k :
    # for rui ε R :
        # predict rui
        # update puk and qki

 

So based on each latent feature, all the missing ratings in the R matrix will be filled using the predicted rui value. Then puk and qki are updated using gradient descent and their optimal value is obtained. It can be visualized as shown below:

 

 

Now that we have understood the inner workings of this algorithm, we’ll take an example and see how a matrix is factorized into its constituents.

 

Consider a 2 X 3 matrix, A2X3 as shown below:

 

 

Here we have 2 users and their corresponding ratings for 3 movies. Now, we will decompose this matrix into sub parts, such that:

 

 

The eigenvalues of AAT will give us the P matrix and the eigenvalues of ATA will give us the Q matrix. Σ is the square root of the eigenvalues from AAT or ATA.

 

Calculate the eigenvalues for AAT.

 

 

 

So, the eigenvalues of AAT are 25, 9. Similarly, we can calculate the eigenvalues of ATA. These values will be 25, 9, 0. Now we have to calculate the corresponding eigenvectors for AAT and ATA.

 

For λ = 25, we have:

 

 

It can be row reduced to:

 

 

A unit-length vector in the kernel of that matrix is:

 

 

Similarly, for λ = 9 we have:

 

 

It can be row reduced to:

 

 

A unit-length vector in the kernel of that matrix is:

 

 

For the last eigenvector, we could find a unit vector perpendicular to q1 and q2. So,

 

 

Σ2X3 matrix is the square root of eigenvalues of AAT or ATA, i.e. 25 and 9.

 

 

Finally, we can compute P2X2 by the formula σpi = Aqi, or pi = 1/σ(Aqi). This gives:

 

 

So, the decomposed form of A matrix is given by:

 

    

 

Since we have the P and Q matrix, we can use the gradient descent approach to get their optimized versions. Let us build our recommendation engine using matrix factorization.

 

 

 

7. Building a recommendation engine using matrix factorization

 

Let us define a function to predict the ratings given by the user to all the movies which are not rated by him/her.

 

class MF():

    # Initializing the user-movie rating matrix, no. of latent features, alpha and beta.
    def __init__(self, R, K, alpha, beta, iterations):
        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    # Initializing user-feature and movie-feature matrix 
    def train(self):
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initializing the bias terms
        self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # List of training samples
        self.samples = [
        (i, j, self.R[i, j])
        for i in range(self.num_users)
        for j in range(self.num_items)
        if self.R[i, j] > 0
        ]

        # Stochastic gradient descent for given number of iterations
        training_process = []
        for i in range(self.iterations):
        np.random.shuffle(self.samples)
        self.sgd()
        mse = self.mse()
        training_process.append((i, mse))
        if (i+1) % 20 == 0:
            print("Iteration: %d ; error = %.4f" % (i+1, mse))

        return training_process

    # Computing total mean squared error
    def mse(self):
        xs, ys = self.R.nonzero()
        predicted = self.full_matrix()
        error = 0
        for x, y in zip(xs, ys):
            error += pow(self.R[x, y] - predicted[x, y], 2)
        return np.sqrt(error)

    # Stochastic gradient descent to get optimized P and Q matrix
    def sgd(self):
        for i, j, r in self.samples:
            prediction = self.get_rating(i, j)
            e = (r - prediction)

            self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
            self.b_i[j] += self.alpha * (e - self.beta * self.b_i[j])

            self.P[i, :] += self.alpha * (e * self.Q[j, :] - self.beta * self.P[i,:])
            self.Q[j, :] += self.alpha * (e * self.P[i, :] - self.beta * self.Q[j,:])

    # Ratings for user i and moive j
    def get_rating(self, i, j):
        prediction = self.b + self.b_u[i] + self.b_i[j] + self.P[i, :].dot(self.Q[j, :].T)
        return prediction

    # Full user-movie rating matrix
    def full_matrix(self):
        return mf.b + mf.b_u[:,np.newaxis] + mf.b_i[np.newaxis:,] + mf.P.dot(mf.Q.T)

 

Now we have a function that can predict the ratings. The input for this function are:

 

  • R – The user-movie rating matrix
  • K – Number of latent features
  • alpha – Learning rate for stochastic gradient descent
  • beta – Regularization parameter for bias
  • iterations – Number of iterations to perform stochastic gradient descent

 

We have to convert the user item ratings to matrix form. It can be done using the pivot function in python.

 

R= np.array(ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0))

 

fillna(0) will fill all the missing ratings with 0. Now we have the R matrix. We can initialize the number of latent features, but the number of these features must be less than or equal to the number of original features.

 

Now let us predict all the missing ratings. Let’s take K=20, alpha=0.001, beta=0.01 and iterations=100.

 

mf = MF(R, K=20, alpha=0.001, beta=0.01, iterations=100)
training_process = mf.train()
print()
print("P x Q:")
print(mf.full_matrix())
print()

 

This will give us the error value corresponding to every 20th iteration and finally the complete user-movie rating matrix. The output looks like this:

 

 

We have created our recommendation engine. Let’s focus on how to evaluate a recommendation engine in the next section.

 

 

 

8. Evaluation metrics for recommendation engines

 

For evaluating recommendation engines, we can use the following metrics

 

    8.1 Recall:

 

  • What proportion of items that a user likes were actually recommended
  • It is given by:

 

 

    • Here tp represents the number of items recommended to a user that he/she likes and tp+fnrepresents the total items that a user likes
    • If a user likes 5 items and the recommendation engine decided to show 3 of them, then the recall will be 0.6
    • Larger the recall, better are the recommendations

 

    8.2 Precision:

 

    • Out of all the recommended items, how many did the user actually like?
    • It is given by:

 

 

    • Here tp represents the number of items recommended to a user that he/she likes and tp+fprepresents the total items recommended to a user
    • If 5 items were recommended to the user out of which he liked 4, then precision will be 0.8
    • Larger the precision, better the recommendations
    • But consider this case: If we simply recommend all the items, they will definitely cover the items which the user likes. So we have 100% recall! But think about precision for a second. If we recommend say 1000 items and user likes only 10 of them, then precision is 0.1%. This is really low. So, our aim should be to maximize both precision and recall.

 

    8.3 RMSE (Root Mean Squared Error):

 

  • It measures the error in the predicted ratings:

 

 

    • Here, Predicted is the rating predicted by the model and Actual is the original rating
    • If a user has given a rating of 5 to a movie and we predicted the rating as 4, then RMSE is 1
    • Lesser the RMSE value, better the recommendations

 

The above metrics tell us how accurate our recommendations are but they do not focus on the order of recommendations, i.e. they do not focus on which product to recommend first and what follows after that. We need some metric that also considers the order of the products recommended. So, let’s look at some of the ranking metrics:

 

    8.4 Mean Reciprocal Rank:

 

  • Evaluates the list of recommendations

 

 

    • Suppose we have recommended 3 movies to a user, say A, B, C in the given order, but the user only liked movie C. As the rank of movie C is 3, the reciprocal rank will be 1/3
    • Larger the mean reciprocal rank, better the recommendations

 

    8.5 MAP at k (Mean Average Precision at cutoff k):

 

  • Precision and Recall don’t care about ordering in the recommendations
  • Precision at cutoff k is the precision calculated by considering only the subset of your recommendations from rank 1 through k

 

 

    • Suppose we have made three recommendations [0, 1, 1]. Here 0 means the recommendation is not correct while 1 means that the recommendation is correct. Then the precision at k will be [0, 1/2, 2/3], and the average precision will be (1/3)*(0+1/2+2/3) = 0.38
    • Larger the mean average precision, more correct will be the recommendations

 

    8.6 NDCG (Normalized Discounted Cumulative Gain):

 

  • The main difference between MAP and NDCG is that MAP assumes that an item is either of interest (or not), while NDCG gives the relevance score
  • Let us understand it with an example: suppose out of 10 movies – A to J, we can recommend the first five movies, i.e. A, B, C, D and E while we must not recommend the other 5 movies, i.e., F, G, H, I and J. The recommendation was [A,B,C,D]. So the NDCG in this case will be 1 as the recommended products are relevant for the user
  • Higher the NDCG value, better the recommendations

 

 

 

9. What else can be tried?

 

Up to this point we have learnt what is a recommendation engine, its different types and their workings. Both content-based filtering and collaborative filtering algorithms have their strengths and weaknesses.

 

In some domains, generating a useful description of the content can be very difficult. A content-based filtering model will not select items if the user’s previous behavior does not provide evidence for this. Additional techniques have to be used so that the system can make suggestions outside the scope of what the user has already shown an interest in.

 

A collaborative filtering model doesn’t have these shortcomings. Because there is no need for a description of the items being recommended, the system can deal with any kind of information. Furthermore, it can recommend products which the user has not shown an interest in previously. But, collaborative filtering cannot provide recommendations for new items if there are no user ratings upon which to base a prediction. Even if users start rating the item, it will take some time before the item has received enough ratings in order to make accurate recommendations.

 

A system that combines content-based filtering and collaborative filtering could potentially take advantage from both the representation of the content as well as the similarities among users. One approach to combine collaborative and content-based filtering is to make predictions based on a weighted average of the content-based recommendations and the collaborative recommendations. Various means of doing so are:

 

  • Combining item scores
    • In this approach, we combine the ratings obtained from both the filtering methods. The simplest way is to take the average of the ratings
    • Suppose one method suggested a rating of 4 for a movie while the other suggested a rating of 5 for the same movie. So the final recommendation will be the average of both ratings, i.e. 4.5
    • We can assign different weights to different methods as well

 

  • Combining item ranks:
    • Suppose collaborative filtering recommended 5 movies A, B, C, D and E in the following order: A, B, C, D, E while content based filtering recommended them in the following order: B, D, A, C, E
    • The rank for the movies will be:

 

Collaborative filtering

 

Movie Rank
A 1
B 0.8
C 0.6
D 0.4
E 0.2

 

 

 

Content Based Filtering:

 

Movie Rank
B 1
D 0.8
A 0.6
C 0.4
E 0.2

 

 

 

So, a hybrid recommender engine will combine these ranks and make final recommendations based on the combined rankings. The combined rank will be:

 

Movie New Rank
A 1+0.6 = 1.6
B 0.8+1 = 1.8
C 0.6+0.4 = 1
D 0.4+0.8 = 1.2
E 0.2+0.2 = 0.4

 

 

 

The recommendations will be made based on these rankings. So, the final recommendations will look like this: B, A, D, C, E.

 

In this way, two or more techniques can be combined to build a hybrid recommendation engine and to improve their overall recommendation accuracy and power.

 

 

 

End Notes

 

This was a very comprehensive article on recommendation engines. This tutorial should be good enough to get you started with this topic. We not only covered basic recommendation techniques but also saw how to implement some of the more advanced techniques available in the industry today.

 

We also covered some key facts associated with each technique. As somebody who wants to learn how to make a recommendation engine, I’d advise you to learn the techniques discussed in this tutorial and later implement them in your models.

 

Did you find this article useful? Share your opinions / views in the comments section below!

 

 

 

Ref:

  1.   另外一篇很好的相關文章推薦 Implementing your own recommender systems in Python
  2.   http://cis.csuohio.edu/~sschung/CIS660/CollaborativeFilteringSuhua.pdf
  3.   https://www.math.uci.edu/icamp/courses/math77b/lecture_12w/pdfs/Chapter%2002%20-%20Collaborative%20recommendation.pdf
相關文章
相關標籤/搜索