推薦系統(三)

(原創文章,轉載請註明出處!)html

推薦系統關注的是人與物品,但願預測出人對物品的喜歡程度。不一樣的人有相近的喜愛(好比:都喜歡武俠小說),不一樣的物品有相近的特徵(好比:都是武俠小說)。當想預測一個用戶A對其尚未評價的的物品T的評分時,能夠從兩個角度來考慮:找和用戶A有相近喜歡的人,經過他們對物品T的評分,來估計用戶A對物品T的評分;另一個角度是用戶A已經評價過的物品,看看哪些物品與物品T比較相近,經過這些相近的物品,來估計用戶A對物品T可能的評分。基於這兩種思路獲得了兩種計算推薦系統評分的方法:基於用戶的協同過濾法和基於物品的協同過濾法。web

1、基於用戶的協同過濾法,User-Based Collaborative Filtering ( UBCF )算法

1. 尋找類似用戶app

思路一:計算用戶A與全部對物品T評價過的其餘用戶的類似度,而後將與這些用戶的類似度都應用到評分預測值的計算中;函數

思路二:計算用戶A與全部對物品T評價過的其餘用戶的類似度,取其中類似度最大的K個,將這K個應用到評分預測值的計算中;post

思路三:計算用戶A與全部對物品T評價過的其餘用戶的類似度,設置一個閾值,取比閾值大的類似度,將這些用戶的類似度應用到評分預測值的計算中。spa

對於類似度的計算,能夠有多種選擇:皮爾遜相關係數(Pearson correlation coefficient)、夾角餘弦歐式距離等。(R中的cor函數能夠用來計算皮爾遜相關係數;dist函數能夠用來計算歐式距離(daist函數也能夠,不過須要先安裝cluster包))rest

2. 計算用戶A對物品T的評分預測值code

尋找類似的用戶後,能夠計算這些類似用戶對物品T的評分的平均值,以此做爲用戶A對物品T評分的預測;在類似的用戶中,每一個用戶與用戶A的類似度不盡相同,還可使用類似度與評分的加權平均來做爲用戶A對物品T評分的預測。orm

3. 實現

下面使用餘弦夾角度量類似度,找出最大的K個類似用戶,並使用這些用戶的評分來計算評分預測值。訓練數據是一個矩陣,每行是一個物品收到的全部評價,每列是一個用戶對全部物品的評價,評分值是:1-5, 沒有評價過值是:NA,代碼以下:

  1 ## normalize a vector with z-score method ( (x-u)/sigma )
  2 ## Args :
  3 ##     x - a matrix
  4 ## Returns :
  5 ##     a list contains, mean of each colum, 
  6 ##                      standard derivation of each colum
  7 ##                      normalized x
  8 zScoreNormalization <- function(x)
  9 {   # sapply(,FUN=function(x) ( (x - mean(x)) / sd(x) ))
 10     ## normalize the data
 11     meanOfcol <- numeric(dim(x)[2])
 12     sdOfcol <- numeric(dim(x)[2])
 13     for (i in 1:dim(x)[2]) {
 14         t <- x[,i]
 15         idx <- which(t != 0)  
 16         if (length(idx) <= 1) {
 17             meanOfcol[i] <- NA
 18             sdOfcol[i] <- NA
 19             next
 20         }
 21         meanOfcol[i] <- mean(t[idx])
 22         sdOfcol[i] <- sd(t[idx])
 23         x[idx,i] <- (t[idx] - mean(t[idx])) / sd(t[idx]) # z-score
 24     }
 25     
 26     return ( list(meanOfcol = meanOfcol, sdOfcol = sdOfcol, xNormalized=x) )
 27 }
 28 ## inverse the z-score normalized training data
 29 ## Args :
 30 ##     x  -  a vector, which need to be inversed
 31 ##     u  -  mean of original x
 32 ##     sigma  -  standard derivation of original x
 33 ## Returns :
 34 ##     inversed vector x
 35 zScoreNormalizationInverse <- function(x, u, sigma)
 36 {
 37     return (x*sigma + u)
 38 }
 39 
 40 ## calculate the consine of two vector angle
 41 ## Args :
 42 ##      x  -  a vector
 43 ##      y  -  a vector
 44 ## Returns :
 45 ##      cosine value of two vector's angle
 46 cosineSimilarity <- function(x, y) {
 47     if (length(x) != length(y)) {
 48         stop("Function cosineSimilarity : length of two parameter vectors is different!")
 49     }
 50     xx <- x
 51     yy <- y
 52     xx[which(is.na(xx))] <- 0
 53     yy[which(is.na(yy))] <- 0
 54     ## if  x and y is zero, return 0 without calculating
 55     if ( sum(abs(xx*yy)) == 0 ) {
 56         return (0)
 57     }
 58     
 59     sim <- sum(xx*yy) / ( sqrt(sum(xx^2)) * sqrt(sum(yy^2)) ) # cosine of vector angle
 60     return ( 0.5 + 0.5*sim )  # ensure the similarity is in range [0,1]
 61 }
 62 
 63 ## find the top n items as the item recommendation list with the User-Based Collaborative Filtering algorithm 
 64 ## Args :
 65 ##      x  -  a matrix, contain all rating reslut. 
 66 ##            Each colum is the rating by one user, each row is the rating of one movie.
 67 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
 68 ##      userI - index of specified user
 69 ##      k  -  k nearest neigbour of user I
 70 ##      n  -  top n items that will be recommended to user I
 71 ## Returns :
 72 ##      a list, contains recommendation result
 73 recommendationUBCF <- function(x, userI, k, n) 
 74 {
 75     x[which(is.na(x))] <- 0 
 76     ## normalize the data
 77     normlizedResult <- zScoreNormalization(x)
 78     x <- normlizedResult$xNormalized
 79     
 80     ## find the k similary users    
 81     userSimilarity <- numeric(dim(x)[2])
 82     for (i in 1:dim(x)[2]) {
 83         if (i == userI) {
 84             userSimilarity[i] <- -1
 85             next
 86         }
 87         userSimilarity[i] <- cosineSimilarity(x[,i], x[, userI])
 88     }
 89     KSimilarUserIdx <- apply( matrix(userSimilarity,nrow=1), 
 90                               MARGIN=1,  # apply the function to each colum
 91                               FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
 92                             )
 93     KSimilarUserIdx <- as.vector(KSimilarUserIdx)
 94     
 95     ## predict the rating of un-rated items
 96     unRatedItems <- which( x[,userI]==0 ) 
 97     ratingOfUnRatedItems <- numeric( dim(x)[1] )
 98     for (i in unRatedItems) {
 99         ratingOfUnRatedItems[i] <- sum( x[i,KSimilarUserIdx] * userSimilarity[KSimilarUserIdx] )   
100                                    /   sum( userSimilarity[KSimilarUserIdx] )
101     }
102     ratingOfUnRatedItems <- zScoreNormalizationInverse( ratingOfUnRatedItems, 
103                                                         normlizedResult$meanOfUsers[userI], 
104                                                         normlizedResult$sdOfusers[userI] )
105     
106     ## find the Top-N items
107     topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 
108                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), n )  )
109     topnIdx <- as.vector(topnIdx)
110     recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx)
111     return( recommendList )       
112 }

 

2、基於物品的協同過濾法,Item-Based Collaborative Filtering ( IBCF )

1. 算法流程

1) 找出指定用戶還沒評價過的全部物品

2) 對每一個沒有評價過的物品,尋找與其最相近的k個指定用戶已經評價過的物品,利用這k個相近物品的評分以及類似度值,預測未評價物品的評分

2. 實現

使用皮爾遜相關係數來計算物品間的類似度,訓練數據同UBCF同樣,實現代碼以下:

 1 ## find the top n items as the item recommendation list with the Item-Based Collaborative Filtering algorithm 
 2 ## Args :
 3 ##      x  -  a matrix, contain all rating reslut. 
 4 ##            Each colum is the rating by one user, each row is the rating of one movie.
 5 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
 6 ##      userI - index of specified user
 7 ##      k  -  k nearest neigbour of useriI
 8 ##      n  -  top n items that will be recommended to user-I
 9 ## Returns :
10 ##      a list, contains recommendation result
11 recommendationIBCF <- function(x, userI, k, n) 
12 {
13     # Pearson correlation coefficient between two vectors :
14     # sum((x - u_x)*(y - u_y)) / (sd_x * sd_y)
15     
16     x[which(is.na(x))] <- 0 
17     ## normalize the data
18     normlizedResult <- zScoreNormalization( t(x) )
19     x <- t( normlizedResult$xNormalized )
20     
21     ## predicting the rating of user-I's un-rated items
22     unRatedIdx <- which(x[,userI] == 0)
23     ratedIdx <- which(x[,userI] != 0)
24     ratingOfUnRatedItems <- numeric( dim(x)[1] )
25     for (i in unRatedIdx) {        
26         # calculate the Pearson correlation coefficient to each item
27         itemSim <- cor( x = x[i,], y = t(x[ratedIdx,]), use = "everything", method = "pearson" )
28 
29         # find the k nearest items to item-i
30         KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 
31                                   MARGIN=1,  # apply the function to each row
32                                   FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
33                                 )
34         KSimilarItemIdx <- as.vector(KSimilarItemIdx)                              
35 
36         # predicting the rating of un-rated item-i
37         r <- x[ratedIdx,]
38         ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )  
39 / sum( itemSim[KSimilarItemIdx] ) 40 if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) { 41 next 42 } 43 ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 44 normlizedResult$meanOfcol[i], 45 normlizedResult$sdOfcol[i] ) 46 } 47 48 ## find the Top-N items 49 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 50 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 51 topnIdx <- as.vector(topnIdx) 52 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 53 return( recommendList ) 54 }

 

 3、評分標準化,Normalization

不一樣的用戶有不一樣的評分偏好,好比:有人喜歡評分時均打較低的分,有人則喜歡均打較高的分,須要對數據進行標準化(normalization)的預處理,來消除評分偏好帶來的影響。選擇正規化方法的原則是標準化後,還能還原回去。一般的標準化方法有均值標準化,Z-score標準化。

均值標準化的代碼在文章推薦系統(二)中已經給出;Z-score標準化的實現代碼見本文章上面的代碼。

相關文章
相關標籤/搜索