(原創文章,轉載請註明出處!)html
推薦系統關注的是人與物品,但願預測出人對物品的喜歡程度。不一樣的人有相近的喜愛(好比:都喜歡武俠小說),不一樣的物品有相近的特徵(好比:都是武俠小說)。當想預測一個用戶A對其尚未評價的的物品T的評分時,能夠從兩個角度來考慮:找和用戶A有相近喜歡的人,經過他們對物品T的評分,來估計用戶A對物品T的評分;另一個角度是用戶A已經評價過的物品,看看哪些物品與物品T比較相近,經過這些相近的物品,來估計用戶A對物品T可能的評分。基於這兩種思路獲得了兩種計算推薦系統評分的方法:基於用戶的協同過濾法和基於物品的協同過濾法。web
1、基於用戶的協同過濾法,User-Based Collaborative Filtering ( UBCF )算法
1. 尋找類似用戶app
思路一:計算用戶A與全部對物品T評價過的其餘用戶的類似度,而後將與這些用戶的類似度都應用到評分預測值的計算中;函數
思路二:計算用戶A與全部對物品T評價過的其餘用戶的類似度,取其中類似度最大的K個,將這K個應用到評分預測值的計算中;post
思路三:計算用戶A與全部對物品T評價過的其餘用戶的類似度,設置一個閾值,取比閾值大的類似度,將這些用戶的類似度應用到評分預測值的計算中。spa
對於類似度的計算,能夠有多種選擇:皮爾遜相關係數(Pearson correlation coefficient)、夾角餘弦、歐式距離等。(R中的cor函數能夠用來計算皮爾遜相關係數;dist函數能夠用來計算歐式距離(daist函數也能夠,不過須要先安裝cluster包))。rest
2. 計算用戶A對物品T的評分預測值code
尋找類似的用戶後,能夠計算這些類似用戶對物品T的評分的平均值,以此做爲用戶A對物品T評分的預測;在類似的用戶中,每一個用戶與用戶A的類似度不盡相同,還可使用類似度與評分的加權平均來做爲用戶A對物品T評分的預測。orm
3. 實現
下面使用餘弦夾角度量類似度,找出最大的K個類似用戶,並使用這些用戶的評分來計算評分預測值。訓練數據是一個矩陣,每行是一個物品收到的全部評價,每列是一個用戶對全部物品的評價,評分值是:1-5, 沒有評價過值是:NA,代碼以下:
1 ## normalize a vector with z-score method ( (x-u)/sigma ) 2 ## Args : 3 ## x - a matrix 4 ## Returns : 5 ## a list contains, mean of each colum, 6 ## standard derivation of each colum 7 ## normalized x 8 zScoreNormalization <- function(x) 9 { # sapply(,FUN=function(x) ( (x - mean(x)) / sd(x) )) 10 ## normalize the data 11 meanOfcol <- numeric(dim(x)[2]) 12 sdOfcol <- numeric(dim(x)[2]) 13 for (i in 1:dim(x)[2]) { 14 t <- x[,i] 15 idx <- which(t != 0) 16 if (length(idx) <= 1) { 17 meanOfcol[i] <- NA 18 sdOfcol[i] <- NA 19 next 20 } 21 meanOfcol[i] <- mean(t[idx]) 22 sdOfcol[i] <- sd(t[idx]) 23 x[idx,i] <- (t[idx] - mean(t[idx])) / sd(t[idx]) # z-score 24 } 25 26 return ( list(meanOfcol = meanOfcol, sdOfcol = sdOfcol, xNormalized=x) ) 27 } 28 ## inverse the z-score normalized training data 29 ## Args : 30 ## x - a vector, which need to be inversed 31 ## u - mean of original x 32 ## sigma - standard derivation of original x 33 ## Returns : 34 ## inversed vector x 35 zScoreNormalizationInverse <- function(x, u, sigma) 36 { 37 return (x*sigma + u) 38 } 39 40 ## calculate the consine of two vector angle 41 ## Args : 42 ## x - a vector 43 ## y - a vector 44 ## Returns : 45 ## cosine value of two vector's angle 46 cosineSimilarity <- function(x, y) { 47 if (length(x) != length(y)) { 48 stop("Function cosineSimilarity : length of two parameter vectors is different!") 49 } 50 xx <- x 51 yy <- y 52 xx[which(is.na(xx))] <- 0 53 yy[which(is.na(yy))] <- 0 54 ## if x and y is zero, return 0 without calculating 55 if ( sum(abs(xx*yy)) == 0 ) { 56 return (0) 57 } 58 59 sim <- sum(xx*yy) / ( sqrt(sum(xx^2)) * sqrt(sum(yy^2)) ) # cosine of vector angle 60 return ( 0.5 + 0.5*sim ) # ensure the similarity is in range [0,1] 61 } 62 63 ## find the top n items as the item recommendation list with the User-Based Collaborative Filtering algorithm 64 ## Args : 65 ## x - a matrix, contain all rating reslut. 66 ## Each colum is the rating by one user, each row is the rating of one movie. 67 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 68 ## userI - index of specified user 69 ## k - k nearest neigbour of user I 70 ## n - top n items that will be recommended to user I 71 ## Returns : 72 ## a list, contains recommendation result 73 recommendationUBCF <- function(x, userI, k, n) 74 { 75 x[which(is.na(x))] <- 0 76 ## normalize the data 77 normlizedResult <- zScoreNormalization(x) 78 x <- normlizedResult$xNormalized 79 80 ## find the k similary users 81 userSimilarity <- numeric(dim(x)[2]) 82 for (i in 1:dim(x)[2]) { 83 if (i == userI) { 84 userSimilarity[i] <- -1 85 next 86 } 87 userSimilarity[i] <- cosineSimilarity(x[,i], x[, userI]) 88 } 89 KSimilarUserIdx <- apply( matrix(userSimilarity,nrow=1), 90 MARGIN=1, # apply the function to each colum 91 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), k) 92 ) 93 KSimilarUserIdx <- as.vector(KSimilarUserIdx) 94 95 ## predict the rating of un-rated items 96 unRatedItems <- which( x[,userI]==0 ) 97 ratingOfUnRatedItems <- numeric( dim(x)[1] ) 98 for (i in unRatedItems) { 99 ratingOfUnRatedItems[i] <- sum( x[i,KSimilarUserIdx] * userSimilarity[KSimilarUserIdx] ) 100 / sum( userSimilarity[KSimilarUserIdx] ) 101 } 102 ratingOfUnRatedItems <- zScoreNormalizationInverse( ratingOfUnRatedItems, 103 normlizedResult$meanOfUsers[userI], 104 normlizedResult$sdOfusers[userI] ) 105 106 ## find the Top-N items 107 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 108 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 109 topnIdx <- as.vector(topnIdx) 110 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 111 return( recommendList ) 112 }
2、基於物品的協同過濾法,Item-Based Collaborative Filtering ( IBCF )
1. 算法流程
1) 找出指定用戶還沒評價過的全部物品
2) 對每一個沒有評價過的物品,尋找與其最相近的k個指定用戶已經評價過的物品,利用這k個相近物品的評分以及類似度值,預測未評價物品的評分
2. 實現
使用皮爾遜相關係數來計算物品間的類似度,訓練數據同UBCF同樣,實現代碼以下:
1 ## find the top n items as the item recommendation list with the Item-Based Collaborative Filtering algorithm 2 ## Args : 3 ## x - a matrix, contain all rating reslut. 4 ## Each colum is the rating by one user, each row is the rating of one movie. 5 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 6 ## userI - index of specified user 7 ## k - k nearest neigbour of useriI 8 ## n - top n items that will be recommended to user-I 9 ## Returns : 10 ## a list, contains recommendation result 11 recommendationIBCF <- function(x, userI, k, n) 12 { 13 # Pearson correlation coefficient between two vectors : 14 # sum((x - u_x)*(y - u_y)) / (sd_x * sd_y) 15 16 x[which(is.na(x))] <- 0 17 ## normalize the data 18 normlizedResult <- zScoreNormalization( t(x) ) 19 x <- t( normlizedResult$xNormalized ) 20 21 ## predicting the rating of user-I's un-rated items 22 unRatedIdx <- which(x[,userI] == 0) 23 ratedIdx <- which(x[,userI] != 0) 24 ratingOfUnRatedItems <- numeric( dim(x)[1] ) 25 for (i in unRatedIdx) { 26 # calculate the Pearson correlation coefficient to each item 27 itemSim <- cor( x = x[i,], y = t(x[ratedIdx,]), use = "everything", method = "pearson" ) 28 29 # find the k nearest items to item-i 30 KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 31 MARGIN=1, # apply the function to each row 32 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), k) 33 ) 34 KSimilarItemIdx <- as.vector(KSimilarItemIdx) 35 36 # predicting the rating of un-rated item-i 37 r <- x[ratedIdx,] 38 ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )
39 / sum( itemSim[KSimilarItemIdx] ) 40 if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) { 41 next 42 } 43 ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 44 normlizedResult$meanOfcol[i], 45 normlizedResult$sdOfcol[i] ) 46 } 47 48 ## find the Top-N items 49 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 50 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 51 topnIdx <- as.vector(topnIdx) 52 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 53 return( recommendList ) 54 }
3、評分標準化,Normalization
不一樣的用戶有不一樣的評分偏好,好比:有人喜歡評分時均打較低的分,有人則喜歡均打較高的分,須要對數據進行標準化(normalization)的預處理,來消除評分偏好帶來的影響。選擇正規化方法的原則是標準化後,還能還原回去。一般的標準化方法有均值標準化,Z-score標準化。
均值標準化的代碼在文章推薦系統(二)中已經給出;Z-score標準化的實現代碼見本文章上面的代碼。