推薦系統(二)

 (原創文章,轉載請註明出處!)html

1、問題算法

實現對電影的推薦,數據集中有約1600部電影,有約900個用戶對這些電影進行了評價。設每一個電影有10個特徵,根據推薦系統(一)描述的算法,每一個用戶也相應的有10的參數,那麼總的參數個數 ≈ 1600 * 10 + 900 * 10 ≈ 16000 + 9000 ≈25000 個。電影評價的數據集能夠訪問連接:http://grouplens.org/datasets/movielens/編程

 

2、代碼實現dom

用R編程實現推薦系統(一)描述的算法。在實現中使用R提供的一個求解最優化的函數,optim。推薦算法的優化目標是無約束最優化問題,使用optim時,選擇使用共軛梯度算法求解無約束最優化問題。optim的參數須要兩個函數:目標函數和目標函數的一階導函數。函數

1. 目標函數的代碼以下:優化

1 JThetaXLamda <- function(thetaX) {
2     theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE)  
3     X     <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE)
4     
5     return (  (1/2) * ( sum( (tcrossprod(X, theta) * R - Y)^2 )  
6                         + lamda * ( sum(theta^2) + sum(X^2) )  ) )
7 }

 

2. 目標函數的一階導函數代碼以下:spa

 1 JThetaXGrLamda <- function(thetaX) {
 2     theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE)  # 943 is the number of user,  each row is one user's parameter
 3     X     <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE)  # 1682 is the number of movie
 4      
 5     # gradient of theta
 6     grTheta <- matrix(numeric(),ncol=nMovieFeature)
 7     for (j in 1:nUser) { # jth user
 8         idx <- which(R[,j]==1)  # which movies are rated by jth user
 9         tmpX <- X[idx,] 
10         tmpY <- matrix(Y[idx,j], nrow=1)
11         grTheta <- rbind( grTheta, alpha * ( (tcrossprod(theta[j,], tmpX) - tmpY) %*% tmpX 
12                                                        + lamda * theta[j,] )  )
13     }
14 
15     # gradient of x
16     # calculating gradient of each movie only need the users who rate it.
17     grX <- matrix(numeric(),ncol=nMovieFeature)
18     for (i in 1:nMovie) { # ith movie
19         idx <- which(R[i,]==1)  # exclude the user who haven't rated movie i
20         tmpTheta <- theta[idx,] 
21         tmpY <- matrix(Y[i,idx], nrow=1) # tmpY is a vector including rating credit of movie i
22         grX <- rbind( grX, alpha * ( (tcrossprod(X[i,], tmpTheta) - tmpY) %*% tmpTheta 
23                                                + lamda * X[i,]) )
24     }
25     
26     return (   c( t(grTheta), t(grX) )   )
27 }

 

3. 訓練:code

上面兩個函數使用的變量:nUser、nMovie、nMovieFeature、Y、R是自由變量,在訓練代碼中定義和賦值。orm

推薦系統(一)中還提到對數據須要作進行均值正規化(Mean Normalization), 代碼以下:htm

 1 ## normalizae the training data
 2 ## Args :
 3 ##     Y  -  a matrix, rating of movies by users 
 4 ##     R  -  a matrix, flags, which movies are rated by user
 5 ## Returns :
 6 ##     a list contains two elements: Ynormalized - a matrix, normalized Y
 7 ##                                   Ymean - a vector contains mean of each movie
 8 craEx8RS_meanNormalization <- function(Y, R)
 9 {
10     numOfMovies <- dim(Y)[1]
11     l <- list( Ynormalized = Y, Ymean = numeric(length = numOfMovies) )
12     for (i in 1:numOfMovies) {
13         idx <- which(R[i,] == 1)
14         meanOfOneMovie <- mean(Y[i,idx])
15         l$Ynormalized[i,] <- Y[i,] - meanOfOneMovie
16         l$Ymean[i] <- meanOfOneMovie
17     }
18     return( l )
19 }

訓練代碼以下:

 1 ## do the training
 2 Y <- read.table(file="Y.txt")
 3 R <- read.table(file="R.txt")
 4 Y <- as.matrix(Y)
 5 R <- as.matrix(R)
 6 l <- meanNormalization(Y, R)  # normalize the data
 7 Y <- l$Ynormalized
 8 nMovie <- dim(Y)[1]
 9 nUser  <- dim(Y)[2]
10 nMovieFeature <- 10 
11 lamda <- 0.01
12 alpha <- 0.01
13 
14 
15 ## random initialize theta, X
16 theta <- matrix(runif(nUser*nMovieFeature,min=0,max=0.1), nrow=nUser, ncol=nMovieFeature)  # each row is parameter of one user
17 X     <- matrix(runif(nMovie*nMovieFeature,min=0,max=0.1), nrow=nMovie, ncol=nMovieFeature)  # each row is parameter of one movie
18 # convert the two initialize value matrix to vector
19 thetaX <- c( t(theta), t(X) ) 
20 # using conjugate gradient algorithm to train the parameters
21 optimCG <- optim( par=thetaX, fn=JThetaX, gr = JThetaXGr, method = "CG", control = list(maxit=500000, abstol=1e-6) )
22 optimCG$par # display the training reslut, i.e. best value has been found for parameters  

 

4. 預測:
經過訓練獲得參數值後,就能夠經過推薦系統(一)中提到的計算公式:  θT*x + 平均值, 來計算每一個用戶對尚未評價電影的評分。經過比較計算獲得評分,就能找到用戶偏好的電影推薦給他。代碼以下:

 1 ## Find the top-n un-rating movies of the jth user
 2 ## Args :
 3 ##     topn  -  a positive integer
 4 ##     theta_j  -  a vector, contains theta parameters of jth user
 5 ##     X  -  a matrix, contains X paramters
 6 ##     Y  -  a matrix, rating of movies by users 
 7 ##     R  -  a matrix, flags, which movies are rated by user
 8 ##     j  -  a interger, jth user
 9 ##     Ymean  -  a vector, contains mean of all movies
10 ##     isAddMean  -  flag, is the mean added to rating credit, or not?
11 ##                   TRUE  -  add the mean to rating credit
12 ##                   FALSE  -  the mean will not be added to the rating credit
13 ## Returns :
14 ##     a list, contains the index of the top-n un-rating movies of jth user and the rating result
15 findTopN <- function(topn, theta_j, X, R, j, Ymean=NULL, isAddMean=FALSE)
16 {
17     idx <- which(R[,j]==0)    
18     Yvec <- rep(-1,length(R[,j]))
19       
20     if (isAddMean == TRUE) {
21         if (is.null(Ymean)) {
22             cat("Error: Ymean is NULL!!!\n")
23         }
24         Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) )   +   as.matrix( Ymean[idx], ncol=1)
25     } else {
26         Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) )   
27     }
28     
29     sortTating <- sort(Yvec, method = "sh", decreasing = TRUE, index.return = TRUE)
30     objList <- list(ratingResult = Yvec, topnIndex = sortTating$ix[1:topn])
31     return( objList )
32 }

(尋找N個數中最大的K個數的方法,可參考連接:http://zhangzhibiao02005.blog.163.com/blog/static/3736782020112123353695/)

相關文章
相關標籤/搜索