(原創文章,轉載請註明出處!)html
1、問題算法
實現對電影的推薦,數據集中有約1600部電影,有約900個用戶對這些電影進行了評價。設每一個電影有10個特徵,根據推薦系統(一)描述的算法,每一個用戶也相應的有10的參數,那麼總的參數個數 ≈ 1600 * 10 + 900 * 10 ≈ 16000 + 9000 ≈25000 個。電影評價的數據集能夠訪問連接:http://grouplens.org/datasets/movielens/編程
2、代碼實現dom
用R編程實現推薦系統(一)描述的算法。在實現中使用R提供的一個求解最優化的函數,optim。推薦算法的優化目標是無約束最優化問題,使用optim時,選擇使用共軛梯度算法求解無約束最優化問題。optim的參數須要兩個函數:目標函數和目標函數的一階導函數。函數
1. 目標函數的代碼以下:優化
1 JThetaXLamda <- function(thetaX) { 2 theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE) 3 X <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE) 4 5 return ( (1/2) * ( sum( (tcrossprod(X, theta) * R - Y)^2 ) 6 + lamda * ( sum(theta^2) + sum(X^2) ) ) ) 7 }
2. 目標函數的一階導函數代碼以下:spa
1 JThetaXGrLamda <- function(thetaX) { 2 theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE) # 943 is the number of user, each row is one user's parameter 3 X <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE) # 1682 is the number of movie 4 5 # gradient of theta 6 grTheta <- matrix(numeric(),ncol=nMovieFeature) 7 for (j in 1:nUser) { # jth user 8 idx <- which(R[,j]==1) # which movies are rated by jth user 9 tmpX <- X[idx,] 10 tmpY <- matrix(Y[idx,j], nrow=1) 11 grTheta <- rbind( grTheta, alpha * ( (tcrossprod(theta[j,], tmpX) - tmpY) %*% tmpX 12 + lamda * theta[j,] ) ) 13 } 14 15 # gradient of x 16 # calculating gradient of each movie only need the users who rate it. 17 grX <- matrix(numeric(),ncol=nMovieFeature) 18 for (i in 1:nMovie) { # ith movie 19 idx <- which(R[i,]==1) # exclude the user who haven't rated movie i 20 tmpTheta <- theta[idx,] 21 tmpY <- matrix(Y[i,idx], nrow=1) # tmpY is a vector including rating credit of movie i 22 grX <- rbind( grX, alpha * ( (tcrossprod(X[i,], tmpTheta) - tmpY) %*% tmpTheta 23 + lamda * X[i,]) ) 24 } 25 26 return ( c( t(grTheta), t(grX) ) ) 27 }
3. 訓練:code
上面兩個函數使用的變量:nUser、nMovie、nMovieFeature、Y、R是自由變量,在訓練代碼中定義和賦值。orm
推薦系統(一)中還提到對數據須要作進行均值正規化(Mean Normalization), 代碼以下:htm
1 ## normalizae the training data 2 ## Args : 3 ## Y - a matrix, rating of movies by users 4 ## R - a matrix, flags, which movies are rated by user 5 ## Returns : 6 ## a list contains two elements: Ynormalized - a matrix, normalized Y 7 ## Ymean - a vector contains mean of each movie 8 craEx8RS_meanNormalization <- function(Y, R) 9 { 10 numOfMovies <- dim(Y)[1] 11 l <- list( Ynormalized = Y, Ymean = numeric(length = numOfMovies) ) 12 for (i in 1:numOfMovies) { 13 idx <- which(R[i,] == 1) 14 meanOfOneMovie <- mean(Y[i,idx]) 15 l$Ynormalized[i,] <- Y[i,] - meanOfOneMovie 16 l$Ymean[i] <- meanOfOneMovie 17 } 18 return( l ) 19 }
訓練代碼以下:
1 ## do the training 2 Y <- read.table(file="Y.txt") 3 R <- read.table(file="R.txt") 4 Y <- as.matrix(Y) 5 R <- as.matrix(R) 6 l <- meanNormalization(Y, R) # normalize the data 7 Y <- l$Ynormalized 8 nMovie <- dim(Y)[1] 9 nUser <- dim(Y)[2] 10 nMovieFeature <- 10 11 lamda <- 0.01 12 alpha <- 0.01 13 14 15 ## random initialize theta, X 16 theta <- matrix(runif(nUser*nMovieFeature,min=0,max=0.1), nrow=nUser, ncol=nMovieFeature) # each row is parameter of one user 17 X <- matrix(runif(nMovie*nMovieFeature,min=0,max=0.1), nrow=nMovie, ncol=nMovieFeature) # each row is parameter of one movie 18 # convert the two initialize value matrix to vector 19 thetaX <- c( t(theta), t(X) ) 20 # using conjugate gradient algorithm to train the parameters 21 optimCG <- optim( par=thetaX, fn=JThetaX, gr = JThetaXGr, method = "CG", control = list(maxit=500000, abstol=1e-6) ) 22 optimCG$par # display the training reslut, i.e. best value has been found for parameters
4. 預測:
經過訓練獲得參數值後,就能夠經過推薦系統(一)中提到的計算公式: θT*x + 平均值, 來計算每一個用戶對尚未評價電影的評分。經過比較計算獲得評分,就能找到用戶偏好的電影推薦給他。代碼以下:
1 ## Find the top-n un-rating movies of the jth user 2 ## Args : 3 ## topn - a positive integer 4 ## theta_j - a vector, contains theta parameters of jth user 5 ## X - a matrix, contains X paramters 6 ## Y - a matrix, rating of movies by users 7 ## R - a matrix, flags, which movies are rated by user 8 ## j - a interger, jth user 9 ## Ymean - a vector, contains mean of all movies 10 ## isAddMean - flag, is the mean added to rating credit, or not? 11 ## TRUE - add the mean to rating credit 12 ## FALSE - the mean will not be added to the rating credit 13 ## Returns : 14 ## a list, contains the index of the top-n un-rating movies of jth user and the rating result 15 findTopN <- function(topn, theta_j, X, R, j, Ymean=NULL, isAddMean=FALSE) 16 { 17 idx <- which(R[,j]==0) 18 Yvec <- rep(-1,length(R[,j])) 19 20 if (isAddMean == TRUE) { 21 if (is.null(Ymean)) { 22 cat("Error: Ymean is NULL!!!\n") 23 } 24 Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) ) + as.matrix( Ymean[idx], ncol=1) 25 } else { 26 Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) ) 27 } 28 29 sortTating <- sort(Yvec, method = "sh", decreasing = TRUE, index.return = TRUE) 30 objList <- list(ratingResult = Yvec, topnIndex = sortTating$ix[1:topn]) 31 return( objList ) 32 }
(尋找N個數中最大的K個數的方法,可參考連接:http://zhangzhibiao02005.blog.163.com/blog/static/3736782020112123353695/)