轉載自:http://www.cnblogs.com/tornadomeet/archive/2013/03/20/2970724.htmlphp
前言:html
如今來進入sparse autoencoder的一個實例練習,參考Ng的網頁教程:Exercise:Sparse Autoencoder。這個例子所要實現的內容大概以下:從給定的不少張天然圖片中截取出大小爲8*8的小patches圖片共10000張,如今須要用sparse autoencoder的方法訓練出一個隱含層網絡所學習到的特徵。該網絡共有3層,輸入層是64個節點,隱含層是25個節點,輸出層固然也是64個節點了。node
實驗基礎:算法
其實實現該功能的主要步驟仍是須要計算出網絡的損失函數以及其偏導數,具體的公式能夠參考前面的博文Deep learning:八(Sparse Autoencoder)。下面用簡單的語言大概介紹下這個步驟,方便你們理清算法的流程。數組
1. 計算出網絡每一個節點的輸入值(即程序中的z值)和輸出值(即程序中的a值,a是z的sigmoid函數值)。網絡
2. 利用z值和a值計算出網絡每一個節點的偏差值(即程序中的delta值)。app
3. 這樣能夠利用上面計算出的每一個節點的a,z,delta來表達出系統的損失函數以及損失函數的偏導數了,固然這些都是一些數學推導,其公式就是前面的博文Deep learning:八(Sparse Autoencoder)了。less
其實步驟1是前向進行的,也就是說按照輸入層——》隱含層——》輸出層的方向進行計算。而步驟2是方向進行的(這也是該算法叫作BP算法的來源),即每一個節點的偏差值是按照輸出層——》隱含層——》輸入層方向進行的。dom
一些malab函數:ide
bsxfun:
C=bsxfun(fun,A,B)表達的是兩個數組A和B間元素的二值操做,fun是函數句柄或者m文件,或者是內嵌的函數。在實際使用過程當中fun有不少選擇好比說加,減等,前面須要使用符號’@’.通常狀況下A和B須要尺寸大小相同,若是不相同的話,則只能有一個維度不一樣,同時A和B中在該維度處必須有一個的維度爲1。好比說bsxfun(@minus, A, mean(A)),其中A和mean(A)的大小是不一樣的,這裏的意思須要先將mean(A)擴充到和A大小相同,而後用A的每一個元素減去擴充後的mean(A)對應元素的值。
rand:
生成均勻分佈的僞隨機數。分佈在(0~1)之間 主要語法:rand(m,n)生成m行n列的均勻分佈的僞隨機數 rand(m,n,'double')生成指定精度的均勻分佈的僞隨機數,參數還能夠是'single' rand(RandStream,m,n)利用指定的RandStream(我理解爲隨機種子)生成僞隨機數
randn:
生成標準正態分佈的僞隨機數(均值爲0,方差爲1)。主要語法:和上面同樣
randi:
生成均勻分佈的僞隨機整數 主要語法:randi(iMax)在閉區間(0,iMax)生成均勻分佈的僞隨機整數 randi(iMax,m,n)在閉區間(0,iMax)生成mXn型隨機矩陣 r = randi([iMin,iMax],m,n)在閉區間(iMin,iMax)生成mXn型隨機矩陣
exist:
測試參數是否存在,好比說exist('opt_normalize', 'var')表示檢測變量opt_normalize是否存在,其中的’var’表示變量的意思。
colormap:
設置當前常見的顏色值表。
floor:
floor(A):取不大於A的最大整數。
ceil:
ceil(A):取不小於A的最小整數。
imagesc:
imagesc和image相似,能夠用於顯示圖像。好比imagesc(array,'EraseMode','none',[-1 1]),這裏的意思是將array中的數據線性映射到[-1,1]之間,而後使用當前設置的顏色表進行顯示。此時的[-1,1]充滿了整個顏色表。背景擦除模式設置爲node,表示不擦除背景。
repmat:
該函數是擴展一個矩陣並把原來矩陣中的數據複製進去。好比說B = repmat(A,m,n),就是建立一個矩陣B,B中複製了共m*n個A矩陣,所以B矩陣的大小爲[size(A,1)*m size(A,2)*m]。
使用函數句柄的做用:
不使用函數句柄的狀況下,對函數屢次調用,每次都要爲該函數進行全面的路徑搜索,直接影響計算速度,藉助句柄能夠徹底避免這種時間損耗。也就是直接指定了函數的指針。函數句柄就像一個函數的名字,有點相似於C++程序中的引用。
實驗流程:
首先運行主程序train.m中的步驟1,即隨機採樣出10000個小的patch,而且顯示出其中的204個patch圖像,圖像顯示以下所示:
而後運行train.m中的步驟2和步驟3,進行損失函數和梯度函數的計算並驗證。進行gradient checking的時間可能會太長,我這裏大概用了1個半小時以上(反正1個多小時還沒checking完,因此去睡覺了),當用gradient checking時,發現偏差只有6.5101e-11,遠小於1e-9,因此說明前面的損失函數和偏導函數程序是對的。後面就能夠接着用優化算法來求參數了,本程序給的是優化算法是L-BFGS。通過幾分鐘的優化,就出結果了。
最後的W1的權值以下所示:
實驗代碼:
train.m:
%% CS294A/CS294W Programming Assignment Starter Code
% Instructions % ------------ % % This file contains code that helps you get started on the % programming assignment. You will need to complete the code in sampleIMAGES.m, % sparseAutoencoderCost.m and computeNumericalGradient.m. % For the purpose of completing the assignment, you do not need to % change the code in this file. % %%====================================================================== %% STEP 0: Here we provide the relevant parameters values that will % allow your sparse autoencoder to get good filters; you do not need to % change the parameters below. visibleSize = 8*8; % number of input units hiddenSize = 25; % number of hidden units sparsityParam = 0.01; % desired average activation of the hidden units. % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p", % in the lecture notes). lambda = 0.0001; % weight decay parameter beta = 3; % weight of sparsity penalty term %%====================================================================== %% STEP 1: Implement sampleIMAGES % % After implementing sampleIMAGES, the display_network command should % display a random sample of 200 patches from the dataset patches = sampleIMAGES; display_network(patches(:,randi(size(patches,2),204,1)),8);%randi(size(patches,2),204,1) %爲產生一個204維的列向量,每一維的值爲0~10000 %中的隨機數,說明是隨機取204個patch來顯示 % Obtain random parameters theta theta = initializeParameters(hiddenSize, visibleSize); %%====================================================================== %% STEP 2: Implement sparseAutoencoderCost % % You can implement all of the components (squared error cost, weight decay term, % sparsity penalty) in the cost function at once, but it may be easier to do % it step-by-step and run gradient checking (see STEP 3) after each step. We % suggest implementing the sparseAutoencoderCost function using the following steps: % % (a) Implement forward propagation in your neural network, and implement the % squared error term of the cost function. Implement backpropagation to % compute the derivatives. Then (using lambda=beta=0), run Gradient Checking % to verify that the calculations corresponding to the squared error cost % term are correct. % % (b) Add in the weight decay term (in both the cost function and the derivative % calculations), then re-run Gradient Checking to verify correctness. % % (c) Add in the sparsity penalty term, then re-run Gradient Checking to % verify correctness. % % Feel free to change the training settings when debugging your % code. (For example, reducing the training set size or % number of hidden units may make your code run faster; and setting beta % and/or lambda to zero may be helpful for debugging.) However, in your % final submission of the visualized weights, please use parameters we % gave in Step 0 above. [cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ... sparsityParam, beta, patches); %%====================================================================== %% STEP 3: Gradient Checking % % Hint: If you are debugging your code, performing gradient checking on smaller models % and smaller training sets (e.g., using only 10 training examples and 1-2 hidden % units) may speed things up. % First, lets make sure your numerical gradient computation is correct for a % simple function. After you have implemented computeNumericalGradient.m, % run the following: checkNumericalGradient(); % Now we can use it to check your cost function and derivative calculations % for the sparse autoencoder. numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ... hiddenSize, lambda, ... sparsityParam, beta, ... patches), theta); % Use this to visually compare the gradients side by side %disp([numgrad grad]); % Compare numerically computed gradients with the ones obtained from backpropagation diff = norm(numgrad-grad)/norm(numgrad+grad); disp(diff); % Should be small. In our implementation, these values are % usually less than 1e-9. % When you got this working, Congratulations!!! %%====================================================================== %% STEP 4: After verifying that your implementation of % sparseAutoencoderCost is correct, You can start training your sparse % autoencoder with minFunc (L-BFGS). % Randomly initialize the parameters theta = initializeParameters(hiddenSize, visibleSize); % Use minFunc to minimize the function addpath minFunc/ options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost % function. Generally, for minFunc to work, you % need a function pointer with two outputs: the % function value and the gradient. In our problem, % sparseAutoencoderCost.m satisfies this. options.maxIter = 400; % Maximum number of iterations of L-BFGS to run options.display = 'on'; [opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ... visibleSize, hiddenSize, ... lambda, sparsityParam, ... beta, patches), ... theta, options); %%====================================================================== %% STEP 5: Visualization W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); figure; display_network(W1', 12); print -djpeg weights.jpg % save the visualization to a file
sampleIMAGES.m:
function patches = sampleIMAGES() % sampleIMAGES % Returns 10000 patches for training load IMAGES; % load images from disk patchsize = 8; % we'll use 8x8 patches numpatches = 10000; % Initialize patches with zeros. Your code will fill in this matrix--one % column per patch, 10000 columns. patches = zeros(patchsize*patchsize, numpatches); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Fill in the variable called "patches" using data % from IMAGES. % % IMAGES is a 3D array containing 10 images % For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image, % and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize % it. (The contrast on these images look a bit off because they have % been preprocessed using using "whitening." See the lecture notes for % more details.) As a second example, IMAGES(21:30,21:30,1) is an image % patch corresponding to the pixels in the block (21,21) to (30,30) of % Image 1 for imageNum = 1:10%在每張圖片中隨機選取1000個patch,共10000個patch [rowNum colNum] = size(IMAGES(:,:,imageNum)); for patchNum = 1:1000%實現每張圖片選取1000個patch xPos = randi([1,rowNum-patchsize+1]); yPos = randi([1, colNum-patchsize+1]); patches(:,(imageNum-1)*1000+patchNum) = reshape(IMAGES(xPos:xPos+7,yPos:yPos+7,... imageNum),64,1); end end %% --------------------------------------------------------------- % For the autoencoder to work well we need to normalize the data % Specifically, since the output of the network is bounded between [0,1] % (due to the sigmoid activation function), we have to make sure % the range of pixel values is also bounded between [0,1] patches = normalizeData(patches); end %% --------------------------------------------------------------- function patches = normalizeData(patches) % Squash data to [0.1, 0.9] since we use sigmoid as the activation % function in the output layer % Remove DC (mean of images). patches = bsxfun(@minus, patches, mean(patches)); % Truncate to +/-3 standard deviations and scale to -1 to 1 pstd = 3 * std(patches(:)); patches = max(min(patches, pstd), -pstd) / pstd;%由於根據3sigma法則,95%以上的數據都在該區域內 % 這裏轉換後將數據變到了-1到1之間 % Rescale from [-1,1] to [0.1,0.9] patches = (patches + 1) * 0.4 + 0.1; end
initializeParameters.m:
function theta = initializeParameters(hiddenSize, visibleSize) %% Initialize parameters randomly based on layer sizes. r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the interval [-r, r] W1 = rand(hiddenSize, visibleSize) * 2 * r - r; W2 = rand(visibleSize, hiddenSize) * 2 * r - r; b1 = zeros(hiddenSize, 1); b2 = zeros(visibleSize, 1); % Convert weights and bias gradients to the vector form. % This step will "unroll" (flatten and concatenate together) all % your parameters into a vector, which can then be used with minFunc. theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)]; end
sparseAutoencoderCost.m:
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter % sparsityParam: The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: weight of sparsity penalty term % data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. %將長向量轉換成每一層的權值矩陣和偏置向量值 W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); % Cost and gradient variables (your code needs to compute these values). % Here, we initialize them to zeros. cost = 0; W1grad = zeros(size(W1)); W2grad = zeros(size(W2)); b1grad = zeros(size(b1)); b2grad = zeros(size(b2)); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder, % and the corresponding gradients W1grad, W2grad, b1grad, b2grad. % % W1grad, W2grad, b1grad and b2grad should be computed using backpropagation. % Note that W1grad has the same dimensions as W1, b1grad has the same dimensions % as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with % respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) % with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term % [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 % of the lecture notes (and similarly for W2grad, b1grad, b2grad). % % Stated differently, if we were using batch gradient descent to optimize the parameters, % the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. % Jcost = 0;%直接偏差 Jweight = 0;%權值懲罰 Jsparse = 0;%稀疏性懲罰 [n m] = size(data);%m爲樣本的個數,n爲樣本的特徵數 %前向算法計算各神經網絡節點的線性組合值和active值 z2 = W1*data+repmat(b1,1,m);%注意這裏必定要將b1向量複製擴展成m列的矩陣 a2 = sigmoid(z2); z3 = W2*a2+repmat(b2,1,m); a3 = sigmoid(z3); % 計算預測產生的偏差 Jcost = (0.5/m)*sum(sum((a3-data).^2)); %計算權值懲罰項 Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2))); %計算稀釋性規則項 rho = (1/m).*sum(a2,2);%求出第一個隱含層的平均值向量 Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ... (1-sparsityParam).*log((1-sparsityParam)./(1-rho))); %損失函數的總表達式 cost = Jcost+lambda*Jweight+beta*Jsparse; %反向算法求出每一個節點的偏差值 d3 = -(data-a3).*sigmoidInv(z3); sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%由於加入了稀疏規則項,因此 %計算偏導時須要引入該項 d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2); %計算W1grad W1grad = W1grad+d2*data'; W1grad = (1/m)*W1grad+lambda*W1; %計算W2grad W2grad = W2grad+d3*a2'; W2grad = (1/m).*W2grad+lambda*W2; %計算b1grad b1grad = b1grad+sum(d2,2); b1grad = (1/m)*b1grad;%注意b的偏導是一個向量,因此這裏應該把每一行的值累加起來 %計算b2grad b2grad = b2grad+sum(d3,2); b2grad = (1/m)*b2grad; % %%方法二,每次處理1個樣本,速度慢 % m=size(data,2); % rho=zeros(size(b1)); % for i=1:m % %feedforward % a1=data(:,i); % z2=W1*a1+b1; % a2=sigmoid(z2); % z3=W2*a2+b2; % a3=sigmoid(z3); % %cost=cost+(a1-a3)'*(a1-a3)*0.5; % rho=rho+a2; % end % rho=rho/m; % sterm=beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho)); % %sterm=beta*2*rho; % for i=1:m % %feedforward % a1=data(:,i); % z2=W1*a1+b1; % a2=sigmoid(z2); % z3=W2*a2+b2; % a3=sigmoid(z3); % cost=cost+(a1-a3)'*(a1-a3)*0.5; % %backpropagation % delta3=(a3-a1).*a3.*(1-a3); % delta2=(W2'*delta3+sterm).*a2.*(1-a2); % W2grad=W2grad+delta3*a2'; % b2grad=b2grad+delta3; % W1grad=W1grad+delta2*a1'; % b1grad=b1grad+delta2; % end % % kl=sparsityParam*log(sparsityParam./rho)+(1-sparsityParam)*log((1-sparsityParam)./(1-rho)); % %kl=rho.^2; % cost=cost/m; % cost=cost+sum(sum(W1.^2))*lambda/2.0+sum(sum(W2.^2))*lambda/2.0+beta*sum(kl); % W2grad=W2grad./m+lambda*W2; % b2grad=b2grad./m; % W1grad=W1grad./m+lambda*W1; % b1grad=b1grad./m; %------------------------------------------------------------------- % After computing the cost and gradient, we will convert the gradients back % to a vector format (suitable for minFunc). Specifically, we will unroll % your gradient matrices into a vector. grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; end %------------------------------------------------------------------- % Here's an implementation of the sigmoid function, which you may find useful % in your computation of the costs and the gradients. This inputs a (row or % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end %sigmoid函數的逆向求導函數 function sigmInv = sigmoidInv(x) sigmInv = sigmoid(x).*(1-sigmoid(x));