在第一篇 UFLDL深度學習筆記 (一)基本知識與稀疏自編碼中討論了激活函數爲\(sigmoid\)函數的係數自編碼網絡,本文要討論「UFLDL 線性解碼器」,區別在於輸出層去掉了\(sigmoid\),將計算值\(z\)直接做爲輸出。線性輸出的緣由是爲了不對輸入範圍的縮放:php
S 型激勵函數輸出範圍是 [0,1],當$ f(z^{(3)}) $採用該激勵函數時,就要對輸入限制或縮放,使其位於 [0,1] 範圍中。一些數據集,好比 MNIST,能方便將輸出縮放到 [0,1] 中,可是很難知足對輸入值的要求。好比, PCA 白化處理的輸入並不知足 [0,1] 範圍要求,也不清楚是否有最好的辦法能夠將數據縮放到特定範圍中。html
既然改變了輸出層激活函數,能夠想到須要對其殘差、偏導公式關係從新推演。git
線性輸出的神經網絡仍然是三層,\(n_l=3\),自編碼線性輸出\(a_i^{(n_l)}\),則\(f'(z_i^{(n_l)})=1\),計算輸出層殘差:github
\[\begin{align} \delta_i^{(3)} &= -(y_i-a_i^{(n_l)})*f'(z_i^{(n_l)}) \\ &= -(y_i-a_i^{(n_l)}) \\ \end{align}\]網絡
使用反向傳播計算另外兩層殘差:函數
\[ \begin{align} \delta^{(2)} &= {W^{(2)}}^T*\delta^{(3)} .* f'(z_i^{(2)}) \\ &= {W^{(2)}}^T*\delta^{(3)} .*(a^{(2)}.*(1-a^{(2)})) \end{align} \]學習
根據梯度與殘差矩陣的關係可得:this
\[\begin{align} \frac {\nabla J} {\nabla W^{(2)}} & =\frac 1 m \delta^{(3)}*a^{(2)} \\ \frac {\nabla J} {\nabla b^{(2)}} &=\frac 1 m\delta^{(3)} \end{align} \]編碼
同理可求出:spa
\[ \begin{align} \delta^{(1)} &= {W^{(1)}}^T*\delta^{(2)} .* f'(z_i^{(1)}) \\ &= {W^{(1)}}^T*\delta^{(2)} .*(a^{(1)}.*(1-a^{(1)})) \end{align} \]
\[\begin{align} \frac {\nabla J} {\nabla W^{(1)}} & = \frac 1 m\delta^{(2)}*a^{(1)} \\ \frac {\nabla J} {\nabla b^{(1)}} &=\frac 1 m\delta^{(2)} \end{align} \]
這樣就獲得了線性解碼器自編碼網絡代價函數對網絡權值\(W^{(1)}, b^{(1)}; W^{(2)}, b^{(2)}\)的梯度。
根據前面的步驟描述,與稀疏自編碼的區別僅僅是梯度公式形式的差別,基本流程以及懲罰項、稀疏性約束徹底複用稀疏自編碼的要求。須要增長的模塊是代價函數與梯度計算模塊sparseAutoencoderLinearCost.m
,詳見https://github.com/codgeek/deeplearning
function [cost,grad] = sparseAutoencoderLinearCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter % sparsityParam: The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: weight of sparsity penalty term % data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); %% ---------- YOUR CODE HERE -------------------------------------- % forward propagation [~, m] = size(data); % visibleSize×N_samples, m=N_samples a2 = sigmoid(W1*data + b1*ones(1,m));% active value of hiddenlayer: hiddenSize×N_samples a3 = W2*a2 + b2*ones(1,m);% liner decoder would output Z. output result: visibleSize×N_samples diff = a3 - data; penalty = mean(a2, 2); % measure of hiddenlayer active: hiddenSize×1 residualPenalty = (-sparsityParam./penalty + (1 - sparsityParam)./(1 - penalty)).*beta; % penalty factor in residual error delta2 % size(residualPenalty) cost = sum(sum((diff.*diff)))./(2*m) + ... (sum(sum(W1.*W1)) + sum(sum(W2.*W2))).*lambda./2 + ... beta.*sum(KLdivergence(sparsityParam, penalty)); % back propagation delta3 = -(data-a3); % liner decoder: visibleSize×N_samples delta2 = (W2'*delta3 + residualPenalty*ones(1, m)).*(a2.*(1-a2)); % hiddenSize×N_samples. !!! => W2'*delta3 not W1'*delta3 W2grad = (a2*(delta3'))'; % ▽J(L)=delta(L+1,i)*a(l,j). sum of grade value from N_samples is got by matrix product hiddenSize×N_samples * N_samples×visibleSize. so mean value is caculated by "/N_samples" W1grad = (data*(delta2'))';% matrix product visibleSize×N_samples * N_samples×hiddenSize b1grad = sum(delta2, 2); b2grad = sum(delta3, 2); % mean value across N_sample W1grad=W1grad./m + lambda.*W1; W2grad=W2grad./m + lambda.*W2; b1grad=b1grad./m; b2grad=b2grad./m;% mean value across N_sample: visibleSize ×1 grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; end function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end function value = KLdivergence(pmean, p) value = pmean.*log(pmean./p) + (1- pmean).*log((1 - pmean)./( 1 - p)); end
數據集來自STL-10 dataset. 須要注意的是咱們使用的是下采樣以後的圖片,每張圖片爲8X8的彩色圖片;另外也原始數據須要作ZCA白化處理,得益於matlab豐富的庫函數,svd分解、白化等每一個步驟只須要單行代碼便可完成。
% Apply ZCA whitening sigma = patches * patches' / numPatches; [u, s, v] = svd(sigma); ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u'; patches = ZCAWhite * patches;
STL-10 原始圖片下采樣到8X8像素圖片
設定與練習說明相同的參數,STL10數據爲8X8像素的彩色圖片,因此輸入層是192個單元,隱藏層設定400個節點,輸出層一樣是192個節點。運行代碼主文件linearDecoderExercise.m 能夠學習到彩色圖片特徵,如上圖所示,本節只是將數據提取爲特徵,並不進行進一步分類,特徵數據留給後續的卷積神經網絡使用。