UFLDL深度學習筆記（七）拓撲稀疏編碼與矩陣化

時間 2019-11-20

原文原文鏈接

UFLDL深度學習筆記（七）拓撲稀疏編碼與矩陣化

主要思路

　　前面幾篇所講的都是圍繞神經網絡展開的，一個標誌就是激活函數非線性；在前人的研究中，也存在線性激活函數的稀疏編碼，該方法試圖直接學習數據的特徵集，利用與此特徵集相應的基向量，將學習獲得的特徵集從特徵空間轉換到樣本數據空間，這樣能夠用特徵集重構樣本數據。php

數據集、特徵集、基向量分別表示爲\(x、A、s\).構造以下目標代價函數,對估計偏差的代價採用二階範數，對稀疏性因子的懲罰代價採用一階範數。原文中沒有對偏差項在數據集上作平均，真實狀況下都會除以數據集個數\(m\).git

\[J(A,s)= \frac 1m||As-x||_2^2+\lambda||s||\]github

接下來，原文中解釋說爲了加強稀疏性約束，避免取\(A\)的縮放值也能得到相同的代價，稀疏性懲罰須要考慮特徵集的數值，進一步改進代價函數爲算法

\[J(A,s)= \frac 1m||As-x||_2^2+\lambda||s||+\gamma ||A||_2^2\]網絡

代價函數仍然存在L1範數的0點不可微問題，經過近似平滑解決，定義常量值平滑參數\(\epsilon\), 將代價函數變成函數

\[J(A,s)= \frac 1m||As-x||_2^2+\lambda \sum_k \sqrt{s_k^2+\epsilon} +\gamma ||A||_2^2\]學習

因爲本代價函數非凸，而固定任意一個變量以後，代價函數是凸函數，因此能夠經過交替固定\(A,s\)來求最優化代價的\(A,s\). 理論上對上式最優化取得的特徵集與經過稀疏自編碼學習獲得的特徵集差很少，人類視覺神經具備一種特色，大腦皮層 V1 區神經元可以按特定的方向對邊緣進行檢測，同時，這些神經元（在生理上）被組織成超柱（hypercolumns），在超柱中，相鄰神經元以類似的方向對邊緣進行檢測，一個神經元檢測水平邊緣，其相鄰神經元檢測到的邊緣就稍微偏離水平方向。爲了使算法達到這種 拓撲性 ，也就是說相鄰的特徵激活具備必定連續性、平滑性，咱們將懲罰項也改形成考慮相鄰特徵值，在2X2分組狀況下，把原來 \(\sqrt{s_{1,1}^2+\epsilon}\)這一項換成 \(\sqrt{s_{1,1}^2+s_{1,2}^2+s_{2,1}^2 +s_{2,2}^2+ \epsilon}\) 。獲得拓撲稀疏編碼的代價函數爲優化

\[J(A,s)= \frac 1m||As-x||_2^2+\lambda \sum_{all G} \sqrt{ \sum_{s \in G}s^2+\epsilon} +\gamma ||A||_2^2\]編碼

進一步用分組矩陣G來表示鄰域分組規則，\(G_{r,c}=1\)表示第r組包含第c個特徵，目標函數重寫爲spa

\[J(A,s)= \frac 1m||As-x||_2^2+\lambda \sum \sqrt{ Vss^T+\epsilon} +\gamma ||A||_2^2\]

矩陣化推導

以上符號都是很是抽象意義上的變量，矩陣化實現時就須要考慮清楚每行每列的操做是否是按照預設的每一項運算規則實現的，原文中沒有這部份內容，我也花費了一番功夫推導。

按照前文所說的交替求\(A,s\)最優化策略，咱們須要先推導代價函數對\(A,s\)的偏導。設定矩陣表示展開爲\(A=[_{Wj,f}]_{visibleSize \times featureSize}\). \(s=[S_{Wj,f}]_{featureSize\times m}\). 令\(V=visibleSize, F=featureSize\).

固定\(s\)求解\(A\)的梯度與最優取值

代價的一階範數項對\(A\)求偏導爲0.

\[\frac {\nabla J(A,s)} {W_{j,f}} =\frac 1 m \sum _i^m 2[W_{j,1}S_{1,i}+W_{j,2}S_{2,i}+…W_{j,F}S_{F,i} -x_{j,i}]S_{f,i}+ 2\gamma W_{j,f}\]

單向合併成矩陣表示爲

\[\frac {\nabla J(A,s)} {A} = \frac 2 m (As-x)s^T +2\gamma A \]

同時咱們發現此表達式爲一階方程，能夠獲得代價函數取極小值時的\(A\)。可得s固定時使代價函數最小的\(A\)爲

即\[min J(A,s) \Leftrightarrow A = \frac {xs^T} {ssT+m \gamma I}; \] .

固定\(A\)求解\(s\)的梯度

展開代價函數並對\(s\)求解，

\[\begin{align} \frac {\nabla J(A,s)} { S_{f,i}} &= \frac 1 m \sum _j^V 2[W_{j,1}S_{1,i}+W_{j,2}S_{2,i}+…W_{j,F}S_{F,i} -x_{j,i}]W_{j,f}+ \frac {\nabla \lambda \sum_f^F \sum_i^m \sqrt {Gss^T+\epsilon }} {\nabla S_{f,i}} \\ &= \frac 1 m \sum _j^V 2[W_{j,1}S_{1,i}+W_{j,2}S_{2,i}+…W_{j,F}S_{F,i} -x_{j,i}]W_{j,f} + \lambda S_{f,i}\sum_l^F{\frac {g_{l,f}} {S\_smooth_{x,f}}} \end{align}\]

其中\(G=[g_{l,f}]_{F \times F}\) ,\(g_{l,f}=1\)表示第\(l\)組包含第f個特徵。 S_smooth表示根據拓撲編碼要求，對特徵值的鄰域進行平滑後的特徵矩陣。

進行矩陣化改寫，能夠獲得，兩個求和式能夠分別寫成矩陣乘法：

\[ \frac {\nabla J(A,s)} S = \frac 2 m A^T(As-x) + \lambda S \cdot (G^T {(1./ S\_smooth)})\]

這個矩陣表達式不能獲得使代價函數最小的\(S\)解析式，這個最優化過程須要使用迭代的方式得到，可使用梯度降低這類最優化方法。

至此咱們獲得了編寫代碼須要的全部矩陣化表達。

代碼實現

在本節實踐實例中，主文件是 sparseCodingExercise.m ,對\(A,s\)的代價梯度計算模塊分別是 sparseCodingWeightCost.m、sparseCodingFeatureCost.m. 按照上述矩陣推導分別填充其中的公式部分，所有代碼見https://github.com/codgeek/deeplearning。

分別固定\(A,s\)進行最優化的步驟在sparseCodingExercise.m中，有幾條須要注意的地方，不然將會很難訓練出結果。

每次交替開始前,不能隨機設定特徵值，而是設定爲 featureMatrix = weightMatrix'*batchPatches;
加載原始圖像的函數sampleIMAGES.m中不能調用歸一化normalizeData。和稀疏自編碼不一樣。
前面推導的固定\(s\)時最優化\(A\)表示中千萬不能漏掉單位矩陣，是\(\gamma\)乘以單位矩陣，不能直接加\(\gamma\)。公式爲weightMatrix = (batchPatches*(featureMatrix'))/(featureMatrix*(featureMatrix')+gamma*batchNumPatches*eye(numFeatures));.

兩個梯度公式代碼以下。

function [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures,  patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingWeightCost - given the features in featureMatrix, 
%                         computes the cost and gradient with respect to
%                         the weights, given in weightMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.
    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
    else
        groupMatrix = eye(numFeatures);
    end

    numExamples = size(patches, 2);

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);
    
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.     
    % -------------------- YOUR CODE HERE --------------------    
    linearError = weightMatrix * featureMatrix - patches;
    normError = sum(sum(linearError .* linearError))./numExamples;% 公式中代價項是二階範數的平方，因此不用在開方
    normWeight = sum(sum(weightMatrix .* weightMatrix));
    
    topoFeature = groupMatrix*(featureMatrix.*featureMatrix);
    smoothFeature = sqrt(topoFeature + epsilon);
    costFeature = sum(sum(smoothFeature));% L1 範數爲sum(|x|),對x加上平滑參數後,sum(sqrt(x2+epsilon)).容易錯寫爲sqrt(sum(x2+epsilon))實際是L2範數
    
%     cost = normError + gamma.*normWeight;
    cost = normError + lambda.*costFeature + gamma.*normWeight;
    grad = 2./numExamples.*(linearError*featureMatrix') + (2*gamma) .* weightMatrix;
%     grad = 2.*(weightMatrix*featureMatrix - patches)*featureMatrix' + 2.*gamma*weightMatrix;
    grad = grad(:);

end

function [cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingFeatureCost - given the weights in weightMatrix,
%                          computes the cost and gradient with respect to
%                          the features, given in featureMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.

    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
    else
        groupMatrix = eye(numFeatures);
    end

    numExamples = size(patches, 2);

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);

    linearError = weightMatrix * featureMatrix - patches;
    normError = sum(sum(linearError .* linearError))./numExamples;
    normWeight = sum(sum(weightMatrix .* weightMatrix));
    topoFeature = groupMatrix*(featureMatrix.*featureMatrix);
    smoothFeature = sqrt(topoFeature + epsilon);
    costFeature = sum(sum(smoothFeature));% L1 範數爲sum(|x|),對x加上平滑參數後,sum(sqrt(x2+epsilon)).容易錯寫爲sqrt(sum(x2+epsilon))實際是L2範數
    
    cost = normError + lambda.*costFeature + gamma.*normWeight;
    grad = 2./numExamples.*(weightMatrix' * linearError) + lambda.*featureMatrix.*( (groupMatrix')*(1 ./ smoothFeature) );% 不止(f,i)本項偏導非零，(f-1,i)……，groupMatrix第f列不爲0的全部行對應項都有s(f,i)的偏導
    grad = grad(:);
end

訓練結果

數據來源仍是稀疏自編碼一節所用的圖片, 設定特徵層包含121個節點，輸入層爲8X8patch即64個節點，拓撲鄰域爲3X3的方陣，運行200次訓練，

當輸入值爲對應特徵值時，每一個激活值會有最大響應，因此把A矩陣每一行的64個向量還原成8*8的圖片patch，也就是特徵值了，每一個隱藏層對應一個，總共121個。結果以下圖. 可看出在當前參數下，相同迭代次數，cg算法的圖片特徵更加清晰。

lbfgs	cg

爲了看到更精細的訓練結果，增長特徵層以及輸入層節點數，特徵層採用256個節點，輸入層分別試驗了14X14以及15X15，相應須要增長拓撲鄰域的大小，採用5X5的方陣。迭代算法採用cg。特徵的清晰程度以及拓撲結構的完整性已經和示例中的結果無差異。邊緣特徵有序排列。而當把輸入節點個數增長到16X16, 訓練效果出現惡化，邊緣特徵開始變得模糊，緣由也能夠理解，特徵層已經再也不大於輸入層，超完備基的條件不成立了，獲得的訓練效果也相對變差。