Deep Learning 12_深度學習UFLDL教程：Sparse Coding_exercise（斯坦福大學深度學習教程）

時間 2019-11-19

標籤 deep learning 深度學習 ufldl 教程 sparse coding exercise 斯坦福大學简体版

原文原文鏈接

前言php

理論知識：UFLDL教程、Deep learning：二十六(Sparse coding簡單理解)、Deep learning：二十七(Sparse coding中關於矩陣的範數求導)、Deep learning：二十九(Sparse coding練習)html

實驗環境：win7， matlab2015b，16G內存，2T機械硬盤算法

本節實驗比較很差理解也很差作，我看不少人最後也沒得出好的結果，因此得花時間仔細理解才行。編程

實驗內容：Exercise:Sparse Coding。從10張512*512的已經白化後的灰度圖像（即：Deep Learning 一_深度學習UFLDL教程：Sparse Autoencoder練習（斯坦福大學深度學習教程）中用到的數據 sparseae_exercise.zip中的IMAGES.mat）中隨機抽取20000張小圖像塊（大小爲8*8或16*16），分別經過稀疏編碼（Sparse Coding）和拓撲稀疏編碼（topographic sparse coding）的方法學習它們的特徵，並分別顯示出來。數組

實驗基礎說明：dom

1.稀疏自動編碼器（Sparse Autoencoder）和稀疏編碼（Sparse Coding）的區別？ide

①Ng的解釋：稀疏編碼能夠看做是稀疏自編碼方法的一個變形，稀疏編碼在試圖直接學習數據的特徵集。而稀疏自動編碼是直接學習原始數據。函數

稀疏編碼算法是一種無監督學習方法，它用來尋找一組「超完備」基向量來更高效地表示樣本數據，這個「超完備」基的維數大於原始數據的維數。而只有一層隱含層的稀疏自動編碼器至關於PCA，只能使咱們方便地找到一組「完備」基向量，它的維數小於原始數據的維數。tornado

②稀疏編碼算法的目的就是找到一組基向量 $\mathbf{\phi}_i$ ，使得咱們能將輸入向量 $\mathbf{x}$ 表示爲這些基向量的線性組合：post

　　　　　　　　　　　　　　　　　　　　　　 $\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} \end{align}$ ，其中x＝[x1,x2,x3，……，xn]

通常狀況下要求基的個數k很是大，至少要比x中元素的個數n要大，所以上面這樣的方程就大多狀況會有無窮多個解。咱們爲何要尋找這樣的「超完備」基呢？由於超完備基的好處是它們能更有效地找出隱含在輸入數據內部的結構與模式。其實在常見的PCA算法中，是能夠找到一組基來分解X的，只不過那個基的數目比較小，因此能夠獲得分解後的係數a是能夠惟一肯定，而在sparse coding中，k太大，比n大不少，其分解係數a不能惟一肯定。爲了能惟一肯定a，通常的作法是對係數a做一個稀疏性約束，即：要求係數 $a i 是稀疏的。也就是說要求：對於一組輸入向量，在a中咱們只想有儘量少的幾個係數遠大於零，其餘都等於0或接近於0。a變稀疏從而就會使x變得稀疏。$

$\mathbf{\phi}_i$

$③ 在稀疏編碼中，權值矩陣 A 用於將特徵 s 映射到原始數據x，而在之前的全部實驗（包括稀疏自動編碼）中，權值矩陣 W 工做的方向相反，是將原始數據 x 映射到特徵$

2.稀疏編碼的做用？即：爲何要稀疏編碼？

Ng已經回答了：稀疏編碼算法是一種無監督學習方法，它用來尋找一組「超完備」基向量來更高效地表示樣本數據，而超完備基的好處是它們能更有效地找出隱含在輸入數據內部的結構與模式。

3.代價函數

非拓撲稀疏編碼（non-topographic sparse coding）時的代價函數：

拓撲稀疏編碼（topographic sparse coding）時的代價函數：

注意： $\lVert x \rVert_k$ 是x的Lk範數，等價於 $\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}$ 。L2 範數即你們熟知的歐幾里得範數，即：平方和的開方。L1 範數是向量元素的絕對值之和

在Ng的講解中，這個代價函數表達是不許確的，起碼在重構項中少了要除以元素個數，即：重構項其實是均方偏差，而不是偏差平方和。

4.代價函數的導數

代價函數關於權值矩陣A的導數（拓撲和非拓撲時結果是同樣的，由於此時這2種狀況下代價函數關於A是沒區別的）：

非拓撲稀疏編碼（non-topographic sparse coding）時代價函數關於s的導數：

拓撲稀疏編碼（topographic sparse coding）時代價函數關於s的導數爲：

其中，m爲樣本數量。

上面矩陣求導公式的推導，可參考Deep learning：二十七(Sparse coding中關於矩陣的範數求導)、http://www.mathchina.net/dvbbs/dispbbs.asp?boardid=4&Id=3673和The Matrix Cookbook（強烈推薦這本書，還有Writing Fast MATLAB Code (by Pascal Getreuer)也很是值得一看）。

5.迭代優化時的參數

迭代優化參數時，給定s優化A時，因爲A有直接的解析解，因此不須要經過lbfgs等優化算法求得，經過令代價函數對A的導函數爲0，能夠獲得解析解爲：

　　其中，m爲樣本個數。

6.注意理解：本節實驗的代價函數是怎麼推導出來的？只有在理解它的基礎上，之後才能本身根據本身的模型和須要來推導本身的代價函數。

實際上，Ng在他的講解（稀疏編碼和稀疏編碼自編碼表達）中已經很詳細清楚地講明瞭這一點。下面的解釋只是爲了把話說得更直白一點：

①重構項：由於本節實驗的目的是尋找數據X的「超完備」基，並把數據x用「超完備」基表示出來，即： $\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} \end{align}$ ，其中 $\mathbf{\phi}_i$ 就是它的「超完備」基，其矩陣表示爲A。 $a i 是數據x的特徵集（也就是「超完備」基的係數），其矩陣表示爲s。$

$爲了使數據x和之間的偏差最小，那麼代價函數必需要包括這二者的均方差，而且要使這個代價函數最小，即：$

$最小化，$

$矩陣表示爲：$

$Ng叫這一項爲「重構項」，其實也是均方偏差項。因此Ng的表示其實是有一點不許確的，少了除以樣本數m，固然這只是表示的問題，在他的代碼中是除以m的。$

$\mathbf{x}$ $，其中在本節實驗中 S (.) 的選擇是 L 1 範式代價函數，其實也能夠選擇對數代價函數。$

$只要上式最小化就能保證係數 a i 只有不多的幾個遠大於零的元素，即保證係數 a i 的稀疏性。由於 a i 的矩陣表示爲s，因此稀疏懲罰項矩陣表示爲：$

又由於之後咱們爲求代價函數的最小值，會對代價函數求導，而對s在0點處不可導的（理解：y=|x|在x=0處不可導），因此爲了方便之後求代價函數最小值，可把替換爲，其中 $ε 是「平滑參數」（"smoothing parameter"）或者「稀疏參數」（"sparsity parameter"）。$

$綜上，稀疏懲罰項爲：$

故此時代價函數爲：

$\begin{align} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$ ，

矩陣表示爲：

$\mathbf{\phi}_i$

$，矩陣表示爲：，$

$\sum_r{\sum_c{A_{rc}^2}}$ $就能夠保證前面提到的狀況不會發生。$

因此，代價函數第三項爲：

$經過這上面三步的推導，咱們才得出知足咱們要求的代價函數：$

對於拓撲稀疏編碼的代價函數可見Ng的解釋，他已經說很清楚很好理解了。

7.注意：由於10張512*512的灰度圖像是已經白化後的圖像，故它的值不是在[0，1]之間。之前實驗的sampleIMAGES函數中要把樣本歸一化到[0，1]之間的緣由在於它們都是輸入到稀疏自動編碼器，其要求輸入數據爲[0，1]範圍，而本節實驗是把數據輸入到稀疏編碼中，它並不要求數據輸入範圍爲[0，1]，同時歸一化自己其實是有偏差的（由於它是根據3sigma法則歸一化的），因此本節實驗修改後的sampleIMAGES函數中，不須要再把隨機抽取20000張小圖像塊歸一化到[0，1]範圍，即：要把原來sampleIMAGES函數中以下代碼註釋掉：patches = normalizeData(patches);。實際上這一步很是重要，由於它直接關係到最後是否能夠獲得要求的結果圖。具體抽取函數 sampleIMAGES見下面代碼。

8.優秀的編程技巧：

assert(diff < 1e-8, 'Weight difference too large. Check your weight cost function. ');

用於檢查邏輯不等式diff < 1e-8爲真仍是假，如爲假就顯示：Weight difference too large. Check your weight cost function

9.一些Matlab函數：

　　circshift:

　　該函數是將矩陣循環平移的函數，好比說B = circshift(A,shiftsize)是將矩陣A按照shiftsize的方式左右平移，通常hiftsize爲一個多維的向量，第一個元素表示上下方向移動（更準確的說是在第一個維度上移動，這裏只是考慮是2維矩陣的狀況，後面的相似），若是爲正表示向下移，第二個元素表示左右方向移動，若是向右表示向右移動。

　　rndperm：

　　該函數是隨機產生一個行向量，好比randperm(n)產生一個n維的行向量，向量元素值爲1~n，隨機選取且不重複。而randperm(n,k)表示產生一個長爲k的行向量，其元素也是在1到n之間，不能有重複。

　　questdlg：

　　button = questdlg('qstring','title','str1','str2','str3',default)，這是一個對話框，對話框中的內容用qstring表示，標題爲title，而後後面3個分別爲對應yes,no,cancel按鈕，最後的參數default爲默認的對應按鈕。

疑問

1.分組矩陣V究竟應該怎麼樣產生？爲何這樣產生？

Ng對分組矩陣V的產生方法以下：

donutDim = floor(sqrt(numFeatures));
assert(donutDim * donutDim == numFeatures,'donutDim^2 must be equal to numFeatures');

groupMatrix = zeros(numFeatures, donutDim, donutDim);

groupNum = 1;
for row = 1:donutDim
    for col = 1:donutDim
        groupMatrix(groupNum, 1:poolDim, 1:poolDim) = 1;
        groupNum = groupNum + 1;
        groupMatrix = circshift(groupMatrix, [0 0 -1]);
    end
    groupMatrix = circshift(groupMatrix, [0 -1, 0]);
end
%poolDim = 3;

爲何這麼產生？沒搞懂！

實驗步驟

1.初始化參數，而後加載數據（即：10張512*512的已經白化後的灰度圖），而後從中隨機抽取出20000張小圖像塊。對於本節修改後的抽取函數見 sampleIMAGES.m。

2. 實現稀疏編碼代價函數，並求出代價函數對權值矩陣A的導數，即代價函數對A梯度，而後檢查它的實現是否正確。由於非拓撲稀疏編碼代價函數和拓撲稀疏編碼代價函數對A的導數是同樣的，因此這裏不用分開求。

3.實現非拓撲稀疏編碼代價函數，並求出它對特徵集s的導數，即代價函數對s梯度，而後檢查它的實現是否正確。

4.實現拓撲稀疏編碼代價函數，並求出它對特徵集s的導數，即代價函數對s梯度，而後檢查它的實現是否正確。

5.迭代優化參數A和s（可用lbfgs或cg算法），獲得最佳的權值矩陣A，並把它顯示出來。

結果

①部分原始數據

②非拓撲稀疏編碼算法，優化算法爲lbfgs，特徵個數爲256，小圖像塊大小爲16*16時，提取的「超完備」基爲：

③拓撲稀疏編碼算法，優化算法爲lbfgs，特徵個數爲256，小圖像塊大小爲16*16時，提取的「超完備」基爲：

④拓撲稀疏編碼算法，優化算法爲cg，特徵個數爲256，小圖像塊大小爲16*16時，提取的「超完備」基爲：

⑤拓撲稀疏編碼算法，優化算法爲cg，特徵個數爲121，小圖像塊大小爲8*8時，提取的「超完備」基爲：

總結

1.優化算法對結果是有影響的。好比，從上面可看出，對於拓撲稀疏編碼，優化算法爲cg，特徵個數爲121，小圖像塊大小爲16*16時，cg算法獲得的結果比lbfgs算法更好。

2.小圖像塊大小對結果是有影響的。好比，拓撲稀疏編碼算法，特徵個數爲121，小圖像塊大小爲8*8時，若是用lbfgs算法就得不到結果，而用cg算法就能夠獲得上面的結果。

3.規律：隨着迭代次數的增長，代價函數值應該是愈來愈小的，若是不是，就永遠不會獲得正確結果。若是代價函數不是愈來愈小，多是優化算法的選擇有問題，或者小圖像塊的選擇有問題，具體狀況具體分析，有多種緣由。固然更本質的緣由，還沒弄清楚。

代碼

sparseCodingExercise.m

%% CS294A/CS294W Sparse Coding Exercise

%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  sparse coding exercise. In this exercise, you will need to modify
%  sparseCodingFeatureCost.m and sparseCodingWeightCost.m. You will also
%  need to modify this file, sparseCodingExercise.m slightly.

% Add the paths to your earlier exercises if necessary
% addpath /path/to/solution
addpath minFunc/;
%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

numPatches = 20000;   % number of patches
numFeatures = 256;    % number of features to learn
patchDim = 16;         % patch dimension
visibleSize = patchDim * patchDim; 

% dimension of the grouping region (poolDim x poolDim) for topographic sparse coding
poolDim = 3;

% number of patches per batch
batchNumPatches = 2000; 

lambda = 5e-5;  % L1-regularisation parameter (on features)
epsilon = 1e-5; % L1-regularisation epsilon |x| ~ sqrt(x^2 + epsilon)
gamma = 1e-2;   % L2-regularisation parameter (on basis)

%%======================================================================
%% STEP 1: Sample patches

images = load('IMAGES.mat');
images = images.IMAGES;

patches = sampleIMAGES(images, patchDim, numPatches);
display_network(patches(:, 1:64));

%%======================================================================
%% STEP 2: Implement and check sparse coding cost functions
%  Implement the two sparse coding cost functions and check your gradients.
%  The two cost functions are
%  1) sparseCodingFeatureCost (in sparseCodingFeatureCost.m) for the features 
%     (used when optimizing for s, which is called featureMatrix in this exercise) 
%  2) sparseCodingWeightCost (in sparseCodingWeightCost.m) for the weights
%     (used when optimizing for A, which is called weightMatrix in this exericse)

% We reduce the number of features and number of patches for debugging
debug = false;
if debug
numFeatures = 5;
patches = patches(:, 1:5);
numPatches = 5;

weightMatrix = randn(visibleSize, numFeatures) * 0.005;
featureMatrix = randn(numFeatures, numPatches) * 0.005;

%% STEP 2a: Implement and test weight cost
%  Implement sparseCodingWeightCost in sparseCodingWeightCost.m and check
%  the gradient

[cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon);

numgrad = computeNumericalGradient( @(x) sparseCodingWeightCost(x, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon), weightMatrix(:) );
% Uncomment the blow line to display the numerical and analytic gradients side by side
% disp([numgrad grad]);     
diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf('Weight difference: %g\n', diff);
assert(diff < 1e-8, 'Weight difference too large. Check your weight cost function. ');

%% STEP 2b: Implement and test feature cost (非拓撲結構non-topographic)
%  Implement sparseCodingFeatureCost in sparseCodingFeatureCost.m and check
%  the gradient. You may wish to implement the non-topographic version of
%  the cost function first, and extend it to the topographic version later.

% Set epsilon to a larger value so checking the gradient numerically makes sense
epsilon = 1e-2;

% We pass in the identity matrix as the grouping matrix, putting each
% feature in a group on its own.
groupMatrix = eye(numFeatures);

[cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix);

numgrad = computeNumericalGradient( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix), featureMatrix(:) );
% Uncomment the blow line to display the numerical and analytic gradients side by side
% disp([numgrad grad]); 
diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf('Feature difference (non-topographic): %g\n', diff);
assert(diff < 1e-8, 'Feature difference too large. Check your feature cost function. ');

%% STEP 2c: Implement and test feature cost (拓撲結構topographic)
%  Implement sparseCodingFeatureCost in sparseCodingFeatureCost.m and check
%  the gradient. This time, we will pass a random grouping matrix in to
%  check if your costs and gradients are correct for the topographic
%  version.

% Set epsilon to a larger value so checking the gradient numerically makes sense
epsilon = 1e-2;

% This time we pass in a random grouping matrix to check if the grouping is
% correct.
groupMatrix = rand(100, numFeatures);

[cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix);

numgrad = computeNumericalGradient( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix), featureMatrix(:) );
% Uncomment the blow line to display the numerical and analytic gradients side by side
% disp([numgrad grad]); 
diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf('Feature difference (topographic): %g\n', diff);
assert(diff < 1e-8, 'Feature difference too large. Check your feature cost function. ');
end

%%======================================================================
%% STEP 3: Iterative optimization
%  Once you have implemented the cost functions, you can now optimize for
%  the objective iteratively. The code to do the iterative optimization 
%  using mini-batching and good initialization of the features has already
%  been included for you. 
% 
%  However, you will still need to derive and fill in the analytic solution 
%  for optimizing the weight matrix given the features. 
%  Derive the solution and implement it in the code below, verify the
%  gradient as described in the instructions below, and then run the
%  iterative optimization.


% Initialize options for minFunc
options.Method = 'lbfgs';
options.display = 'off';
options.verbose = 0;

% Initialize matrices
weightMatrix = rand(visibleSize, numFeatures);
featureMatrix = rand(numFeatures, batchNumPatches);

% Initialize grouping matrix
assert(floor(sqrt(numFeatures)) ^2 == numFeatures, 'numFeatures should be a perfect square');
donutDim = floor(sqrt(numFeatures));
assert(donutDim * donutDim == numFeatures,'donutDim^2 must be equal to numFeatures');

groupMatrix = zeros(numFeatures, donutDim, donutDim);

groupNum = 1;
for row = 1:donutDim
    for col = 1:donutDim
        groupMatrix(groupNum, 1:poolDim, 1:poolDim) = 1;
        groupNum = groupNum + 1;
        groupMatrix = circshift(groupMatrix, [0 0 -1]);
    end
    groupMatrix = circshift(groupMatrix, [0 -1, 0]);
end

groupMatrix = reshape(groupMatrix, numFeatures, numFeatures);
if isequal(questdlg('Initialize grouping matrix for topographic or non-topographic sparse coding?', 'Topographic/non-topographic?', 'Non-topographic', 'Topographic', 'Non-topographic'), 'Non-topographic')
    groupMatrix = eye(numFeatures);
end

% Initial batch
indices = randperm(numPatches);
indices = indices(1:batchNumPatches);
batchPatches = patches(:, indices);                           

fprintf('%6s%12s%12s%12s%12s\n','Iter', 'fObj','fResidue','fSparsity','fWeight');

for iteration = 1:200                      
    error = weightMatrix * featureMatrix - batchPatches;
    error = sum(error(:) .^ 2) / batchNumPatches;
    
    fResidue = error;
    
    R = groupMatrix * (featureMatrix .^ 2);
    R = sqrt(R + epsilon);    
    fSparsity = lambda * sum(R(:));    
    
    fWeight = gamma * sum(weightMatrix(:) .^ 2);
    
    fprintf('  %4d  %10.4f  %10.4f  %10.4f  %10.4f\n', iteration, fResidue+fSparsity+fWeight, fResidue, fSparsity, fWeight)
               
    % Select a new batch
    indices = randperm(numPatches);
    indices = indices(1:batchNumPatches);
    batchPatches = patches(:, indices);                    
    
    % Reinitialize featureMatrix with respect to the new batch
    featureMatrix = weightMatrix' * batchPatches;
    normWM = sum(weightMatrix .^ 2)';
    featureMatrix = bsxfun(@rdivide, featureMatrix, normWM); 
    
    % Optimize for feature matrix    
    options.maxIter = 20;
    [featureMatrix, cost] = minFunc( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix), ...
                                           featureMatrix(:), options);
    featureMatrix = reshape(featureMatrix, numFeatures, batchNumPatches);                                      
       
    % Optimize for weight matrix  
%     weightMatrix = zeros(visibleSize, numFeatures);     
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Fill in the analytic solution for weightMatrix that minimizes 
    %   the weight cost here.     
    %   Once that is done, use the code provided below to check that your
    %   closed form solution is correct.
    %   Once you have verified that your closed form solution is correct,
    %   you should comment out the checking code before running the
    %   optimization.
    
    weightMatrix = batchPatches * featureMatrix' / (featureMatrix * featureMatrix' + gamma * batchNumPatches * eye(numFeatures));
    
%     [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix);
%     assert(norm(grad) < 1e-12, 'Weight gradient should be close to 0. Check your closed form solution for weightMatrix again.')
%     error('Weight gradient is okay. Comment out checking code before running optimization.');
    % -------------------- YOUR CODE HERE --------------------   
    
    % Visualize learned basis
    figure(1);
    display_network(weightMatrix);           
end

sampleIMAGES.m

function patches = sampleIMAGES(IMAGES, patchsize, numpatches)
% sampleIMAGES
% Returns 10000 patches for training

% load IMAGES;    % load images from disk 
% 
% patchsize = 8;  % we'll use 8x8 patches 
% numpatches = 10000;

% Initialize patches with zeros.  Your code will fill in this matrix--one
% column per patch, 10000 columns. 
patches = zeros(patchsize*patchsize, numpatches);

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Fill in the variable called "patches" using data 
%  from IMAGES.  
%  
%  IMAGES is a 3D array containing 10 images
%  For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
%  and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize
%  it. (The contrast on these images look a bit off because they have
%  been preprocessed using using "whitening."  See the lecture notes for
%  more details.) As a second example, IMAGES(21:30,21:30,1) is an image
%  patch corresponding to the pixels in the block (21,21) to (30,30) of
%  Image 1
tic
image_size=size(IMAGES);
i=randi(image_size(1)-patchsize+1,1,numpatches);%生成元素值隨機爲大於0且小於image_size(1)-patchsize+1的1行numpatches矩陣
j=randi(image_size(2)-patchsize+1,1,numpatches);
k=randi(image_size(3),1,numpatches);
for num=1:numpatches
        patches(:,num)=reshape(IMAGES(i(num):i(num)+patchsize-1,j(num):j(num)+patchsize-1,k(num)),1,patchsize*patchsize);
end
toc

%% ---------------------------------------------------------------
% For the autoencoder to work well we need to normalize the data
% Specifically, since the output of the network is bounded between [0,1]
% (due to the sigmoid activation function), we have to make sure 
% the range of pixel values is also bounded between [0,1]
% patches = normalizeData(patches);

end


%% ---------------------------------------------------------------
function patches = normalizeData(patches)

% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer

% Remove DC (mean of images). 把patches數組中的每一個元素值都減去mean(patches)
patches = bsxfun(@minus, patches, mean(patches));

% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));%把patches的標準差變爲其原來的3倍
patches = max(min(patches, pstd), -pstd) / pstd;%由於根據3sigma法則，95%以上的數據都在該區域內
                                                % 這裏轉換後將數據變到了-1到1之間
% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;

end

sparseCodingWeightCost.m

function [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures,  patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingWeightCost - given the features in featureMatrix, 
%                         computes the cost and gradient with respect to
%                         the weights, given in weightMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.

    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
    else
        groupMatrix = eye(numFeatures);
    end

    numExamples = size(patches, 2);

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);
    
    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.     
    % -------------------- YOUR CODE HERE --------------------    
 %% 求目標的代價函數
    delta = weightMatrix*featureMatrix-patches;
    fResidue = sum(sum(delta.^2))./numExamples;%重構偏差
    fWeight = gamma*sum(weightMatrix(:).^2);%防止基內元素值過大
    sparsityMatrix = sqrt(groupMatrix*(featureMatrix.^2)+epsilon);
    fSparsity = lambda*sum(sparsityMatrix(:));% 對特徵係數性的懲罰值
    cost = fResidue+fWeight+fSparsity; %目標的代價函數
%     cost = fResidue+fWeight;
    
    %% 求目標代價函數的偏導函數
    grad = (2*weightMatrix*featureMatrix*(featureMatrix')-2*patches*featureMatrix')./numExamples+2*gamma*weightMatrix;
    grad = grad(:);
    
    
    
    
end

sparseCodingFeatureCost.m

function [cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingFeatureCost - given the weights in weightMatrix,
%                          computes the cost and gradient with respect to
%                          the features, given in featureMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.

    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
    else
        groupMatrix = eye(numFeatures);
    end

    numExamples = size(patches, 2);

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);

    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   features given in featureMatrix.     
    %   You may wish to write the non-topographic version, ignoring
    %   the grouping matrix groupMatrix first, and extend the 
    %   non-topographic version to the topographic version later.
    % -------------------- YOUR CODE HERE --------------------
    
      %% 求目標的代價函數
    delta = weightMatrix*featureMatrix-patches;
    fResidue = sum(sum(delta.^2))./numExamples;%重構偏差
    fWeight = gamma*sum(weightMatrix(:).^2);%防止基內元素值過大
    sparsityMatrix = sqrt(groupMatrix*(featureMatrix.^2)+epsilon);
    fSparsity = lambda*sum(sparsityMatrix(:)); %對特徵係數性的懲罰值
    cost = fResidue++fSparsity+fWeight;%此時A爲常量，能夠不用
%     cost = fResidue++fSparsity;

    %% 求目標代價函數的偏導函數
    gradResidue = (-2*weightMatrix'*patches+2*(weightMatrix')*weightMatrix*featureMatrix)./numExamples;
  
    % 非拓撲結構時：
    isTopo = 1;  %isTopo = 1時，表示爲拓撲結構
    if ~isTopo
        gradSparsity = lambda*(featureMatrix./sparsityMatrix);
        
    % 拓撲結構時
    else
        gradSparsity = lambda*groupMatrix'*(groupMatrix*(featureMatrix.^2)+epsilon).^(-0.5).*featureMatrix;%必定要當心最後一項是內積乘法
    end
    grad = gradResidue+gradSparsity;
    grad = grad(:);
    
    
    
    
    
end