Deep learning：四十八(Contractive AutoEncoder簡單理解)

時間 2019-11-05

標籤 deep learning 四十八 contractive autoencoder 簡單理解简体版

原文原文鏈接

　　Contractive autoencoder是autoencoder的一個變種，其實就是在autoencoder上加入了一個規則項，它簡稱CAE（對應中文翻譯爲？）。一般狀況下，對權值進行懲罰後的autoencoder數學表達形式爲：html

　　這是直接對W的值進行懲罰的，而今天要講的CAE其數學表達式一樣很是簡單，以下：ios

　　其中的是隱含層輸出值關於權重的雅克比矩陣，而表示的是該雅克比矩陣的F範數的平方，即雅克比矩陣中每一個元素求平方git

　　而後求和，更具體的數學表達式爲：github

　　關於雅克比矩陣的介紹可參考雅克比矩陣&行列式——單純的矩陣和算子，關於F範數可參考我前面的博文Sparse coding中關於矩陣的範數求導中的內容。算法

　　有了loss函數的表達式，採用常見的mini-batch隨機梯度降低法訓練便可。app

　　關於爲何contrative autoencoder效果這麼好？paper中做者解釋了好幾頁，好吧，我真沒徹底明白，但願懂的朋友能簡單通俗的介紹下。下面是讀完文章中的一些理解：dom

　　好的特徵表示大體有2個衡量標準：1. 能夠很好的重構出輸入數據; 2.對輸入數據必定程度下的擾動具備不變形。普通的autoencoder和sparse autoencoder主要是符合第一個標準。而deniose autoencoder和contractive autoencoder則主要體如今第二個。而做爲分類任務來講，第二個標準顯得更重要。函數

　　雅克比矩陣包含數據在各類方向上的信息，能夠對雅克比矩陣進行奇異值分解，同時畫出奇異值數目和奇異值的曲線圖，大的奇異值對應着學習到的局部方向可容許的變化量，而且曲線越抖越好（這個圖沒看明白，因此這裏的解釋基本上是直接翻譯原文中某些觀點）。tornado

　　另外一個曲線圖是contractive ratio圖，contractive ratio定義爲：原空間中2個樣本直接的距離比上特徵空間（指映射後的空間）中對應2個樣本點之間的距離。某個點x處局部映射的contraction值是指該點處雅克比矩陣的F範數。按照做者的觀點，contractive ration曲線呈上升趨勢的話更好（why？），而CAE恰好符合。學習

　　總之Contractive autoencoder主要是抑制訓練樣本（處在低維流形曲面上）在全部方向上的擾動。

　　CAE的代碼可參考：pylearn2/cA.py 　　

"""This tutorial introduces Contractive auto-encoders (cA) using Theano.

 They are based on auto-encoders as the ones used in Bengio et
 al. 2007.  An autoencoder takes an input x and first maps it to a
 hidden representation y = f_{\theta}(x) = s(Wx+b), parameterized by
 \theta={W,b}. The resulting latent representation y is then mapped
 back to a "reconstructed" vector z \in [0,1]^d in input space z =
 g_{\theta'}(y) = s(W'y + b').  The weight matrix W' can optionally be
 constrained such that W' = W^T, in which case the autoencoder is said
 to have tied weights. The network is trained such that to minimize
 the reconstruction error (the error between x and z).  Adding the
 squared Frobenius norm of the Jacobian of the hidden mapping h with
 respect to the visible units yields the contractive auto-encoder:

      - \sum_{k=1}^d[ x_k \log z_k + (1-x_k) \log( 1-z_k)]  + \| \frac{\partial h(x)}{\partial x} \|^2

 References :
   - S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio: Contractive
   Auto-Encoders: Explicit Invariance During Feature Extraction, ICML-11

   - S. Rifai, X. Muller, X. Glorot, G. Mesnil, Y. Bengio, and Pascal
     Vincent. Learning invariant features through local space
     contraction. Technical Report 1360, Universite de Montreal

   - Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle: Greedy Layer-Wise
   Training of Deep Networks, Advances in Neural Information Processing
   Systems 19, 2007

"""
import cPickle
import gzip
import os
import sys
import time

import numpy

import theano
import theano.tensor as T


from logistic_sgd import load_data
from utils import tile_raster_images

import PIL.Image


class cA(object):
    """ Contractive Auto-Encoder class (cA)

    The contractive autoencoder tries to reconstruct the input with an
    additional constraint on the latent space. With the objective of
    obtaining a robust representation of the input space, we
    regularize the L2 norm(Froebenius) of the jacobian of the hidden
    representation with respect to the input. Please refer to Rifai et
    al.,2011 for more details.

    If x is the input then equation (1) computes the projection of the
    input into the latent space h. Equation (2) computes the jacobian
    of h with respect to x.  Equation (3) computes the reconstruction
    of the input, while equation (4) computes the reconstruction
    error and the added regularization term from Eq.(2).

    .. math::

        h_i = s(W_i x + b_i)                                             (1)

        J_i = h_i (1 - h_i) * W_i                                        (2)

        x' = s(W' h  + b')                                               (3)

        L = -sum_{k=1}^d [x_k \log x'_k + (1-x_k) \log( 1-x'_k)]
             + lambda * sum_{i=1}^d sum_{j=1}^n J_{ij}^2                 (4)

    """

    def __init__(self, numpy_rng, input=None, n_visible=784, n_hidden=100,
                 n_batchsize=1, W=None, bhid=None, bvis=None):
        """Initialize the cA class by specifying the number of visible units (the
        dimension d of the input ), the number of hidden units ( the dimension
        d' of the latent or hidden space ) and the contraction level. The
        constructor also receives symbolic variables for the input, weights and
        bias.

        :type numpy_rng: numpy.random.RandomState
        :param numpy_rng: number random generator used to generate weights

        :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
        :param theano_rng: Theano random generator; if None is given
                     one is generated based on a seed drawn from `rng`

        :type input: theano.tensor.TensorType
        :param input: a symbolic description of the input or None for
                      standalone cA

        :type n_visible: int
        :param n_visible: number of visible units

        :type n_hidden: int
        :param n_hidden:  number of hidden units

        :type n_batchsize int
        :param n_batchsize: number of examples per batch

        :type W: theano.tensor.TensorType
        :param W: Theano variable pointing to a set of weights that should be
                  shared belong the dA and another architecture; if dA should
                  be standalone set this to None

        :type bhid: theano.tensor.TensorType
        :param bhid: Theano variable pointing to a set of biases values (for
                     hidden units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None

        :type bvis: theano.tensor.TensorType
        :param bvis: Theano variable pointing to a set of biases values (for
                     visible units) that should be shared belong dA and another
                     architecture; if dA should be standalone set this to None

        """
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        self.n_batchsize = n_batchsize
        # note : W' was written as `W_prime` and b' as `b_prime`
        if not W:
            # W is initialized with `initial_W` which is uniformely sampled
            # from -4*sqrt(6./(n_visible+n_hidden)) and
            # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
            # converted using asarray to dtype
            # theano.config.floatX so that the code is runable on GPU
            initial_W = numpy.asarray(numpy_rng.uniform(
                      low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                      high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
                      size=(n_visible, n_hidden)),
                                      dtype=theano.config.floatX)
            W = theano.shared(value=initial_W, name='W', borrow=True)

        if not bvis:
            bvis = theano.shared(value=numpy.zeros(n_visible,
                                                   dtype=theano.config.floatX),
                                 borrow=True)

        if not bhid:
            bhid = theano.shared(value=numpy.zeros(n_hidden,
                                                   dtype=theano.config.floatX),
                                 name='b',
                                 borrow=True)

        self.W = W
        # b corresponds to the bias of the hidden
        self.b = bhid
        # b_prime corresponds to the bias of the visible
        self.b_prime = bvis
        # tied weights, therefore W_prime is W transpose
        self.W_prime = self.W.T

        # if no input is given, generate a variable representing the input
        if input == None:
            # we use a matrix because we expect a minibatch of several
            # examples, each example being a row
            self.x = T.dmatrix(name='input')
        else:
            self.x = input

        self.params = [self.W, self.b, self.b_prime]

    def get_hidden_values(self, input): #激發函數爲sigmoid看，這裏只向前進一次
        """ Computes the values of the hidden layer """
        return T.nnet.sigmoid(T.dot(input, self.W) + self.b)

    def get_jacobian(self, hidden, W):
        """Computes the jacobian of the hidden layer with respect to
        the input, reshapes are necessary for broadcasting the
        element-wise product on the right axis

        """
        return T.reshape(hidden * (1 - hidden), #計算雅克比矩陣，先將h(1-h)變成3維矩陣，而後將w也變成3維矩陣，而後將這2個3維矩陣
                         (self.n_batchsize, 1, self.n_hidden)) * T.reshape( #對應元素相乘，但怎麼感受2個矩陣尺寸不對應呢？
                             W, (1, self.n_visible, self.n_hidden))

    def get_reconstructed_input(self, hidden): #重構輸入時得到的輸出端數據
        """Computes the reconstructed input given the values of the
        hidden layer

        """
        return  T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)

    def get_cost_updates(self, contraction_level, learning_rate):
        """ This function computes the cost and the updates for one trainng
        step of the cA """

        y = self.get_hidden_values(self.x)
        z = self.get_reconstructed_input(y)
        J = self.get_jacobian(y, self.W)
        # note : we sum over the size of a datapoint; if we are using
        #        minibatches, L will be a vector, with one entry per
        #        example in minibatch
        self.L_rec = - T.sum(self.x * T.log(z) + #交叉熵做爲重構偏差(當輸入是[0,1],且是sigmoid時能夠採用)
                             (1 - self.x) * T.log(1 - z),
                             axis=1)

        # Compute the jacobian and average over the number of samples/minibatch
        self.L_jacob = T.sum(J ** 2) / self.n_batchsize

        # note : L is now a vector, where each element is the
        #        cross-entropy cost of the reconstruction of the
        #        corresponding example of the minibatch. We need to
        #        compute the average of all these to get the cost of
        #        the minibatch
        cost = T.mean(self.L_rec) + contraction_level * T.mean(self.L_jacob)

        # compute the gradients of the cost of the `cA` with respect
        # to its parameters
        gparams = T.grad(cost, self.params) #Theano特有的功能，自動求導
        # generate the list of updates
        updates = []
        for param, gparam in zip(self.params, gparams):
            updates.append((param, param - learning_rate * gparam)) #SGD算法 

        return (cost, updates)


def test_cA(learning_rate=0.01, training_epochs=20,
            dataset='./data/mnist.pkl.gz',
            batch_size=10, output_folder='cA_plots', contraction_level=.1):
    """
    This demo is tested on MNIST

    :type learning_rate: float
    :param learning_rate: learning rate used for training the contracting
                          AutoEncoder

    :type training_epochs: int
    :param training_epochs: number of epochs used for training

    :type dataset: string
    :param dataset: path to the picked dataset

    """
    datasets = load_data(dataset)
    train_set_x, train_set_y = datasets[0]

    # compute number of minibatches for training, validation and testing
    n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size #標識borrow=True表示不須要複製樣本

    # allocate symbolic variables for the data
    index = T.lscalar()    # index to a [mini]batch
    x = T.matrix('x')  # the data is presented as rasterized images

    if not os.path.isdir(output_folder):
        os.makedirs(output_folder)
    os.chdir(output_folder)
    ####################################
    #        BUILDING THE MODEL        #
    ####################################

    rng = numpy.random.RandomState(123)

    ca = cA(numpy_rng=rng, input=x,
            n_visible=28 * 28, n_hidden=500, n_batchsize=batch_size) #500個隱含層節點

    cost, updates = ca.get_cost_updates(contraction_level=contraction_level, #update裏面裝的是參數的更新過程
                                        learning_rate=learning_rate)

    train_ca = theano.function([index], [T.mean(ca.L_rec), ca.L_jacob], #定義函數，輸入爲batch的索引，輸出爲該batch下的重構偏差和雅克比偏差
                               updates=updates,
                               givens={x: train_set_x[index * batch_size:
                                                    (index + 1) * batch_size]})

    start_time = time.clock()

    ############
    # TRAINING #
    ############

    # go through training epochs
    for epoch in xrange(training_epochs): #循環20次
        # go through trainng set
        c = []
        for batch_index in xrange(n_train_batches):
            c.append(train_ca(batch_index)) #計算loss值,計算過程當中其實也一直在更新updates權值

        c_array = numpy.vstack(c) #vstack()爲將矩陣序列c按照每行疊加，從新構造一個矩陣
        print 'Training epoch %d, reconstruction cost ' % epoch, numpy.mean(
            c_array[0]), ' jacobian norm ', numpy.mean(numpy.sqrt(c_array[1]))

    end_time = time.clock()

    training_time = (end_time - start_time)
    #下面是顯示和保存學習到的權值結果
    print >> sys.stderr, ('The code for file ' + os.path.split(__file__)[1] +
                          ' ran for %.2fm' % ((training_time) / 60.))
    image = PIL.Image.fromarray(tile_raster_images(
        X=ca.W.get_value(borrow=True).T,
        img_shape=(28, 28), tile_shape=(10, 10),
        tile_spacing=(1, 1)))

    image.save('cae_filters.png')

    os.chdir('../')


if __name__ == '__main__':
    test_cA()

　　按照原程序，迭代20次，跑了6個多小時，重構偏差項和contraction項變化狀況以下：

... loading data
Training epoch 0, reconstruction cost  589.571872577  jacobian norm  20.9938791886
Training epoch 1, reconstruction cost  115.13390224  jacobian norm  10.673699659
Training epoch 2, reconstruction cost  101.291018001  jacobian norm  10.134422748
Training epoch 3, reconstruction cost  94.220284334  jacobian norm  9.84685383242
Training epoch 4, reconstruction cost  89.5890225412  jacobian norm  9.64736166807
Training epoch 5, reconstruction cost  86.1490384385  jacobian norm  9.49857669084
Training epoch 6, reconstruction cost  83.4664242016  jacobian norm  9.38143172793
Training epoch 7, reconstruction cost  81.3512907826  jacobian norm  9.28327421556
Training epoch 8, reconstruction cost  79.6482831506  jacobian norm  9.19748922967
Training epoch 9, reconstruction cost  78.2066659332  jacobian norm  9.12143982155
Training epoch 10, reconstruction cost  76.9456192804  jacobian norm  9.05343287129
Training epoch 11, reconstruction cost  75.8435863545  jacobian norm  8.99151663486
Training epoch 12, reconstruction cost  74.8999458491  jacobian norm  8.9338049163
Training epoch 13, reconstruction cost  74.1060022563  jacobian norm  8.87925367541
Training epoch 14, reconstruction cost  73.4415396294  jacobian norm  8.8291852146
Training epoch 15, reconstruction cost  72.879630175  jacobian norm  8.78442892358
Training epoch 16, reconstruction cost  72.3729563995  jacobian norm  8.74324402838
Training epoch 17, reconstruction cost  71.8622392555  jacobian norm  8.70262903409
Training epoch 18, reconstruction cost  71.3049790204  jacobian norm  8.66103980493
Training epoch 19, reconstruction cost  70.6462751293  jacobian norm  8.61777944201