Contractive autoencoder是autoencoder的一個變種,其實就是在autoencoder上加入了一個規則項,它簡稱CAE(對應中文翻譯爲?)。一般狀況下,對權值進行懲罰後的autoencoder數學表達形式爲:html
這是直接對W的值進行懲罰的,而今天要講的CAE其數學表達式一樣很是簡單,以下:ios
其中的 是隱含層輸出值關於權重的雅克比矩陣,而
表示的是該雅克比矩陣的F範數的平方,即雅克比矩陣中每一個元素求平方git
而後求和,更具體的數學表達式爲:github
關於雅克比矩陣的介紹可參考雅克比矩陣&行列式——單純的矩陣和算子,關於F範數可參考我前面的博文Sparse coding中關於矩陣的範數求導中的內容。算法
有了loss函數的表達式,採用常見的mini-batch隨機梯度降低法訓練便可。app
關於爲何contrative autoencoder效果這麼好?paper中做者解釋了好幾頁,好吧,我真沒徹底明白,但願懂的朋友能簡單通俗的介紹下。下面是讀完文章中的一些理解:dom
好的特徵表示大體有2個衡量標準:1. 能夠很好的重構出輸入數據; 2.對輸入數據必定程度下的擾動具備不變形。普通的autoencoder和sparse autoencoder主要是符合第一個標準。而deniose autoencoder和contractive autoencoder則主要體如今第二個。而做爲分類任務來講,第二個標準顯得更重要。函數
雅克比矩陣包含數據在各類方向上的信息,能夠對雅克比矩陣進行奇異值分解,同時畫出奇異值數目和奇異值的曲線圖,大的奇異值對應着學習到的局部方向可容許的變化量,而且曲線越抖越好(這個圖沒看明白,因此這裏的解釋基本上是直接翻譯原文中某些觀點)。tornado
另外一個曲線圖是contractive ratio圖,contractive ratio定義爲:原空間中2個樣本直接的距離比上特徵空間(指映射後的空間)中對應2個樣本點之間的距離。某個點x處局部映射的contraction值是指該點處雅克比矩陣的F範數。按照做者的觀點,contractive ration曲線呈上升趨勢的話更好(why?),而CAE恰好符合。學習
總之Contractive autoencoder主要是抑制訓練樣本(處在低維流形曲面上)在全部方向上的擾動。
CAE的代碼可參考:pylearn2/cA.py
"""This tutorial introduces Contractive auto-encoders (cA) using Theano. They are based on auto-encoders as the ones used in Bengio et al. 2007. An autoencoder takes an input x and first maps it to a hidden representation y = f_{\theta}(x) = s(Wx+b), parameterized by \theta={W,b}. The resulting latent representation y is then mapped back to a "reconstructed" vector z \in [0,1]^d in input space z = g_{\theta'}(y) = s(W'y + b'). The weight matrix W' can optionally be constrained such that W' = W^T, in which case the autoencoder is said to have tied weights. The network is trained such that to minimize the reconstruction error (the error between x and z). Adding the squared Frobenius norm of the Jacobian of the hidden mapping h with respect to the visible units yields the contractive auto-encoder: - \sum_{k=1}^d[ x_k \log z_k + (1-x_k) \log( 1-z_k)] + \| \frac{\partial h(x)}{\partial x} \|^2 References : - S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio: Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, ICML-11 - S. Rifai, X. Muller, X. Glorot, G. Mesnil, Y. Bengio, and Pascal Vincent. Learning invariant features through local space contraction. Technical Report 1360, Universite de Montreal - Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle: Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19, 2007 """ import cPickle import gzip import os import sys import time import numpy import theano import theano.tensor as T from logistic_sgd import load_data from utils import tile_raster_images import PIL.Image class cA(object): """ Contractive Auto-Encoder class (cA) The contractive autoencoder tries to reconstruct the input with an additional constraint on the latent space. With the objective of obtaining a robust representation of the input space, we regularize the L2 norm(Froebenius) of the jacobian of the hidden representation with respect to the input. Please refer to Rifai et al.,2011 for more details. If x is the input then equation (1) computes the projection of the input into the latent space h. Equation (2) computes the jacobian of h with respect to x. Equation (3) computes the reconstruction of the input, while equation (4) computes the reconstruction error and the added regularization term from Eq.(2). .. math:: h_i = s(W_i x + b_i) (1) J_i = h_i (1 - h_i) * W_i (2) x' = s(W' h + b') (3) L = -sum_{k=1}^d [x_k \log x'_k + (1-x_k) \log( 1-x'_k)] + lambda * sum_{i=1}^d sum_{j=1}^n J_{ij}^2 (4) """ def __init__(self, numpy_rng, input=None, n_visible=784, n_hidden=100, n_batchsize=1, W=None, bhid=None, bvis=None): """Initialize the cA class by specifying the number of visible units (the dimension d of the input ), the number of hidden units ( the dimension d' of the latent or hidden space ) and the contraction level. The constructor also receives symbolic variables for the input, weights and bias. :type numpy_rng: numpy.random.RandomState :param numpy_rng: number random generator used to generate weights :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams :param theano_rng: Theano random generator; if None is given one is generated based on a seed drawn from `rng` :type input: theano.tensor.TensorType :param input: a symbolic description of the input or None for standalone cA :type n_visible: int :param n_visible: number of visible units :type n_hidden: int :param n_hidden: number of hidden units :type n_batchsize int :param n_batchsize: number of examples per batch :type W: theano.tensor.TensorType :param W: Theano variable pointing to a set of weights that should be shared belong the dA and another architecture; if dA should be standalone set this to None :type bhid: theano.tensor.TensorType :param bhid: Theano variable pointing to a set of biases values (for hidden units) that should be shared belong dA and another architecture; if dA should be standalone set this to None :type bvis: theano.tensor.TensorType :param bvis: Theano variable pointing to a set of biases values (for visible units) that should be shared belong dA and another architecture; if dA should be standalone set this to None """ self.n_visible = n_visible self.n_hidden = n_hidden self.n_batchsize = n_batchsize # note : W' was written as `W_prime` and b' as `b_prime` if not W: # W is initialized with `initial_W` which is uniformely sampled # from -4*sqrt(6./(n_visible+n_hidden)) and # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if # converted using asarray to dtype # theano.config.floatX so that the code is runable on GPU initial_W = numpy.asarray(numpy_rng.uniform( low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), size=(n_visible, n_hidden)), dtype=theano.config.floatX) W = theano.shared(value=initial_W, name='W', borrow=True) if not bvis: bvis = theano.shared(value=numpy.zeros(n_visible, dtype=theano.config.floatX), borrow=True) if not bhid: bhid = theano.shared(value=numpy.zeros(n_hidden, dtype=theano.config.floatX), name='b', borrow=True) self.W = W # b corresponds to the bias of the hidden self.b = bhid # b_prime corresponds to the bias of the visible self.b_prime = bvis # tied weights, therefore W_prime is W transpose self.W_prime = self.W.T # if no input is given, generate a variable representing the input if input == None: # we use a matrix because we expect a minibatch of several # examples, each example being a row self.x = T.dmatrix(name='input') else: self.x = input self.params = [self.W, self.b, self.b_prime] def get_hidden_values(self, input): #激發函數爲sigmoid看,這裏只向前進一次 """ Computes the values of the hidden layer """ return T.nnet.sigmoid(T.dot(input, self.W) + self.b) def get_jacobian(self, hidden, W): """Computes the jacobian of the hidden layer with respect to the input, reshapes are necessary for broadcasting the element-wise product on the right axis """ return T.reshape(hidden * (1 - hidden), #計算雅克比矩陣,先將h(1-h)變成3維矩陣,而後將w也變成3維矩陣,而後將這2個3維矩陣 (self.n_batchsize, 1, self.n_hidden)) * T.reshape( #對應元素相乘,但怎麼感受2個矩陣尺寸不對應呢? W, (1, self.n_visible, self.n_hidden)) def get_reconstructed_input(self, hidden): #重構輸入時得到的輸出端數據 """Computes the reconstructed input given the values of the hidden layer """ return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) def get_cost_updates(self, contraction_level, learning_rate): """ This function computes the cost and the updates for one trainng step of the cA """ y = self.get_hidden_values(self.x) z = self.get_reconstructed_input(y) J = self.get_jacobian(y, self.W) # note : we sum over the size of a datapoint; if we are using # minibatches, L will be a vector, with one entry per # example in minibatch self.L_rec = - T.sum(self.x * T.log(z) + #交叉熵做爲重構偏差(當輸入是[0,1],且是sigmoid時能夠採用) (1 - self.x) * T.log(1 - z), axis=1) # Compute the jacobian and average over the number of samples/minibatch self.L_jacob = T.sum(J ** 2) / self.n_batchsize # note : L is now a vector, where each element is the # cross-entropy cost of the reconstruction of the # corresponding example of the minibatch. We need to # compute the average of all these to get the cost of # the minibatch cost = T.mean(self.L_rec) + contraction_level * T.mean(self.L_jacob) # compute the gradients of the cost of the `cA` with respect # to its parameters gparams = T.grad(cost, self.params) #Theano特有的功能,自動求導 # generate the list of updates updates = [] for param, gparam in zip(self.params, gparams): updates.append((param, param - learning_rate * gparam)) #SGD算法 return (cost, updates) def test_cA(learning_rate=0.01, training_epochs=20, dataset='./data/mnist.pkl.gz', batch_size=10, output_folder='cA_plots', contraction_level=.1): """ This demo is tested on MNIST :type learning_rate: float :param learning_rate: learning rate used for training the contracting AutoEncoder :type training_epochs: int :param training_epochs: number of epochs used for training :type dataset: string :param dataset: path to the picked dataset """ datasets = load_data(dataset) train_set_x, train_set_y = datasets[0] # compute number of minibatches for training, validation and testing n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size #標識borrow=True表示不須要複製樣本 # allocate symbolic variables for the data index = T.lscalar() # index to a [mini]batch x = T.matrix('x') # the data is presented as rasterized images if not os.path.isdir(output_folder): os.makedirs(output_folder) os.chdir(output_folder) #################################### # BUILDING THE MODEL # #################################### rng = numpy.random.RandomState(123) ca = cA(numpy_rng=rng, input=x, n_visible=28 * 28, n_hidden=500, n_batchsize=batch_size) #500個隱含層節點 cost, updates = ca.get_cost_updates(contraction_level=contraction_level, #update裏面裝的是參數的更新過程 learning_rate=learning_rate) train_ca = theano.function([index], [T.mean(ca.L_rec), ca.L_jacob], #定義函數,輸入爲batch的索引,輸出爲該batch下的重構偏差和雅克比偏差 updates=updates, givens={x: train_set_x[index * batch_size: (index + 1) * batch_size]}) start_time = time.clock() ############ # TRAINING # ############ # go through training epochs for epoch in xrange(training_epochs): #循環20次 # go through trainng set c = [] for batch_index in xrange(n_train_batches): c.append(train_ca(batch_index)) #計算loss值,計算過程當中其實也一直在更新updates權值 c_array = numpy.vstack(c) #vstack()爲將矩陣序列c按照每行疊加,從新構造一個矩陣 print 'Training epoch %d, reconstruction cost ' % epoch, numpy.mean( c_array[0]), ' jacobian norm ', numpy.mean(numpy.sqrt(c_array[1])) end_time = time.clock() training_time = (end_time - start_time) #下面是顯示和保存學習到的權值結果 print >> sys.stderr, ('The code for file ' + os.path.split(__file__)[1] + ' ran for %.2fm' % ((training_time) / 60.)) image = PIL.Image.fromarray(tile_raster_images( X=ca.W.get_value(borrow=True).T, img_shape=(28, 28), tile_shape=(10, 10), tile_spacing=(1, 1))) image.save('cae_filters.png') os.chdir('../') if __name__ == '__main__': test_cA()
按照原程序,迭代20次,跑了6個多小時,重構偏差項和contraction項變化狀況以下:
... loading data Training epoch 0, reconstruction cost 589.571872577 jacobian norm 20.9938791886 Training epoch 1, reconstruction cost 115.13390224 jacobian norm 10.673699659 Training epoch 2, reconstruction cost 101.291018001 jacobian norm 10.134422748 Training epoch 3, reconstruction cost 94.220284334 jacobian norm 9.84685383242 Training epoch 4, reconstruction cost 89.5890225412 jacobian norm 9.64736166807 Training epoch 5, reconstruction cost 86.1490384385 jacobian norm 9.49857669084 Training epoch 6, reconstruction cost 83.4664242016 jacobian norm 9.38143172793 Training epoch 7, reconstruction cost 81.3512907826 jacobian norm 9.28327421556 Training epoch 8, reconstruction cost 79.6482831506 jacobian norm 9.19748922967 Training epoch 9, reconstruction cost 78.2066659332 jacobian norm 9.12143982155 Training epoch 10, reconstruction cost 76.9456192804 jacobian norm 9.05343287129 Training epoch 11, reconstruction cost 75.8435863545 jacobian norm 8.99151663486 Training epoch 12, reconstruction cost 74.8999458491 jacobian norm 8.9338049163 Training epoch 13, reconstruction cost 74.1060022563 jacobian norm 8.87925367541 Training epoch 14, reconstruction cost 73.4415396294 jacobian norm 8.8291852146 Training epoch 15, reconstruction cost 72.879630175 jacobian norm 8.78442892358 Training epoch 16, reconstruction cost 72.3729563995 jacobian norm 8.74324402838 Training epoch 17, reconstruction cost 71.8622392555 jacobian norm 8.70262903409 Training epoch 18, reconstruction cost 71.3049790204 jacobian norm 8.66103980493 Training epoch 19, reconstruction cost 70.6462751293 jacobian norm 8.61777944201
參考資料:
Contractive auto-encoders: Explicit invariance during feature extraction,Salah Rifai,Pascal Vincent,Xavier Muller,Xavier Glorot,Yoshua Bengio