Deep learning：四十四(Pylearn2中的Quick-start例子)

時間 2019-11-05

標籤 deep learning 四十四 pylearn2 pylearn quick start 例子简体版

原文原文鏈接

　　前言：html

　　據說Pylearn2是個蠻適合搞深度學習的庫，它創建在Theano之上，支持GPU(估計得之後工做才玩這個，如今木有這個硬件條件)運算，由DL大牛Bengio小組弄出來的，再加上Pylearn2裏面已經集成了一部分常見的DL算法，本着很想讀讀這些算法的源碼和細節這一想法，打算學習下Pylearn2的使用. 網上這方面的中文資料簡直是太少了，雖然本博文沒什麼實質內容，但也寫貼出來，說不定能夠幫到一些初學者。python

　　從Bengio的一篇paper: Pylearn2: a machine learning research library能夠看出，Pylearn2主要是針對機器學習開發者而設計的（說明使用該庫的人須要有必定的機器學習背景知識），利用Pylearn2能夠靈活設計本身的機器學習模型和算法，可擴展性較強（具體怎麼弄？？）。而根據Pylearn2庫的特徵(官網)上的介紹可知，在pylearn2裏，有一些常見的數據模塊、模型模塊、訓練算法模塊。數據模塊中有常見的MNIST, CIFAR10, CIFAR100, STL10, NORB等。DL模型模塊包含：RBM系列，AutoEncoder系列，LCC, maxout等。訓練算法模塊主要是SGD系列。git

　　Pylearn2安裝簡單介紹：github

　　好吧，進入正題。首先是庫的安裝，我是運行在64bit-ubuntu13.10上的。算法

　　1. 在此以前還需安裝Theano(python下進行符號運算的庫，相似Numpy，但在多維矩陣處理上功能更強)，安裝Theano的方法請參考：Installing Theano（Bleeding-edge install instruction），裏面有Ubuntu下安裝的連接，按照裏面的步驟一步步進行下去就行（期間遇到的各類問題多google吧！）。須要提一下的是，安裝成功後，咱們須要將Theano升級到開發版本Bleeding-edge下，由於後面的Pylearn2用到了開發版Theano的新特徵。具體的升級方法參考網頁中的Bleeding-edge install instructions小節。數據庫

　　2. Pylearn2的安裝能夠參考博文pylearn2安裝及測試（lucktroy的csdn博客）。主要有3步：ubuntu

　　a. 在想安裝Pylearn2的目錄下打開vim,輸入命令：vim

　　git clone git://github.com/lisa-lab/pylearn2.git bash

　　b. 配置Pylearn2所用數據目錄的環境變量（作一些標準實驗時，可將數據放入該目錄），即在vim裏輸入命令行:vim ~/.bashrc ,而後在打開的.bashrc文件最後一行加入語句:export PYLEARN2_DATA_PATH=YourPath/data 保存後退出。其中的YourPath爲你想放入數據的目錄全稱。接着在vim裏執行source ~/.vimrc命令網絡

　　c. 進入pylearn2目錄(剛用git下載後會有該文件的)，執行命令:python setup.py build.

　　運行Quick-start例子：

　　安裝完Pylearn2後就想弄個sample爽一把，選的是GRBM算法例子，可參考官網的Quick-start example教程。這個例子中主要有3個步驟（若是實驗過程當中出現一些問題，能夠參考下本博文的附錄，看可否提供一些幫助）：

　　步驟一：建立數據。

　　在YourPath/pylearn2/scripts/tutorials/grbm_smd/ 目錄下執行下列命令：python make_dataset.py

　　從make_dataset.py的源碼中能夠看出，這裏使用的是CIFAR10圖片庫（http://www.cs.toronto.edu/~kriz/cifar.html（CIFAR10數據庫）），爲32*32大小的彩色圖片，共5w個訓練樣本和1w個測試樣本。訓練grbm的patch大小爲8*8的，有15w個patch。固然還對該圖片庫進行了一些預處理，好比ZCA白化等等。最後將預處理好的結果保存爲pickle文件（pickle是python中用於序列處理的模塊，保存數據爲.pkl格式到硬盤，下次要使用該數據時可從新加載）：cifar10_preprocessed_train.pkl.

　　步驟二：GRBM模型參數的訓練。

　　使用的命令（仍是在原來的目錄下）爲：python ../../ train.py cifar_grbm_smd.yaml

　　其中的cifar_grbm_smd.yaml文件是該實驗的配置文件，須要配置數據，模型，算法3個模塊的一些參數，yaml文件是咱們與pylearn2打交道的文件，若是是使用常見的深度學習模型和常見的優化算法來作實驗的話，則只需把配置好這個.yaml文件就能夠了。這能夠簡化很多工做。下面來看看這個cifar_grbm_smd.yaml的代碼及一些註釋，關於yaml語法的簡單介紹可參考：YAML for Pylearn2. 另外，若是想了解GRBM，則可參考網友博文：DeepLearning（深度學習）原理與實現（四），寫得很不錯。

# pylearn2 tutorial example: cifar_grbm_smd.yaml by Ian Goodfellow
#
# Read the README file before reading this file
#
# This is an example of yaml file, which is the main way that an experimenter
# interacts with pylearn2.
#
# A yaml file is very similar to a python dictionary, with a bit of extra
# syntax.

# The !obj tag allows us to create a specific class of object. The text after
# the : indicates what class should be loaded. This is followed by a pair of
# braces containing the arguments to that class's __init__ method.
#
# Here, we allocate a Train object, which represents the main loop of the
# training script. The train script will run this loop repeatedly. Each time
# through the loop, the model is trained on data from a training dataset, then
# saved to file.

!obj:pylearn2.train.Train {
    # The !pkl tag is used to create an object from a pkl file. Here we retrieve
    # the dataset made by make_dataset.py and use it as our training dataset.
    dataset: !pkl: "cifar10_preprocessed_train.pkl",

    # Next we make the model to be trained. It is a Binary Gaussian RBM
    model: !obj:pylearn2.models.rbm.GaussianBinaryRBM {

        # The RBM needs 192 visible units (its inputs are 8x8 patches with 3
        # color channels)
        nvis : 192,

        # We'll use 400 hidden units for this RBM. That's a small number but we
        # want this example script to train quickly.
        nhid : 400,

        # The elements of the weight matrices of the RBM will be drawn
        # independently from U(-0.05, 0.05)
        irange : 0.05,

        # There are many ways to parameterize a GRBM. Here we use a
        # parameterization that makes the correspondence to denoising
        # autoencoders more clear.
        energy_function_class : !obj:pylearn2.energy_functions.rbm_energy.grbm_type_1 {},

        # Some learning algorithms are capable of estimating the standard
        # deviation of the visible units of a GRBM successfully, others are not
        # and just fix the standard deviation to 1.  We're going to show off
        # and learn the standard deviation.
        learn_sigma : True,

        # Learning works better if we provide a smart initialization for the
        # parameters.  Here we start sigma at .4 , which is about the same
        # standard deviation as the training data. We start the biases on the
        # hidden units at -2, which will make them have fairly sparse
        # activations.
        init_sigma : .4,
        init_bias_hid : -2.,

        # Some GRBM training algorithms can't handle the visible units being
        # noisy and just use their mean for all computations. We will show off
        # and not use that hack here.
        mean_vis : False,

        # One hack we will make is we will scale back the gradient steps on the
        # sigma parameter. This way we don't need to worry about sigma getting
        # too small prematurely (if it gets too small too fast the learning
        # signal gets weak).
        sigma_lr_scale : 1e-3

    },

    # Next we need to specify the training algorithm that will be used to train
    # the model.  Here we use stochastic gradient descent.

    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        # The learning rate determines how big of steps the learning algorithm
        # takes.  Here we use fairly big steps initially because we have a
        # learning rate adjustment scheme that will scale them down if
        # necessary.
        learning_rate : 1e-1,

        # Each gradient step will be based on this many examples
        batch_size : 5,

        # We'll monitor our progress by looking at the first 20 batches of the
        # training dataset. This is an estimate of the training error. To be
        # really exhaustive, we could use the entire training set instead,
        # or to avoid overfitting, we could use held out data instead.
        monitoring_batches : 20,

        monitoring_dataset : !pkl: "cifar10_preprocessed_train.pkl",

        # Here we specify the objective function that stochastic gradient
        # descent should minimize.  In this case we use denoising score
        # matching, which makes this RBM behave as a denoising autoencoder.
        # See
        # Pascal Vincent. "A Connection Between Score Matching and Denoising
        # Auutoencoders." Neural Computation, 2011
        # for details.

        cost : !obj:pylearn2.costs.ebm_estimation.SMD {

            # Denoising score matching uses a corruption process to transform
            # the raw data.  Here we use additive gaussian noise.

            corruptor : !obj:pylearn2.corruption.GaussianCorruptor {
                    stdev : 0.4
            },
        },

        # We'll use the monitoring dataset to figure out when to stop training.
        #
        # In this case, we stop if there is less than a 1% decrease in the
        # training error in the last epoch.  You'll notice that the learned
        # features are a bit noisy. If you'd like nice smooth features you can
        # make this criterion stricter so that the model will train for longer.
        # (setting N to 10 should make the weights prettier, but will make it
        # run a lot longer)

        termination_criterion : !obj:pylearn2.termination_criteria.MonitorBased {
            prop_decrease : 0.01,
            N : 1,
        },

        # Let's throw a learning rate adjuster into the training algorithm.
        # To do this we'll use an "extension," which is basically an event
        # handler that can be registered with the Train object.
        # This particular one is triggered on each epoch.
        # It will shrink the learning rate if the objective goes up and increase
        # the learning rate if the objective decreases too slowly. This makes
        # our learning rate hyperparameter less important to get right.
        # This is not a very mathematically principled approach, but it works
        # well in practice.
        },
    extensions : [!obj:pylearn2.training_algorithms.sgd.MonitorBasedLRAdjuster {}],
    #Finally, request that the model be saved after each epoch
    save_freq : 1
}

　　由上面的yaml文件可知，yaml中的內容有點相似python中的字典：一個關鍵字key對應一個值value。而這些key都是對應類的構造函數__init__()中的參數，也就是說將這些value傳入到這些構造函數中，並由其對象接收。上面yaml代碼中data來源於步驟一的cifar10_preprocessed_train.pkl文件。model來源於pylearn2庫下的pylearn2.models.rbm.GaussianBinaryRBM類，而algorithm來源於pylearn2庫下的pylearn2.training_algorithms.sgd.SGD類。

　　當.yaml文件都配置好後，咱們就須要啓動對應的程序來訓練參數了，train.py就是執行的這個功能的，其代碼爲：

#!/usr/bin/env python
"""
Script implementing the logic for training pylearn2 models.

This is intended to be a "driver" for most training experiments. A user
specifies an object hierarchy in a configuration file using a dictionary-like
syntax and this script takes care of the rest.

For example configuration files that are consumable by this script, see

    pylearn2/scripts/train_example
    pylearn2/scripts/autoencoder_example
"""
__authors__ = "Ian Goodfellow"
__copyright__ = "Copyright 2010-2012, Universite de Montreal"
__credits__ = ["Ian Goodfellow", "David Warde-Farley"]
__license__ = "3-clause BSD"
__maintainer__ = "Ian Goodfellow"
__email__ = "goodfeli@iro"
# Standard library imports
import argparse
import gc
import logging
import os

# Third-party imports
import numpy as np

# Local imports
from pylearn2.utils import serial
from pylearn2.utils.logger import (
    CustomStreamHandler, CustomFormatter, restore_defaults
)


class FeatureDump(object):
    def __init__(self, encoder, dataset, path, batch_size=None, topo=False):
        self.encoder = encoder
        self.dataset = dataset
        self.path = path
        self.batch_size = batch_size
        self.topo = topo

    def main_loop(self):
        if self.batch_size is None:
            if self.topo:
                data = self.dataset.get_topological_view()
            else:
                data = self.dataset.get_design_matrix()
            output = self.encoder.perform(data)
        else:
            myiterator = self.dataset.iterator(mode='sequential',
                                               batch_size=self.batch_size,
                                               topo=self.topo)
            chunks = []
            for data in myiterator:
                chunks.append(self.encoder.perform(data))
            output = np.concatenate(chunks)
        np.save(self.path, output)


def make_argument_parser():
    parser = argparse.ArgumentParser(
        description="Launch an experiment from a YAML configuration file.",
        epilog='\n'.join(__doc__.strip().split('\n')[1:]).strip(),
        formatter_class=argparse.RawTextHelpFormatter
    ) #parser是用來接收參數的
    parser.add_argument('--level-name', '-L',
                        action='store_true',
                        help='Display the log level (e.g. DEBUG, INFO) '
                             'for each logged message')
    parser.add_argument('--timestamp', '-T',
                        action='store_true',
                        help='Display human-readable timestamps for '
                             'each logged message')
    parser.add_argument('--verbose-logging', '-V',
                        action='store_true',
                        help='Display timestamp, log level and source '
                             'logger for every logged message '
                             '(implies -T).')
    parser.add_argument('--debug', '-D',
                        action='store_true',
                        help='Display any DEBUG-level log messages, '
                             'suppressed by default.')
    parser.add_argument('config', action='store', #按照格式輸入參數，好比這裏的輸入的參數會保存在config中
                        choices=None,
                        help='A YAML configuration file specifying the '
                             'training procedure')
    return parser


if __name__ == "__main__":
    parser = make_argument_parser()
    args = parser.parse_args() #讀取傳入進來的參數，這裏是直接在命令行讀取該文件,參數放入args.config中
    train_obj = serial.load_train_file(args.config) #serial.load_train_file()函數最後返回的是：
　　　　　　　　　　# return yaml_parse.load_path(args.config) 也就是說調用的是ymal_parse.load_path()函數。返回的是一個train類的對象。
　　　　　　　　　　# 其中的ymal_parse是pylearn2.config中的函數。

return yaml_parse.load_path(config_file_path) 
    try:
        iter(train_obj) #iter()是個迭代器函數
        iterable = True
    except TypeError as e:
        iterable = False

    # Undo our custom logging setup.
    restore_defaults()
    # Set up the root logger with a custom handler that logs stdout for INFO
    # and DEBUG and stderr for WARNING, ERROR, CRITICAL.
    root_logger = logging.getLogger() #logging主要是python中用於處理日誌的模塊,這裏是返回一個logger實例，因爲沒有指定name，因此是root logger
    if args.verbose_logging:
        formatter = logging.Formatter(fmt="%(asctime)s %(name)s %(levelname)s "
                                          "%(message)s")
        handler = CustomStreamHandler(formatter=formatter)
    else:
        if args.timestamp:
            prefix = '%(asctime)s '
        else:
            prefix = '' #這裏爲空
        formatter = CustomFormatter(prefix=prefix, only_from='pylearn2')
        handler = CustomStreamHandler(formatter=formatter)
    root_logger.addHandler(handler) #給root_lgger添加handler來幫助處理日誌
    # Set the root logger level.
    if args.debug:
        root_logger.setLevel(logging.DEBUG)
    else:
        root_logger.setLevel(logging.INFO) #給root_logger設置級別，爲INFO級別，由於每一個日誌消息都會關聯一個級別

    if iterable: #enumerate()爲對一個list或者array既要遍歷索引又要遍歷元素時使用
        for number, subobj in enumerate(iter(train_obj)):#train_obj裏面裝的是ymal文件內容，相似字典
            # Publish a variable indicating the training phase.
            phase_variable = 'PYLEARN2_TRAIN_PHASE'
            phase_value = 'phase%d' % (number + 1)
            os.environ[phase_variable] = phase_value
            os.putenv(phase_variable, phase_value)

            # Execute this training phase.
            subobj.main_loop()

            # Clean up, in case there's a lot of memory used that's
            # necessary for the next phase.
            del subobj
            gc.collect()
    else:
        train_obj.main_loop() #由於train_obj中已經包含了數據，模型，算法，因此調用main_loop()後表示採用對應算法用對應數據在對應的模型上訓練
                              #直到知足迭代終止條件

　　其中最核心的就是main_loop()函數了，在調用main_loop()後，程序會自動用algorithm對象使用model對象在data上來訓練參數了。至於具體該函數是怎樣將data, model, algorithm聯繫起來的呢？咱們能夠試着去讀一下源碼：

　　首先是由train_obj.main_loop()函數將data, model, algorithm聯繫起來的。從名字train_obj能夠看出它是一個某個類的對象，猜想應該是Pylearn2下的Train類對象，由於在庫Pylearn2的子目錄下有個model爲train.py，該文件有個Train類，而且這個Train類有一個方法：main_loop()。看來一切符合猜想，那麼是否真是的呢？

　　首先來看看train_obj從哪裏來的(由於main_loop()是由train_obj來調用的)。由上面的程序可知：train_obj = serial.load_train_file(args.config), 須要跟蹤serial, 找到serial.load_train_file()的源代碼，最後一句爲：return yaml_parse.load_path(args.config). 繼續跟蹤發現load_path()函數裏面調用了load()函數，而裏面最調用的是yaml.load()函數，由源碼中的註釋可知它是將.yaml配置文件轉換成一個graph, 而這個graph應該就是一個Train對象...

　　好吧，到了該看main_loop()的內容了：

    def main_loop(self):
        """
        Repeatedly runs an epoch of the training algorithm, runs any
        epoch-level callbacks, and saves the model.
        """
        if self.algorithm is None:
            self.model.monitor = Monitor.get_monitor(self.model)
            self.setup_extensions()
            self.run_callbacks_and_monitoring()
            while True:
                rval = self.model.train_all(dataset=self.dataset)
                if rval is not None:
                    raise ValueError("Model.train_all should not return anything. Use Model.continue_learning to control whether learning continues.")
                self.model.monitor.report_epoch()
                if self.save_freq > 0 and self.model.monitor.epochs_seen % self.save_freq == 0:
                    self.save()
                continue_learning = self.model.continue_learning()
                assert continue_learning in [True, False, 0, 1]
                if not continue_learning:
                    break
        else:
            self.algorithm.setup(model=self.model, dataset=self.dataset) #這一句將data,model, dataset聯繫起來了
            self.setup_extensions() #和.yaml文件中的extensions項聯繫起來了
            if not hasattr(self.model, 'monitor'):
                # TODO: is this really necessary? I just put this error here
                # to prevent an AttributeError later, but I think we could
                # rewrite to avoid the AttributeError
                raise RuntimeError("The algorithm is responsible for setting"
                        " up the Monitor, but failed to.")
            if len(self.model.monitor._datasets)>0:
                # This monitoring channel keeps track of a shared variable,
                # which does not need inputs nor data.
                self.model.monitor.add_channel(name="monitor_seconds_per_epoch",
                                               ipt=None,
                                               val=self.monitor_time,
                                               data_specs=(NullSpace(), ''),
                                               dataset=self.model.monitor._datasets[0])
            self.run_callbacks_and_monitoring()
            while True: #循環中，直到知足終止條件
                with log_timing(log, None, final_msg='Time this epoch:',
                                callbacks=[self.monitor_time.set_value]):
                    rval = self.algorithm.train(dataset=self.dataset) #算法訓練的核心函數
                if rval is not None:
                    raise ValueError("TrainingAlgorithm.train should not return anything. Use TrainingAlgorithm.continue_learning to control whether learning continues.")
                self.model.monitor.report_epoch()
                self.run_callbacks_and_monitoring()
                if self.save_freq > 0 and self.model.monitor._epochs_seen % self.save_freq == 0:
                    self.save()
                continue_learning =  self.algorithm.continue_learning(self.model) #終止條件測試
                assert continue_learning in [True, False, 0, 1]
                if not continue_learning:
                    break

        self.model.monitor.training_succeeded = True

        if self.save_freq > 0:
            self.save()

　　步驟三:

　　這部分就是看結果顯示了，執行命令：python ../../show_weights.py cifar_grbm_smd.pkl 好比我這裏執行後的結果顯示以下：

　　固然了你還可使用plot_monitor.py來看一些對應結果。

　　總結：

　　當使用Pylearn2中已有的一些DL模型，採用其中已有的一些優化算法來作實驗時，咱們只須要配置好實驗的.yaml文件便可，調參過程就是不斷更改.ymal中的配置。可是若是須要採用本身新提出來的DL模型，或者採用本身新提出的目標函數及優化方法，則還須要本身寫出對應的類，具體這部分該怎麼作（好比說怎樣去實現這個類，接口怎樣設計，.ymal文件須要更改哪些地方等），本人暫時沒任何經驗。但願懂這些的能夠你們可貢獻貢獻下想法，交流交流下。網上有個教程是把Pylearn2當作一般的python庫來用，實現了一個異或網絡，很不錯，見：Neural network example using Pylearn2.

　　另外，分析Pylearn2的源碼可知，每一個algorithm中，必須有下面4個函數:__init(), setup(), train(), continue_training(), 做用分別爲構造函數, 根據model創建網絡的結構，模型參數的訓練，模型訓練終止處理。model模塊中，應該也有一些統一的函數。

　　附錄：

　　我實驗過程當中可能出現的一些錯誤處理：

　　若是執行 python make_dataset.py後出現錯誤：

　raise IOError("permission error creating %s" % filepath) IOError: permission error creating cifar10_preprocessed_train.pkl

　　看錯誤提示應該是權限問題，這時改成命令：

　　sudo python make_dataset.py

　　若是繼續出現錯誤：

　　pylearn2.datasets.exc.NoDataPathError: You need to define your PYLEARN2_DATA_PATH environment variable. If you are using a computer at LISA, this should be set to /data/lisa/data.

　　說明PYLEARN2_DATA_PATH環境變量沒有設置，可是前面倒是設置了啊！爲何呢？有多是你設置環境變量時用的是root權限，而執行該命令只是普通用戶。若是切換到root下再執行 root#:python make_dataset.py成功！生成了cifar10_preprocessed_train.pkl

　　可是後面執行：../../train.py cifar_grbm_smd.yaml出現錯誤：ImportError: Could not import pylearn2.models but could import pylearn2. Original exception: No module named compat.python2x

　　到這裏基本能夠肯定是權限問題，解決方法是：從新用普通用戶安裝了下pylearn2,設置好環境變量，放着好下載的數據後，執行（普通用戶下）：

　　python make_dataset

　　則成功生成了cifar10_preprocessed_train.pkl 可惡的是後續的../../train.py cifar_grbm_smd.yaml仍是會出現剛剛的錯誤。

　　固然了這個問題主要是由於Theano的版本不對，在使用pylearn2時，應該使用development版本的Theano，按照本文前面的方法更新下Theano便可。

　　若是在顯示權值階段，當執行下面命令後：sudo python ../../show_weights.py cifar_grbm_smd.pkl.可能會出現下面提示：

　　You need to choose an image viewer program that pylearn2 should use. Then tell pylearn2 to usethat image viewer program by defining your PYLEARN2_VIEWER_COMMAND environment variable.You need to choose PYLEARN_VIEWER_COMMAND such that running ${PYLEARN2_VIEWER_COMMAND} image.png

in a command prompt on your machine will do the following:

-open an image viewer in a new process.

-not return until you have closed the image.

Acceptable commands include:

gwenview

eog --new-instance

This is assuming that you have gwenview or a version of eog that supports --new-instance

......

……

　　這說明pylearn2中沒有指定圖片顯示的軟件。首先安裝gwenview軟件：sudo apt-get Install gwenview.

　　而後設置一下PYLEARN2_VIEWER_COMMAND環境變量。vim ~/.bashrc 在最後一行加入gwenview的安裝目錄，好比我按照默認的安裝目錄加入的爲：

　　export PYLEARN2_VIEWER_COMMAND=/usr/bin/gwenview