前言:html
據說Pylearn2是個蠻適合搞深度學習的庫,它創建在Theano之上,支持GPU(估計得之後工做才玩這個,如今木有這個硬件條件)運算,由DL大牛Bengio小組弄出來的,再加上Pylearn2裏面已經集成了一部分常見的DL算法,本着很想讀讀這些算法的源碼和細節這一想法,打算學習下Pylearn2的使用. 網上這方面的中文資料簡直是太少了,雖然本博文沒什麼實質內容,但也寫貼出來,說不定能夠幫到一些初學者。python
從Bengio的一篇paper: Pylearn2: a machine learning research library能夠看出,Pylearn2主要是針對機器學習開發者而設計的(說明使用該庫的人須要有必定的機器學習背景知識),利用Pylearn2能夠靈活設計本身的機器學習模型和算法,可擴展性較強(具體怎麼弄??)。而根據Pylearn2庫的特徵(官網)上的介紹可知,在pylearn2裏,有一些常見的數據模塊、模型模塊、訓練算法模塊。數據模塊中有常見的MNIST, CIFAR10, CIFAR100, STL10, NORB等。DL模型模塊包含:RBM系列,AutoEncoder系列,LCC, maxout等。訓練算法模塊主要是SGD系列。git
Pylearn2安裝簡單介紹:github
好吧,進入正題。首先是庫的安裝,我是運行在64bit-ubuntu13.10上的。算法
1. 在此以前還需安裝Theano(python下進行符號運算的庫,相似Numpy,但在多維矩陣處理上功能更強),安裝Theano的方法請參考:Installing Theano(Bleeding-edge install instruction),裏面有Ubuntu下安裝的連接,按照裏面的步驟一步步進行下去就行(期間遇到的各類問題多google吧!)。須要提一下的是,安裝成功後,咱們須要將Theano升級到開發版本Bleeding-edge下,由於後面的Pylearn2用到了開發版Theano的新特徵。具體的升級方法參考網頁中的Bleeding-edge install instructions小節。數據庫
2. Pylearn2的安裝能夠參考博文pylearn2安裝及測試(lucktroy的csdn博客)。主要有3步:ubuntu
a. 在想安裝Pylearn2的目錄下打開vim,輸入命令:vim
git clone git://github.com/lisa-lab/pylearn2.git bash
b. 配置Pylearn2所用數據目錄的環境變量(作一些標準實驗時,可將數據放入該目錄),即在vim裏輸入命令行:vim ~/.bashrc ,而後在打開的.bashrc文件最後一行加入語句:export PYLEARN2_DATA_PATH=YourPath/data 保存後退出。其中的YourPath爲你想放入數據的目錄全稱。接着在vim裏執行source ~/.vimrc命令網絡
c. 進入pylearn2目錄(剛用git下載後會有該文件的),執行命令:python setup.py build.
運行Quick-start例子:
安裝完Pylearn2後就想弄個sample爽一把,選的是GRBM算法例子,可參考官網的Quick-start example教程。這個例子中主要有3個步驟(若是實驗過程當中出現一些問題,能夠參考下本博文的附錄,看可否提供一些幫助):
步驟一:建立數據。
在YourPath/pylearn2/scripts/tutorials/grbm_smd/ 目錄下執行下列命令:python make_dataset.py
從make_dataset.py的源碼中能夠看出,這裏使用的是CIFAR10圖片庫(http://www.cs.toronto.edu/~kriz/cifar.html(CIFAR10數據庫) ),爲32*32大小的彩色圖片,共5w個訓練樣本和1w個測試樣本。訓練grbm的patch大小爲8*8的,有15w個patch。固然還對該圖片庫進行了一些預處理,好比ZCA白化等等。最後將預處理好的結果保存爲pickle文件(pickle是python中用於序列處理的模塊,保存數據爲.pkl格式到硬盤,下次要使用該數據時可從新加載):cifar10_preprocessed_train.pkl.
步驟二:GRBM模型參數的訓練。
使用的命令(仍是在原來的目錄下)爲:python ../../ train.py cifar_grbm_smd.yaml
其中的cifar_grbm_smd.yaml文件是該實驗的配置文件,須要配置數據,模型,算法3個模塊的一些參數,yaml文件是咱們與pylearn2打交道的文件,若是是使用常見的深度學習模型和常見的優化算法來作實驗的話,則只需把配置好這個.yaml文件就能夠了。這能夠簡化很多工做。下面來看看這個cifar_grbm_smd.yaml的代碼及一些註釋,關於yaml語法的簡單介紹可參考:YAML for Pylearn2. 另外,若是想了解GRBM,則可參考網友博文:DeepLearning(深度學習)原理與實現(四),寫得很不錯。
# pylearn2 tutorial example: cifar_grbm_smd.yaml by Ian Goodfellow # # Read the README file before reading this file # # This is an example of yaml file, which is the main way that an experimenter # interacts with pylearn2. # # A yaml file is very similar to a python dictionary, with a bit of extra # syntax. # The !obj tag allows us to create a specific class of object. The text after # the : indicates what class should be loaded. This is followed by a pair of # braces containing the arguments to that class's __init__ method. # # Here, we allocate a Train object, which represents the main loop of the # training script. The train script will run this loop repeatedly. Each time # through the loop, the model is trained on data from a training dataset, then # saved to file. !obj:pylearn2.train.Train { # The !pkl tag is used to create an object from a pkl file. Here we retrieve # the dataset made by make_dataset.py and use it as our training dataset. dataset: !pkl: "cifar10_preprocessed_train.pkl", # Next we make the model to be trained. It is a Binary Gaussian RBM model: !obj:pylearn2.models.rbm.GaussianBinaryRBM { # The RBM needs 192 visible units (its inputs are 8x8 patches with 3 # color channels) nvis : 192, # We'll use 400 hidden units for this RBM. That's a small number but we # want this example script to train quickly. nhid : 400, # The elements of the weight matrices of the RBM will be drawn # independently from U(-0.05, 0.05) irange : 0.05, # There are many ways to parameterize a GRBM. Here we use a # parameterization that makes the correspondence to denoising # autoencoders more clear. energy_function_class : !obj:pylearn2.energy_functions.rbm_energy.grbm_type_1 {}, # Some learning algorithms are capable of estimating the standard # deviation of the visible units of a GRBM successfully, others are not # and just fix the standard deviation to 1. We're going to show off # and learn the standard deviation. learn_sigma : True, # Learning works better if we provide a smart initialization for the # parameters. Here we start sigma at .4 , which is about the same # standard deviation as the training data. We start the biases on the # hidden units at -2, which will make them have fairly sparse # activations. init_sigma : .4, init_bias_hid : -2., # Some GRBM training algorithms can't handle the visible units being # noisy and just use their mean for all computations. We will show off # and not use that hack here. mean_vis : False, # One hack we will make is we will scale back the gradient steps on the # sigma parameter. This way we don't need to worry about sigma getting # too small prematurely (if it gets too small too fast the learning # signal gets weak). sigma_lr_scale : 1e-3 }, # Next we need to specify the training algorithm that will be used to train # the model. Here we use stochastic gradient descent. algorithm: !obj:pylearn2.training_algorithms.sgd.SGD { # The learning rate determines how big of steps the learning algorithm # takes. Here we use fairly big steps initially because we have a # learning rate adjustment scheme that will scale them down if # necessary. learning_rate : 1e-1, # Each gradient step will be based on this many examples batch_size : 5, # We'll monitor our progress by looking at the first 20 batches of the # training dataset. This is an estimate of the training error. To be # really exhaustive, we could use the entire training set instead, # or to avoid overfitting, we could use held out data instead. monitoring_batches : 20, monitoring_dataset : !pkl: "cifar10_preprocessed_train.pkl", # Here we specify the objective function that stochastic gradient # descent should minimize. In this case we use denoising score # matching, which makes this RBM behave as a denoising autoencoder. # See # Pascal Vincent. "A Connection Between Score Matching and Denoising # Auutoencoders." Neural Computation, 2011 # for details. cost : !obj:pylearn2.costs.ebm_estimation.SMD { # Denoising score matching uses a corruption process to transform # the raw data. Here we use additive gaussian noise. corruptor : !obj:pylearn2.corruption.GaussianCorruptor { stdev : 0.4 }, }, # We'll use the monitoring dataset to figure out when to stop training. # # In this case, we stop if there is less than a 1% decrease in the # training error in the last epoch. You'll notice that the learned # features are a bit noisy. If you'd like nice smooth features you can # make this criterion stricter so that the model will train for longer. # (setting N to 10 should make the weights prettier, but will make it # run a lot longer) termination_criterion : !obj:pylearn2.termination_criteria.MonitorBased { prop_decrease : 0.01, N : 1, }, # Let's throw a learning rate adjuster into the training algorithm. # To do this we'll use an "extension," which is basically an event # handler that can be registered with the Train object. # This particular one is triggered on each epoch. # It will shrink the learning rate if the objective goes up and increase # the learning rate if the objective decreases too slowly. This makes # our learning rate hyperparameter less important to get right. # This is not a very mathematically principled approach, but it works # well in practice. }, extensions : [!obj:pylearn2.training_algorithms.sgd.MonitorBasedLRAdjuster {}], #Finally, request that the model be saved after each epoch save_freq : 1 }
由上面的yaml文件可知,yaml中的內容有點相似python中的字典:一個關鍵字key對應一個值value。而這些key都是對應類的構造函數__init__()中的參數,也就是說將這些value傳入到這些構造函數中,並由其對象接收。上面yaml代碼中data來源於步驟一的cifar10_preprocessed_train.pkl文件。model來源於pylearn2庫下的pylearn2.models.rbm.GaussianBinaryRBM類,而algorithm來源於pylearn2庫下的pylearn2.training_algorithms.sgd.SGD類。
當.yaml文件都配置好後,咱們就須要啓動對應的程序來訓練參數了,train.py就是執行的這個功能的,其代碼爲:
#!/usr/bin/env python """ Script implementing the logic for training pylearn2 models. This is intended to be a "driver" for most training experiments. A user specifies an object hierarchy in a configuration file using a dictionary-like syntax and this script takes care of the rest. For example configuration files that are consumable by this script, see pylearn2/scripts/train_example pylearn2/scripts/autoencoder_example """ __authors__ = "Ian Goodfellow" __copyright__ = "Copyright 2010-2012, Universite de Montreal" __credits__ = ["Ian Goodfellow", "David Warde-Farley"] __license__ = "3-clause BSD" __maintainer__ = "Ian Goodfellow" __email__ = "goodfeli@iro" # Standard library imports import argparse import gc import logging import os # Third-party imports import numpy as np # Local imports from pylearn2.utils import serial from pylearn2.utils.logger import ( CustomStreamHandler, CustomFormatter, restore_defaults ) class FeatureDump(object): def __init__(self, encoder, dataset, path, batch_size=None, topo=False): self.encoder = encoder self.dataset = dataset self.path = path self.batch_size = batch_size self.topo = topo def main_loop(self): if self.batch_size is None: if self.topo: data = self.dataset.get_topological_view() else: data = self.dataset.get_design_matrix() output = self.encoder.perform(data) else: myiterator = self.dataset.iterator(mode='sequential', batch_size=self.batch_size, topo=self.topo) chunks = [] for data in myiterator: chunks.append(self.encoder.perform(data)) output = np.concatenate(chunks) np.save(self.path, output) def make_argument_parser(): parser = argparse.ArgumentParser( description="Launch an experiment from a YAML configuration file.", epilog='\n'.join(__doc__.strip().split('\n')[1:]).strip(), formatter_class=argparse.RawTextHelpFormatter ) #parser是用來接收參數的 parser.add_argument('--level-name', '-L', action='store_true', help='Display the log level (e.g. DEBUG, INFO) ' 'for each logged message') parser.add_argument('--timestamp', '-T', action='store_true', help='Display human-readable timestamps for ' 'each logged message') parser.add_argument('--verbose-logging', '-V', action='store_true', help='Display timestamp, log level and source ' 'logger for every logged message ' '(implies -T).') parser.add_argument('--debug', '-D', action='store_true', help='Display any DEBUG-level log messages, ' 'suppressed by default.') parser.add_argument('config', action='store', #按照格式輸入參數,好比這裏的輸入的參數會保存在config中 choices=None, help='A YAML configuration file specifying the ' 'training procedure') return parser if __name__ == "__main__": parser = make_argument_parser() args = parser.parse_args() #讀取傳入進來的參數,這裏是直接在命令行讀取該文件,參數放入args.config中 train_obj = serial.load_train_file(args.config) #serial.load_train_file()函數最後返回的是:
# return yaml_parse.load_path(args.config) 也就是說調用的是ymal_parse.load_path()函數。返回的是一個train類的對象。
# 其中的ymal_parse是pylearn2.config中的函數。
return yaml_parse.load_path(config_file_path) try: iter(train_obj) #iter()是個迭代器函數 iterable = True except TypeError as e: iterable = False # Undo our custom logging setup. restore_defaults() # Set up the root logger with a custom handler that logs stdout for INFO # and DEBUG and stderr for WARNING, ERROR, CRITICAL. root_logger = logging.getLogger() #logging主要是python中用於處理日誌的模塊,這裏是返回一個logger實例,因爲沒有指定name,因此是root logger if args.verbose_logging: formatter = logging.Formatter(fmt="%(asctime)s %(name)s %(levelname)s " "%(message)s") handler = CustomStreamHandler(formatter=formatter) else: if args.timestamp: prefix = '%(asctime)s ' else: prefix = '' #這裏爲空 formatter = CustomFormatter(prefix=prefix, only_from='pylearn2') handler = CustomStreamHandler(formatter=formatter) root_logger.addHandler(handler) #給root_lgger添加handler來幫助處理日誌 # Set the root logger level. if args.debug: root_logger.setLevel(logging.DEBUG) else: root_logger.setLevel(logging.INFO) #給root_logger設置級別,爲INFO級別,由於每一個日誌消息都會關聯一個級別 if iterable: #enumerate()爲對一個list或者array既要遍歷索引又要遍歷元素時使用 for number, subobj in enumerate(iter(train_obj)):#train_obj裏面裝的是ymal文件內容,相似字典 # Publish a variable indicating the training phase. phase_variable = 'PYLEARN2_TRAIN_PHASE' phase_value = 'phase%d' % (number + 1) os.environ[phase_variable] = phase_value os.putenv(phase_variable, phase_value) # Execute this training phase. subobj.main_loop() # Clean up, in case there's a lot of memory used that's # necessary for the next phase. del subobj gc.collect() else: train_obj.main_loop() #由於train_obj中已經包含了數據,模型,算法,因此調用main_loop()後表示採用對應算法用對應數據在對應的模型上訓練 #直到知足迭代終止條件
其中最核心的就是main_loop()函數了,在調用main_loop()後,程序會自動用algorithm對象使用model對象在data上來訓練參數了。至於具體該函數是怎樣將data, model, algorithm聯繫起來的呢?咱們能夠試着去讀一下源碼:
首先是由train_obj.main_loop()函數將data, model, algorithm聯繫起來的。從名字train_obj能夠看出它是一個某個類的對象,猜想應該是Pylearn2下的Train類對象,由於在庫Pylearn2的子目錄下有個model爲train.py,該文件有個Train類,而且這個Train類有一個方法:main_loop()。看來一切符合猜想,那麼是否真是的呢?
首先來看看train_obj從哪裏來的(由於main_loop()是由train_obj來調用的)。由上面的程序可知:train_obj = serial.load_train_file(args.config), 須要跟蹤serial, 找到serial.load_train_file()的源代碼,最後一句爲:return yaml_parse.load_path(args.config). 繼續跟蹤發現load_path()函數裏面調用了load()函數,而裏面最調用的是yaml.load()函數,由源碼中的註釋可知它是將.yaml配置文件轉換成一個graph, 而這個graph應該就是一個Train對象...
好吧,到了該看main_loop()的內容了:
def main_loop(self): """ Repeatedly runs an epoch of the training algorithm, runs any epoch-level callbacks, and saves the model. """ if self.algorithm is None: self.model.monitor = Monitor.get_monitor(self.model) self.setup_extensions() self.run_callbacks_and_monitoring() while True: rval = self.model.train_all(dataset=self.dataset) if rval is not None: raise ValueError("Model.train_all should not return anything. Use Model.continue_learning to control whether learning continues.") self.model.monitor.report_epoch() if self.save_freq > 0 and self.model.monitor.epochs_seen % self.save_freq == 0: self.save() continue_learning = self.model.continue_learning() assert continue_learning in [True, False, 0, 1] if not continue_learning: break else: self.algorithm.setup(model=self.model, dataset=self.dataset) #這一句將data,model, dataset聯繫起來了 self.setup_extensions() #和.yaml文件中的extensions項聯繫起來了 if not hasattr(self.model, 'monitor'): # TODO: is this really necessary? I just put this error here # to prevent an AttributeError later, but I think we could # rewrite to avoid the AttributeError raise RuntimeError("The algorithm is responsible for setting" " up the Monitor, but failed to.") if len(self.model.monitor._datasets)>0: # This monitoring channel keeps track of a shared variable, # which does not need inputs nor data. self.model.monitor.add_channel(name="monitor_seconds_per_epoch", ipt=None, val=self.monitor_time, data_specs=(NullSpace(), ''), dataset=self.model.monitor._datasets[0]) self.run_callbacks_and_monitoring() while True: #循環中,直到知足終止條件 with log_timing(log, None, final_msg='Time this epoch:', callbacks=[self.monitor_time.set_value]): rval = self.algorithm.train(dataset=self.dataset) #算法訓練的核心函數 if rval is not None: raise ValueError("TrainingAlgorithm.train should not return anything. Use TrainingAlgorithm.continue_learning to control whether learning continues.") self.model.monitor.report_epoch() self.run_callbacks_and_monitoring() if self.save_freq > 0 and self.model.monitor._epochs_seen % self.save_freq == 0: self.save() continue_learning = self.algorithm.continue_learning(self.model) #終止條件測試 assert continue_learning in [True, False, 0, 1] if not continue_learning: break self.model.monitor.training_succeeded = True if self.save_freq > 0: self.save()
步驟三:
這部分就是看結果顯示了,執行命令:python ../../show_weights.py cifar_grbm_smd.pkl 好比我這裏執行後的結果顯示以下:
固然了你還可使用plot_monitor.py來看一些對應結果。
總結:
當使用Pylearn2中已有的一些DL模型,採用其中已有的一些優化算法來作實驗時,咱們只須要配置好實驗的.yaml文件便可,調參過程就是不斷更改.ymal中的配置。可是若是須要採用本身新提出來的DL模型,或者採用本身新提出的目標函數及優化方法,則還須要本身寫出對應的類,具體這部分該怎麼作(好比說怎樣去實現這個類,接口怎樣設計,.ymal文件須要更改哪些地方等),本人暫時沒任何經驗。但願懂這些的能夠你們可貢獻貢獻下想法,交流交流下。網上有個教程是把Pylearn2當作一般的python庫來用,實現了一個異或網絡,很不錯,見:Neural network example using Pylearn2.
另外,分析Pylearn2的源碼可知,每一個algorithm中,必須有下面4個函數:__init(), setup(), train(), continue_training(), 做用分別爲構造函數, 根據model創建網絡的結構,模型參數的訓練,模型訓練終止處理。model模塊中,應該也有一些統一的函數。
附錄:
我實驗過程當中可能出現的一些錯誤處理:
A:
若是執行 python make_dataset.py後出現錯誤:
raise IOError("permission error creating %s" % filepath) IOError: permission error creating cifar10_preprocessed_train.pkl
看錯誤提示應該是權限問題,這時改成命令:
sudo python make_dataset.py
若是繼續出現錯誤:
pylearn2.datasets.exc.NoDataPathError: You need to define your PYLEARN2_DATA_PATH environment variable. If you are using a computer at LISA, this should be set to /data/lisa/data.
說明PYLEARN2_DATA_PATH環境變量沒有設置,可是前面倒是設置了啊!爲何呢?有多是你設置環境變量時用的是root權限,而執行該命令只是普通用戶。若是切換到root下再執行 root#:python make_dataset.py成功!生成了cifar10_preprocessed_train.pkl
可是後面執行:../../train.py cifar_grbm_smd.yaml出現錯誤:ImportError: Could not import pylearn2.models but could import pylearn2. Original exception: No module named compat.python2x
到這裏基本能夠肯定是權限問題,解決方法是:從新用普通用戶安裝了下pylearn2,設置好環境變量,放着好下載的數據後,執行(普通用戶下):
python make_dataset
則成功生成了cifar10_preprocessed_train.pkl 可惡的是後續的../../train.py cifar_grbm_smd.yaml仍是會出現剛剛的錯誤。
固然了這個問題主要是由於Theano的版本不對,在使用pylearn2時,應該使用development版本的Theano,按照本文前面的方法更新下Theano便可。
B.
若是在顯示權值階段,當執行下面命令後:sudo python ../../show_weights.py cifar_grbm_smd.pkl.可能會出現下面提示:
You need to choose an image viewer program that pylearn2 should use. Then tell pylearn2 to usethat image viewer program by defining your PYLEARN2_VIEWER_COMMAND environment variable.You need to choose PYLEARN_VIEWER_COMMAND such that running ${PYLEARN2_VIEWER_COMMAND} image.png
in a command prompt on your machine will do the following:
-open an image viewer in a new process.
-not return until you have closed the image.
Acceptable commands include:
gwenview
eog --new-instance
This is assuming that you have gwenview or a version of eog that supports --new-instance
......
……
這說明pylearn2中沒有指定圖片顯示的軟件。首先安裝gwenview軟件:sudo apt-get Install gwenview.
而後設置一下PYLEARN2_VIEWER_COMMAND環境變量。vim ~/.bashrc 在最後一行加入gwenview的安裝目錄,好比我按照默認的安裝目錄加入的爲:
export PYLEARN2_VIEWER_COMMAND=/usr/bin/gwenview
保存好後執行source ~/.bashrc
參考資料:
Pylearn2: a machine learning research library
Installing Theano(Bleeding-edge install instruction)
pylearn2安裝及測試(lucktroy的csdn博客)