Deep learning:四十五(maxout簡單理解)

 

  maxout出如今ICML2013上,做者Goodfellow將maxout和dropout結合後,號稱在MNIST, CIFAR-10, CIFAR-100, SVHN這4個數據上都取得了start-of-art的識別率。git

  從論文中能夠看出,maxout其實一種激發函數形式。一般狀況下,若是激發函數採用sigmoid函數的話,在前向傳播過程當中,隱含層節點的輸出表達式爲:github

   

  其中W通常是2維的,這裏表示取出的是第i列,下標i前的省略號表示對應第i列中的全部行。但若是是maxout激發函數,則其隱含層節點的輸出表達式爲:網絡

    

  

  這裏的W是3維的,尺寸爲d*m*k,其中d表示輸入層節點的個數,m表示隱含層節點的個數,k表示每一個隱含層節點對應了k個」隱隱含層」節點,這k個」隱隱含層」節點都是線性輸出的,而maxout的每一個節點就是取這k個」隱隱含層」節點輸出值中最大的那個值。由於激發函數中有了max操做,因此整個maxout網絡也是一種非線性的變換。所以當咱們看到常規結構的神經網絡時,若是它使用了maxout激發,則咱們頭腦中應該自動將這個」隱隱含層」節點加入。參考個日文的maxout ppt 中的一頁ppt以下:dom

   

  ppt中箭頭先後示意圖你們應該能夠明白什麼是maxout激發函數了。ide

  maxout的擬合能力是很是強的,它能夠擬合任意的的凸函數。最直觀的解釋就是任意的凸函數均可以由分段線性函數以任意精度擬合(學太高等數學應該能明白),而maxout又是取k個隱隱含層節點的最大值,這些」隱隱含層"節點也是線性的,因此在不一樣的取值範圍下,最大值也能夠看作是分段線性的(分段的個數與k值有關)。論文中的圖1以下(它表達的意思就是能夠擬合任意凸函數,固然也包括了ReLU了):函數

   

  做者從數學的角度上也證實了這個結論,即只需2個maxout節點就能夠擬合任意的凸函數了(相減),前提是」隱隱含層」節點的個數能夠任意多,以下圖所示:ui

   

  下面來看下maxout源碼,看其激發函數表達式是否符合咱們的理解。找到庫目錄下的pylearn2/models/maxout.py文件,選擇不帶卷積的Maxout類,主要是其前向傳播函數fprop():this

  def fprop(self, state_below): #前向傳播,對linear分組進行max-pooling操做
                                                                                                                                        
      self.input_space.validate(state_below)
                                                                                                                                        
      if self.requires_reformat:
          if not isinstance(state_below, tuple):
              for sb in get_debug_values(state_below):
                  if sb.shape[0] != self.dbm.batch_size:
                      raise ValueError("self.dbm.batch_size is %d but got shape of %d" % (self.dbm.batch_size, sb.shape[0]))
                  assert reduce(lambda x,y: x * y, sb.shape[1:]) == self.input_dim
                                                                                                                                        
          state_below = self.input_space.format_as(state_below, self.desired_space) #統一好輸入數據的格式
                                                                                                                                        
      z = self.transformer.lmul(state_below) + self.b # lmul()函數返回的是 return T.dot(x, self._W)
                                                                                                                                        
      if not hasattr(self, 'randomize_pools'):
          self.randomize_pools = False
                                                                                                                                        
      if not hasattr(self, 'pool_stride'):
          self.pool_stride = self.pool_size #默認狀況下是沒有重疊的pooling
                                                                                                                                        
      if self.randomize_pools:
          z = T.dot(z, self.permute)
                                                                                                                                        
      if not hasattr(self, 'min_zero'):
          self.min_zero = False
                                                                                                                                        
      if self.min_zero:
          p = T.zeros_like(z) #返回一個和z一樣大小的矩陣,元素值爲0,元素值類型和z的類型同樣
      else:
          p = None
                                                                                                                                        
      last_start = self.detector_layer_dim  - self.pool_size
      for i in xrange(self.pool_size): #xrange和reange的功能相似
          cur = z[:,i:last_start+i+1:self.pool_stride]  # L[start:end:step]是用來切片的,從[start,end)之間,每隔step取一次
          if p is None:
              p = cur
          else:
              p = T.maximum(cur, p) #將p進行迭代比較,由於每次取的是每一個group裏的元素,因此進行pool_size次後就能夠得到每一個group的最大值
                                                                                                                                        
      p.name = self.layer_name + '_p_'
                                                                                                                                        
      return p

  仔細閱讀上面的源碼,發現和文章中描述基本是一致的,只是多了不少細節。spa

  因爲沒有GPU,因此只用CPU 跑了個mnist的簡單實驗,參考:maxout下的readme文件。(需先下載mnist dataset到PYLEARN2_DATA_PATA目錄下)。.net

  執行../../train.py minist_pi.yaml

  此時的.yaml配置文件內容以下:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train',
        one_hot: 1,
        start: 0,
        stop: 50000
    },
    model: !obj:pylearn2.models.mlp.MLP {
        layers: [
                 !obj:pylearn2.models.maxout.Maxout {
                     layer_name: 'h0',
                     num_units: 240,
                     num_pieces: 5,
                     irange: .005,
                     max_col_norm: 1.9365,
                 },
                 !obj:pylearn2.models.maxout.Maxout {
                     layer_name: 'h1',
                     num_units: 240,
                     num_pieces: 5,
                     irange: .005,
                     max_col_norm: 1.9365,
                 },
                 !obj:pylearn2.models.mlp.Softmax {
                     max_col_norm: 1.9365,
                     layer_name: 'y',
                     n_classes: 10,
                     irange: .005
                 }
                ],
        nvis: 784,
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        batch_size: 100,
        learning_rate: .1,
        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: .5,
        },
        monitoring_dataset:
            {
                'train' : *train,
                'valid' : !obj:pylearn2.datasets.mnist.MNIST {
                              which_set: 'train',
                              one_hot: 1,
                              start: 50000,
                              stop:  60000
                          },
                'test'  : !obj:pylearn2.datasets.mnist.MNIST {
                              which_set: 'test',
                              one_hot: 1,
                          }
            },
        cost: !obj:pylearn2.costs.mlp.dropout.Dropout {
            input_include_probs: { 'h0' : .8 },
            input_scales: { 'h0': 1. }
        },
        termination_criterion: !obj:pylearn2.termination_criteria.MonitorBased {
            channel_name: "valid_y_misclass",
            prop_decrease: 0.,
            N: 100
        },
        update_callbacks: !obj:pylearn2.training_algorithms.sgd.ExponentialDecay {
            decay_factor: 1.000004,
            min_lr: .000001
        }
    },
    extensions: [
        !obj:pylearn2.train_extensions.best_params.MonitorBasedSaveBest {
             channel_name: 'valid_y_misclass',
             save_path: "${PYLEARN2_TRAIN_FILE_FULL_STEM}_best.pkl"
        },
        !obj:pylearn2.training_algorithms.learning_rule.MomentumAdjustor {
            start: 1,
            saturate: 250,
            final_momentum: .7
        }
    ],
    save_path: "${PYLEARN2_TRAIN_FILE_FULL_STEM}.pkl",
    save_freq: 1
}

  跑了一個晚上才迭代了210次,被我kill掉了(筆記本還得拿到別的地方幹活),這時的偏差率爲1.22%。估計繼續跑幾個小時應該會降到做者的0.94%偏差率。

  其monitor監控輸出結果以下:

Monitoring step:
    Epochs seen: 210
    Batches seen: 105000
    Examples seen: 10500000
    learning_rate: 0.0657047371741
    momentum: 0.667871485944
    monitor_seconds_per_epoch: 121.0
    test_h0_col_norms_max: 1.9364999
    test_h0_col_norms_mean: 1.09864382902
    test_h0_col_norms_min: 0.0935518826938
    test_h0_p_max_x.max_u: 3.97355476543
    test_h0_p_max_x.mean_u: 2.14463905251
    test_h0_p_max_x.min_u: 0.961549570265
    test_h0_p_mean_x.max_u: 0.878285389379
    test_h0_p_mean_x.mean_u: 0.131020009421
    test_h0_p_mean_x.min_u: -0.373017504665
    test_h0_p_min_x.max_u: -0.202480633479
    test_h0_p_min_x.mean_u: -1.31821964107
    test_h0_p_min_x.min_u: -2.52428183099
    test_h0_p_range_x.max_u: 5.56309069078
    test_h0_p_range_x.mean_u: 3.46285869357
    test_h0_p_range_x.min_u: 2.01775637301
    test_h0_row_norms_max: 2.67556467
    test_h0_row_norms_mean: 1.15743973628
    test_h0_row_norms_min: 0.0951322935423
    test_h1_col_norms_max: 1.12119975186
    test_h1_col_norms_mean: 0.595629304226
    test_h1_col_norms_min: 0.183531862659
    test_h1_p_max_x.max_u: 6.42944749321
    test_h1_p_max_x.mean_u: 3.74599401756
    test_h1_p_max_x.min_u: 2.03028191814
    test_h1_p_mean_x.max_u: 1.38424650414
    test_h1_p_mean_x.mean_u: 0.583690886644
    test_h1_p_mean_x.min_u: 0.0253866100292
    test_h1_p_min_x.max_u: -0.830110300894
    test_h1_p_min_x.mean_u: -1.73539242398
    test_h1_p_min_x.min_u: -3.03677525979
    test_h1_p_range_x.max_u: 8.63650239768
    test_h1_p_range_x.mean_u: 5.48138644154
    test_h1_p_range_x.min_u: 3.36428499068
    test_h1_row_norms_max: 1.95904749183
    test_h1_row_norms_mean: 1.40561339238
    test_h1_row_norms_min: 1.16953677471
    test_objective: 0.0959691806325
    test_y_col_norms_max: 1.93642459019
    test_y_col_norms_mean: 1.90996961714
    test_y_col_norms_min: 1.88659811751
    test_y_max_max_class: 1.0
    test_y_mean_max_class: 0.996910632311
    test_y_min_max_class: 0.824416386342
    test_y_misclass: 0.0114
    test_y_nll: 0.0609837733094
    test_y_row_norms_max: 0.536167736581
    test_y_row_norms_mean: 0.386866656967
    test_y_row_norms_min: 0.266996530755
    train_h0_col_norms_max: 1.9364999
    train_h0_col_norms_mean: 1.09864382902
    train_h0_col_norms_min: 0.0935518826938
    train_h0_p_max_x.max_u: 3.98463017313
    train_h0_p_max_x.mean_u: 2.16546276053
    train_h0_p_max_x.min_u: 0.986865505974
    train_h0_p_mean_x.max_u: 0.850944629066
    train_h0_p_mean_x.mean_u: 0.135825383808
    train_h0_p_mean_x.min_u: -0.354841456
    train_h0_p_min_x.max_u: -0.20750516843
    train_h0_p_min_x.mean_u: -1.32748375925
    train_h0_p_min_x.min_u: -2.49716541111
    train_h0_p_range_x.max_u: 5.61263186775
    train_h0_p_range_x.mean_u: 3.49294651978
    train_h0_p_range_x.min_u: 2.07324073262
    train_h0_row_norms_max: 2.67556467
    train_h0_row_norms_mean: 1.15743973628
    train_h0_row_norms_min: 0.0951322935423
    train_h1_col_norms_max: 1.12119975186
    train_h1_col_norms_mean: 0.595629304226
    train_h1_col_norms_min: 0.183531862659
    train_h1_p_max_x.max_u: 6.49689754011
    train_h1_p_max_x.mean_u: 3.77637040198
    train_h1_p_max_x.min_u: 2.03274038543
    train_h1_p_mean_x.max_u: 1.34966894021
    train_h1_p_mean_x.mean_u: 0.57555584546
    train_h1_p_mean_x.min_u: 0.0176827309146
    train_h1_p_min_x.max_u: -0.845786992369
    train_h1_p_min_x.mean_u: -1.74696425227
    train_h1_p_min_x.min_u: -3.05703072635
    train_h1_p_range_x.max_u: 8.73556577905
    train_h1_p_range_x.mean_u: 5.52333465425
    train_h1_p_range_x.min_u: 3.379501944
    train_h1_row_norms_max: 1.95904749183
    train_h1_row_norms_mean: 1.40561339238
    train_h1_row_norms_min: 1.16953677471
    train_objective: 0.0119584870103
    train_y_col_norms_max: 1.93642459019
    train_y_col_norms_mean: 1.90996961714
    train_y_col_norms_min: 1.88659811751
    train_y_max_max_class: 1.0
    train_y_mean_max_class: 0.999958965285
    train_y_min_max_class: 0.996295480193
    train_y_misclass: 0.0
    train_y_nll: 4.22109408992e-05
    train_y_row_norms_max: 0.536167736581
    train_y_row_norms_mean: 0.386866656967
    train_y_row_norms_min: 0.266996530755
    valid_h0_col_norms_max: 1.9364999
    valid_h0_col_norms_mean: 1.09864382902
    valid_h0_col_norms_min: 0.0935518826938
    valid_h0_p_max_x.max_u: 3.970333514
    valid_h0_p_max_x.mean_u: 2.15548653063
    valid_h0_p_max_x.min_u: 0.99228626325
    valid_h0_p_mean_x.max_u: 0.84583547397
    valid_h0_p_mean_x.mean_u: 0.143554208322
    valid_h0_p_mean_x.min_u: -0.349097300524
    valid_h0_p_min_x.max_u: -0.218285757389
    valid_h0_p_min_x.mean_u: -1.28008164111
    valid_h0_p_min_x.min_u: -2.41494612443
    valid_h0_p_range_x.max_u: 5.54136030367
    valid_h0_p_range_x.mean_u: 3.43556817173
    valid_h0_p_range_x.min_u: 2.03580165751
    valid_h0_row_norms_max: 2.67556467
    valid_h0_row_norms_mean: 1.15743973628
    valid_h0_row_norms_min: 0.0951322935423
    valid_h1_col_norms_max: 1.12119975186
    valid_h1_col_norms_mean: 0.595629304226
    valid_h1_col_norms_min: 0.183531862659
    valid_h1_p_max_x.max_u: 6.4820340666
    valid_h1_p_max_x.mean_u: 3.75160795812
    valid_h1_p_max_x.min_u: 2.00587987424
    valid_h1_p_mean_x.max_u: 1.38777592924
    valid_h1_p_mean_x.mean_u: 0.578550013139
    valid_h1_p_mean_x.min_u: 0.0232071426066
    valid_h1_p_min_x.max_u: -0.84151110053
    valid_h1_p_min_x.mean_u: -1.73734213646
    valid_h1_p_min_x.min_u: -3.09680505839
    valid_h1_p_range_x.max_u: 8.72732563235
    valid_h1_p_range_x.mean_u: 5.48895009458
    valid_h1_p_range_x.min_u: 3.32030803638
    valid_h1_row_norms_max: 1.95904749183
    valid_h1_row_norms_mean: 1.40561339238
    valid_h1_row_norms_min: 1.16953677471
    valid_objective: 0.104670540623
    valid_y_col_norms_max: 1.93642459019
    valid_y_col_norms_mean: 1.90996961714
    valid_y_col_norms_min: 1.88659811751
    valid_y_max_max_class: 1.0
    valid_y_mean_max_class: 0.99627268242
    valid_y_min_max_class: 0.767024730168
    valid_y_misclass: 0.0122
    valid_y_nll: 0.0682986195071
    valid_y_row_norms_max: 0.536167736581
    valid_y_row_norms_mean: 0.38686665696
    valid_y_row_norms_min: 0.266996530755
Saving to mnist_pi.pkl...
Saving to mnist_pi.pkl done. Time elapsed: 3.000000 seconds
Time this epoch: 0:02:08.747395

 

 

  參考資料:

  Maxout Networks.  Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio

       一個日文的maxout ppt

       GoodFellow在ICML上關於maxout的報告。

      maxout下的readme文件。

相關文章
相關標籤/搜索