Caffe學習系列(8)：solver優化方法

時間 2019-11-13

標籤 caffe 學習系列 solver 優化方法简体版

原文原文鏈接

上文提到，到目前爲止，caffe總共提供了六種優化方法：網絡

Stochastic Gradient Descent (type: "SGD"),
AdaDelta (type: "AdaDelta"),
Adaptive Gradient (type: "AdaGrad"),
Adam (type: "Adam"),
Nesterov’s Accelerated Gradient (type: "Nesterov") and
RMSprop (type: "RMSProp")

Solver就是用來使loss最小化的優化方法。對於一個數據集D，須要優化的目標函數是整個數據集中全部數據loss的平均值。機器學習

其中，f_W(x⁽ⁱ⁾)計算的是數據x⁽ⁱ⁾上的loss, 先將每一個單獨的樣本x的loss求出來，而後求和，最後求均值。 r(W)是正則項（weight_decay)，爲了減弱過擬合現象。ide

若是採用這種Loss 函數，迭代一次須要計算整個數據集，在數據集很是大的這狀況下，這種方法的效率很低，這個也是咱們熟知的梯度降低採用的方法。

在實際中，經過將整個數據集分紅幾批（batches), 每一批就是一個mini-batch，其數量（batch_size)爲N<<|D|，此時的loss 函數爲：

有了loss函數後，就能夠迭代的求解loss和梯度來優化這個問題。在神經網絡中，用forward pass來求解loss，用backward pass來求解梯度。函數

在caffe中，默認採用的Stochastic Gradient Descent（SGD）進行優化求解。後面幾種方法也是基於梯度的優化方法（like SGD），所以本文只介紹一下SGD。其它的方法，有興趣的同窗，能夠去看文獻原文。學習

一、Stochastic gradient descent（SGD)優化

隨機梯度降低（Stochastic gradient descent）是在梯度降低法（gradient descent）的基礎上發展起來的，梯度降低法也叫最速降低法，具體原理在網易公開課《機器學習》中，吳恩達教授已經講解得很是詳細。SGD在經過負梯度和上一次的權重更新值V_t的線性組合來更新W，迭代公式以下：ui

其中，是負梯度的學習率(base_lr)，是上一次梯度值的權重（momentum），用來加權以前梯度方向對如今梯度降低方向的影響。這兩個參數須要經過tuning來獲得最好的結果，通常是根據經驗設定的。若是你不知道如何設定這些參數，能夠參考相關的論文。spa

在深度學習中使用SGD，比較好的初始化參數的策略是把學習率設爲0.01左右（base_lr: 0.01)，在訓練的過程當中，若是loss開始出現穩定水平時，對學習率乘以一個常數因子（gamma），這樣的過程重複屢次。code

對於momentum，通常取值在0.5--0.99之間。一般設爲0.9，momentum可讓使用SGD的深度學習方法更加穩定以及快速。blog

關於更多的momentum，請參看Hinton的《A Practical Guide to Training Restricted Boltzmann Machines》。

實例：

base_lr: 0.01 
lr_policy: "step"
gamma: 0.1   
stepsize: 1000  
max_iter: 3500 
momentum: 0.9

lr_policy設置爲step,則學習率的變化規則爲 base_lr * gamma ^ (floor(iter / stepsize))

即前1000次迭代，學習率爲0.01; 第1001-2000次迭代，學習率爲0.001; 第2001-3000次迭代，學習率爲0.00001，第3001-3500次迭代，學習率爲10^-5

上面的設置只能做爲一種指導，它們不能保證在任何狀況下都能獲得最佳的結果，有時候這種方法甚至不work。若是學習的時候出現diverge（好比，你一開始就發現很是大或者NaN或者inf的loss值或者輸出），此時你須要下降base_lr的值（好比，0.001），而後從新訓練，這樣的過程重複幾回直到你找到能夠work的base_lr。

二、AdaDelta

AdaDelta是一種」魯棒的學習率方法「，是基於梯度的優化方法（like SGD）。

具體的介紹文獻：

M. Zeiler ADADELTA: AN ADAPTIVE LEARNING RATE METHOD. arXiv preprint, 2012.

示例：

net: "examples/mnist/lenet_train_test.prototxt"
test_iter: 100
test_interval: 500
base_lr: 1.0
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet_adadelta"
solver_mode: GPU
type: "AdaDelta"
delta: 1e-6

從最後兩行可看出，設置solver type爲Adadelta時，須要設置delta的值。

三、AdaGrad

自適應梯度（adaptive gradient）是基於梯度的優化方法（like SGD）

具體的介紹文獻：

Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 2011.

示例：

net: "examples/mnist/mnist_autoencoder.prototxt"
test_state: { stage: 'test-on-train' }
test_iter: 500
test_state: { stage: 'test-on-test' }
test_iter: 100
test_interval: 500
test_compute_loss: true
base_lr: 0.01
lr_policy: "fixed"
display: 100
max_iter: 65000
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "examples/mnist/mnist_autoencoder_adagrad_train"
# solver mode: CPU or GPU
solver_mode: GPU
type: "AdaGrad"

四、Adam

是一種基於梯度的優化方法（like SGD）。

具體的介紹文獻：

D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.

五、NAG

Nesterov 的加速梯度法（Nesterov’s accelerated gradient）做爲凸優化中最理想的方法，其收斂速度很是快。

具體的介紹文獻：

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 2013.

示例：

net: "examples/mnist/mnist_autoencoder.prototxt"
test_state: { stage: 'test-on-train' }
test_iter: 500
test_state: { stage: 'test-on-test' }
test_iter: 100
test_interval: 500
test_compute_loss: true
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 10000
display: 100
max_iter: 65000
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "examples/mnist/mnist_autoencoder_nesterov_train"
momentum: 0.95
# solver mode: CPU or GPU
solver_mode: GPU
type: "Nesterov"

六、RMSprop

RMSprop是Tieleman在一次 Coursera課程演講中提出來的，也是一種基於梯度的優化方法（like SGD）

具體的介紹文獻：

T. Tieleman, and G. Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.Technical report, 2012.

示例：

net: "examples/mnist/lenet_train_test.prototxt"
test_iter: 100
test_interval: 500
base_lr: 1.0
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
display: 100
max_iter: 10000
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet_adadelta"
solver_mode: GPU
type: "RMSProp"
rms_decay: 0.98