《深度學習-改善深層神經網絡》-第二週-優化算法-Andrew Ng

時間 2020-05-18

標籤深度學習-改善深層神經網絡第二優化算法 andrew 简体版

原文原文鏈接

看到有很多人挺推崇：An overview of gradient descent optimization algorithms；特此放到最上面，你們有機會能夠閱讀一下；html

本文內容主要來源於Coursera吳恩達《優化深度神經網絡》課程，另一些不一樣優化算法之間的比較也會出如今其中，具體來源再也不單獨說明，會在文末給出所有的參考文獻；python

本主要主要介紹的優化算法有：算法

Mini-batch梯度降低（Mini-batch gradient descent）
指數加權平均（Exponentially weighted averages）
Momentum梯度降低法
RMSprop算法

Adam算法

其實就是對梯度降低的優化算法，每一種優化算法會介紹其：基本原理、TensorFlow中的使用、不一樣優化算法的優缺點總結；在最後會介紹調整學習率衰減的方式以及局部最優問題；segmentfault

[TOC]網絡

1. Mini-batch gradient descent

若是樣本數量不是過於龐大，通常使用batch的方式進行計算，即將整個樣本集投入到深度神經網絡進行梯度降低；而通常實際應用中，樣本集的數量將會很大，如達到百萬數量級，這種狀況下若是繼續使用batch的方式，訓練的速度每每會很慢；app

所以，假如每次只對整個樣本集中的部分樣本執行梯度降低，這就有了Mini-batch gradient descent。機器學習

1.1 算法原理

整個樣本集$X=[x^1, x^2, \cdots, x^m] \in R^{n \times m}$；$Y=[y^1, y^2, \cdots, y^m] \in R^{1 \times m}$；函數

假設：oop

$m=5000000$；每個mini-batch含有1000個樣本，即$X^{{t}} \in R^{n \times 1000},Y^{{t}} \in R^{1 \times 1000}, t=1, 2, \cdots, 5000$；學習

$x^i$表示第$i$個樣本；$Z^{[l]}$表示網絡第$l$層網絡的線性輸出；$X^{{t}}, Y^{{t}}$表示第$t$組mini-batch；

即在每個mini-batch上執行梯度降低，僞代碼以下：

# 一個epoch
for t = 1, ..., T{
    Forward Propagation
    Compute Cost Function
    Backward Propagation
}

其中，每一步詳解：

（1）Forward Propagation

第一層網絡非線性輸出： $$ Z^{[1]} = W^{[1]}X^{{t}} + b^{[1]} $$

$$ A^{[1]} = g^{(1)}(Z^{[1]}) $$

第$l$層網絡非線性輸出： $$ A^{[l]} = g^{[l]}(Z^{[l]}) $$ （2）Compute Cost Function

計算代價函數： $$ J = \dfrac{1}{1000} \sum_{i=1}^{l}Loss(\hat{y}^i, y^i) + \dfrac{\lambda}{2 \times 1000} \sum_{l}||W^l||_F^2 $$ （3）Backward Propagation

更新權重和偏置： $$ W^{[l]} : = W^{[l]} - \alpha dW^{[l]} $$

$$ b^{[l]} : = b^{[l]} - \alpha db^{[l]} $$

通過T次for循環後，表示已經在整個樣本集上訓練了一次，即一個epoch；能夠執行多個epoch；

1.2 進一步理解Mini-batch gradient descent

對與Batch Gradient Descent來講，一個epoch只進行了一次梯度降低；而對於Mini-batch Gradient Decent來講，一個epoch進行T次梯度降低；

1.2.1 Cost function

（1）左圖表示通常神經網絡中，使用Batch Gradient Descent，隨着在整個樣本集上迭代次數的增長，cost在不斷的減少；

（2）右圖表示使用Mini-batch Gradient Descent，隨着在不一樣的mini-batch上進行訓練，cost總體趨勢處於降低，但因爲受到噪聲的影響，會出現震盪；

（3）Mini-batch Gradient Descent中cost出現震盪的緣由時：不一樣的mini-batch之間是存在差別的，可能其中某些mini-batch是好的子集，而某些子集中存在噪聲，所以cost會出現震盪的狀況；

1.2.2 如何選擇batch size

總共有三種選擇方式：（1）batch_size=m；（2）batch_size=1；（3）batch_size介於1和m之間；

（1）Batch Gradient Descent（batch_size = m）

當batch_size=m，就成了Batch Gradient Descent，只有包含一個子集，就是整個數據集；即$(X^{{1}}, Y^{{1}})=(X,Y)$；

（2）Stochastic Gradient Descent（batch_size=1）

當batch_size=m，就成了Stochastic Gradient Descent，共包含m個子集，每一個樣本做爲一個子集，即$(X^{{1}}, Y^{{1}})=(x^i,y^i)$；

（3）Mini-batch gradient descent（batch_size介於1和m之間）

上圖表示三者之間梯度降低曲線：

a. 藍色表示Batch Gradient Descent，會比較平穩的接近全局最小值；因爲使用了所有數據集，每次前進的速度會比較慢；

b. 紫色表示Stochastic Gradient Descent，每次前進速度很快；但因爲每次只使用了一個樣本，會出現較大的震盪；並且，不會收斂到最小值，最終會在最小值附近來回波動

c. 綠色表示Mini-batch gradient descent，每次前進速度較快，且震盪較小，基本可以接近最小值；若是出如今最小值附近波動，能夠減少學習率；

算法	Stochastic Gradient Descent	Mini-batch gradient descent	Batch Gradient Descent
優勢	適用於單個樣本；	（1）可以快速學習；（2）向量化加速；（3）未在整個訓練集上訓練完，就能夠執行後續工做；
缺點	（1）丟失了向量化帶來的加速；（2）效率低；		單次迭代時間太長；

如何爲Mini-batch gradient descent選擇batch size？

64-512，2的n次方，提升運算速度；
$X^{{t}}, Y^{{t}}$符合GPU、CPU內存；

1.3 TensorFlow中的梯度降低

1.3.1 構建optimizer

optimizer = tf.train.GradientDescentOptimizer(leraning_rate)
train = optimizer.minimize(loss)

1.3.2 tf.train.GradientDescentOptimizer()

tf.train.GradientDescentOptimizer.__init__(self, 
                                           learning_rate, 
                                           use_locking=False, 
                                           name="GradientDescent"):
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate to use.  # 學習率
	use_locking: If True use locks for update operations.  # 
	name: Optional name prefix for the operations created when applying gradients. Defaults to "GradientDescent".

1.3.3 TensorFlow中的使用

#coding=utf-8
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
y_pred = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(y_pred - y))  # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)  # reset values to wrong
for i in range(1000):
    sess.run(train, {x: x_train, y: y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s" % (curr_W, curr_b, curr_loss))

2. Exponentially weighted averages

指數加權平均（Exponentially weighted averages）是除梯度降低算法以外其餘優化算法中重要的概念，所以，這裏先介紹其概念。

2.1 倫敦天氣溫度

這裏再也不介紹如何引入指數加權平均的，具體參考：網易雲課堂-吳恩達《優化深度神經網絡》-第二週或紅色石頭Will-吳恩達《優化深度神經網絡》課程筆記；

假設：$V_0 = 0$； $$ V_t = \beta V_{t-1} + (1 - \beta) \theta_t $$ 其中，$\theta_t$表示第$t$天的溫度；$V_t$表示經過移動平均的方法對天天氣溫進行平滑處理後結果； $\beta$值決定了指數加權平均的天數，即$\dfrac{1}{1-\beta}$；$\beta$表示加權平均的天數越多，平均後的趨勢越平緩，同時也會向右移動；

即，當$\beta=0.9$，則$\dfrac{1}{1-\beta}=10$，表示將前10天進行指數加權平均；

2.2 進一步理解Exponentially weighted averages

2.2.1 理解指數加權平均通常形式

$$ V_t = \beta V_{t-1} + (1-\beta)\theta_{t} $$

$$ V_t = (1-\beta) \cdot \theta_{t} + (1-\beta) \cdot \beta \cdot \theta_{t-1} + (1-\beta) \cdot \beta^2 \cdot \theta_{t-2} + \cdots + (1-\beta)\cdot \beta^{t-1}\cdot \theta_1 + \beta^t\cdot V_0 $$

其中，$\theta_t, \theta_{t-1}, \cdots , \theta_1$表示原始數據集，即下圖中的第一張圖；

$(1-\beta), (1-\beta)\cdot \beta, \cdots, (1-\beta)\cdot \beta^{t-1}$相似指數曲線，以下圖中第二張圖；從右向左，呈指數降低；

$V_t$表示二者點乘，將原始數據值與衰減指數點乘，至關於作了指數衰減，離的越近，影響就越大；離的越遠，影響就越小，衰減就越嚴重；

2.2.2 實際計算指數加權平均

實際應用中，爲了減小內存的使用，可使用以下語句實現指數加權平均：

$V_0=0$

Repeat{ $$ Get \quad next \quad \theta_t $$

$$ V_{\theta} := \beta V_{\theta} + (1-\beta)\theta_t $$

}

2.3 誤差修正（bias correction）

由於初始假設$V_0=0$，能夠想到，在使用$V_t = \beta V_{t-1} + (1-\beta)\theta_t$計算的時候，前面的一些值將會受到很大的影響，會比正常值小一些，直到計算後面數據的時候，影響纔會漸漸變小，趨於正常。

所以，修正這種問題的方式是偏移修正（bias correction），即對$V_t$做以下處理： $$ \dfrac{V_t}{1-\beta^t} $$ 在機器學習中，偏移修正不是必須的；

3. Gradient descent with momentum（Momentum梯度降低法）

**動量梯度降低算法（Gradient descent with momentum）**的速度要快於標準的梯度降低算法；

具體作法是：在每次訓練時，對梯度計算指數加權平均，而後使用獲得的梯度值更新權重和偏置；

3.1 梯度降低

如上圖藍色折線所示，表示標準梯度降低算法；在梯度降低的過程當中，會出現震盪的狀況，這是由於每一點的梯度只與當前梯度方向有關，所以會出現折線的效果；

如上圖紅色折線所示，表示使用momentum梯度降低算法；能夠看到，在梯度降低的過程當中，不會出現劇烈的震盪，這是由於，每個點的梯度不只與當前梯度方向有關，還與以前的梯度方向有關；可以作到縱軸擺動變小，橫軸方向運動更快；

3.2 僞代碼表示

On iteration t{

Compute dW, db on the current mini-batch

$V_{dW} = \beta V_{dW} + (1-\beta)dW$

$V_{db} = \beta V_{db} + (1-\beta)db$

更新權重和偏置

$W := W - \alpha V_{dW}, b := b - \alpha V_{db}$

}

其中，初始化時，$V_{dW}=0, V_{db}=0, \beta=0.9$；

3.3 TensorFlow中的Gradient descent with momentum

3.3.1 構建optimizer

# optimizer
optimizer = tf.train.MomentumOptimizer(0.01, momentum) # \beta 
train = optimizer.minimize(loss)

3.3.2 tf.train.MomentumOptimizer()

tf.train.MomentumOptimizer.__init__(self, learning_rate, momentum,
               use_locking=False, name="Momentum", use_nesterov=False):
    
Args:
	learning_rat: A `Tensor` or a floating point value.  The learning rate. # 學習率
	momentum: A `Tensor` or a floating point value.  The momentum. # 就是指數加權平均中的超參數\alpha=0.9
	use_locking: If `True` use locks for update operations. 
	name: Optional name prefix for the operations created when applying gradients.  Defaults to "Momentum".
	use_nesterov: If `True` use Nesterov Momentum. # 另外一種優化算法，由momentum改進而來，效果更好；來源於：http://jmlr.org/proceedings/papers/v28/sutskever13.pdf

Return:
    optimizer

4. RMSprop

RMSprop（Root mean squared prop）是另一種優化梯度降低的算法，相似於Momentum Gradient descent，一樣能夠在縱軸上減少擺動，在橫軸方向上運動更快；

4.1 僞代碼表示

On iteration t{

Compute dW, db on the current mini-batch

$S_{dW} = \beta S_{dW} + (1-\beta)(dW)^2$

$S_{db} = \beta S_{db} + (1-\beta)(db)^2$

更新權重和偏置

$W := W - \alpha \dfrac{dW}{\sqrt{S_W}+\epsilon}, b := b - \alpha \dfrac{db}{\sqrt{S_W}+\epsilon}$

}

其中，通常取$\epsilon=10^{-8}$，防止分母趨近於0；

4.2 TensorFlow中的RMSprop

4.2.1 構建optimizer

# optimizer
optimizer = tf.train.RMSPropOptimizer(0.01, decay, momentum) # decay不清楚具體什麼做用？？求解：
train = optimizer.minimize(loss)

4.2.2 tf.train.RMSPropOptimizer()

tf.train.RMSPropOptimizer.__init__(self,
                                  learning_rate,
                                  decay=0.9,
                                  momentum=0.0,
                                  epsilon=1e-10,
                                  use_locking=False,
                                  centered=False,
                                  name="RMSProp")
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate.  # 學習率
	decay: Discounting factor for the history/coming gradient  # ？？
	momentum: A scalar tensor. # \alpha
	epsilon: Small value to avoid zero denominator.  # \epsilon 防止分母趨近於0
	use_locking: If True use locks for update operation.
	centered: If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
	name: Optional name prefix for the operations created when applying gradients. Defaults to "RMSProp".

5. Adam optimization algorithm

Adam優化算法是結合了Gradient descent with momentum與RMSprop兩種算法；被證實可以適用於不一樣的神經網絡；

5.1 Adam算法流程-僞代碼

初始化：$V_{dW}=0, S_{dW}=0, V_{db}=0, S_{db}=0$；

On iteration t {

Compute $dW, db$ on each mini-batch

$V_{dW} = \beta_1 V_{dW} + (1-\beta_1)dW$

$V_{db} = \beta_1 V_{db} + (1-\beta_1)db$

$S_{dW} = \beta_2 S_{dW} + (1-\beta_2)(dW)^2$

$S_{db} = \beta_2 S_{db} + (1-\beta_2)(db)^2$

$V_{dW}^{corrected}= \dfrac{V_{dW}}{1-\beta_1^t}, V_{db}^{corrected}= \dfrac{V_{db}}{1-\beta_1^t}$

$S_{dW}^{corrected}= \dfrac{S_{dW}}{1-\beta_2^t}, S_{db}^{corrected}= \dfrac{S_{db}}{1-\beta_2^t}$

$W := W - \alpha \dfrac{V_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon} b := b - \alpha \dfrac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}$

}

Adam算法中須要作誤差修正；

超參數設置：$\beta_1 = 0.9, \beta_2=0.999, \epsilon = 10^{-8}$；通常只須要對學習率$\alpha$進行調試；

5.2 TensorFlow中Adam optimization algorithm

5.2.1 構建optimizer

optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon)
train = optimizer.minimize(loss)

5.2.2 tf.train.AdamOptimizer

tf.train.AdamOptimizer._init__(self,
                               learning_rate=0.001,
                               beta1=0.9,
                               beta2=0.999,
                               epsilon=1e-8,
                               use_locking=False,
                               name="Adam"):
Args:
	learning_rate: A Tensor or a floating point value.  The learning rate. # 學習率
	beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. # \beta_1
	beta2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates. # \beta_2
	epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
	use_locking: If True use locks for update operations.
	name: Optional name for the operations created when applying gradients. Defaults to "Adam".

6. 不一樣優化算法的優缺點總結

6.1 Batch Gradient Descent

**思想：**基於整個訓練集進行梯度降低，更新權重；

優勢：

考慮的是全局損失，不會陷入局部最優；

缺點：

每次迭代計算量較大，佔用內存較高；

6.2 Stochastic Gradient Descent

**思想：**從訓練集中隨機選取一個樣本計算梯度更新參數；

優勢：

因爲是對當個樣本的損失計算梯度，所以計算量較小；

缺點：

僅考慮單個樣本，容易陷入局部最優；
訓練集較大時，訓練時間較長；
選擇合適的學習率比較困難；
對參數初始化比較敏感；
因爲引入了噪聲，所以具備正則化的效果；

6.3 Mini Batch Gradient Descent

**思想：**從整個樣本集中選擇batch_size個樣本計算損失的梯度，更新權重；

優勢：

對於很大的訓練集，可以較快的收斂；

缺點：

梯度更新的方向依賴於當前batch內的樣本，因此梯度的方向不穩定；
可能會出現不會收斂的最小值的狀況，須要逐漸減少學習率；

6.4 Gradient Descent with Momentum

**思想：**基於以前梯度的方向以及當前batch的梯度方向進行更新；

優勢：

減弱縱向方向的擺動，對震盪的狀況可以有必定的抑制做用；
加速橫向的運動，快速接近於最優值，加速收斂；

6.5 RMSprop

**思想：**相似於動量梯度降低，引入了指數權重加權平均值；

6.6 AdaGrad

**思想：**綜合了Gradient Descent with Momentum與RMSprop兩種優化算法；

優勢：

訓練前期，更新幅度大；
訓練後期，更新幅度小；
適合處理稀疏梯度；

缺點：

訓練後期，會致使學習率很小，梯度更新的很慢；
自定義全局學習率；

7. Learning rate decay

在神經網絡訓練的過程當中，適當減少學習率有利於提升訓練速度，該類方法稱爲learning rate decay，即隨着迭代次數的增長，學習率$\alpha$逐漸減少；

7.1 學習率減少的幾種方式

（1）第一種： $$ \alpha = \dfrac{1}{1+ decay_rate \times epoch_num}\cdot \alpha_0 $$ 其中，$decay_rate$衰減參數；$epoch_num$表示迭代次數；

（2）第二種： $$ \alpha = 0.95^{epoch_num} \cdot \alpha_0 $$ （3）第三種： $$ alpha = \dfrac{k}{\sqrt{epoch_num}}\cdot \alpha_0 \quad 或 \quad \dfrac{k}{\sqrt{t}}\cdot \alpha_0 $$ （4）第四種：

將$\alpha$設置爲關於$t$的離散值，隨着$t$的增長，$\alpha$呈階梯式減小；

（5）第五種：

經過查看訓練日誌，手動調整學習率；

7.2 TensorFlow中的學習率設置

因爲TensorFlow中提供的學習率設置方式有很多種，而本文主要是敘述梯度降低的優化算法，在此處介紹將會佔用不小的篇幅，顯得有些臃腫，所以，另總結一篇博文供本身學習；

TensorFlow中設置學習率的方式

8. The problem of local optima

在使用梯度降低算法減小cost function的時候，可能會獲得局部最優解，而不是全局最優解；

咱們認爲的局部最優可能以下圖左邊所示；但在神經網絡中，局部最優的概念發生了變化；大部分梯度爲零的「最優勢「不是這些凹槽處，而是以下圖右邊的馬鞍處，稱爲saddle point。

相似馬鞍狀的plateaus會下降神經網絡的學習速度。plateaus是梯度接近於零的平緩區域，以下圖所示，在plateaus上梯度很小，前進緩慢，達到saddle point須要很長時間；到達saddle point後，因爲隨機擾動，梯度可以進去降低；可是會在plateaus上花費不少時間；

動量梯度降低、RMSprop、Adam算法可以解決plateaus降低過慢的問題，提升訓練速度；

結束！！！