自動調參的Adam方法已經很是給力了,不過這主要流行於工程界,在大多數科學實驗室中,模型調參依然使用了傳統的SGD方法,在SGD基礎上增長各種學習率的主動控制,以達到對複雜模型的精細調參,以達到刷出最高的分數。算法
ICLR會議的 On the convergence of Adam and Beyond 論文,對Adam算法進行了 猛烈的抨擊,並提出了新的Adam算法變體。app
之前的文章:最優化方法之GD、SGD ;最優化之迴歸/擬合方法總結;最優化方法之SGD、Adams;ide
參考文章:GD優化算法總結--,可見每一步公式的更新都來之不易。
函數
Adams算法測試
先上結論:優化
1.Adam算法能夠看作是修正後的Momentum+RMSProp算法ui
2.動量直接併入梯度一階矩估計中(指數加權)this
3.Adam一般被認爲對超參數的選擇至關魯棒spa
4.學習率建議爲0.001
再看算法:其實就是Momentum+RMSProp的結合,而後再修正其誤差。
Adams問題
參考:Adams那麼棒,爲何對SGD念念不忘-2 ?
1.Adams可能不收斂
文中各大優化算法的學習率:
其中,SGD沒有用到二階動量,所以學習率是恆定的(實際使用過程當中會採用學習率衰減策略,所以學習率遞減)。AdaGrad的二階動量不斷累積,單調遞增,所以學習率是單調遞減的。所以,這兩類算法會使得學習率不斷遞減,最終收斂到0,模型也得以收斂。
但AdaDelta和Adam則否則。二階動量是固定時間窗口內的累積,隨着時間窗口的變化,遇到的數據可能發生鉅變,使得 可能會時大時小,不是單調變化。這就可能在訓練後期引發學習率的震盪,致使模型沒法收斂。
2.Adams可能錯失全局最優解
Adams變體方法改進
會議評論:On the Convergence of Adam and Beyond ; 論文:https://openreview.net/pdf?id=ryQu7f-RZ 。
1. THE NON-CONVERGENCE OF ADAM
With the problem setup in the previous section, we discuss fundamental flaw in the current exponential moving average methods like ADAM. We show that ADAM can fail to converge to an optimal solution even in simple one-dimensional convex settings. These examples of non-convergence contradict the claim of convergence in (Kingma & Ba, 2015), and the main issue lies in the following quantity of interest:
(2)
This quantity essentially measures the change in the inverse of learning rate of the adaptive method with respect to time. One key observation is that for SGD and ADAGRAD, -t 0 for all t 2 [T]. This simply follows from update rules of SGD and ADAGRAD in the previous section. Inparticular, update rules for these algorithms lead to 「non-increasing」 learning rates. However, this is not necessarily the case for exponential moving average variants like ADAM and RMSPROP i.e., -t can potentially be indefinite for t 2 [T] . We show that this violation of positive definiteness can lead to undesirable convergence behavior for ADAM and RMSPROP. Consider the following simple sequence of linear functions for F = [-1,1]:
where C > 2. For this function sequence, it is easy to see that the point x = -1 provides the minimum regret. Suppose 1 = 0 and 2 = 1=(1 + C2). We show that ADAM converges to a highly suboptimal solution of x = +1 for this setting. Intuitively, the reasoning is as follows. The algorithm obtains the large gradient C once every 3 steps, and while the other 2 steps it observes the gradient -1, which moves the algorithm in the wrong direction. The large gradient C is unable to counteract this effect since it is scaled down by a factor of almost C for the given value of 2, and hence the algorithm converges to 1 rather than -1. We formalize this intuition in the result below.
Theorem 1. There is an online convex optimization problem where ADAM has non-zero average regret i.e., RT =T 9 0 as T ! 1.
We relegate all proofs to the appendix.
使用指數移動平均值的RMSProp公式有缺陷,,基本表現了「自適應學習率」優化算法的學習率的倒數相對於時間的變化。 對於SGD和ADAGRAD而言,當t ∈ [T]時,Γt始終大於等於0。這是它們的基本梯度更新規則,因此它們的學習率始終是單調遞減的。可是基於指數移動平均值的RMSProp和Adam卻無法保證這一點,當t ∈ [T]時,它們的Γt可能大於等於0,也可能小於0。這種現象會致使學習率反覆震盪,繼而使模型沒法收斂。
以F = [−1, 1]的簡單分段線性函數爲例: 。當C > 2,在這個函數中,咱們很輕鬆就能看出它應收斂於x = −1。但若是用Adam,它的二階動量超參數分別是β1 = 0,β2 = 1/(1 + C2),算法會收斂在x = +1這個點。咱們直觀推理下:該算法每3步計算一次梯度和,若是其中兩步得出的結論是x = -1,而一次得出的結論是C,那麼計算指數移動平均值後,算法就會偏離正確收斂方向。由於對於給定的超參數β2,大梯度C無法控制本身帶來的不良影響。
2.來看一個各個優化算法在最優解和鞍點附近的迭代求解圖。
從上圖來看,彷佛SGD是一種 「最蠢」的方法,不過文獻
《The Marginal Value of Adaptive Gradient Methods in Machine Learning》給出的結論倒是:
自適應優化算法訓練出來的結果一般都不如SGD,儘管這些自適應優化算法在訓練時表現的看起來更好。 使用者應當慎重使用自適應優化算法。自適應算法相似於過學習的效果,生成的模型面對總體分佈時是過擬合的。
做者這樣闡述:our experimental evidence demonstrates that adaptive methods are not advantageous for machine learning, the Adam algorithm remains incredibly popular. We are not sure exactly as to why, but hope that our step-size tuning suggestions make it easier for practitioners to use standard stochastic gradient methods in their research.這一點貌似不是主要的Adam的缺點。
Adams變體AMSGrad
RMSProp和Adam算法下的Γt多是負的,因此文章探討了一種替代方法,經過把超參數β1、β2設置爲隨着t變化而變化,從而保證Γt始終是個非負數。
For the first part, we modify these algorithms to satisfy this additional constraint. Later on, we also explore an alternative approach where -t can be made positive semi-definite by using values of 1 and 2 that change with t.
AMSGRAD uses a smaller learning rate in comparison to ADAM and yet incorporates the intuition of slowly decaying the effect of past gradients on the learning rate as long as -t is positive semidefinite.
經過添加額外的約束,使學習率始終爲正值,固然代價是在大多數時候,AMSGrad算法的學習率是小於Adams和Rmsprop的。它們的主要區別在於AMSGrad記錄的是迄今爲止全部梯度值vt中的最大值,並用它來更新學習率,而Adam用的是平均值。所以當t ∈ [T]時,AMSGrad的Γt也能作到始終大於等於0。
實驗結果
多種方法結合
論文 Improving Generalization Performance by Switching from Adam to SGD,進行了實驗驗證。他們CIFAR-10數據集上進行測試,Adam的收斂速度比SGD要快,但最終收斂的結果並無SGD好。他們進一步實驗發現,主要是後期Adam的學習率過低,影響了有效的收斂。他們試着對Adam的學習率的下界進行控制,發現效果好了不少。
因而他們提出了一個用來改進Adam的方法:前期用Adam,享受Adam快速收斂的優點;後期切換到SGD,慢慢尋找最優解。這一方法之前也被研究者們用到,不過主要是根據經驗來選擇切換的時機和切換後的學習率。這篇文章把這一切換過程傻瓜化,給出了切換SGD的時機選擇方法,以及學習率的計算方法,效果看起來也不錯。
時機很重要,把自適應變化爲分析數據後,固定優化函數手動切換,理論上能取得更好的效果。又或者,修改Adams算法,以應對Novel狀況,保證它的收斂性。