李宏毅機器學習筆記---Gradient Descent

gradient

  • gradient is a vector
  • gradient是等高線的法線方向 ###set the learning rate carefully
  • 過小-迭代時間過長
  • 太大-cannot find the minima
  • if there are more than three parameters,you cannot visualize this.

adaptive learning rate (adagrad)

  • 1/t decay: $ \eta^t = \frac{\eta}{\sqrt{t+1}}$
  • vanilla gradient descent | adagrad
  • adagrad learning rate 之反差
  • Larger gradient,lager step?|if there are more than two parameters,then no!
  • best step is $\frac{|firstderivative|}{secondderivative}$|adagrad在不增長額外運算量的狀況下用first derivative估算second derivative的大小

stochastic gradient descent

  • $ L(\theta)$ 從考慮全部error之和改成僅僅考慮當前error

feature scaling

  • make diffrent features have the same scaling
  • 等高線圖爲正圓形,update更容易效率更高

gradient descent基於taylor series

eg:在等高線圖上隨機選取一點做爲初始點,以其爲圓心畫圓,尋找知足是的loss function最小的點 $\theta=(\theta_1 ,\theta_2)$ $$ L(\theta)=L(a,b)+\frac{\partial L(a,b)}{\partial \theta_1}(\theta_1-a)+\frac{\partial L(a,b)}{\partial \theta_2}(\theta_2-b)$$ 設u= $\frac{\partial L(a,b)}{\partial \theta_1}$,v=$\frac{\partial L(a,b)}{\partial \theta_2}$,$\triangle \theta_1=(\theta_1-a)$,$\triangle \theta_2=(\theta_1-b)$,內積定義$a\cdot b=|a||b|cos(a,b)$。this

$ L(\theta)中L(a,b),u,v均爲常數,故當(u,v)\cdot (\triangle \theta_1,\triangle \theta_2)最小時,可找到L(\theta)最小值。$spa

$要使內積最小,那麼應使cos(a,b)=-1且|a|,|b|儘可能大,也即a=-\eta b,其中\eta 爲正常數,能夠獲得(\triangle \theta_1,\triangle \theta_2)=-\eta (u,v),化簡可得(\theta_1,\theta_2)=(a,b)-\eta (u,v)$three

上式即爲gradient descent的原始公式,其中 $(u,v)=(\frac{\partial L(a,b)}{\partial \theta_1},\frac{\partial L(a,b)}{\partial \theta_2})$ 即爲梯度。io

相關文章
相關標籤/搜索