李宏毅機器學習筆記---Gradient Descent

gradient

gradient is a vector
gradient是等高線的法線方向 ###set the learning rate carefully
過小-迭代時間過長
太大-cannot find the minima
if there are more than three parameters,you cannot visualize this.

adaptive learning rate (adagrad)

1/t decay: $ \eta^t = \frac{\eta}{\sqrt{t+1}}$
vanilla gradient descent | adagrad
adagrad learning rate 之反差
Larger gradient,lager step?|if there are more than two parameters,then no!
best step is $\frac{|firstderivative|}{secondderivative}$|adagrad在不增長額外運算量的狀況下用first derivative估算second derivative的大小

stochastic gradient descent

$ L(\theta)$ 從考慮全部error之和改成僅僅考慮當前error

feature scaling

make diffrent features have the same scaling
等高線圖爲正圓形，update更容易效率更高

gradient descent基於taylor series

eg:在等高線圖上隨機選取一點做爲初始點，以其爲圓心畫圓，尋找知足是的loss function最小的點 $\theta=(\theta_1 ,\theta_2)$ $$ L(\theta)=L(a,b)+\frac{\partial L(a,b)}{\partial \theta_1}(\theta_1-a)+\frac{\partial L(a,b)}{\partial \theta_2}(\theta_2-b)$$ 設u= $\frac{\partial L(a,b)}{\partial \theta_1}$,v=$\frac{\partial L(a,b)}{\partial \theta_2}$,$\triangle \theta_1=(\theta_1-a)$,$\triangle \theta_2=(\theta_1-b)$,內積定義$a\cdot b=|a||b|cos(a,b)$。this

$ L(\theta)中L(a,b),u,v均爲常數，故當(u,v)\cdot (\triangle \theta_1,\triangle \theta_2)最小時，可找到L(\theta)最小值。$spa

$要使內積最小，那麼應使cos(a,b)=-1且|a|,|b|儘可能大，也即a=-\eta b,其中\eta 爲正常數，能夠獲得(\triangle \theta_1,\triangle \theta_2)=-\eta (u,v),化簡可得(\theta_1,\theta_2)=(a,b)-\eta (u,v)$three

上式即爲gradient descent的原始公式，其中 $(u,v)=(\frac{\partial L(a,b)}{\partial \theta_1},\frac{\partial L(a,b)}{\partial \theta_2})$ 即爲梯度。io