gradient
- gradient is a vector
- gradient是等高線的法線方向 ###set the learning rate carefully
- 過小-迭代時間過長
- 太大-cannot find the minima
- if there are more than three parameters,you cannot visualize this.
adaptive learning rate (adagrad)
- 1/t decay: $ \eta^t = \frac{\eta}{\sqrt{t+1}}$
- vanilla gradient descent | adagrad
- adagrad learning rate 之反差
- Larger gradient,lager step?|if there are more than two parameters,then no!
- best step is $\frac{|firstderivative|}{secondderivative}$|adagrad在不增長額外運算量的狀況下用first derivative估算second derivative的大小
stochastic gradient descent
- $ L(\theta)$ 從考慮全部error之和改成僅僅考慮當前error
feature scaling
- make diffrent features have the same scaling
- 等高線圖爲正圓形,update更容易效率更高
gradient descent基於taylor series
eg:在等高線圖上隨機選取一點做爲初始點,以其爲圓心畫圓,尋找知足是的loss function最小的點 $\theta=(\theta_1 ,\theta_2)$ $$ L(\theta)=L(a,b)+\frac{\partial L(a,b)}{\partial \theta_1}(\theta_1-a)+\frac{\partial L(a,b)}{\partial \theta_2}(\theta_2-b)$$ 設u= $\frac{\partial L(a,b)}{\partial \theta_1}$,v=$\frac{\partial L(a,b)}{\partial \theta_2}$,$\triangle \theta_1=(\theta_1-a)$,$\triangle \theta_2=(\theta_1-b)$,內積定義$a\cdot b=|a||b|cos(a,b)$。this
$ L(\theta)中L(a,b),u,v均爲常數,故當(u,v)\cdot (\triangle \theta_1,\triangle \theta_2)最小時,可找到L(\theta)最小值。$spa
$要使內積最小,那麼應使cos(a,b)=-1且|a|,|b|儘可能大,也即a=-\eta b,其中\eta 爲正常數,能夠獲得(\triangle \theta_1,\triangle \theta_2)=-\eta (u,v),化簡可得(\theta_1,\theta_2)=(a,b)-\eta (u,v)$three
上式即爲gradient descent的原始公式,其中 $(u,v)=(\frac{\partial L(a,b)}{\partial \theta_1},\frac{\partial L(a,b)}{\partial \theta_2})$ 即爲梯度。io