梯度降低法和牛頓法

泰勒公式的基本形式如式(1)所示,
$$\begin{equation}f(x) = \sum_{n=0}^{\infty}\frac{f^{(n)}(x_0)}{n!}(x-x_0)^{n}\tag{1}\end{equation}$$
特殊地,一階泰勒展開,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x0)\tag{2}\end{equation}$$
二階泰勒展開,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2}\tag{3}\end{equation}$$
令$x^t = x^{t-1}+\Delta x$,則$f(x^t)$的二階泰勒展開寫爲,
$$\begin{equation}f(x^t)=f(x^{t-1}+\Delta x)\approx f(x^{t-1})+f^{'}(x^{t-1})\Delta x+f^{''}(x^{t-1})\frac{\Delta x^2}{2}\tag{4}\end{equation}$$算法

梯度降低

機器學習任務中,須要最小化損失函數$L(\theta)$,其中$\theta$爲待求解的模型參數。梯度降低法經常使用來解決這一類無約束最優化問題。梯度降低法是一種迭代式算法,最開始給$\theta^0$賦一個初始值,而後在每個迭代過程當中,更新$\theta^t$的值,使得損失函數$L(\theta^t)$減少。
$$\begin{equation}\theta^t = \theta^{t-1}+\Delta \theta\tag{5}\end{equation}$$
將$L(\theta^t)$在$\theta^{t-1}$處進行一階泰勒展開:
$$\begin{equation}L(\theta^t)=L(\theta^{t-1}+\Delta \theta)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta\tag{6}\end{equation}$$
梯度降低法每個迭代,須要使損失函數$L$變小,即要使$L(\theta^t)<L(\theta^{t-1})$。結合式(6),可取$\Delta \theta = -\alpha L^{'}(\theta^{t-1})$,則$\theta$的迭代公式爲,
$$\begin{equation}\theta^t = \theta^{t-1}-\alpha L^{'}(\theta^{t-1})\tag{7}\end{equation}$$
這裏的$\alpha$是步長,即常說的learning rate,一般直接賦一個較小的值。app

牛頓法

牛頓法自己是一階算法,本質是求根算法。使用一階泰勒公式推到牛頓法,則根據式(6),令$L(\theta^t)=0$,則能夠推導出
$$\begin{equation}\theta^t = \theta^{t-1}-\frac{L(\theta^{t-1})}{L^{'}(\theta^{t-1})}\tag{8}\end{equation}$$
可是在機器學習任務中,牛頓法被用來求解目標函數的最優解。若是用來求最優解(極值點),這時就要求導數爲0的跟,就須要求二階導數,就變成了二階算法。牛頓法求解最優值是迭代算法,每一次迭代,在現有極小點估計值$\theta^{t-1}$的附近,對$L(\theta^{t})$作二階泰勒展開,進而找到極小點的下一個估計值$\theta^t$。
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta + L^{''}(\theta^{t-1})\frac{\Delta \theta^2}{2}\tag{9}\end{equation}$$
先只考慮參數$\theta$爲標量的狀況,可將一階導數$L^{'}$和二階導數$L^{''}$分別記做$g$和$h$,
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+g\Delta \theta + h\frac{\Delta \theta^2}{2}\tag{10}\end{equation}$$
此時,要使得$L(\theta^t)$極小,則讓$g\Delta \theta + h\frac{\Delta \theta^2}{2}$極小($L(\theta^{t-1})$看做常量),令$\frac{\partial(g\Delta \theta + h\frac{\Delta \theta^2}{2})}{\partial \Delta \theta}=0$,獲得$\Delta \theta = -\frac{g}{h}$,所以
$$\begin{equation}\theta^{t} = \theta^{t-1}+\Delta \theta=\theta^{t-1}-\frac{g}{h}\tag{11}\end{equation}$$
將參數$\theta$由標量推廣到向量形式,每一次迭代,參數的更新方式爲,
$$\begin{equation}\theta^{t} =\theta^{t-1}-H^{-1}g\tag{12}\end{equation}$$
其中,$g$爲雅克比矩陣,矩陣中各元素對應一階偏導數;$H$爲Hessian矩陣,各元素對應二階偏導數。機器學習

比較

  1. 梯度降低法和牛頓法相比,二者都是迭代求解,不過梯度降低法是梯度求解,而牛頓法是用二階的海森矩陣的逆矩陣求解。相對而言,使用牛頓法收斂更快(迭代更少次數)。可是每次迭代的時間比梯度降低法長。至於爲何牛頓法收斂更快,通俗來講梯度降低法每次只從你當前所處位置選一個坡度最大的方向走一步,牛頓法在選擇方向時,不只會考慮坡度是否夠大,還會考慮你走了一步以後,坡度是否會變得更大。因此,能夠說牛頓法比梯度降低法看得更遠一點,能更快地走到最底部。牛頓法爲何比梯度降低法求解須要的迭代次數更少?
  2. 梯度降低法是用來求解最優勢的,而一階的牛頓法是用來求解零點的,二階的牛頓法纔是用來求解最優勢,且牛頓法是能夠由泰勒公式嚴格推到出來的,而梯度降低法不是。

代碼實現

import numpy as np
from scipy.misc import derivative
from sympy import symbols, diff
import sympy
from sympy.tensor.array import derive_by_array
import math

x1, x2, x3, x4 = symbols('x1, x2, x3, x4')
Y = symbols('Y')
# Y = x1 ** 2 + x2 ** 2
# Y = x1**2 + 3*x1 - 3*x1*x2 + 2*x2**2 + 4*x2
Y = x1 **2 + x2 ** 2 - 4 * x1 - 6 * x2 + 13 + sympy.sqrt(x3) - sympy.sqrt(x4)
var_list = [x1, x2, x3, x4]

def gradient(f, X_, X):
    grad_ = sympy.Array([diff(f, x_) for x_ in X_])
    grad = grad_
    for i, x_ in enumerate(X_):
        grad = grad.subs(x_, X[i])
    return grad_, np.array(grad.tolist(), dtype=np.float32)

def jacobian2(f, X_, X):
    G_ = sympy.Array([diff(f, x_) for x_ in X_])
    G = G_
    for i, x_ in enumerate(X_):
        G = G.subs(x_, X[i])
    return G_, np.array(G.tolist(), dtype=np.float32)

def hessian2(f, X_, X):
    H_ = sympy.Array([[diff(f, x_1).diff(x_2) for x_2 in X_] for x_1 in X_])
    H = H_
    for i, x_ in enumerate(X_):
        H = H.subs(x_, X[i])
    return H_, np.array(H.tolist(), dtype=np.float32)

def jacobian3():
    res = derive_by_array(Y, (x1, x2))
    return res

def hessian3():
    res = derive_by_array(derive_by_array(Y, (x1, x2)), (x1, x2))
    return res

def newton(f, X_, X, iters):
    """
    牛頓法
    :param f 函數
    :param x 輸入
    :param iters 迭代次數
    """
    G_, G = jacobian2(f, X_, X)
    H_, H = hessian2(f, X_, X)
    Hessian_T = np.linalg.inv(H)
    H_G = np.matmul(Hessian_T, G)
    # print(H_G)
    X_new = X
    for i in range(0, iters):
        X_new = X_new - H_G
        print("Iteration {}: {}".format(i+1, X_new))
        G_tmp = G_
        H_tmp = H_
        # print(G_tmp)
        # print("X_new", X_new)
        for i, x_ in enumerate(X_):
            H_tmp = H_tmp.subs(x_, X_new[i])
            G_tmp = G_tmp.subs(x_, X_new[i])
        # print("G_tmp: ", G_tmp)
        Hessian_T = np.linalg.inv(np.array(H_tmp.tolist(), dtype=np.float32))
        # print(Hessian_T)
        H_G = np.matmul(Hessian_T, np.array(G_tmp.tolist(), dtype=np.float32))
        # print(H_G)
    return X_new

def gradient_descent_1d_decay(f, X_, X, iters, learning_rate=0.01, decay=0.5):
    for i in range(iters):
        # learning_rate = learning_rate * 1.0 / (1.0 + decay * i)
        _, grad = gradient(f, X_, X)
        X = X - learning_rate * grad
        X[X<0] = 0.0
        print("Iteration {}: {}".format(i+1, X))
    return X



if __name__ == "__main__":
    j2_, j2 = jacobian2(Y, var_list, np.array([1, 2, 1, 1]))
    h2_, h2 = hessian2(Y, var_list, np.array([1, 2, 1, 1]))
    print(j2_, j2)
    print(h2_, h2)
    print("newton:----------------------------")
    print(newton(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 20))
    print("gradient descent:----------------------------")
    print(gradient_descent_1d_decay(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 200))

參考

  1. https://zhuanlan.zhihu.com/p/...
  2. http://wepon.me/files/gbdt.pdf
相關文章
相關標籤/搜索