泰勒公式的基本形式如式(1)所示,
$$\begin{equation}f(x) = \sum_{n=0}^{\infty}\frac{f^{(n)}(x_0)}{n!}(x-x_0)^{n}\tag{1}\end{equation}$$
特殊地,一階泰勒展開,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x0)\tag{2}\end{equation}$$
二階泰勒展開,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2}\tag{3}\end{equation}$$
令$x^t = x^{t-1}+\Delta x$,則$f(x^t)$的二階泰勒展開寫爲,
$$\begin{equation}f(x^t)=f(x^{t-1}+\Delta x)\approx f(x^{t-1})+f^{'}(x^{t-1})\Delta x+f^{''}(x^{t-1})\frac{\Delta x^2}{2}\tag{4}\end{equation}$$算法
機器學習任務中,須要最小化損失函數$L(\theta)$,其中$\theta$爲待求解的模型參數。梯度降低法經常使用來解決這一類無約束最優化問題。梯度降低法是一種迭代式算法,最開始給$\theta^0$賦一個初始值,而後在每個迭代過程當中,更新$\theta^t$的值,使得損失函數$L(\theta^t)$減少。
$$\begin{equation}\theta^t = \theta^{t-1}+\Delta \theta\tag{5}\end{equation}$$
將$L(\theta^t)$在$\theta^{t-1}$處進行一階泰勒展開:
$$\begin{equation}L(\theta^t)=L(\theta^{t-1}+\Delta \theta)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta\tag{6}\end{equation}$$
梯度降低法每個迭代,須要使損失函數$L$變小,即要使$L(\theta^t)<L(\theta^{t-1})$。結合式(6),可取$\Delta \theta = -\alpha L^{'}(\theta^{t-1})$,則$\theta$的迭代公式爲,
$$\begin{equation}\theta^t = \theta^{t-1}-\alpha L^{'}(\theta^{t-1})\tag{7}\end{equation}$$
這裏的$\alpha$是步長,即常說的learning rate,一般直接賦一個較小的值。app
牛頓法自己是一階算法,本質是求根算法。使用一階泰勒公式推到牛頓法,則根據式(6),令$L(\theta^t)=0$,則能夠推導出
$$\begin{equation}\theta^t = \theta^{t-1}-\frac{L(\theta^{t-1})}{L^{'}(\theta^{t-1})}\tag{8}\end{equation}$$
可是在機器學習任務中,牛頓法被用來求解目標函數的最優解。若是用來求最優解(極值點),這時就要求導數爲0的跟,就須要求二階導數,就變成了二階算法。牛頓法求解最優值是迭代算法,每一次迭代,在現有極小點估計值$\theta^{t-1}$的附近,對$L(\theta^{t})$作二階泰勒展開,進而找到極小點的下一個估計值$\theta^t$。
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta + L^{''}(\theta^{t-1})\frac{\Delta \theta^2}{2}\tag{9}\end{equation}$$
先只考慮參數$\theta$爲標量的狀況,可將一階導數$L^{'}$和二階導數$L^{''}$分別記做$g$和$h$,
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+g\Delta \theta + h\frac{\Delta \theta^2}{2}\tag{10}\end{equation}$$
此時,要使得$L(\theta^t)$極小,則讓$g\Delta \theta + h\frac{\Delta \theta^2}{2}$極小($L(\theta^{t-1})$看做常量),令$\frac{\partial(g\Delta \theta + h\frac{\Delta \theta^2}{2})}{\partial \Delta \theta}=0$,獲得$\Delta \theta = -\frac{g}{h}$,所以
$$\begin{equation}\theta^{t} = \theta^{t-1}+\Delta \theta=\theta^{t-1}-\frac{g}{h}\tag{11}\end{equation}$$
將參數$\theta$由標量推廣到向量形式,每一次迭代,參數的更新方式爲,
$$\begin{equation}\theta^{t} =\theta^{t-1}-H^{-1}g\tag{12}\end{equation}$$
其中,$g$爲雅克比矩陣,矩陣中各元素對應一階偏導數;$H$爲Hessian矩陣,各元素對應二階偏導數。機器學習
import numpy as np from scipy.misc import derivative from sympy import symbols, diff import sympy from sympy.tensor.array import derive_by_array import math x1, x2, x3, x4 = symbols('x1, x2, x3, x4') Y = symbols('Y') # Y = x1 ** 2 + x2 ** 2 # Y = x1**2 + 3*x1 - 3*x1*x2 + 2*x2**2 + 4*x2 Y = x1 **2 + x2 ** 2 - 4 * x1 - 6 * x2 + 13 + sympy.sqrt(x3) - sympy.sqrt(x4) var_list = [x1, x2, x3, x4] def gradient(f, X_, X): grad_ = sympy.Array([diff(f, x_) for x_ in X_]) grad = grad_ for i, x_ in enumerate(X_): grad = grad.subs(x_, X[i]) return grad_, np.array(grad.tolist(), dtype=np.float32) def jacobian2(f, X_, X): G_ = sympy.Array([diff(f, x_) for x_ in X_]) G = G_ for i, x_ in enumerate(X_): G = G.subs(x_, X[i]) return G_, np.array(G.tolist(), dtype=np.float32) def hessian2(f, X_, X): H_ = sympy.Array([[diff(f, x_1).diff(x_2) for x_2 in X_] for x_1 in X_]) H = H_ for i, x_ in enumerate(X_): H = H.subs(x_, X[i]) return H_, np.array(H.tolist(), dtype=np.float32) def jacobian3(): res = derive_by_array(Y, (x1, x2)) return res def hessian3(): res = derive_by_array(derive_by_array(Y, (x1, x2)), (x1, x2)) return res def newton(f, X_, X, iters): """ 牛頓法 :param f 函數 :param x 輸入 :param iters 迭代次數 """ G_, G = jacobian2(f, X_, X) H_, H = hessian2(f, X_, X) Hessian_T = np.linalg.inv(H) H_G = np.matmul(Hessian_T, G) # print(H_G) X_new = X for i in range(0, iters): X_new = X_new - H_G print("Iteration {}: {}".format(i+1, X_new)) G_tmp = G_ H_tmp = H_ # print(G_tmp) # print("X_new", X_new) for i, x_ in enumerate(X_): H_tmp = H_tmp.subs(x_, X_new[i]) G_tmp = G_tmp.subs(x_, X_new[i]) # print("G_tmp: ", G_tmp) Hessian_T = np.linalg.inv(np.array(H_tmp.tolist(), dtype=np.float32)) # print(Hessian_T) H_G = np.matmul(Hessian_T, np.array(G_tmp.tolist(), dtype=np.float32)) # print(H_G) return X_new def gradient_descent_1d_decay(f, X_, X, iters, learning_rate=0.01, decay=0.5): for i in range(iters): # learning_rate = learning_rate * 1.0 / (1.0 + decay * i) _, grad = gradient(f, X_, X) X = X - learning_rate * grad X[X<0] = 0.0 print("Iteration {}: {}".format(i+1, X)) return X if __name__ == "__main__": j2_, j2 = jacobian2(Y, var_list, np.array([1, 2, 1, 1])) h2_, h2 = hessian2(Y, var_list, np.array([1, 2, 1, 1])) print(j2_, j2) print(h2_, h2) print("newton:----------------------------") print(newton(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 20)) print("gradient descent:----------------------------") print(gradient_descent_1d_decay(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 200))