本文參考深刻淺出--梯度降低法及其實現,對後面將公式轉化爲矩陣的形式加了些解釋,便於你們理解。python
梯度降低是用來求函數最小值點的一種方法,所謂梯度在一元函數中是指某一點的斜率,在多元函數中可表示爲一個向量,可由偏導求得,此向量的指向是函數值上升最快的方向。公式表示爲:
\[ \nabla J(\Theta) = \langle \frac{\overrightarrow{\partial J}}{\partial \theta_1}, \frac{\overrightarrow{\partial J}}{\partial \theta_2}, \frac{\overrightarrow{\partial J}}{\partial \theta_3} \rangle \]函數
好比學習
\[ \begin{aligned} J(\theta) &= \theta^2 \qquad \nabla J(\theta) = J'(\theta) = 2\theta\\ J(\theta) &= 2\theta_1+3\theta_2 \quad \nabla J(\theta) = \langle 2,3 \rangle \end{aligned} \]spa
因爲梯度是函數值上升最快的方向,那麼咱們一直沿着梯度的反方向走,就能找到函數的局部最低點。
所以有梯度降低的遞推公式:
\[ \Theta^1 = \Theta^0 - \alpha \nabla J(\Theta^0) \]
其中\(\alpha\)稱爲學習率,直觀得講就是每次沿着梯度反方向移動的距離,學習率太高可能會越過最低點,學習率太低又會使得學習速度偏慢。
爲了更好的理解上面的公式
咱們舉例:
\[ \quad J(\theta) = \theta^2 \Longrightarrow \nabla J(\theta) = J'(\theta) = 2\theta \]
則梯度降低的迭代過程有:
\[ \begin{aligned} \theta^0 &= 1 \\ \theta^1 &= \theta^0 - \alpha \times J'(\theta^0) \\ &= 1-0.1 \times 2 \\ &= 0.2 \\ \theta^2 &= \theta^1 - \alpha \times J'(\theta^1)\\ &= 0.04\\ \theta^3 &= 0.008\\ \theta^4 &= 0.0016\\ \end{aligned} \]
咱們知道$J(\theta) = \theta^2 \(的極小值點在\)(0,0)\(處,而迭代四次後的結果\)(0.0016,0.00000256)$已至關接近。code
再舉一個二元函數的例子:
\[J(\Theta) = \theta_1^2 + \theta_2^2 \Longrightarrow \nabla J(\Theta) = \langle 2\theta_1 + 2\theta_2 \rangle\]
令\(\alpha = 0.1\), 以初始點\((1,3)\)用梯度降低法求函數最低點
\[ \begin{aligned} \Theta^0 &= (1,3) \\ \Theta^1 &= \Theta^0 - \alpha \nabla J(\Theta^0)\\ &= (1,3) - 0.1(2,6)\\ &= (0.8,2.4)\\ \Theta^2 &= (0.8,2.4) - 0.1(1.6,4.8)\\ &= (0.64,1.92) \\ \Theta^3 &= (0.512,1.536) \\ \end{aligned} \]orm
值得注意的是,迭代的效率除了和步長(學習率)\(\alpha\)有關外,還與初始值的選取有關。blog
利用梯度降低法擬合出一條直線使得平均方差最小。
假設擬合後的直線爲:
\[h_\Theta(x) = \theta_0 + \theta_1 \times x \]
其中\(\Theta = (\theta_0, \theta_1)\)
把平均方差做爲代價函數
\[J(\Theta) = \frac{1}{2m}\sum_{i=1}^m \left\lbrace h_\Theta(x_i) - y_i \right\rbrace ^2\]get
\[ \begin{aligned} &\nabla J(\Theta) = \langle \frac{\delta J}{\delta \theta_0}, \frac{\delta J}{\delta\theta_1} \rangle \\ &\frac{\delta J}{\delta \theta_0} = \frac{1}{m} \sum_{i=1}^m \left\lbrace h_\Theta(x_i) - y_i \right\rbrace\\ &\frac{\delta J}{\delta \theta_1} = \frac{1}{m} \sum_{i=1}^m \left\lbrace h_\Theta(x_i) - y_i \right\rbrace x_i\\ \end{aligned} \]it
爲了便於利用python的矩陣運算,咱們能夠將公式稍微變形
\[ \begin{aligned} h_\Theta(x_i) &= \theta_0 \times 1 + \theta_1 \times x_i \\ &= \begin{bmatrix} 1 & x_0\\ 1 & x_1\\ \vdots & \vdots\\ 1 & x_{19}\\ \end{bmatrix} \times \begin{bmatrix} \theta_0\\ \theta_1 \end{bmatrix} \end{aligned} \]io
\(i\)的範圍是\([1,20]\),所以最後獲得
\[ h_\Theta(x_i)= \begin{bmatrix} h_0\\ h_1\\ \vdots\\ h_{19} \end{bmatrix} \]
\[ 偏差diff = \begin{bmatrix} h_0\\ h_1\\ \vdots\\ h_{19} \end{bmatrix}- \begin{bmatrix} y_0\\ y_1\\ \vdots\\ y_{19} \end{bmatrix} \]
\[ 方差和 \sum_{i=1}^m \left\lbrace h_\Theta(x_i) - y_i \right\rbrace ^2=diff^T \times diff \]
將公式變爲矩陣向量相乘的形式後你應當更容易理解下面的python代碼
import numpy as np # Size of the points dataset. m = 20 # Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 21 ]).reshape(m, 1) # The Learning Rate alpha. alpha = 0.01 def error_function(theta, X, y): '''Error function J definition.''' diff = np.dot(X, theta) - y return (1./(2*m)) * np.dot(np.transpose(diff), diff) def gradient_function(theta, X, y): '''Gradient of the function J definition.''' diff = np.dot(X, theta) - y return (1./m) * np.dot(np.transpose(X), diff) def gradient_descent(X, y, alpha): '''Perform gradient descent.''' theta = np.array([1, 1]).reshape(2, 1) gradient = gradient_function(theta, X, y) while not np.all(np.absolute(gradient) <= 1e-5): theta = theta - alpha * gradient gradient = gradient_function(theta, X, y) return theta optimal = gradient_descent(X, y, alpha) print('optimal:', optimal) print('error function:', error_function(optimal, X, y)[0,0])