最近機器學習愈來愈火了,前段時間斯丹福大學副教授吳恩達都親自錄製了關於Deep Learning Specialization
的教程,在國內掀起了巨大的學習熱潮。本着不被時代拋棄的念頭,本身也開始研究有關機器學習的知識。都說機器學習的學習難度很是大,但不親自嘗試一下又怎麼會知道其中的奧妙與樂趣呢?只有不斷的嘗試才能找到最適合本身的道路。python
請容忍我上述的自我煽情,下面進入主題。這篇文章主要對機器學習中所遇到的GradientDescent
(梯度降低)進行全面分析,相信你看了這篇文章以後,對GradientDescent
將完全弄明白其中的原理。git
梯度降低法
是一個一階最優化算法,一般也稱爲最速降低法。要使用梯度降低法找到一個函數的局部極小值,必須向函數上當前點對於梯度(或者是近似梯度)的反方向的規定步長距離點進行迭代搜索。因此梯度降低法能夠幫助咱們求解某個函數的極小值或者最小值。對於n維問題就最優解,梯度降低法是最經常使用的方法之一。下面經過梯度降低法的前生今世
來進行詳細推導說明。github
首先從簡單的開始,看下面的一維函數:算法
f(x) = x^3 + 2 * x - 3
在數學中若是咱們要求f(x) = 0
處的解,咱們能夠經過以下偏差等式來求得:機器學習
error = (f(x) - 0)^2
當error
趨近於最小值時,也就是f(x) = 0
處x
的解,咱們也能夠經過圖來觀察:函數
經過這函數圖,咱們能夠很是直觀的發現,要想求得該函數的最小值,只要將x
指定爲函數圖的最低谷。這在高中咱們就已經掌握了該函數的最小值解法。咱們能夠經過對該函數進行求導(即斜率):oop
derivative(x) = 6 * x^5 + 16 * x^3 - 18 * x^2 + 8 * x - 12
若是要獲得最小值,只需令derivative(x) = 0
,即x = 1
。同時咱們結合圖與導函數能夠知道:學習
x < 1
時,derivative < 0
,斜率爲負的;x > 1
時,derivative > 0
,斜率爲正的;x 無限接近 1
時,derivative也就無限=0
,斜率爲零。經過上面的結論,咱們可使用以下表達式來代替x
在函數中的移動優化
x = x - reate * derivative
當斜率爲負的時候,x
增大,當斜率爲正的時候,x
減少;所以x
老是會向着低谷移動,使得error
最小,從而求得f(x) = 0
處的解。其中的rate
表明x
逆着導數方向移動的距離,rate
越大,x
每次就移動的越多。反之移動的越少。
這是針對簡單的函數,咱們能夠很是直觀的求得它的導函數。爲了應對複雜的函數,咱們能夠經過使用求導函數的定義來表達導函數:若函數f(x)
在點x0
處可導,那麼有以下定義:lua
上面是都是公式推導,下面經過代碼來實現,下面的代碼都是使用python
進行實現。
>>> def f(x): ... return x**3 + 2 * x - 3 ... >>> def error(x): ... return (f(x) - 0)**2 ... >>> def gradient_descent(x): ... delta = 0.00000001 ... derivative = (error(x + delta) - error(x)) / delta ... rate = 0.01 ... return x - rate * derivative ... >>> x = 0.8 >>> for i in range(50): ... x = gradient_descent(x) ... print('x = {:6f}, f(x) = {:6f}'.format(x, f(x))) ...
執行上面程序,咱們就能獲得以下結果:
x = 0.869619, f(x) = -0.603123 x = 0.921110, f(x) = -0.376268 x = 0.955316, f(x) = -0.217521 x = 0.975927, f(x) = -0.118638 x = 0.987453, f(x) = -0.062266 x = 0.993586, f(x) = -0.031946 x = 0.996756, f(x) = -0.016187 x = 0.998369, f(x) = -0.008149 x = 0.999182, f(x) = -0.004088 x = 0.999590, f(x) = -0.002048 x = 0.999795, f(x) = -0.001025 x = 0.999897, f(x) = -0.000513 x = 0.999949, f(x) = -0.000256 x = 0.999974, f(x) = -0.000128 x = 0.999987, f(x) = -0.000064 x = 0.999994, f(x) = -0.000032 x = 0.999997, f(x) = -0.000016 x = 0.999998, f(x) = -0.000008 x = 0.999999, f(x) = -0.000004 x = 1.000000, f(x) = -0.000002 x = 1.000000, f(x) = -0.000001 x = 1.000000, f(x) = -0.000001 x = 1.000000, f(x) = -0.000000 x = 1.000000, f(x) = -0.000000 x = 1.000000, f(x) = -0.000000
經過上面的結果,也驗證了咱們最初的結論。x = 1
時,f(x) = 0
。
因此經過該方法,只要步數足夠多,就能獲得很是精確的值。
上面是對一維
函數進行求解,那麼對於多維
函數又要如何求呢?咱們接着看下面的函數,你會發現對於多維
函數也是那麼的簡單。
f(x) = x[0] + 2 * x[1] + 4
一樣的若是咱們要求f(x) = 0
處,x[0]
與x[1]
的值,也能夠經過求error
函數的最小值來間接求f(x)
的解。跟一維
函數惟一不一樣的是,要分別對x[0]
與x[1]
進行求導。在數學上叫作偏導數
:
x[0]
進行求導,即f(x)
對x[0]
的偏導數x[1]
進行求導,即f(x)
對x[1]
的偏導數有了上面的理解基礎,咱們定義的gradient_descent
以下:
>>> def gradient_descent(x): ... delta = 0.00000001 ... derivative_x0 = (error([x[0] + delta, x[1]]) - error([x[0], x[1]])) / delta ... derivative_x1 = (error([x[0], x[1] + delta]) - error([x[0], x[1]])) / delta ... rate = 0.01 ... x[0] = x[0] - rate * derivative_x0 ... x[1] = x[1] - rate * derivative_x1 ... return [x[0], x[1]] ...
rate
的做用不變,惟一的區別就是分別獲取最新的x[0]
與x[1]
。下面是整個代碼:
>>> def f(x): ... return x[0] + 2 * x[1] + 4 ... >>> def error(x): ... return (f(x) - 0)**2 ... >>> def gradient_descent(x): ... delta = 0.00000001 ... derivative_x0 = (error([x[0] + delta, x[1]]) - error([x[0], x[1]])) / delta ... derivative_x1 = (error([x[0], x[1] + delta]) - error([x[0], x[1]])) / delta ... rate = 0.02 ... x[0] = x[0] - rate * derivative_x0 ... x[1] = x[1] - rate * derivative_x1 ... return [x[0], x[1]] ... >>> x = [-0.5, -1.0] >>> for i in range(100): ... x = gradient_descent(x) ... print('x = {:6f},{:6f}, f(x) = {:6f}'.format(x[0],x[1],f(x))) ...
輸出結果爲:
x = -0.560000,-1.120000, f(x) = 1.200000 x = -0.608000,-1.216000, f(x) = 0.960000 x = -0.646400,-1.292800, f(x) = 0.768000 x = -0.677120,-1.354240, f(x) = 0.614400 x = -0.701696,-1.403392, f(x) = 0.491520 x = -0.721357,-1.442714, f(x) = 0.393216 x = -0.737085,-1.474171, f(x) = 0.314573 x = -0.749668,-1.499337, f(x) = 0.251658 x = -0.759735,-1.519469, f(x) = 0.201327 x = -0.767788,-1.535575, f(x) = 0.161061 x = -0.774230,-1.548460, f(x) = 0.128849 x = -0.779384,-1.558768, f(x) = 0.103079 x = -0.783507,-1.567015, f(x) = 0.082463 x = -0.786806,-1.573612, f(x) = 0.065971 x = -0.789445,-1.578889, f(x) = 0.052777 x = -0.791556,-1.583112, f(x) = 0.042221 x = -0.793245,-1.586489, f(x) = 0.033777 x = -0.794596,-1.589191, f(x) = 0.027022 x = -0.795677,-1.591353, f(x) = 0.021617 x = -0.796541,-1.593082, f(x) = 0.017294 x = -0.797233,-1.594466, f(x) = 0.013835 x = -0.797786,-1.595573, f(x) = 0.011068 x = -0.798229,-1.596458, f(x) = 0.008854 x = -0.798583,-1.597167, f(x) = 0.007084 x = -0.798867,-1.597733, f(x) = 0.005667 x = -0.799093,-1.598187, f(x) = 0.004533 x = -0.799275,-1.598549, f(x) = 0.003627 x = -0.799420,-1.598839, f(x) = 0.002901 x = -0.799536,-1.599072, f(x) = 0.002321 x = -0.799629,-1.599257, f(x) = 0.001857 x = -0.799703,-1.599406, f(x) = 0.001486 x = -0.799762,-1.599525, f(x) = 0.001188 x = -0.799810,-1.599620, f(x) = 0.000951 x = -0.799848,-1.599696, f(x) = 0.000761 x = -0.799878,-1.599757, f(x) = 0.000608 x = -0.799903,-1.599805, f(x) = 0.000487 x = -0.799922,-1.599844, f(x) = 0.000389 x = -0.799938,-1.599875, f(x) = 0.000312 x = -0.799950,-1.599900, f(x) = 0.000249 x = -0.799960,-1.599920, f(x) = 0.000199 x = -0.799968,-1.599936, f(x) = 0.000159 x = -0.799974,-1.599949, f(x) = 0.000128 x = -0.799980,-1.599959, f(x) = 0.000102 x = -0.799984,-1.599967, f(x) = 0.000082 x = -0.799987,-1.599974, f(x) = 0.000065 x = -0.799990,-1.599979, f(x) = 0.000052 x = -0.799992,-1.599983, f(x) = 0.000042 x = -0.799993,-1.599987, f(x) = 0.000033 x = -0.799995,-1.599989, f(x) = 0.000027 x = -0.799996,-1.599991, f(x) = 0.000021 x = -0.799997,-1.599993, f(x) = 0.000017 x = -0.799997,-1.599995, f(x) = 0.000014 x = -0.799998,-1.599996, f(x) = 0.000011 x = -0.799998,-1.599997, f(x) = 0.000009 x = -0.799999,-1.599997, f(x) = 0.000007 x = -0.799999,-1.599998, f(x) = 0.000006 x = -0.799999,-1.599998, f(x) = 0.000004 x = -0.799999,-1.599999, f(x) = 0.000004 x = -0.799999,-1.599999, f(x) = 0.000003 x = -0.800000,-1.599999, f(x) = 0.000002 x = -0.800000,-1.599999, f(x) = 0.000002 x = -0.800000,-1.599999, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000001 x = -0.800000,-1.600000, f(x) = 0.000000
細心的你可能會發現,f(x) = 0
不止這一個解還能夠是x = -2, -1
。這是由於梯度降低法只是對當前所處的凹谷
進行梯度降低求解,對於error
函數並不表明只有一個f(x) = 0
的凹谷。因此梯度降低法只能求得局部解,但不必定能求得所有的解。固然若是對於很是複雜的函數,可以求得局部解也是很是不錯的。
經過上面的示例,相信對梯度降低
也有了一個基本的認識。如今咱們回到最開始的地方,在tensorflow
中使用gradientDescent
。
import tensorflow as tf # Model parameters W = tf.Variable([.3], dtype=tf.float32) b = tf.Variable([-.3], dtype=tf.float32) # Model input and output x = tf.placeholder(tf.float32) linear_model = W*x + b y = tf.placeholder(tf.float32) # loss loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares # optimizer optimizer = tf.train.GradientDescentOptimizer(0.01) train = optimizer.minimize(loss) # training data x_train = [1, 2, 3, 4] y_train = [0, -1, -2, -3] # training loop init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) # reset values to wrong for i in range(1000): sess.run(train, {x: x_train, y: y_train}) # evaluate training accuracy curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train}) print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))
上面的是tensorflow的官網示例,上面代碼定義了函數linear_model = W * x + b
,其中的error
函數爲linear_model - y
。目的是對一組x_train
與y_train
進行簡單的訓練求解W
與b
。爲了求得這一組數據的最優解,將每一組的error
相加從而獲得loss
,最後再對loss
進行梯度降低求解最優值。
optimizer = tf.train.GradientDescentOptimizer(0.01) train = optimizer.minimize(loss)
在這裏rate
爲0.01
,由於這個示例也是多維
函數,因此也要用到偏導數
來進行逐步向最優解靠近。
for i in range(1000): sess.run(train, {x: x_train, y: y_train})
最後使用梯度降低
進行循環推導,下面給出一些推導過程當中的相關結果
W: [-0.21999997] b: [-0.456] loss: 4.01814 W: [-0.39679998] b: [-0.49552] loss: 1.81987 W: [-0.45961601] b: [-0.4965184] loss: 1.54482 W: [-0.48454273] b: [-0.48487374] loss: 1.48251 W: [-0.49684232] b: [-0.46917531] loss: 1.4444 W: [-0.50490189] b: [-0.45227283] loss: 1.4097 W: [-0.5115062] b: [-0.43511063] loss: 1.3761 .... .... .... W: [-0.99999678] b: [ 0.99999058] loss: 5.84635e-11 W: [-0.99999684] b: [ 0.9999907] loss: 5.77707e-11 W: [-0.9999969] b: [ 0.99999082] loss: 5.69997e-11
這裏就不推理驗證了,若是看了上面的梯度降低
的前世此生,相信可以自主的推導出來。那麼咱們直接看最後的結果,能夠估算爲W = -1.0
與b = 1.0
,將他們帶入上面的loss
獲得的結果爲0.0
,即偏差損失值最小,因此W = -1.0
與b = 1.0
就是x_train
與y_train
這組數據的最優解。
好了,關於梯度降低
的內容就到這了,但願可以幫助到你;若有不足之處歡迎來討論,若是感受這篇文章不錯的話,能夠關注個人博客,或者掃描下方二維碼關注:怪談時間到了 公衆號,查看個人其它文章。