Copyright © Microsoft Corporation. All rights reserved.
適用於License版權許可
更多微軟人工智能學習資源,請見微軟人工智能教育與學習共建社區git
這篇文章的內容更多的是一些可能要用到的數學公式的導數公式和推導,是一種理論基礎,感興趣的同窗能夠仔細瞅瞅,想直接上手的同窗也能夠直接跳過這一篇~
你們能夠mark一下,以便之後用到時過來查一下,當成字典。
下面進入正題!github
\[y'=0 \tag 1\]算法
\[y'=ax^{a-1} \tag 2\]網絡
\[y'=\frac{1}{x}log_ae=\frac{1}{xlna} \tag 3\]
\[(由於log_ae=\frac{1}{log_ea}=\frac{1}{lna})\]框架
\[y'=\frac{1}{x} \tag4\]函數
\[y'=a^xlna \tag5\]學習
\[y'=e^x \tag6\]測試
\[y'=-e^{-x} \tag7\]優化
\[y'=cos(x) \tag 8\]人工智能
\[y'=-sin(x) \tag 9\]
\[y'=sec^2(x)=\frac{1}{cos^2x} \tag{10}\]
\[y'=-csc^2(x) \tag{11}\]
\[y'=\frac{1}{\sqrt{1-x^2}} \tag{12}\]
\[y'=-\frac{1}{\sqrt{1-x^2}} \tag{13}\]
\[y'=\frac{1}{1+x^2} \tag{14}\]
\[y'=-\frac{1}{1+x^2} \tag{15}\]
\[y'=cosh(x) \tag{16}\]
\[y'=sinh(x) \tag{17}\]
\[y'=sech^2(x)=1-tanh^2(x) \tag{18}\]
\[y'=-csch^2(x) \tag{19}\]
\[y'=-sech(x)*tanh(x) \tag{20}\]
\[y'=-csch(x)*coth(x) \tag{21}\]
則Z對x的偏導能夠理解爲當y是個常數時,Z單獨對x求導:
\[Z'_x=f'_x(x,y)=\frac{\partial{Z}}{\partial{x}} \tag{40}\]
則Z對y的偏導能夠理解爲當x是個常數時,Z單獨對y求導:
\[Z'_y=f'_y(x,y)=\frac{\partial{Z}}{\partial{y}} \tag{41}\]
在二元函數中,偏導的何意義,就是對任意的\(y=y_0\)的取值,在二元函數曲面上作一個\(y=y_0\)切片,獲得\(Z = f(x, y_0)\)的曲線,這條曲線的一階導數就是Z對x的偏導。對\(x=x_0\)一樣,就是Z對y的偏導。
\[y'_x = f'(u)*u'(x) = y'_u*u'_x=\frac{dy}{du}*\frac{du}{dx} \tag{50}\]
\[ \frac{dy}{dx}=f'(u)*g'(v)*h'(x)=\frac{dy}{du}*\frac{du}{dv}*\frac{dv}{dx} \tag{51} \]
\[ \frac{\partial{Z}}{\partial{x}}=\frac{\partial{Z}}{\partial{U}} * \frac{\partial{U}}{\partial{x}} + \frac{\partial{Z}}{\partial{V}} * \frac{\partial{V}}{\partial{x}} \tag{52} \]
\[ \frac{\partial{Z}}{\partial{y}}=\frac{\partial{Z}}{\partial{U}} * \frac{\partial{U}}{\partial{y}} + \frac{\partial{Z}}{\partial{V}} * \frac{\partial{V}}{\partial{y}} \]
如\(A,B,X\)都是矩陣,
則
\[ B\frac{\partial{(AX)}}{\partial{X}} = A^TB \tag{60} \]
\[ B\frac{\partial{(XA)}}{\partial{X}} = BA^T \tag{61} \]
\[ \frac{\partial{(X^TA)}}{\partial{X}} = \frac{\partial{(A^TX)}}{\partial{X}}=A \tag{62} \]
\[ \frac{\partial{(A^TXB)}}{\partial{X}} = AB^T \tag{63} \]
\[ \frac{\partial{(A^TX^TB)}}{\partial{X}} = BA^T \tag{64} \]
利用公式30,令:\(u=1,v=1+e^{-z}\) 則
\[ A'_z = \frac{u'v-v'u}{v^2}=\frac{0-(1+e^{-z})'}{(1+e^{-z})^2} \tag{70} \]
\[ =\frac{e^{-z}}{(1+e^{-z})^2} =\frac{1+e^{-z}-1}{(1+e^{-z})^2} \]
\[ =\frac{1}{1+e^{-z}}-(\frac{1}{1+e^{-z}})^2 \]
\[ =A-A^2=A(1-A) \]
利用公式23,令:\(u={e^{Z}-e^{-Z}},v=e^{Z}+e^{-Z}\) 則
\[ A'_Z=\frac{u'v-v'u}{v^2} \tag{71} \]
\[ =\frac{(e^{Z}-e^{-Z})'(e^{Z}+e^{-Z})-(e^{Z}+e^{-Z})'(e^{Z}-e^{-Z})}{(e^{Z}+e^{-Z})^2} \]
\[ =\frac{(e^{Z}+e^{-Z})(e^{Z}+e^{-Z})-(e^{Z}-e^{-Z})(e^{Z}-e^{-Z})}{(e^{Z}+e^{-Z})^2} \]
\[ =\frac{(e^{Z}+e^{-Z})^2-(e^{Z}-e^{-Z})^2}{(e^{Z}+e^{-Z})^2} \]
\[ =1-(\frac{(e^{Z}-e^{-Z}}{e^{Z}+e^{-Z}})^2=1-A^2 \]
著名的反向傳播四大公式是:
\[\delta^{L} = \nabla_{a}C \odot \sigma_{'}(Z^L) \tag{80}\]
\[\delta^{l} = ((W^{l + 1})^T\delta^{l+1})\odot\sigma_{'}(Z^l) \tag{81}\]
\[\frac{\partial{C}}{\partial{b_j^l}} = \delta_j^l \tag{82}\]
\[\frac{\partial{C}}{\partial{w_{jk}^{l}}} = a_k^{l-1}\delta_j^l \tag{83}\]
下面咱們用一個簡單的兩個神經元的全鏈接神經網絡來直觀解釋一下這四個公式,
每一個結點的輸入輸出標記如圖上所示,使用MSE做爲計算loss的函數,那麼能夠獲得這張計算圖中的計算過公式以下所示:
\[e_{01} = \frac{1}{2}(y-a_1^3)^2\]
\[a_1^3 = sigmoid(z_1^3)\]
\[z_1^3 = (w_{11}^2 * a_1^2 + w_{12}^2 * a_2^2 + b_1^3)\]
\[a_1^2 = sigmoid(z_1^2)\]
\[z_1^2 = (w_{11}^1 * a_1^1 + w_{12}^1 * a_2^1 + b_1^2)\]
咱們按照反向傳播中梯度降低的原理來對損失求梯度,計算過程以下:
\[\frac{\partial{e_{o1}}}{\partial{w_{11}^2}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{w_{11}^2}}=\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}a_{1}^2\]
\[\frac{\partial{e_{o1}}}{\partial{w_{12}^2}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{w_{12}^2}}=\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}a_{2}^2\]
\[\frac{\partial{e_{o1}}}{\partial{w_{11}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{1}^2}}\frac{\partial{a_{1}^2}}{\partial{z_{1}^2}}\frac{\partial{z_{1}^2}}{\partial{w_{11}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{1}^2}}\frac{\partial{a_{1}^2}}{\partial{z_{1}^2}}a_1^1\]
\[=\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}w_{11}^2\frac{\partial{a_{1}^2}}{\partial{z_{1}^2}}a_1^1\]
\[\frac{\partial{e_{o1}}}{\partial{w_{12}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{2}^2}}\frac{\partial{a_{2}^2}}{\partial{z_{1}^2}}\frac{\partial{z_{1}^2}}{\partial{w_{12}^1}} = \frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\frac{\partial{z_{1}^3}}{\partial{a_{2}^2}}\frac{\partial{a_{2}^2}}{\partial{z_{1}^2}}a_2^2\]
\[=\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}w_{12}^2\frac{\partial{a_{2}^2}}{\partial{z_{1}^2}}a_2^2\]
上述式中,\(\frac{\partial{a}}{\partial{z}}\)是激活函數的導數,即\(\sigma^{'}(z)\)項。觀察到在求偏導數過程當中有共同項\(\frac{\partial{e_{o1}}}{\partial{a_{1}^3}}\frac{\partial{a_{1}^3}}{\partial{z_{1}^3}}\),採用\(\delta\)符號記錄,用矩陣形式表示,
即:
\[\delta^L = [\frac{\partial{e_{o1}}}{\partial{a_{i}^L}}\frac{\partial{a_{i}^L}}{\partial{z_{i}^L}}] = \nabla_{a}C\odot\sigma^{'}(Z^L)\]
上述式中,\([a_i]\)表示一個元素是a的矩陣,\(\nabla_{a}C\)表示將損失\(C\)對\(a\)求梯度,\(\odot\)表示矩陣element wise的乘積(也就是矩陣對應位置的元素相乘)。
從上面的推導過程當中,咱們能夠得出\(\delta\)矩陣的遞推公式:
\[\delta^{L-1} = (W^L)^T[\frac{\partial{e_{o1}}}{\partial{a_{i}^L}}\frac{\partial{a_{i}^L}}{\partial{z_{i}^L}}]\odot\sigma^{'}(Z^{L - 1})\]
因此在反向傳播過程當中只須要逐層利用上一層的\(\delta^l\)進行遞推便可。
相對而言,這是一個很是直觀的結果,這份推導過程也是不嚴謹的。下面,咱們會從比較嚴格的數學定義角度進行推導,首先要補充一些定義。
假定\(y\)是一個標量,\(X\)是一個\(N \times M\)大小的矩陣,有\(y=f(X)\), \(f()\)是一個函數。咱們來看\(df\)應該如何計算。
首先給出定義:
\[ df = \sum_j^M\sum_i^N \frac{\partial{f}}{\partial{x_{ij}}}dx_{ij} \]
下面咱們引入矩陣跡的概念,所謂矩陣的跡,就是矩陣對角線元素之和。也就是說:
\[ tr(X) = \sum_i x_{ii} \]
引入跡的概念後,咱們來看上面的梯度計算是否是能夠用跡來表達呢?
\[ \frac{\partial{f}}{\partial{X}} = \begin{pmatrix} \frac{\partial{f}}{\partial{x_{11}}} & \frac{\partial{f}}{\partial{x_{12}}} & \dots & \frac{\partial{f}}{\partial{x_{1M}}} \\ \frac{\partial{f}}{\partial{x_{21}}} & \frac{\partial{f}}{\partial{x_{22}}} & \dots & \frac{\partial{f}}{\partial{x_{2M}}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial{f}}{\partial{x_{N1}}} & \frac{\partial{f}}{\partial{x_{N2}}} & \dots & \frac{\partial{f}}{\partial{x_{NM}}} \end{pmatrix} \tag{90} \]
\[ dX = \begin{pmatrix} dx_{11} & d{x_{12}} & \dots & d{x_{1M}} \\ d{x_{21}} & d{x_{22}} & \dots & d{x_{2M}} \\ \vdots & \vdots & \ddots & \vdots \\ d{x_{N1}} & d{x_{N2}} & \dots & d{x_{NM}} \end{pmatrix} \tag{91} \]
咱們來看矩陣\((90)\)的轉置和矩陣\((91)\)乘積的對角線元素
\[ {({(\frac{\partial{f}}{\partial{X}})}^TdX)}_{jj} = \sum_i^N\frac{\partial{f}}{\partial{x_{ij}}}dx_{ij} \]
所以,
\[ tr({(\frac{\partial{f}}{\partial{X}})}^TdX) = \sum_j^M\sum_i^N\frac{\partial{f}}{\partial{x_{ij}}}dx_{ij} = df = tr(df) \tag{92} \]
上式的最後一個等號是由於\(df\)是一個標量,標量的跡就等於其自己。
這裏將會給出部分矩陣的跡和導數的性質,做爲後面推導過程的參考。性子急的同窗能夠姑且默認這是一些結論。
\[ d(X + Y) = dX + dY \tag{93} \]
\[ d(XY) = (dX)Y + X(dY)\tag{94} \]
\[ dX^T = {(dX)}^T \tag{95} \]
\[ d(tr(X)) = tr(dX) \tag{96} \]
\[ d(X \odot Y) = dX \odot Y + X \odot dY \tag{97} \]
\[ d(f(X)) = f^{'}(X) \odot dX \tag{98} \]
\[ tr(XY) = tr(YX) \tag{99} \]
\[ tr(A^T (B \odot C)) = tr((A \odot B)^T C) \tag{100} \]
以上各性質的證實方法相似,咱們選取式(94)做爲證實的示例:
\[ Z = XY \]
則Z中的任意一項是
\[ z_{ij} = \sum_k x_{ik}y_{kj} \\ dz_{ij} = \sum_k d(x_{ik}y_{kj}) = \sum_k (dx_{ik}) y_{kj} + \sum_k x_{ik} (dy_{kj}) = ((dX)Y)_{ij} + (X(dY))_{ij} \]
從上式可見,\(dZ\)的每一項和\((dX)Y + X(dY)\)的每一項都是相等的。所以,能夠得出式(94)成立。
首先,來看一個通用狀況,已知\(f = A^TXB\),\(A,B\)是常矢量,但願獲得\(\frac{\partial{f}}{\partial{X}}\),推導過程以下
根據式(94),
\[ df = d(A^TXB) = d(A^TX)B + A^TX(dB) = d(A^TX)B + 0 = d(A^T)XB+A^TdXB = A^TdXB \]
因爲\(df\)是一個標量,標量的跡等於自己,同時利用公式(99):
\[ df = tr(df) = tr(A^TdXB) = tr(BA^TdX) \]
因爲公式(92):
\[ tr(df) = tr({(\frac{\partial{f}}{\partial{X}})}^TdX) \]
能夠獲得:
\[ (\frac{\partial{f}}{\partial{X}})^T = BA^T \]
\[ \frac{\partial{f}}{\partial{X}} = AB^T \tag{101} \]
咱們來看全鏈接層的狀況
\[ Y = WX + B\]
取全鏈接層其中一個元素
\[ y = wX + b\]
這裏的\(w\)是權重矩陣的一行,尺寸是\(1 \times M\),X是一個大小爲\(M \times 1\)的矢量,y是一個標量,若添加一個大小是1的單位陣,上式總體保持不變:
\[ y = (w^T)^TXI + b\]
利用式(92),能夠獲得
\[ \frac{\partial{y}}{\partial{X}} = I^Tw^T = w^T\]
所以在偏差傳遞的四大公式中,在根據上層傳遞回來的偏差\(\delta\)繼續傳遞的過程當中,利用鏈式法則,有
\[\delta^{L-1} = (W^L)^T \delta^L \odot \sigma^{'}(Z^{L - 1})\]
同理,若將\(y=wX+b\)視做:
\[ y = IwX + b \]
那麼利用式(92),能夠獲得:
\[ \frac{\partial{y}}{\partial{w}} = X^T\]
使用softmax和交叉熵來計算損失的狀況下
\[ l = - Y^Tlog(softmax(Z))\]
式中,\(y\)是數據的標籤,\(Z\)是網絡預測的輸出,\(y\)和\(Z\)的維度是\(N \times 1\)。通過softmax處理做爲機率。但願可以獲得\(\frac{\partial{l}}{\partial{Z}}\),下面是推導的過程:
\[ softmax(Z) = \frac{exp(Z)}{\boldsymbol{1}^Texp(Z)} \]
其中, \(\boldsymbol{1}\)是一個維度是\(N \times 1\)的全1向量。將softmax表達式代入損失函數中,有
\[ dl = -Y^T d(log(softmax(Z)))\\ = -Y^T d (log\frac{exp(Z)}{\boldsymbol{1}^Texp(Z)}) \\ = -Y^T dZ + Y^T \boldsymbol{1}d(log(\boldsymbol{1}^Texp(Z))) \tag{102} \]
下面來化簡式(102)的後半部分,利用式(98):
\[ d(log(\boldsymbol{1}^Texp(Z))) = log^{'}(\boldsymbol{1}^Texp(Z)) \odot dZ = \frac{\boldsymbol{1}^T(exp(Z)\odot dZ)}{\boldsymbol{1}^Texp(Z)} \]
利用式(100),能夠獲得
\[ tr(Y^T \boldsymbol{1}\frac{\boldsymbol{1}^T(exp(Z)\odot dZ)}{\boldsymbol{1}^Texp(Z)}) = tr(Y^T \boldsymbol{1}\frac{(\boldsymbol{1} \odot (exp(Z))^T dZ)}{\boldsymbol{1}^Texp(Z)}) = tr(Y^T \boldsymbol{1}\frac{exp(Z)^T dZ}{\boldsymbol{1}^Texp(Z)}) = tr(Y^T \boldsymbol{1} softmax(Z)^TdZ) \tag{103} \]
將式(103)代入式(102)並兩邊取跡,能夠獲得:
\[ dl = tr(dl) = tr(-y^T dZ + y^T\boldsymbol{1}softmax(Z)^TdZ) = tr((\frac{\partial{l}}{\partial{Z}})^TdZ) \]
在分類問題中,一個標籤中只有一項會是1,因此\(Y^T\boldsymbol{1} = 1\),所以有
\[ \frac{\partial{l}}{\partial{Z}} = softmax(Z) - Y \]
這也就是在損失函數中計算反向傳播的偏差的公式。
點擊這裏提交問題與建議
聯繫咱們: msraeduhub@microsoft.com
學習了這麼多,還沒過癮怎麼辦?歡迎加入「微軟 AI 應用開發實戰交流羣」,跟你們一塊兒暢談AI,答疑解惑。掃描下方二維碼,回覆「申請入羣」,即刻邀請你入羣。