有關一些求導的概念與神經網絡梯度降低

Theory for f : \(\mathbb{R}^{n} \mapsto \mathbb{R}\)

先定義一個標識:scalar-product \(\langle a | b\rangle=\sum_{i=1}^{n} a_{i} b_{i}\)網絡

咱們能夠定義導數的公式以下:
\[ f(x+h)=f(x)+\mathrm{d}_{x} f(h)+o_{h \rightarrow 0}(h) \]
\(O_{h \rightarrow 0}(h)\) 知足 \(\lim _{h \rightarrow 0} \epsilon(h)=0\)ide

\(\mathbb{R}^{n} \mapsto \mathbb{R}\)是一個線性變換。形如:
\[ f\left(\left( \begin{array}{l}{x_{1}} \\ {x_{2}}\end{array}\right)\right)=3 x_{1}+x_{2}^{2} \]
\(\left( \begin{array}{l}{a} \\ {b}\end{array}\right) \in \mathbb{R}^{2}\) and \(h=\left( \begin{array}{l}{h_{1}} \\ {h_{2}}\end{array}\right) \in \mathbb{R}^{2}\)時,咱們有
\[ \begin{aligned} f\left(\left( \begin{array}{c}{a+h_{1}} \\ {b+h_{2}}\end{array}\right)\right) &=3\left(a+h_{1}\right)+\left(b+h_{2}\right)^{2} \\ &=3 a+3 h_{1}+b^{2}+2 b h_{2}+h_{2}^{2} \\ &=3 a+b^{2}+3 h_{1}+2 b h_{2}+h_{2}^{2} \\ &=f(a, b)+3 h_{1}+2 b h_{2}+o(h) \end{aligned} \]
也就是說:\(\mathrm{d}_{(\begin{array}{l}{a} \\ {b}\end{array})} f\left(\left( \begin{array}{l}{h_{1}} \\ {h_{2}}\end{array}\right)\right)=3 h_{1}+2 b h_{2}\).函數

神經網絡中的梯度降低

Vectorized Gradients

咱們能夠將一個 \(\mathbb{R}^{n} \rightarrow \mathbb{R}^{m}\) 矩陣(線性變換)看作一個函數 \(\boldsymbol{f}(\boldsymbol{x})=\left[f_{1}\left(x_{1}, \ldots, x_{n}\right), f_{2}\left(x_{1}, \ldots, x_{n}\right), \ldots, f_{m}\left(x_{1}, \ldots, x_{n}\right)\right]\) 。向量 \(\boldsymbol{x}\) = \(x_{1}, \dots, x_{n}\)。 矩陣對向量求導就是:
\[ \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}=\left[ \begin{array}{ccc}{\frac{\partial f_{1}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{1}}{\partial x_{n}}} \\ {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial f_{m}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{m}}{\partial x_{n}}}\end{array}\right] \]
上述的求導矩陣又叫作雅可比矩陣。其中的元素能夠寫成:
\[ \left(\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}\right)_{i j}=\frac{\partial f_{i}}{\partial x_{j}} \]
這個矩陣頗有用。對於神經網絡咱們能夠理解爲 向量的函數,及 \(\boldsymbol{f}(\boldsymbol{x})\), 而這個函數的本質是對向量的線性變換,也就是一個矩陣。在神經網絡的反向傳播的過程就須要對參數求偏導,那麼多層神經網絡就會是鏈式求導,雅可比矩陣就是用於矩陣的鏈式求導。優化

考慮下面的例子: \(f(x)=\left[f_{1}(x), f_{2}(x)\right]\)\(f\) 是一個 1 * 2 的矩陣。\(g(y)=\left[g_{1}\left(y_{1}, y_{2}\right), g_{2}\left(y_{1}, y_{2}\right)\right]\) 是一個 2 * 2的矩陣。那麼 \(f\)\(g\) 的複合矩陣就是 $ g * f\(. 由矩陣的連乘獲得,\)g(x)=\left[g_{1}\left(f_{1}(x), f_{2}(x)\right), g_{2}\left(f_{1}(x), f_{2}(x)\right)\right]$。那麼咱們對複合矩陣求導就是:
\[ \frac{\partial \boldsymbol{g}}{\partial x}=\left[ \begin{array}{c}{\frac{\partial}{\partial x} g_{1}\left(f_{1}(x), f_{2}(x)\right)} \\ {\frac{\partial}{\partial x} g_{2}\left(f_{1}(x), f_{2}(x)\right)}\end{array}\right]=\left[ \begin{array}{c}{\frac{\partial g_{1}}{\partial f_{1}} \frac{\partial f_{1}}{\partial x}+\frac{\partial g_{1}}{\partial f_{2}} \frac{\partial f_{2}}{\partial x}} \\ {\frac{\partial g_{2}}{\partial f_{1}} \frac{\partial f_{1}}{\partial x}+\frac{\partial g_{2}}{\partial f_{2}} \frac{\partial f_{2}}{\partial x}}\end{array}\right] \]
本質上與通常連續函數的求導是同樣的。這個矩陣等價於兩次的求導矩陣的矩陣乘積。
\[ \frac{\partial g}{\partial x}=\frac{\partial g}{\partial f} \frac{\partial f}{\partial x}=\left[ \begin{array}{ll}{\frac{\partial g_{1}}{\partial f_{1}}} & {\frac{\partial g_{1}}{\partial f_{2}}} \\ {\frac{\partial g_{2}}{\partial f_{1}}} & {\frac{\partial g_{2}}{\partial f_{2}}}\end{array}\right] \left[ \begin{array}{c}{\frac{\partial f_{1}}{\partial x}} \\ {\frac{f_{2}}{\partial x}}\end{array}\right] \]spa

Useful Identities

對於通常的矩陣

咱們將通常的矩陣理解爲:\(\boldsymbol{W} \in \mathbb{R}^{n \times m}\),將一個 \(m\) 維的向量變換成一個 \(n\) 維的向量。也能夠寫成 \(z=W x\)。 其中:
\[ z_{i}=\sum_{k=1}^{m} W_{i k} x_{k} \]
因此對每一項的求導也很好計算:
\[ \left(\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right)_{i j}=\frac{\partial z_{i}}{\partial x_{j}}=\frac{\partial}{\partial x_{j}} \sum_{k=1}^{m} W_{i k} x_{k}=\sum_{k=1}^{m} W_{i k} \frac{\partial}{\partial x_{j}} x_{k}=W_{i j} \]
所以有:
\[ \frac{\partial z}{\partial x}=W \]
通常的矩陣的另外一種寫法能夠寫成:\(z=x W\), 其中 \(x\) 是一個行向量。維度爲 m,這就能夠當作是 \(W\) 列向量的線性組合。這時:
\[ \frac{\partial z}{\partial x}=\boldsymbol{W}^{T} \]scala

從另外一個角度看通常的矩陣

咱們能夠將線性變換的矩陣寫成:\(z=W x\)。咱們以前都是將 \(x\) 當作參數求導,若是此次咱們將 \(W\) 當作參數求導會是什麼樣子呢?視頻

咱們假設有:
\[ z=\boldsymbol{W} \boldsymbol{x}, \boldsymbol{\delta}=\frac{\partial J}{\partial \boldsymbol{z}} \]ci

\[ \frac{\partial J}{\partial \boldsymbol{W}}=\frac{\partial J}{\partial \boldsymbol{z}} \frac{\partial \boldsymbol{z}}{\partial \boldsymbol{W}}=\delta \frac{\partial \boldsymbol{z}}{\partial \boldsymbol{W}} \]get

這個怎麼求呢?對於 \(\frac{\partial z}{\partial W}\) 咱們能夠寫成
\[ \begin{aligned} z_{k} &=\sum_{l=1}^{m} W_{k l} x_{l} \\ \frac{\partial z_{k}}{\partial W_{i j}} &=\sum_{l=1}^{m} x_{l} \frac{\partial}{\partial W_{i j}} W_{k l} \end{aligned} \]
也就是:
\[ \frac{\partial z}{\partial W_{i j}}=\left[ \begin{array}{c}{0} \\ {\vdots} \\ {0} \\ {x_{j}} \\ {0} \\ {\vdots} \\ {0}\end{array}\right] \]
所以有:
\[ \frac{\partial J}{\partial W_{i j}}=\frac{\partial J}{\partial z} \frac{\partial z}{\partial W_{i j}}=\delta \frac{\partial z}{\partial W_{i j}}=\sum_{k=1}^{m} \delta_{k} \frac{\partial z_{k}}{\partial W_{i j}}=\delta_{i} x_{j} \]
因此 \(\frac{\partial J}{\partial \boldsymbol{W}}=\boldsymbol{\delta}^{T} \boldsymbol{x}^{T}\)input

同理,若是咱們將矩陣寫成 \(z=x W\),那麼 \(\frac{\partial J}{\partial W}=x^{T} \delta\)

一層神經網絡的例子

簡單描述一個神經網絡以下,咱們採用交叉熵損失函數做爲損失函數,來優化參數,神經網絡描述以下:
\[ \begin{array}{l}{x=\text { input }} \\ {z=W x+b_{1}} \\ {h=\operatorname{ReLU}(z)} \\ {\theta=U h+b_{2}} \\ {\hat{y}=\operatorname{softmax}(\theta)} \\ {J=C E(y, \hat{y})}\end{array} \]
對於其中數據的維度能夠表示成:
\[ \boldsymbol{x} \in \mathbb{R}^{D_{x} \times 1} \quad \boldsymbol{b}_{1} \in \mathbb{R}^{D_{h} \times 1} \quad \boldsymbol{W} \in \mathbb{R}^{D_{h} \times D_{x}} \quad \boldsymbol{b}_{2} \in \mathbb{R}^{N_{c} \times 1} \quad \boldsymbol{U} \in \mathbb{R}^{N_{c} \times D_{h}} \]
其中 \(D_{x}\) 是輸入的維度,\(D_{h}\) 是隱含層的維度,\(N_{c}\)是分類的種數。

咱們須要求得梯度是:
\[ \frac{\partial J}{\partial U} \quad \frac{\partial J}{\partial b_{2}} \quad \frac{\partial J}{\partial W} \quad \frac{\partial J}{\partial b_{1}} \quad \frac{\partial J}{\partial x} \]
這些都比較好計算,
\[ \delta_{1}=\frac{\partial J}{\partial \theta} \quad \delta_{2}=\frac{\partial J}{\partial z} \]

\[ \begin{aligned} \delta_{1} &=\frac{\partial J}{\partial \theta}=(\hat{y}-y)^{T} \\ \delta_{2} &=\frac{\partial J}{\partial z}=\frac{\partial J}{\partial \theta} \frac{\partial \theta}{\partial h} \frac{\partial h}{\partial z} \\ &=\delta_{1} \frac{\partial \theta}{\partial h} \frac{\partial h}{\partial z} \\ &=\delta_{1} U \frac{\partial h}{\partial z} \\ &=\delta_{1} U \circ \operatorname{ReLU}^{\prime}(z) \\ &=\delta_{1} U \circ \operatorname{sgn}(h) \end{aligned} \]

關於神經網絡的反向傳播,這裏是使用的交叉熵損失函數的計算,若是是最小二乘法,推薦下面的是視頻:

BP

相關文章
相關標籤/搜索