前言
【深度學習】從神經網絡到卷積神經網絡 html
前面咱們介紹了 BP
神經網絡和卷積神經網絡CNN
,那麼爲何還須要循環神經網絡 RNN
呢?web
BP
神經網絡和卷積神經網絡CNN
的輸入輸出都是相互獨立的,可是在實際應用中有些場景輸出內容和以前的內容是有關聯的
BP
神經網絡和卷積神經網絡CNN
有一個特色,就是假設輸入是一個獨立的沒有上下文聯繫的單位,好比輸入是一張圖片,網絡識別是狗仍是貓。可是對於一些有明顯的上下文特徵的序列化輸入,好比預測視頻中下一幀的播放內容,那麼很明顯這樣的輸出必須依賴之前的輸入, 也就是說網絡必須擁有必定的」記憶能力」。爲了賦予網絡這樣的記憶力,一種特殊結構的神經網絡——循環神經網絡(Recurrent NeuralNetwork
)便應運而生了。算法
RNN
引入「記憶」的概念,循環指其每個元素都執行相同的任務,可是輸出依賴於輸入和「記憶」
RNN應用場景:天然語言處理、機器翻譯、語音識別等網絡
1、RNN
(循環神經網絡)
循環神經網絡是一類用於處理序列數據的神經網絡,就像卷積神經網絡是專門用於處理網格化數據(如一張圖像)的神經網絡,循環神經網絡時專門用於處理序列
x
(
1
)
,
.
.
.
,
x
(
T
)
x^{(1)},...,x^{(T)}
x ( 1 ) , . . . , x ( T ) 的神經網絡。app
RNN
網絡結構以下:ide
循環神經網絡的結果相比於卷積神經網絡較簡單,一般循環神經網絡只包含輸入層、隱藏層和輸出層,加上輸入輸出層最多也就5層svg
將序列按時間展開就能夠獲得RNN的結構,以下圖:函數
網絡某一時刻的輸入
x
t
x_t
x t ,和以前介紹的BP神經網絡的輸入同樣,
x
t
x_t
x t 是一個
n
n
n 維向量,不一樣的是遞歸網絡的輸入將是一整個序列,也就是
x
=
[
x
1
,
.
.
.
,
x
t
−
1
,
x
t
,
x
t
+
1
,
.
.
.
x
T
]
x=[x_1,...,x_{t-1},x_t,x_{t+1},...x_T]
x = [ x 1 , . . . , x t − 1 , x t , x t + 1 , . . . x T ] ,對於語言模型,每個
x
t
x_t
x t 將表明一個詞向量,一整個序列就表明一句話。學習
h
t
h_t
h t 表明時刻
t
t
t 隱神經元對於線性轉換值
s
t
s_t
s t 表明時刻
t
t
t 的隱藏狀態, 即:「記憶」
o
t
o_t
o t 表明時刻
t
t
t 的輸出,
輸入層到隱藏層直接的權重由
U
U
U 表示
隱藏層到隱藏層的權重
W
W
W ,它是網絡的記憶控制者,負責調度記憶。
隱藏層到輸出層的權重
V
V
V
一、循環神經網絡RNN-BPTT
RNN
的訓練和 CNN/ANN
訓練同樣,一樣適用 BP算法偏差反向傳播算法
。spa
區別在於:
RNN
中的參數U\V\W
是共享的,而且在隨機梯度降低算法中,每一步的輸出不只僅依賴當前步的網絡,而且還須要前若干步網絡的狀態,那麼這種BP改版的算法叫作Backpropagation Through Time(BPTT)
;
BPTT算法
和BP算法
同樣,在多層訓練過程當中(長時依賴<即當前的輸出和前面很長的一段序列有關,通常超過10步>),可能產生梯度消失和梯度爆炸的問題。
BPTT
和BP算法
思路同樣,都是求偏導,區別在於須要考慮時間對step的影響
二、RNN
正向傳播階段
在
t
=
1
t=1
t = 1 的時刻,
U
,
V
,
W
U,V,W
U , V , W 都被隨機初始化好,
s
0
s_0
s 0 一般初始化爲0,而後進行以下計算:
h
1
=
U
x
1
+
W
s
0
h_1 = Ux_1+Ws_0
h 1 = U x 1 + W s 0
s
1
=
f
(
h
1
)
s_1 = f(h_1)
s 1 = f ( h 1 )
o
1
=
g
(
V
s
1
)
o_1 = g(Vs_1)
o 1 = g ( V s 1 )
在
t
=
2
t=2
t = 2 的時刻,,此時的狀態
s
1
s_1
s 1 做爲時刻1的記憶狀態將參與下一個時刻的預測活動,也就是:
h
2
=
U
x
2
+
W
s
1
h_2 = Ux_2+Ws_1
h 2 = U x 2 + W s 1
s
2
=
f
(
h
2
)
s_2 = f(h_2)
s 2 = f ( h 2 )
o
2
=
g
(
V
s
2
)
o_2=g(Vs_2)
o 2 = g ( V s 2 )
以此類推,可得:
h
t
=
U
x
t
+
W
s
t
−
1
h_t = Ux_t+Ws_{t-1}
h t = U x t + W s t − 1
s
t
=
f
(
h
t
)
s_t = f(h_t)
s t = f ( h t )
o
t
=
g
(
V
s
t
)
o_t=g(Vs_t)
o t = g ( V s t )
其中
f
f
f 能夠是 tanh
,relu
,sigmoid
等激活函數,
g
g
g 一般是 softmax
也能夠是其餘
值得注意的是,咱們說遞歸神經網絡擁有記憶能力,而這種能力就是經過
W
W
W 將以往的輸入狀態進行總結,而做爲下次輸入的輔助
能夠這樣理解隱藏狀態:
h
=
f
(
現
有
的
輸
入
+
過
去
記
憶
總
結
)
h=f(現有的輸入+過去記憶總結)
h = f ( 現 有 的 輸 入 + 過 去 記 憶 總 結 )
三、RNN
反向傳播階段
BP神經網絡
用到的偏差反向傳播 方法將輸出層的偏差總和,對各個權重的梯度
∇
U
\nabla U
∇ U ,
∇
V
\nabla V
∇ V ,
∇
W
\nabla W
∇ W ,求偏導數,而後利用梯度降低法更新各個權重。
對於每一時刻
t
t
t 的RNN網絡
,網絡的輸出
o
t
o_t
o t 都會產生必定偏差
e
t
e_t
e t ,偏差的損失函數,能夠是交叉熵也能夠是平方偏差等等。那麼總的偏差爲
E
=
∑
t
e
t
E=\sum_t e_t
E = ∑ t e t ,咱們的目標就是要求取:
E
=
∑
t
e
t
E=\sum_t e_t
E = t ∑ e t
∇
U
=
∂
E
∂
U
=
∑
t
∂
e
t
∂
U
\nabla U = \frac{\partial E}{\partial U} = \sum_t\frac{\partial e_t}{\partial U}
∇ U = ∂ U ∂ E = t ∑ ∂ U ∂ e t
∇
V
=
∂
E
∂
V
=
∑
t
∂
e
t
∂
V
\nabla V = \frac{\partial E}{\partial V} = \sum_t\frac{\partial e_t}{\partial V}
∇ V = ∂ V ∂ E = t ∑ ∂ V ∂ e t
∇
W
=
∂
E
∂
W
=
∑
t
∂
e
t
∂
W
\nabla W = \frac{\partial E}{\partial W} = \sum_t\frac{\partial e_t}{\partial W}
∇ W = ∂ W ∂ E = t ∑ ∂ W ∂ e t
下面咱們以
t
=
3
t=3
t = 3 爲例:
假設使用均方偏差,且真實值爲
y
i
y_i
y i ,那麼:
e
3
=
1
2
(
o
3
−
y
3
)
2
e_3 = \frac{1}{2}(o_3-y_3)^2
e 3 = 2 1 ( o 3 − y 3 ) 2
o
3
=
g
(
V
s
3
)
o_3=g(Vs_3)
o 3 = g ( V s 3 )
e
3
=
1
2
(
g
(
V
s
3
)
−
y
3
)
2
e_3=\frac{1}{2}(g(Vs_3)-y_3)^2
e 3 = 2 1 ( g ( V s 3 ) − y 3 ) 2
s
3
=
f
(
U
x
3
+
W
s
2
)
s_3=f(Ux_3+Ws_2)
s 3 = f ( U x 3 + W s 2 )
e
3
=
1
2
(
g
(
V
f
(
U
x
3
+
W
s
2
)
)
)
−
y
3
)
2
e_3=\frac{1}{2}(g(Vf(Ux_3+Ws_2)))-y_3)^2
e 3 = 2 1 ( g ( V f ( U x 3 + W s 2 ) ) ) − y 3 ) 2
求解
W
W
W 的偏導數:
上式和
W
W
W 有關的是
W
s
2
Ws_2
W s 2 ,很顯然這是個複合函數
咱們即可以根據複合函數的求導方式,鏈式法則:
∂
e
3
∂
W
=
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
W
\frac{\partial e_3}{\partial W} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial W}
∂ W ∂ e 3 = ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ W ∂ s 3
下面便依次求解(若是使用均方差損失),那麼:
e
3
=
1
2
(
o
3
−
y
3
)
2
e_3 = \frac{1}{2}(o_3-y_3)^2
e 3 = 2 1 ( o 3 − y 3 ) 2
∂
e
3
∂
o
3
=
o
3
−
y
3
\frac{\partial e_3}{\partial o_3} = o_3 - y_3
∂ o 3 ∂ e 3 = o 3 − y 3
o
3
=
g
(
V
s
3
)
o_3=g(Vs_3)
o 3 = g ( V s 3 )
∂
o
3
∂
s
3
=
g
′
V
\frac{\partial o_3}{\partial s_3}=g'V
∂ s 3 ∂ o 3 = g ′ V
g
′
g'
g ′ 表示函數 g 的導數
前面兩個比較簡單,重要的是第三項:
根據公式 :
s
t
=
f
(
U
x
t
+
W
s
t
−
1
)
s_t = f(Ux_t+Ws_{t-1})
s t = f ( U x t + W s t − 1 )
咱們會發現,
s
3
s_3
s 3 除了和
W
W
W 有關以外,還和前一時刻
s
2
s_2
s 2 有關
對於
s
3
s_3
s 3 直接展開獲得下面的式子:
∂
s
3
∂
W
=
∂
s
3
∂
s
3
∂
s
3
+
∂
W
+
∂
s
3
∂
s
2
∂
s
2
∂
W
\frac{\partial s_3}{\partial W}=\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} + \frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial W}
∂ W ∂ s 3 = ∂ s 3 ∂ s 3 ∂ W ∂ s 3 + + ∂ s 2 ∂ s 3 ∂ W ∂ s 2
其中
∂
s
3
+
∂
W
\frac{\partial s_3^+}{\partial W}
∂ W ∂ s 3 + 表示不作複合求導,將W之外的都當作常量
∂
s
2
∂
W
\frac{\partial s_2}{\partial W}
∂ W ∂ s 2 表示複合求導
對於
s
2
s_2
s 2 直接展開獲得下面的式子:
∂
s
2
∂
W
=
∂
s
2
∂
s
2
∂
s
2
+
∂
W
+
∂
s
2
∂
s
1
∂
s
1
∂
W
\frac{\partial s_2}{\partial W}=\frac{\partial s_2}{\partial s_2}\frac{\partial s_2^+}{\partial W} + \frac{\partial s_2}{\partial s_1}\frac{\partial s_1}{\partial W}
∂ W ∂ s 2 = ∂ s 2 ∂ s 2 ∂ W ∂ s 2 + + ∂ s 1 ∂ s 2 ∂ W ∂ s 1
對於
s
1
s_1
s 1 直接展開獲得下面的式子:
∂
s
1
∂
W
=
∂
s
1
∂
s
1
∂
s
1
+
∂
W
+
∂
s
1
∂
s
0
∂
s
0
∂
W
\frac{\partial s_1}{\partial W}=\frac{\partial s_1}{\partial s_1}\frac{\partial s_1^+}{\partial W} + \frac{\partial s_1}{\partial s_0}\frac{\partial s_0}{\partial W}
∂ W ∂ s 1 = ∂ s 1 ∂ s 1 ∂ W ∂ s 1 + + ∂ s 0 ∂ s 1 ∂ W ∂ s 0
將後兩個展開的代入第一個獲得:
∂
s
3
∂
W
=
∑
k
=
0
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial s_3}{\partial W}=\sum_{k=0}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ s 3 = k = 0 ∑ 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
最終:
∂
e
3
∂
W
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
另一種方式(假設咱們不考慮
f
f
f ):
s
t
=
U
x
t
+
W
s
t
−
1
s_t=Ux_t+Ws_{t-1}
s t = U x t + W s t − 1
s
3
=
U
x
3
+
W
s
2
s_3=Ux_3+Ws_{2}
s 3 = U x 3 + W s 2
∂
s
3
∂
W
=
s
2
+
W
∂
s
2
∂
W
\frac{\partial s_3}{\partial W} = s_2+W\frac{\partial s_2}{\partial W}
∂ W ∂ s 3 = s 2 + W ∂ W ∂ s 2
=
s
2
+
W
s
1
+
W
W
∂
s
1
∂
W
=s_2+Ws_1+WW\frac{\partial s_1}{\partial W}
= s 2 + W s 1 + W W ∂ W ∂ s 1
s
2
=
∂
s
3
∂
s
3
∂
s
3
+
∂
W
s_2 = \frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}
s 2 = ∂ s 3 ∂ s 3 ∂ W ∂ s 3 +
其中,
∂
s
3
∂
s
3
=
1
\frac{\partial s_3}{\partial s_3}=1
∂ s 3 ∂ s 3 = 1 ,
∂
s
3
+
∂
W
=
s
2
\frac{\partial s_3^+}{\partial W}=s_2
∂ W ∂ s 3 + = s 2 表示
s
3
s_3
s 3 對
W
W
W 求導,不作複合求導
s
2
=
U
x
2
+
W
s
1
s_2=Ux_2+Ws_{1}
s 2 = U x 2 + W s 1
W
s
1
=
∂
s
3
∂
s
2
∂
s
2
+
∂
W
Ws_1 =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}
W s 1 = ∂ s 2 ∂ s 3 ∂ W ∂ s 2 +
其中,
∂
s
3
∂
s
2
=
W
\frac{\partial s_3}{\partial s_2}=W
∂ s 2 ∂ s 3 = W ,
∂
s
2
+
∂
W
=
s
1
\frac{\partial s_2^+}{\partial W}=s_1
∂ W ∂ s 2 + = s 1
s
1
=
U
x
1
+
W
s
0
s_1=Ux_1+Ws_{0}
s 1 = U x 1 + W s 0
W
W
∂
s
1
∂
W
=
∂
s
3
∂
s
2
∂
s
2
∂
s
1
∂
s
1
+
∂
W
=
∂
s
3
∂
s
1
∂
s
1
+
∂
W
WW\frac{\partial s_1}{\partial W}=\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}
W W ∂ W ∂ s 1 = ∂ s 2 ∂ s 3 ∂ s 1 ∂ s 2 ∂ W ∂ s 1 + = ∂ s 1 ∂ s 3 ∂ W ∂ s 1 + 最終:
∂
s
3
∂
W
=
∂
s
3
∂
s
3
∂
s
3
+
∂
W
+
∂
s
3
∂
s
2
∂
s
2
+
∂
W
+
∂
s
3
∂
s
1
∂
s
1
+
∂
W
=
∑
k
=
1
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial s_3}{\partial W} =\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}+\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}+\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\sum_{k=1}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ s 3 = ∂ s 3 ∂ s 3 ∂ W ∂ s 3 + + ∂ s 2 ∂ s 3 ∂ W ∂ s 2 + + ∂ s 1 ∂ s 3 ∂ W ∂ s 1 + = k = 1 ∑ 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
∂
e
3
∂
W
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
根據上圖,鏈式法則:
∂
e
3
∂
W
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
(
∏
j
=
k
+
1
3
∂
s
j
∂
s
j
−
1
)
∂
s
k
+
∂
W
\frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\Big(\prod_{j=k+1}^3\frac{\partial s_j}{\partial s_{j-1}}\Big)\frac{\partial s_k^+}{\partial W}
∂ W ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ( j = k + 1 ∏ 3 ∂ s j − 1 ∂ s j ) ∂ W ∂ s k +
求解
U
U
U 的偏導數:(和求
W
W
W 相似)
∂
e
3
∂
U
=
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
U
\frac{\partial e_3}{\partial U} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial U}
∂ U ∂ e 3 = ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ U ∂ s 3
假設:
a
t
=
U
x
t
,
b
t
=
W
s
t
−
1
a_t = Ux_t,b_t=Ws_{t-1}
a t = U x t , b t = W s t − 1
s
t
=
f
(
a
t
+
b
t
)
s_t = f(a_t+b_t)
s t = f ( a t + b t )
求第三項,根據公式 :
s
3
=
f
(
U
x
3
+
W
s
2
)
s_3 = f(Ux_3+Ws_{2})
s 3 = f ( U x 3 + W s 2 )
∂
s
3
∂
U
=
f
′
×
(
∂
U
x
3
∂
U
+
W
∂
s
2
∂
U
)
\frac{\partial s_3}{\partial U}=f' \times (\frac{\partial Ux_3}{\partial U}+W\frac{\partial s_2}{\partial U})
∂ U ∂ s 3 = f ′ × ( ∂ U ∂ U x 3 + W ∂ U ∂ s 2 )
=
f
′
×
(
∂
U
x
3
∂
U
+
W
f
′
×
(
∂
U
x
2
∂
U
+
W
∂
s
1
∂
U
)
)
=f' \times (\frac{\partial Ux_3}{\partial U}+Wf' \times (\frac{\partial Ux_2}{\partial U}+W\frac{\partial s_1}{\partial U}))
= f ′ × ( ∂ U ∂ U x 3 + W f ′ × ( ∂ U ∂ U x 2 + W ∂ U ∂ s 1 ) )
=
f
′
×
(
∂
U
x
3
∂
U
+
W
f
′
×
(
∂
U
x
2
∂
U
+
W
f
′
×
(
∂
U
x
1
∂
U
+
W
∂
s
1
∂
U
)
)
)
=f' \times (\frac{\partial Ux_3}{\partial U}+Wf' \times (\frac{\partial Ux_2}{\partial U}+Wf' \times (\frac{\partial Ux_1}{\partial U}+W\frac{\partial s_1}{\partial U})))
= f ′ × ( ∂ U ∂ U x 3 + W f ′ × ( ∂ U ∂ U x 2 + W f ′ × ( ∂ U ∂ U x 1 + W ∂ U ∂ s 1 ) ) )
=
f
′
×
(
∂
U
x
3
∂
U
+
W
f
′
×
(
∂
U
x
2
∂
U
+
W
f
′
×
(
∂
U
x
1
∂
U
+
W
f
′
×
(
∂
U
x
0
∂
U
)
)
)
)
=f' \times \Bigg(\frac{\partial Ux_3}{\partial U}+Wf' \times \bigg(\frac{\partial Ux_2}{\partial U}+Wf' \times \Big(\frac{\partial Ux_1}{\partial U}+Wf' \times \big(\frac{\partial Ux_0}{\partial U}\big)\Big)\bigg)\Bigg)
= f ′ × ( ∂ U ∂ U x 3 + W f ′ × ( ∂ U ∂ U x 2 + W f ′ × ( ∂ U ∂ U x 1 + W f ′ × ( ∂ U ∂ U x 0 ) ) ) )
=
f
′
×
∂
U
x
3
∂
U
+
W
(
f
′
)
2
×
∂
U
x
2
∂
U
+
W
2
(
f
′
)
3
×
∂
U
x
1
∂
U
+
W
3
(
f
′
)
4
×
(
∂
U
x
0
∂
U
)
=f' \times \frac{\partial Ux_3}{\partial U}+W(f')^2 \times \frac{\partial Ux_2}{\partial U}+W^2(f')^3 \times \frac{\partial Ux_1}{\partial U}+W^3(f')^4 \times \big(\frac{\partial Ux_0}{\partial U}\big)
= f ′ × ∂ U ∂ U x 3 + W ( f ′ ) 2 × ∂ U ∂ U x 2 + W 2 ( f ′ ) 3 × ∂ U ∂ U x 1 + W 3 ( f ′ ) 4 × ( ∂ U ∂ U x 0 )
=
∑
k
=
0
3
(
f
′
)
4
−
k
∂
(
W
3
−
k
a
k
)
∂
U
=\sum_{k=0}^3 (f')^{4-k}\frac{\partial (W^{3-k}a_k)}{\partial U}
= k = 0 ∑ 3 ( f ′ ) 4 − k ∂ U ∂ ( W 3 − k a k )
∂
e
3
∂
U
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
(
W
3
−
k
a
k
)
∂
U
(
f
′
)
4
−
k
\frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}(f')^{4-k}
∂ U ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ U ∂ ( W 3 − k a k ) ( f ′ ) 4 − k
這裏的結果我也不知道對不對,但願瞭解的朋友指導下,很是感謝。
不考慮
f
f
f :
s
t
=
U
x
t
+
W
s
t
−
1
s_t=Ux_t+Ws_{t-1}
s t = U x t + W s t − 1
s
3
=
U
x
3
+
W
(
U
x
2
+
W
(
U
x
1
+
W
U
x
0
)
)
s_3=Ux_3+W\Big(Ux_2+W\big(Ux_1+WUx_0\big)\Big)
s 3 = U x 3 + W ( U x 2 + W ( U x 1 + W U x 0 ) )
=
U
x
3
+
W
U
x
2
+
W
2
U
x
1
+
W
3
U
x
0
=Ux_3+WUx_2+W^2Ux_1+W^3Ux_0
= U x 3 + W U x 2 + W 2 U x 1 + W 3 U x 0
s
3
=
a
3
+
W
a
2
+
W
2
a
1
+
W
3
a
0
s_3 = a_3+Wa_2+W^2a_1+W^3a_0
s 3 = a 3 + W a 2 + W 2 a 1 + W 3 a 0
∂
s
3
∂
U
=
∑
k
=
0
3
∂
(
W
3
−
k
a
k
)
∂
U
\frac{\partial s_3}{\partial U} =\sum_{k=0}^3 \frac{\partial (W^{3-k}a_k)}{\partial U}
∂ U ∂ s 3 = k = 0 ∑ 3 ∂ U ∂ ( W 3 − k a k )
∂
e
3
∂
U
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
(
W
3
−
k
a
k
)
∂
U
\frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}
∂ U ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ U ∂ ( W 3 − k a k )
求解
V
V
V 的偏導數:
由於
V
V
V 只和輸出
o
t
o_t
o t 有關有關,因此:
∂
e
3
∂
V
=
∂
e
3
∂
o
3
∂
o
3
∂
V
\frac{\partial e_3}{\partial V} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial V}
∂ V ∂ e 3 = ∂ o 3 ∂ e 3 ∂ V ∂ o 3
四、RNN
缺陷
從咱們上面的推導過程,假如
t
=
0
t=0
t = 0 時刻的值,到
t
=
100
t=100
t = 1 0 0 時,因爲前面的
W
W
W 次數過大,又可能會使其忘記
t
=
0
t=0
t = 0 時刻的信息,咱們稱之爲RNN
梯度消失,可是不是真正意思上的消失,由於梯度是累加的過程,不可能爲0,只是在某個時刻的梯度過小,忘記了前面時刻的內容。
爲了克服梯度消失的問題,LSTM和GRU模型便後續被推出了。因爲它們都有特殊的方式存儲「記憶」,那麼之前梯度比較大的「記憶」不會像簡單的RNN
同樣立刻被抹除,所以能夠必定程度 上克服梯度消失問題。
另外一個簡單的技巧能夠用來克服梯度爆炸的問題就是gradient clipping
,也就是當你計算的梯度超過閾值c的或者小於閾值−c時候,便把此時的梯度設置成c或−c。
下圖所示是RNN
的偏差平面:
上圖能夠看到RNN的偏差平面要麼很是陡峭,要麼很是平坦,若是不採起任何措施,當你的參數在某一次更新以後,恰好碰到陡峭的地方,此時梯度變得很是大,那麼你的參數更新也會很是大,很容易致使震盪問題。而若是你採起了gradient clipping這個技巧,那麼即便你不幸碰到陡峭的地方,梯度也不會爆炸,由於梯度被限制在某個閾值c。
2、LSTM
(長短時間記憶網絡)
因爲在RNN
中,存在長期依賴的問題,可能產生梯度消失和梯度爆炸的問題。而LSTM
從名字就能夠看出它特別適合解決這類須要長時間依賴的問題,相比於RNN
:
LSTM
的「記憶細胞(Cell)」改造了
該記錄的信息會一直傳遞下去,不應記錄的信息會被截斷
下圖是循環網絡的展開結構:
其中的 A 部分的框便表示「記憶細胞」
RNN
的「記憶細胞」 以下:
只是經過簡單的非線性映射
LSTM
的「記憶細胞」 以下:
增長了三個門,來控制「記憶細胞」
一、記憶細胞
細胞狀態相似於傳送帶,直接在整個鏈上運行,只有一些少許的線性交互,信息在上面流傳保持不變很容易。
LSTM 怎麼控制「細胞狀態」?
LSTM
能夠經過 gates(「門」)
結構來去除或者增長「細胞狀態」的信息
LSTM
中主要有三個「門」
結構來控制「細胞狀態」
忘記門、信息增長門、輸出門
二、忘記門
將上一時間點的輸出和該時刻的輸入進行 sigmoid
操做,輸出一個0到1之間的機率值
該機率值描述了,每一個部分有多少許能夠經過
若是該值爲0,那麼與
C
t
−
1
C_{t-1}
C t − 1 通過乘法操做後依然爲0,表示「不容許任何變量經過」
若是該值爲1,那麼與
C
t
−
1
C_{t-1}
C t − 1 通過乘法操做後依然爲
C
t
−
1
C_{t-1}
C t − 1 ,」表示「容許全部變量經過」
「忘記門」:決定從「細胞狀態」中丟棄什麼信息; 好比在語言模型中,細胞狀態可能包含了性別信息(「他」或者「她」),當咱們看到新的代名詞的時候,能夠考慮忘記舊的數據
三、信息加強門
決定放什麼新信息到「細胞狀態」中;
Sigmoid層
決定什麼值須要更新;
Tanh層
建立一個新的候選向量
C
~
t
\widetilde{C}_t
C
t ,主要是爲了狀態更新作準備
通過忘記門
、信息增長門
後,能夠肯定傳遞信息的刪除
和增長
,便可以進行「細胞狀態」的更新
更新
C
t
−
1
C_{t-1}
C t − 1 爲
C
t
C_t
C t
將舊狀態與
f
t
f_t
f t 相乘,丟失掉肯定不要的信息
加上新的候選值
i
t
∗
C
t
i_t*C_t
i t ∗ C t 獲得最終更新後的「細胞狀態」
四、輸出門
輸出門是基於「細胞狀態」獲得輸出:
首先運行一個sigmoid層
來肯定細胞狀態的那個部分將輸出
使用 tanh
處理細胞狀態獲得一個-1到1之間的值,再將它和sigmoid門
的輸出相乘,輸出程序肯定輸出的部分。
五、LSTM
正向傳播
f
t
=
σ
(
W
f
⋅
[
h
t
−
1
,
x
t
]
+
b
f
)
f_t = \sigma(W_f \cdot[h_{t-1}, x_t] + b_f)
f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 爲
x
f
x_f
x f
i
t
=
σ
(
W
i
⋅
[
h
t
−
1
,
x
t
]
+
b
i
)
)
i_t = \sigma(W_i \cdot[h_{t-1}, x_t] + b_i))
i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 爲
x
i
x_i
x i
C
~
t
=
t
a
n
h
(
W
C
⋅
[
h
t
−
1
,
x
t
]
+
b
C
)
\widetilde{C}_t = tanh(W_C \cdot [h_{t-1},x_t]+b_C)
C
t = t a n h ( W C ⋅ [ h t − 1 , x t ] + b C )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 爲
x
C
x_C
x C
C
t
=
f
t
∗
C
t
−
1
+
i
t
∗
C
~
t
C_t = f_t * C_{t-1} + i_t * \widetilde{C}_t
C t = f t ∗ C t − 1 + i t ∗ C
t
o
t
=
σ
(
W
o
⋅
[
h
t
−
1
,
x
t
]
+
b
o
)
o_t=\sigma(W_o\cdot [h_{t-1}, x_t] + b_o)
o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 爲
x
o
x_o
x o
h
t
=
o
t
∗
t
a
n
h
(
C
t
)
h_t=o_t * tanh(C_t)
h t = o t ∗ t a n h ( C t )
y
^
t
=
W
y
⋅
h
t
+
b
y
\hat{y}_t=W_y \cdot h_t + b_y
y ^ t = W y ⋅ h t + b y
六、LSTM
反向傳播
使用均方偏差:
E
=
∑
t
=
0
T
E
t
E = \sum_{t=0}^T E_t
E = t = 0 ∑ T E t
E
t
=
1
2
(
y
^
t
−
y
t
)
2
E_t = \frac{1}{2} (\hat{y}_t - y_t)^2
E t = 2 1 ( y ^ t − y t ) 2
∂
E
∂
W
y
=
∑
t
=
0
T
∂
E
t
∂
W
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
∂
y
^
t
∂
W
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
h
t
\frac{\partial E}{\partial W_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot h_t
∂ W y ∂ E = t = 0 ∑ T ∂ W y ∂ E t = t = 0 ∑ T ∂ y ^ t ∂ E t ∂ W y ∂ y ^ t = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ h t
∂
E
∂
b
y
=
∑
t
=
0
T
∂
E
t
∂
b
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
∂
y
^
t
∂
b
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
1
\frac{\partial E}{\partial b_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot 1
∂ b y ∂ E = t = 0 ∑ T ∂ b y ∂ E t = t = 0 ∑ T ∂ y ^ t ∂ E t ∂ b y ∂ y ^ t = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ 1
由於
W
f
,
W
i
,
W
C
,
W
o
W_f,W_i,W_C,W_o
W f , W i , W C , W o 均和
h
t
h_t
h t 或
C
t
C_t
C t 有關係,因此求導法則都可寫爲關於
h
t
h_t
h t 或
C
t
C_t
C t 的鏈式法則
(1)先求
E
E
E 關於
h
t
h_t
h t 和
C
t
C_t
C t 的導數
上圖中可知,
h
t
h_t
h t 和
C
t
C_t
C t 都有兩條鏈路,所以導數包含兩個部分
一個是當前時刻偏差的導數
另外一個是下一時刻到
T
T
T 時刻的全部偏差累積的導數
∂
E
∂
h
t
=
∂
E
t
∂
h
t
+
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
h
t
\frac{\partial E}{\partial h_t} =\frac{\partial E_t}{\partial h_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}
∂ h t ∂ E = ∂ h t ∂ E t + ∂ h t ∂ ( ∑ k = t + 1 T E k )
∂
E
∂
C
t
=
∂
E
t
∂
C
t
+
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
C
t
\frac{\partial E}{\partial C_t} =\frac{\partial E_t}{\partial C_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}
∂ C t ∂ E = ∂ C t ∂ E t + ∂ C t ∂ ( ∑ k = t + 1 T E k )
∂
E
t
∂
h
t
=
∂
E
t
∂
y
^
t
∂
y
^
t
∂
h
t
=
∂
E
t
∂
y
^
t
⋅
W
y
T
\frac{\partial E_t}{\partial h_t} =\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial h_t}=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T
∂ h t ∂ E t = ∂ y ^ t ∂ E t ∂ h t ∂ y ^ t = ∂ y ^ t ∂ E t ⋅ W y T
∂
E
t
∂
C
t
=
∂
E
t
∂
h
t
∂
h
t
∂
C
t
=
∂
E
t
∂
h
t
⋅
o
t
⋅
(
1
−
t
a
n
h
2
(
C
t
)
)
=
∂
E
t
∂
y
^
t
⋅
W
y
T
⋅
o
t
⋅
(
1
−
t
a
n
h
2
(
C
t
)
)
\frac{\partial E_t}{\partial C_t}=\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial C_t}= \frac{\partial E_t}{\partial h_t} \cdot o_t \cdot (1-tanh^2(C_t))=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot o_t \cdot (1-tanh^2(C_t))
∂ C t ∂ E t = ∂ h t ∂ E t ∂ C t ∂ h t = ∂ h t ∂ E t ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) ) = ∂ y ^ t ∂ E t ⋅ W y T ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) )
如下兩個如今求不出來,先用一個記號命名下:
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
h
t
=
d
h
n
e
x
t
\frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}=dh_{next}
∂ h t ∂ ( ∑ k = t + 1 T E k ) = d h n e x t
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
C
t
=
d
C
n
e
x
t
\frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}=dC_{next}
∂ C t ∂ ( ∑ k = t + 1 T E k ) = d C n e x t
(2)求
W
o
W_o
W o 的偏導
∂
E
∂
W
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
W
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
o
t
∂
o
t
∂
W
o
\frac{\partial E}{\partial W_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial W_o}
∂ W o ∂ E = t = 0 ∑ T ∂ h t ∂ E t ∂ W o ∂ h t = t = 0 ∑ T ∂ h t ∂ E t ∂ o t ∂ h t ∂ W o ∂ o t
∂
h
t
∂
o
t
=
t
a
n
h
(
C
t
)
\frac{\partial h_t}{\partial o_t}=tanh(C_t)
∂ o t ∂ h t = t a n h ( C t )
∂
o
t
∂
W
o
=
o
t
⋅
(
1
−
o
t
)
⋅
x
o
T
\frac{\partial o_t}{\partial W_o} = o_t \cdot (1-o_t) \cdot x_o^T
∂ W o ∂ o t = o t ⋅ ( 1 − o t ) ⋅ x o T
∂
E
∂
W
o
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
W
y
T
⋅
t
a
n
h
(
C
t
)
⋅
o
t
⋅
(
1
−
o
t
)
⋅
x
o
T
\frac{\partial E}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) \cdot x_o^T
∂ W o ∂ E = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t ) ⋅ x o T
(3)求
b
o
b_o
b o 的偏導
∂
E
∂
b
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
h
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
o
t
∂
o
t
∂
b
o
\frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial h_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial b_o}
∂ b o ∂ E = t = 0 ∑ T ∂ h t ∂ E t ∂ h o ∂ h t = t = 0 ∑ T ∂ h t ∂ E t ∂ o t ∂ h t ∂ b o ∂ o t
∂
h
t
∂
o
t
=
t
a
n
h
(
C
t
)
\frac{\partial h_t}{\partial o_t} = tanh(C_t)
∂ o t ∂ h t = t a n h ( C t )
∂
o
t
∂
b
o
=
o
t
(
1
−
o
t
)
\frac{\partial o_t}{\partial b_o}=o_t(1-o_t)
∂ b o ∂ o t = o t ( 1 − o t )
∂
E
∂
b
o
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
W
y
T
⋅
t
a
n
h
(
C
t
)
⋅
o
t
⋅
(
1
−
o
t
)
\frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t)
∂ b o ∂ E = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t )
(4)求
x
o
x_o
x o 的偏導
∂
E
∂
x
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
x
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
o
t
∂
o
t
∂
x
o
\frac{\partial E}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial x_o}
∂ x o ∂ E = t = 0 ∑ T ∂ h t ∂ E t ∂ x o ∂ h t = t = 0 ∑ T ∂ h t ∂ E t ∂ o t ∂ h t ∂ x o ∂ o t
∂
o
t
∂
x
o
=
o
t
(
1
−
o
t
)
⋅
W
o
T
\frac{\partial o_t}{\partial x_o}=o_t(1-o_t)\cdot W_o^T
∂ x o ∂ o t = o t ( 1 − o t ) ⋅ W o T
(5)求
W
C
W_C
W C 的偏導
∂
E
∂
W
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
W
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
C
~
t
∂
C
~
t
∂
W
C
\frac{\partial E}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial W_C}
∂ W C ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ W C ∂ C t = t = 0 ∑ T ∂ C t ∂ E t ∂ C
t ∂ C t ∂ W C ∂ C
t
∂
C
t
∂
C
~
t
=
i
t
\frac{\partial C_t}{\partial \widetilde{C}_t}=i_t
∂ C
t ∂ C t = i t
∂
C
~
t
∂
W
C
=
(
1
−
C
~
t
2
)
⋅
x
C
T
\frac{\partial \widetilde{C}_t}{\partial W_C}=(1-\widetilde{C}_t^2)\cdot x_C^T
∂ W C ∂ C
t = ( 1 − C
t 2 ) ⋅ x C T
(6)求
b
C
b_C
b C 的偏導
∂
E
∂
b
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
b
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
C
~
t
∂
C
~
t
∂
b
C
\frac{\partial E}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial b_C}
∂ b C ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ b C ∂ C t = t = 0 ∑ T ∂ C t ∂ E t ∂ C
t ∂ C t ∂ b C ∂ C
t
∂
C
~
t
∂
b
C
=
(
1
−
C
~
t
2
)
⋅
1
\frac{\partial \widetilde{C}_t}{\partial b_C}=(1-\widetilde{C}_t^2)\cdot 1
∂ b C ∂ C
t = ( 1 − C
t 2 ) ⋅ 1
(7)求
x
C
x_C
x C 的偏導
∂
E
∂
x
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
x
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
C
~
t
∂
C
~
t
∂
x
C
\frac{\partial E}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial x_C}
∂ x C ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ x C ∂ C t = t = 0 ∑ T ∂ C t ∂ E t ∂ C
t ∂ C t ∂ x C ∂ C
t
∂
C
~
t
∂
x
C
=
(
1
−
C
~
t
2
)
⋅
W
C
T
\frac{\partial \widetilde{C}_t}{\partial x_C}=(1-\widetilde{C}_t^2)\cdot W_C^T
∂ x C ∂ C
t = ( 1 − C
t 2 ) ⋅ W C T
(8)求
W
i
,
b
i
,
x
i
W_i,b_i,x_i
W i , b i , x i 的偏導
∂
E
∂
W
i
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
i
t
∂
i
t
∂
W
i
\frac{\partial E}{\partial W_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial W_i}
∂ W i ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ i t ∂ C t ∂ W i ∂ i t
∂
E
∂
b
i
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
i
t
∂
i
t
∂
b
i
\frac{\partial E}{\partial b_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial b_i}
∂ b i ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ i t ∂ C t ∂ b i ∂ i t
∂
E
∂
x
i
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
i
t
∂
i
t
∂
x
i
\frac{\partial E}{\partial x_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial x_i}
∂ x i ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ i t ∂ C t ∂ x i ∂ i t
∂
C
t
∂
i
t
=
C
~
t
\frac{\partial C_t}{\partial i_t}=\widetilde{C}_t
∂ i t ∂ C t = C
t
∂
i
t
∂
W
i
=
i
t
⋅
(
1
−
i
t
)
⋅
x
i
T
\frac{\partial i_t}{\partial W_i}=i_t\cdot (1-i_t) \cdot x_i^T
∂ W i ∂ i t = i t ⋅ ( 1 − i t ) ⋅ x i T
∂
i
t
∂
b
i
=
i
t
⋅
(
1
−
i
t
)
⋅
1
\frac{\partial i_t}{\partial b_i}= i_t\cdot (1-i_t) \cdot 1
∂ b i ∂ i t = i t ⋅ ( 1 − i t ) ⋅ 1
∂
i
t
∂
x
i
=
i
t
⋅
(
1
−
i
t
)
⋅
W
i
T
\frac{\partial i_t}{\partial x_i}=i_t\cdot (1-i_t) \cdot W_i^T
∂ x i ∂ i t = i t ⋅ ( 1 − i t ) ⋅ W i T
(9)求
W
f
,
b
f
,
x
f
W_f,b_f,x_f
W f , b f , x f 的偏導
∂
E
∂
W
f
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
f
t
∂
f
t
∂
W
f
\frac{\partial E}{\partial W_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial W_f}
∂ W f ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ f t ∂ C t ∂ W f ∂ f t
∂
E
∂
b
f
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
f
t
∂
f
t
∂
b
f
\frac{\partial E}{\partial b_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial b_f}
∂ b f ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ f t ∂ C t ∂ b f ∂ f t
∂
E
∂
x
f
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
f
t
∂
f
t
∂
x
f
\frac{\partial E}{\partial x_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial x_f}
∂ x f ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ f t ∂ C t ∂ x f ∂ f t
∂
C
t
∂
f
t
=
C