（一）線性循環神經網絡（RNN）

時間 2019-11-29

原文原文鏈接

做者：chen_h
微信號 & QQ：862251340
微信公衆號：coderpai
個人博客：請點擊這裏html

這篇教程是翻譯Peter Roelants寫的循環神經網絡教程，做者已經受權翻譯，這是原文。python

該教程將介紹如何實現一個循環神經網絡（RNN），一共包含兩部分。你能夠在如下連接找到完整內容。git

這篇教程中的代碼是由 Python 2 IPython Notebook產生的，在教程的最後，我會給出所有代碼的連接，幫助學習。神經網絡中有關矩陣的運算咱們採用NumPy來構建，畫圖使用Matplotlib來構建。若是你來沒有安裝這些軟件，那麼我強烈建議你使用Anaconda Python來安裝，這個軟件包中包含了運行這個教程的全部軟件包，很是方便使用。bash

循環神經網絡

本教程主要包含三部分：微信

一個很是簡單的循環神經網絡（RNN）網絡
基於時序的反向傳播（BPTT）app
彈性優化算法dom

循環神經網絡是一種能夠解決序列數據的模型。在時序模型上面，這種循環關係能夠定義成以下式子：

其中，Sk表示在時間k時刻的狀態，Xk是在時序k時刻的輸入數據，Wrec和Wx都是神經網絡的連接權重。若是簡單的理解，能夠把RNN理解成是一個帶反饋迴路的狀態模型。因爲循環關係和延時處理，時序狀態被加入了模型之中。這個延時操做賦予了模型記憶力，由於它能記住模型前面一個狀態。

神經網絡最後的輸出結果Yk是在時間k時刻計算出來的，便是經過前面一個或者多個狀態Sk，....，Sk+j計算出來的。

接下來，咱們就能夠經過輸入的數據Xk和前一步的狀態S(k-1)，來計算當前的狀態S(k)，或者經過輸入的數據Xk和前一步的狀態S(k)來預測下一步的狀態S(k+1)。

這篇教程會說明循環神經網絡和通常的前饋神經網絡沒有很大的不一樣，可是在訓練的方式上面可能會有一些不一樣。

線性循環神經網絡

這部分教程咱們來設計一個簡單的RNN模型，這個模型的輸入是一個二進制的數據流，任務是去計算這個二進制的數據流中存在幾個1。

在這個教程中，咱們設計的RNN模型中的狀態只有一維，在每一個時間點上，輸入數據也是一維的，最後輸出的結果就是序列狀態的最後一個狀態，即y = S(k)。咱們將RNN模型進行展開，就能夠獲得下圖的模型。注意，展開的模型能夠看作是一個 (n+1) 層的神經網絡，每一層使用相同的連接權重Wrec和Wx。

雖然實現和訓練這個模型是一件很是有意思的事情，可是咱們能夠很容易獲得，當W(rec) = W(x) = 1時，模型是最優的。

咱們先導入教程須要的軟件包

import numpy as np 
import matplotlib
import matplotlib.pyplot as plt 
from matplotlib import cm
from matplotlib.colors import LogNorm複製代碼

定義數據集

輸入數據集 X 一共有20組數據，每組數據的長度是10，即每組數據的時間狀態步長是10。輸入數據是由均勻的隨機分佈產生的，取值 0 或者 1 。

輸出結果是輸入的二進制數據流中存在幾個1，也就是把序列的每一位都加起來求和的結果。

# Create dataset
nb_of_samples = 20
sequence_len = 10
# Create the sequences
X = np.zeros((nb_of_samples, sequence_len))
for row_idx in range(nb_of_samples):
    X[row_idx,:] = np.around(np.random.rand(sequence_len)).astype(int)
# Create the targets for each sequence
t = np.sum(X, axis=1)複製代碼

經過基於時序的反向傳播（BPTT）算法進行訓練

訓練RNN的一個典型算法是BPTT（backpropagation through time）算法。經過名字，你也能發現這是一個基於BP的算法。

若是你很瞭解常規的BP算法，那麼BPTT算法和常規的BP算法沒有很大的不一樣。惟一的不一樣是，RNN須要每個特定的時間步驟中，將每一個神經元進行展開處理而已。展開圖已經在教程的最前面進行了說明。展開後，模型就和規則的神經網絡模型很像了。惟一不一樣是，RNN有多個輸入源（前一個時間步驟的輸入狀態和當前的輸入數據）和每一層中的連接矩陣（ W(rec)和W(x) ）都是同樣的。

正向傳播計算RNN的輸出結果

正向傳播的時候，咱們會把RNN展開進行處理，這樣就能夠按照規則的神經網絡進行處理了。RNN模型最後的輸出結果將會被使用在損失函數的計算中，用於訓練網絡。（其實這些都和常規的多層神經網絡同樣。）

當咱們將RNN進行展開計算時，在不一樣的時間點上面，其實循環關係是相同的，咱們將這個相同的循環關係在 update_state 函數中實現了。

forward_states函數經過 for 循環，將update_state函數應用到每個時間點上面。若是咱們將這些步驟都矢量化，那麼就能夠進行並行計算了。跟常規神經網絡同樣，咱們須要給權重進行初始化。在這個教程中，咱們將權重初始化爲0。

最後，咱們經過累加因此輸入數據的偏差進行計算均方偏差函數（MSE）來獲得損失函數 ξ 。在程序中，咱們使用 cost 函數來實現。

# Define the forward step functions
def update_state(xk, sk, wx, wRec):
    """ Compute state k from the previous state (sk) and current input (xk), by use of the input weights (wx) and recursive weights (wRec). """
    return xk * wx + sk * wRec

def forward_states(X, wx, wRec):
    """ Unfold the network and compute all state activations given the input X, and input weights (wx) and recursive weights (wRec). Return the state activations in a matrix, the last column S[:,-1] contains the final activations. """
    # Initialise the matrix that holds all states for all input sequences.
    # The initial state s0 is set to 0.
    S = np.zeros((X.shape[0], X.shape[1]+1))
    # Use the recurrence relation defined by update_state to update the 
    # states trough time.
    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    return S

def cost(y, t): 
    """ Return the MSE between the targets t and the outputs y. """
    return ((t - y)**2).sum() / nb_of_samples複製代碼

反向傳播的梯度計算

在進行反向傳播過程以前，咱們須要先計算偏差的對於輸出結果的梯度∂ξ/∂y，函數 output_gradient 實現了這個梯度計算過程。這個梯度將會被經過反向傳播算法一層一層的向前傳播，函數 backward_gradient 實現了這個計算過程。具體的數學推導以下所示：

梯度最開始的計算公式爲：

其中，n 表示RNN展開以後的時間步長。須要注意的是，參數 Wrec 擔當着反向傳遞偏差的角色。

損失函數對於權重的梯度是經過累加每一層中的梯度獲得的。具體數學公式以下：

def output_gradient(y, t):
    """ Compute the gradient of the MSE cost function with respect to the output y. """
    return 2.0 * (y - t) / nb_of_samples

def backward_gradient(X, S, grad_out, wRec):
    """ Backpropagate the gradient computed at the output (grad_out) through the network. Accumulate the parameter gradients for wX and wRec by for each layer by addition. Return the parameter gradients as a tuple, and the gradients at the output of each layer. """
    # Initialise the array that stores the gradients of the cost with respect to the states.
    grad_over_time = np.zeros((X.shape[0], X.shape[1]+1))
    grad_over_time[:,-1] = grad_out
    # Set the gradient accumulations to 0
    wx_grad = 0
    wRec_grad = 0
    for k in range(X.shape[1], 0, -1):
        # Compute the parameter gradients and accumulate the results.
        wx_grad += np.sum(grad_over_time[:,k] * X[:,k-1])
        wRec_grad += np.sum(grad_over_time[:,k] * S[:,k-1])
        # Compute the gradient at the output of the previous layer
        grad_over_time[:,k-1] = grad_over_time[:,k] * wRec
    return (wx_grad, wRec_grad), grad_over_time複製代碼

梯度檢查

對於RNN，咱們也須要對其進行梯度檢查，具體的檢查方法能夠參考在常規多層神經網絡中的梯度檢查。若是在反向傳播中的梯度計算正確，那麼這個梯度值應該和數值計算出來的梯度值應該是相同的。

# Perform gradient checking
# Set the weight parameters used during gradient checking
params = [1.2, 1.2]  # [wx, wRec]
# Set the small change to compute the numerical gradient
eps = 1e-7
# Compute the backprop gradients
S = forward_states(X, params[0], params[1])
grad_out = output_gradient(S[:,-1], t)
backprop_grads, grad_over_time = backward_gradient(X, S, grad_out, params[1])
# Compute the numerical gradient for each parameter in the layer
for p_idx, _ in enumerate(params):
    grad_backprop = backprop_grads[p_idx]
    # + eps
    params[p_idx] += eps
    plus_cost = cost(forward_states(X, params[0], params[1])[:,-1], t)
    # - eps
    params[p_idx] -= 2 * eps
    min_cost = cost(forward_states(X, params[0], params[1])[:,-1], t)
    # reset param value
    params[p_idx] += eps
    # calculate numerical gradient
    grad_num = (plus_cost - min_cost) / (2*eps)
    # Raise error if the numerical grade is not close to the backprop gradient
    if not np.isclose(grad_num, grad_backprop):
        raise ValueError('Numerical gradient of {:.6f} is not close to the backpropagation gradient of {:.6f}!'.format(float(grad_num), float(grad_backprop)))
print('No gradient errors found')複製代碼

No gradient errors found

參數更新

因爲不穩定的梯度，RNN是很是難訓練的。這也使得通常對於梯度的優化算法，好比梯度降低，都不能使得RNN找到一個好的局部最小值。

咱們在下面的兩張圖中說明了RNN梯度的不穩定性。第一張圖表示，當咱們給定 w(x) 和 w(rec) 時獲得的損失表面圖。圖中帶顏色標記的地方，是咱們取了幾個值作的實驗結果。從圖中，咱們能夠發現，當偏差表面的值接近於0時，w(x) = w(rec) = 1。可是當 |w(rec)| > 1時，偏差表面的值增長的很是迅速。

第二張圖咱們經過幾組數據模擬了梯度的不穩定性，這個隨着時間步長而不穩定的梯度的形式和等比數列的形式很像，具體數學公式以下：

在狀態S(k)時的梯度，反向傳播m步獲得的狀態S(k-m)能夠被寫成：

在咱們簡單的線性模型中，若是 |w(rec)| > 1，那麼梯度是一個指數爆炸的增加。若是 |w(rec)| < 1，那麼梯度將會消失。

關於指數暴漲，在第二張圖中，當咱們取 w(x) =1, w(rec) = 2時，在圖中顯示梯度是指數爆炸增加的，當咱們取 w(x) =1, w(rec) = -2時，正負徘徊指數增加，爲何會出現徘徊？是由於咱們把參數 w(rec) 取成了負數。這個指數爆炸說明了，模型的訓練對參數 w(rec) 是很是敏感的。

關於梯度消失，在第二張圖中，當咱們取 w(x) = 1, w(rec) = 0.5和 w(x) = 1, w(rec) = -0.5時，那麼梯度將會指數降低，直至消失。這個梯度消失表示模型不能長時間的訓練，由於最後梯度將會消失。

若是 w(rec) = 0 時，梯度立刻變成了0。當 w(rec) = 1時，梯度隨着時間不變。

在下一部分，咱們將說明怎麼去優化一個不穩定的偏差函數。

# Define plotting functions

# Define points to annotate (wx, wRec, color)
points = [(2,1,'r'), (1,2,'b'), (1,-2,'g'), (1,0,'c'), (1,0.5,'m'), (1,-0.5,'y')]

def get_cost_surface(w1_low, w1_high, w2_low, w2_high, nb_of_ws, cost_func):
    """Define a vector of weights for which we want to plot the cost."""
    w1 = np.linspace(w1_low, w1_high, num=nb_of_ws)  # Weight 1
    w2 = np.linspace(w2_low, w2_high, num=nb_of_ws)  # Weight 2
    ws1, ws2 = np.meshgrid(w1, w2)  # Generate grid
    cost_ws = np.zeros((nb_of_ws, nb_of_ws))  # Initialize cost matrix
    # Fill the cost matrix for each combination of weights
    for i in range(nb_of_ws):
        for j in range(nb_of_ws):
            cost_ws[i,j] = cost_func(ws1[i,j], ws2[i,j])
    return ws1, ws2, cost_ws

def plot_surface(ax, ws1, ws2, cost_ws):
    """Plot the cost in function of the weights."""
    surf = ax.contourf(ws1, ws2, cost_ws, levels=np.logspace(-0.2, 8, 30), cmap=cm.pink, norm=LogNorm())
    ax.set_xlabel('$w_{in}$', fontsize=15)
    ax.set_ylabel('$w_{rec}$', fontsize=15)
    return surf

def plot_points(ax, points):
    """Plot the annotation points on the given axis."""
    for wx, wRec, c in points:
        ax.plot(wx, wRec, c+'o', linewidth=2)

def get_cost_surface_figure(cost_func, points):
    """Plot the cost surfaces together with the annotated points."""
    # Plot figures
    fig = plt.figure(figsize=(10, 4))   
    # Plot overview of cost function
    ax_1 = fig.add_subplot(1,2,1)
    ws1_1, ws2_1, cost_ws_1 = get_cost_surface(-3, 3, -3, 3, 100, cost_func)
    surf_1 = plot_surface(ax_1, ws1_1, ws2_1, cost_ws_1 + 1)
    plot_points(ax_1, points)
    ax_1.set_xlim(-3, 3)
    ax_1.set_ylim(-3, 3)
    # Plot zoom of cost function
    ax_2 = fig.add_subplot(1,2,2)
    ws1_2, ws2_2, cost_ws_2 = get_cost_surface(0, 2, 0, 2, 100, cost_func)
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2 + 1)
    plot_points(ax_2, points)
    ax_2.set_xlim(0, 2)
    ax_2.set_ylim(0, 2)
    # Show the colorbar
    fig.subplots_adjust(right=0.8)
    cax = fig.add_axes([0.85, 0.12, 0.03, 0.78])
    cbar = fig.colorbar(surf_1, ticks=np.logspace(0, 8, 9), cax=cax)
    cbar.ax.set_ylabel('$\\xi$', fontsize=15, rotation=0, labelpad=20)
    cbar.set_ticklabels(['{:.0e}'.format(i) for i in np.logspace(0, 8, 9)])
    fig.suptitle('Cost surface', fontsize=15)
    return fig

def plot_gradient_over_time(points, get_grad_over_time):
    """Plot the gradients of the annotated point and how the evolve over time."""
    fig = plt.figure(figsize=(6.5, 4))  
    ax = plt.subplot(111)
    # Plot points
    for wx, wRec, c in points:
        grad_over_time = get_grad_over_time(wx, wRec)
        x = np.arange(-grad_over_time.shape[1]+1, 1, 1)
        plt.plot(x, np.sum(grad_over_time, axis=0), c+'-', label='({0}, {1})'.format(wx, wRec), linewidth=1, markersize=8)
    plt.xlim(0, -grad_over_time.shape[1]+1)
    # Set up plot axis
    plt.xticks(x)
    plt.yscale('symlog')
    plt.yticks([10**8, 10**6, 10**4, 10**2, 0, -10**2, -10**4, -10**6, -10**8])
    plt.xlabel('timestep k', fontsize=12)
    plt.ylabel('$\\frac{\\partial \\xi}{\\partial S_{k}}$', fontsize=20, rotation=0)
    plt.grid()
    plt.title('Unstability of gradient in backward propagation.\n(backpropagate from left to right)')
    # Set legend
    leg = plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), frameon=False, numpoints=1)
    leg.set_title('$(w_x, w_{rec})$', prop={'size':15})
    
def get_grad_over_time(wx, wRec):
    """Helper func to only get the gradient over time from wx and wRec."""
    S = forward_states(X, wx, wRec)
    grad_out = output_gradient(S[:,-1], t).sum()
    _, grad_over_time = backward_gradient(X, S, grad_out, wRec)
    return grad_over_time複製代碼

# Plot cost surface and gradients

# Get and plot the cost surface figure with markers
fig = get_cost_surface_figure(lambda w1, w2: cost(forward_states(X, w1, w2)[:,-1] , t), points)

# Get the plots of the gradients changing by backpropagating.
plot_gradient_over_time(points, get_grad_over_time)
# Show figures
plt.show()複製代碼

彈性優化算法

在上面的部分，咱們已經介紹了RNN的梯度是很是不穩定的，因此梯度在損失表面的跳躍度是很是大的，也就是說優化程序可能將最優值帶到離真實最優值很遠的地方，以下圖：

根據在咱們神經網絡裏面的基礎教程，梯度降低法更新參數的公式以下：

其中，W(i) 表示在第 i 次迭代時 W 的值，μ 是學習率。

在訓練過程當中，當咱們取 w(x) = 1 和 w(rec) = 2時，偏差表面上的藍色點的梯度值將達到 10^7。儘管咱們把學習率取的很是小，好比0.000001(1e-6)，可是參數 W 也將離開原來的距離 10 個單位，在咱們的模型中，這將會致使災難性的結果。一個解決方案是咱們再下降學習率的值，可是這樣作將致使，當梯度很小時，更新的點將保持在原地不動。

對於這個問題，研究者們已經找到了不少的方法來解決不穩定的梯度，好比Gradient clipping，Hessian-Free Optimization，Momentum。

咱們可使用一些優化算法來處理這個不穩定梯度，以此來減少梯度的敏感度。其中一個技術就是使用彈性反向傳播（Rprop）。彈性反向傳播算法和以前教程中的動量算法很是類似，可是這裏只是用在梯度上面，用來更新參數。Rprop算法描述以下：

通常狀況下，模型的超參數被設置爲 η^+ = 1.2 和 η^- = 0.5 。若是咱們將這個Rprop算法和以前的動量算法進行對比的話，咱們能夠發現：當梯度的符合不改變時，咱們將增長 20% 的權重；當梯度的符合改變時，咱們將減少 50% 的權重。注意，Rprop算法的更新值 Δ 相似於動量中的速度參數。不一樣點是Rprop算法的值只是反映了動量中的速度的值，不包括方向。方向是由當前梯度的方向來決定的。

在這個教程中，咱們迭代這個Rprop算法 500 次。下圖中的藍色點就是在偏差表面的更新值。注意圖中，儘管權重參數開始的位置是在一個很高的偏差值和一個很高的梯度位置，可是在咱們的迭代最後，Rprop算法仍是將最優值鎖定在座標 (1, 1) 左右。

# Define Rprop optimisation function
def update_rprop(X, t, W, W_prev_sign, W_delta, eta_p, eta_n):
    """ Update Rprop values in one iteration. X: input data. t: targets. W: Current weight parameters. W_prev_sign: Previous sign of the W gradient. W_delta: Rprop update values (Delta). eta_p, eta_n: Rprop hyperparameters. """
    # Perform forward and backward pass to get the gradients
    S = forward_states(X, W[0], W[1])
    grad_out = output_gradient(S[:,-1], t)
    W_grads, _ = backward_gradient(X, S, grad_out, W[1])
    W_sign = np.sign(W_grads)  # Sign of new gradient
    # Update the Delta (update value) for each weight parameter seperately
    for i, _ in enumerate(W):
        if W_sign[i] == W_prev_sign[i]:
            W_delta[i] *= eta_p
        else:
            W_delta[i] *= eta_n
    return W_delta, W_sign複製代碼

# Perform Rprop optimisation

# Set hyperparameters
eta_p = 1.2
eta_n = 0.5

# Set initial parameters
W = [-1.5, 2]  # [wx, wRec]
W_delta = [0.001, 0.001]  # Update values (Delta) for W
W_sign = [0, 0]  # Previous sign of W

ls_of_ws = [(W[0], W[1])]  # List of weights to plot
# Iterate over 500 iterations
for i in range(500):
    # Get the update values and sign of the last gradient
    W_delta, W_sign = update_rprop(X, t, W, W_sign, W_delta, eta_p, eta_n)
    # Update each weight parameter seperately
    for i, _ in enumerate(W):
        W[i] -= W_sign[i] * W_delta[i]
    ls_of_ws.append((W[0], W[1]))  # Add weights to list to plot

print('Final weights are: wx = {0}, wRec = {1}'.format(W[0], W[1]))複製代碼

Final weights are: wx = 1.00135554721, wRec = 0.999674473785

# Plot the cost surface with the weights over the iterations.

# Define plot function
def plot_optimisation(ls_of_ws, cost_func):
    """Plot the optimisation iterations on the cost surface."""
    ws1, ws2 = zip(*ls_of_ws)
    # Plot figures
    fig = plt.figure(figsize=(10, 4))
    # Plot overview of cost function
    ax_1 = fig.add_subplot(1,2,1)
    ws1_1, ws2_1, cost_ws_1 = get_cost_surface(-3, 3, -3, 3, 100, cost_func)
    surf_1 = plot_surface(ax_1, ws1_1, ws2_1, cost_ws_1 + 1)
    ax_1.plot(ws1, ws2, 'b.')
    ax_1.set_xlim([-3,3])
    ax_1.set_ylim([-3,3])
    # Plot zoom of cost function
    ax_2 = fig.add_subplot(1,2,2)
    ws1_2, ws2_2, cost_ws_2 = get_cost_surface(0, 2, 0, 2, 100, cost_func)
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2 + 1)
    ax_2.set_xlim([0,2])
    ax_2.set_ylim([0,2])
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2)
    ax_2.plot(ws1, ws2, 'b.')
    # Show the colorbar
    fig.subplots_adjust(right=0.8)
    cax = fig.add_axes([0.85, 0.12, 0.03, 0.78])
    cbar = fig.colorbar(surf_1, ticks=np.logspace(0, 8, 9), cax=cax)
    cbar.ax.set_ylabel('$\\xi$', fontsize=15)
    cbar.set_ticklabels(['{:.0e}'.format(i) for i in np.logspace(0, 8, 9)])
    plt.suptitle('Cost surface', fontsize=15)
    plt.show()
    
# Plot the optimisation
plot_optimisation(ls_of_ws, lambda w1, w2: cost(forward_states(X, w1, w2)[:,-1] , t))
plt.show()複製代碼

測試模型

最後咱們編寫測試代碼。從代碼的執行中，咱們能發現目標值和真實值很是的相近。若是咱們取模型輸出值的最靠近的整數，那麼預測值的輸出將更加完美。

test_inpt = np.asmatrix([[0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]])
print test_inpt
test_outpt = forward_states(test_inpt, W[0], W[1])[:,-1]
print 'Target output: {:d} vs Model output: {:.2f}'.format(test_inpt.sum(), test_outpt[0])複製代碼

Target output: 5 vs Model output: 4.99

完整代碼，點擊這裏

CoderPai 是一個專一於算法實戰的平臺，從基礎的算法到人工智能算法都有設計。若是你對算法實戰感興趣，請快快關注咱們吧。加入AI實戰微信羣，AI實戰QQ羣，ACM算法微信羣，ACM算法QQ羣。詳情請關注「CoderPai」微信號（coderpai）。