學習筆記CB012: LSTM 簡單實現、完整實現、torch、小說訓練word2vec lstm機器人

真正掌握一種算法,最實際的方法,徹底手寫出來。php

LSTM(Long Short Tem Memory)特殊遞歸神經網絡,神經元保存歷史記憶,解決天然語言處理統計方法只能考慮最近n個詞語而忽略更久前詞語的問題。用途:word representation(embedding)(詞語向量)、sequence to sequence learning(輸入句子預測句子)、機器翻譯、語音識別等。html

100多行原始python代碼實現基於LSTM二進制加法器。https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/ ,翻譯http://blog.csdn.net/zzukun/article/details/49968129node

import copy, numpy as np
np.random.seed(0)

最開始引入numpy庫,矩陣操做。python

def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

聲明sigmoid激活函數,神經網絡基礎內容,經常使用激活函數sigmoid、tan、relu等,sigmoid取值範圍[0, 1],tan取值範圍[-1,1],x是向量,返回output是向量。c++

def sigmoid_output_to_derivative(output):
    return output*(1-output)

聲明sigmoid求導函數。 加法器思路:二進制加法是二進制位相加,記錄滿二進一進位,訓練時隨機c=a+b樣本,輸入a、b輸出c是整個lstm預測過程,訓練由a、b二進制向c各類轉換矩陣和權重,神經網絡。git

int2binary = {}

聲明詞典,由整型數字轉成二進制,存起來不用隨時計算,提早存好讀取更快。github

binary_dim = 8

largest_number = pow(2,binary_dim) 聲明二進制數字維度,8,二進制能表達最大整數2^8=256,largest_number。算法

binary = np.unpackbits(
                       np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]

預先把整數到二進制轉換詞典存起來。windows

alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1

設置參數,alpha是學習速度,input_dim是輸入層向量維度,輸入a、b兩個數,是2,hidden_dim是隱藏層向量維度,隱藏層神經元個數,output_dim是輸出層向量維度,輸出一個c,是1維。從輸入層到隱藏層權重矩陣是216維,從隱藏層到輸出層權重矩陣是161維,隱藏層到隱藏層權重矩陣是16*16維:數組

synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1

2x-1,np.random.random生成從0到1之間隨機浮點數,2x-1使其取值範圍在[-1, 1]。

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

聲明三個矩陣更新,Delta。

for j in range(10000):

進行10000次迭代。

a_int = np.random.randint(largest_number/2)
a = int2binary[a_int]
b_int = np.random.randint(largest_number/2)
b = int2binary[b_int]
c_int = a_int + b_int
c = int2binary[c_int]

隨機生成樣本,包含二進制a、b、c,c=a+b,a_int、b_int、c_int分別是a、b、c對應整數格式。

d = np.zeros_like(c)

d存模型對c預測值。

overallError = 0

全局偏差,觀察模型效果。 layer_2_deltas = list() 存儲第二層(輸出層)殘差,輸出層殘差計算公式推導公式http://deeplearning.stanford.edu/wiki/index.php/%E5%8F%8D%E5%90%91%E4%BC%A0%E5%AF%BC%E7%AE%97%E6%B3%95

layer_1_values = list()
layer_1_values.append(np.zeros(hidden_dim))

存儲第一層(隱藏層)輸出值,賦0值做爲上一個時間值。

for position in range(binary_dim):

遍歷二進制每一位。

X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
y = np.array([[c[binary_dim - position - 1]]]).T

X和y分別是樣本輸入和輸出二進制值第position位,X對於每一個樣本有兩個值,分別是a和b對應第position位。把樣本拆成每一個二進制位用於訓練,二進制加法存在進位標記正好適合利用LSTM長短時間記憶訓練,每一個樣本8個二進制位是一個時間序列。

layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))

公式Ct = sigma(W0·Xt + Wh·Ct-1)

layer_2 = sigmoid(np.dot(layer_1,synapse_1))

這裏使用的公式是C2 = sigma(W1·C1),

layer_2_error = y - layer_2

計算預測值和真實值偏差。

layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2))

反向傳導,計算delta,添加到數組layer_2_deltas

overallError += np.abs(layer_2_error[0])

計算累加總偏差,用於展現和觀察。

d[binary_dim - position - 1] = np.round(layer_2[0][0])

存儲預測position位輸出值。

layer_1_values.append(copy.deepcopy(layer_1))

存儲中間過程生成隱藏層值。

future_layer_1_delta = np.zeros(hidden_dim)

存儲下一個時間週期隱藏層歷史記憶值,先賦一個空值。

for position in range(binary_dim):

遍歷二進制每一位。

X = np.array([[a[position],b[position]]])

取出X值,從大位開始更新,反向傳導按時序逆着一級一級更新。

layer_1 = layer_1_values[-position-1]

取出位對應隱藏層輸出。

prev_layer_1 = layer_1_values[-position-2]

取出位對應隱藏層上一時序輸出。

layer_2_delta = layer_2_deltas[-position-1]

取出位對應輸出層delta。

layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)

神經網絡反向傳導公式,加上隱藏層?值。

synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)

累加權重矩陣更新,對權重(權重矩陣)偏導等於本層輸出與下一層delta點乘。

synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)

前一時序隱藏層權重矩陣更新,前一時序隱藏層輸出與本時序delta點乘。

synapse_0_update += X.T.dot(layer_1_delta)

輸入層權重矩陣更新。

future_layer_1_delta = layer_1_delta

記錄本時序隱藏層delta。

synapse_0 += synapse_0_update * alpha
synapse_1 += synapse_1_update * alpha
synapse_h += synapse_h_update * alpha

權重矩陣更新。

synapse_0_update *= 0
synapse_1_update *= 0
synapse_h_update *= 0

更新變量歸零。

if(j % 1000 == 0):
        print "Error:" + str(overallError)
        print "Pred:" + str(d)
        print "True:" + str(c)
        out = 0
        for index,x in enumerate(reversed(d)):
            out += x*pow(2,index)
        print str(a_int) + " + " + str(b_int) + " = " + str(out)
        print "------------"

每訓練1000個樣本輸出總偏差信息,運行時看收斂過程。 LSTM最簡單實現,沒有考慮偏置變量,只有兩個神經元。

完整LSTM python實現。徹底參照論文great intro paper實現,代碼來源https://github.com/nicodjimenez/lstm ,做者解釋http://nicodjimenez.github.io/2014/08/08/lstm.html ,具體過程參考http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 圖。

import random
import numpy as np
import math

def sigmoid(x):
    return 1. / (1 + np.exp(-x))

聲明sigmoid函數。

def rand_arr(a, b, *args):
    np.random.seed(0)
    return np.random.rand(*args) * (b - a) + a

生成隨機矩陣,取值範圍[a,b),shape用args指定。

class LstmParam:
    def __init__(self, mem_cell_ct, x_dim):
        self.mem_cell_ct = mem_cell_ct
        self.x_dim = x_dim
        concat_len = x_dim + mem_cell_ct
        # weight matrices
        self.wg = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
        self.wi = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
        self.wf = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
        self.wo = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
        # bias terms
        self.bg = rand_arr(-0.1, 0.1, mem_cell_ct)
        self.bi = rand_arr(-0.1, 0.1, mem_cell_ct)
        self.bf = rand_arr(-0.1, 0.1, mem_cell_ct)
        self.bo = rand_arr(-0.1, 0.1, mem_cell_ct)
        # diffs (derivative of loss function w.r.t. all parameters)
        self.wg_diff = np.zeros((mem_cell_ct, concat_len))
        self.wi_diff = np.zeros((mem_cell_ct, concat_len))
        self.wf_diff = np.zeros((mem_cell_ct, concat_len))
        self.wo_diff = np.zeros((mem_cell_ct, concat_len))
        self.bg_diff = np.zeros(mem_cell_ct)
        self.bi_diff = np.zeros(mem_cell_ct)
        self.bf_diff = np.zeros(mem_cell_ct)
        self.bo_diff = np.zeros(mem_cell_ct)

LstmParam類傳遞參數,mem_cell_ct是lstm神經元數目,x_dim是輸入數據維度,concat_len是mem_cell_ct與x_dim長度和,wg是輸入節點權重矩陣,wi是輸入門權重矩陣,wf是忘記門權重矩陣,wo是輸出門權重矩陣,bg、bi、bf、bo分別是輸入節點、輸入門、忘記門、輸出門偏置,wg_diff、wi_diff、wf_diff、wo_diff分別是輸入節點、輸入門、忘記門、輸出門權重損失,bg_diff、bi_diff、bf_diff、bo_diff分別是輸入節點、輸入門、忘記門、輸出門偏置損失,初始化按照矩陣維度初始化,損失矩陣歸零。

def apply_diff(self, lr = 1):
        self.wg -= lr * self.wg_diff
        self.wi -= lr * self.wi_diff
        self.wf -= lr * self.wf_diff
        self.wo -= lr * self.wo_diff
        self.bg -= lr * self.bg_diff
        self.bi -= lr * self.bi_diff
        self.bf -= lr * self.bf_diff
        self.bo -= lr * self.bo_diff
        # reset diffs to zero
        self.wg_diff = np.zeros_like(self.wg)
        self.wi_diff = np.zeros_like(self.wi)
        self.wf_diff = np.zeros_like(self.wf)
        self.wo_diff = np.zeros_like(self.wo)
        self.bg_diff = np.zeros_like(self.bg)
        self.bi_diff = np.zeros_like(self.bi)
        self.bf_diff = np.zeros_like(self.bf)
        self.bo_diff = np.zeros_like(self.bo)

定義權重更新過程,先減損失,再把損失矩陣歸零。

class LstmState:
    def __init__(self, mem_cell_ct, x_dim):
        self.g = np.zeros(mem_cell_ct)
        self.i = np.zeros(mem_cell_ct)
        self.f = np.zeros(mem_cell_ct)
        self.o = np.zeros(mem_cell_ct)
        self.s = np.zeros(mem_cell_ct)
        self.h = np.zeros(mem_cell_ct)
        self.bottom_diff_h = np.zeros_like(self.h)
        self.bottom_diff_s = np.zeros_like(self.s)
        self.bottom_diff_x = np.zeros(x_dim)

LstmState存儲LSTM神經元狀態,包括g、i、f、o、s、h,s是內部狀態矩陣(記憶),h是隱藏層神經元輸出矩陣。

class LstmNode:
    def __init__(self, lstm_param, lstm_state):
        # store reference to parameters and to activations
        self.state = lstm_state
        self.param = lstm_param
        # non-recurrent input to node
        self.x = None
        # non-recurrent input concatenated with recurrent input
        self.xc = None

LstmNode對應樣本輸入,x是輸入樣本x,xc是用hstack把x和遞歸輸入節點拼接矩陣(hstack是橫拼矩陣,vstack是縱拼矩陣)。

def bottom_data_is(self, x, s_prev = None, h_prev = None):
        # if this is the first lstm node in the network
        if s_prev == None: s_prev = np.zeros_like(self.state.s)
        if h_prev == None: h_prev = np.zeros_like(self.state.h)
        # save data for use in backprop
        self.s_prev = s_prev
        self.h_prev = h_prev

        # concatenate x(t) and h(t-1)
        xc = np.hstack((x,  h_prev))
        self.state.g = np.tanh(np.dot(self.param.wg, xc) + self.param.bg)
        self.state.i = sigmoid(np.dot(self.param.wi, xc) + self.param.bi)
        self.state.f = sigmoid(np.dot(self.param.wf, xc) + self.param.bf)
        self.state.o = sigmoid(np.dot(self.param.wo, xc) + self.param.bo)
        self.state.s = self.state.g * self.state.i + s_prev * self.state.f
        self.state.h = self.state.s * self.state.o
        self.x = x
        self.xc = xc

bottom和top是兩個方向,輸入樣本從底部輸入,反向傳導從頂部向底部傳導,bottom_data_is是輸入樣本過程,把x和先前輸入拼接成矩陣,用公式wx+b分別計算g、i、f、o值,激活函數tanh和sigmoid。 每一個時序神經網絡有四個神經網絡層(激活函數),最左邊忘記門,直接生效到記憶C,第二個輸入門,依賴輸入樣本數據,按照必定「比例」影響記憶C,「比例」經過第三個層(tanh)實現,取值範圍是[-1,1]能夠正向影響也能夠負向影響,最後一個輸出門,每一時序產生輸出既依賴輸入樣本x和上一時序輸出,還依賴記憶C,設計模仿生物神經元記憶功能。

def top_diff_is(self, top_diff_h, top_diff_s):
        # notice that top_diff_s is carried along the constant error carousel
        ds = self.state.o * top_diff_h + top_diff_s
        do = self.state.s * top_diff_h
        di = self.state.g * ds
        dg = self.state.i * ds
        df = self.s_prev * ds

        # diffs w.r.t. vector inside sigma / tanh function
        di_input = (1. - self.state.i) * self.state.i * di
        df_input = (1. - self.state.f) * self.state.f * df
        do_input = (1. - self.state.o) * self.state.o * do
        dg_input = (1. - self.state.g ** 2) * dg

        # diffs w.r.t. inputs
        self.param.wi_diff += np.outer(di_input, self.xc)
        self.param.wf_diff += np.outer(df_input, self.xc)
        self.param.wo_diff += np.outer(do_input, self.xc)
        self.param.wg_diff += np.outer(dg_input, self.xc)
        self.param.bi_diff += di_input
        self.param.bf_diff += df_input
        self.param.bo_diff += do_input
        self.param.bg_diff += dg_input

        # compute bottom diff
        dxc = np.zeros_like(self.xc)
        dxc += np.dot(self.param.wi.T, di_input)
        dxc += np.dot(self.param.wf.T, df_input)
        dxc += np.dot(self.param.wo.T, do_input)
        dxc += np.dot(self.param.wg.T, dg_input)

        # save bottom diffs
        self.state.bottom_diff_s = ds * self.state.f
        self.state.bottom_diff_x = dxc[:self.param.x_dim]
        self.state.bottom_diff_h = dxc[self.param.x_dim:]

反向傳導,整個訓練過程核心。假設在t時刻lstm輸出預測值h(t),實際輸出值是y(t),之間差異是損失,假設損失函數爲l(t) = f(h(t), y(t)) = ||h(t) - y(t)||^2,歐式距離,總體損失函數是L(t) = ∑l(t),t從1到T,T表示整個事件序列最大長度。最終目標是用梯度降低法讓L(t)最小化,找到一個最優權重w使得L(t)最小,當w發生微小變化L(t)再也不變化,達到局部最優,即L對w偏導梯度爲0。 dL/dw表示當w發生單位變化L變化多少,dh(t)/dw表示當w發生單位變化h(t)變化多少,dL/dh(t)表示當h(t)發生單位變化時L變化多少,(dL/dh(t)) * (dh(t)/dw)表示第t時序第i個記憶單元w發生單位變化L變化多少,把全部由1到M的i和全部由1到T的t累加是總體dL/dw。

第i個記憶單元,h(t)發生單位變化,整個從1到T時序全部局部損失l的累加和,是dL/dh(t),h(t)隻影響從t到T時序局部損失l。

假設L(t)表示從t到T損失和,L(t) = ∑l(s)。

h(t)對w導數。

L(t) = l(t) + L(t+1),dL(t)/dh(t) = dl(t)/dh(t) + dL(t+1)/dh(t),用下一時序導數得出當前時序導數,規律推導,計算T時刻導數往前推,在T時刻,dL(T)/dh(T) = dl(T)/dh(T)。

class LstmNetwork():
    def __init__(self, lstm_param):
        self.lstm_param = lstm_param
        self.lstm_node_list = []
        # input sequence
        self.x_list = []

    def y_list_is(self, y_list, loss_layer):
        """
        Updates diffs by setting target sequence
        with corresponding loss layer.
        Will *NOT* update parameters.  To update parameters,
        call self.lstm_param.apply_diff()
        """
        assert len(y_list) == len(self.x_list)
        idx = len(self.x_list) - 1
        # first node only gets diffs from label ...
        loss = loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx])
        diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx])
        # here s is not affecting loss due to h(t+1), hence we set equal to zero
        diff_s = np.zeros(self.lstm_param.mem_cell_ct)
        self.lstm_node_list[idx].top_diff_is(diff_h, diff_s)
        idx -= 1

        ### ... following nodes also get diffs from next nodes, hence we add diffs to diff_h
        ### we also propagate error along constant error carousel using diff_s
        while idx >= 0:
            loss += loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx])
            diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx])
            diff_h += self.lstm_node_list[idx + 1].state.bottom_diff_h
            diff_s = self.lstm_node_list[idx + 1].state.bottom_diff_s
            self.lstm_node_list[idx].top_diff_is(diff_h, diff_s)
            idx -= 1

        return loss

diff_h(預測結果偏差發生單位變化損失L多少,dL(t)/dh(t)數值計算),由idx從T往前遍歷到1,計算loss_layer.bottom_diff和下一個時序bottom_diff_h和做爲diff_h(第一次遍歷即T不加bottom_diff_h)。 loss_layer.bottom_diff:

def bottom_diff(self, pred, label):
        diff = np.zeros_like(pred)
        diff[0] = 2 * (pred[0] - label)
        return diff

l(t) = f(h(t), y(t)) = ||h(t) - y(t)||^2導數l'(t) = 2 * (h(t) - y(t)) 。當s(t)發生變化,L(t)變化來源s(t)影響h(t)和h(t+1),影響L(t)。 h(t+1)不會影響l(t)。 左邊式子(dL(t)/dh(t)) * (dh(t)/ds(t)),由t+1到t來逐級反推dL(t)/ds(t)。 神經元self.state.h = self.state.s * self.state.o,h(t) = s(t) * o(t),dh(t)/ds(t) = o(t),dL(t)/dh(t)是top_diff_h。

top_diff_is,Bottom means input to the layer, top means output of the layer. Caffe also uses this terminology. bottom表示神經網絡層輸入,top表示神經網絡層輸出,和caffe概念一致。 def top_diff_is(self, top_diff_h, top_diff_s): top_diff_h表示當前t時序dL(t)/dh(t), top_diff_s表示t+1時序記憶單元dL(t)/ds(t)。

ds = self.state.o * top_diff_h + top_diff_s
        do = self.state.s * top_diff_h
        di = self.state.g * ds
        dg = self.state.i * ds
        df = self.s_prev * ds

前綴d表達偏差L對某一項導數(directive)。 ds是在根據公式dL(t)/ds(t)計算當前t時序dL(t)/ds(t)。 do是計算dL(t)/do(t),h(t) = s(t) * o(t),dh(t)/do(t) = s(t),dL(t)/do(t) = (dL(t)/dh(t)) * (dh(t)/do(t)) = top_diff_h * s(t)。 di是計算dL(t)/di(t)。s(t) = f(t) * s(t-1) + i(t) * g(t)。dL(t)/di(t) = (dL(t)/ds(t)) * (ds(t)/di(t)) = ds * g(t)。 dg是計算dL(t)/dg(t),dL(t)/dg(t) = (dL(t)/ds(t)) * (ds(t)/dg(t)) = ds * i(t)。 df是計算dL(t)/df(t),dL(t)/df(t) = (dL(t)/ds(t)) * (ds(t)/df(t)) = ds * s(t-1)。

di_input = (1. - self.state.i) * self.state.i * di
        df_input = (1. - self.state.f) * self.state.f * df
        do_input = (1. - self.state.o) * self.state.o * do
        dg_input = (1. - self.state.g ** 2) * dg

sigmoid函數導數,tanh函數導數。di_input,(1. - self.state.i) * self.state.i,sigmoid導數,當i神經元輸入發生單位變化時輸出值有多大變化,再乘di表示當i神經元輸入發生單位變化時偏差L(t)發生多大變化,dL(t)/d i_input(t)。

self.param.wi_diff += np.outer(di_input, self.xc)
        self.param.wf_diff += np.outer(df_input, self.xc)
        self.param.wo_diff += np.outer(do_input, self.xc)
        self.param.wg_diff += np.outer(dg_input, self.xc)
        self.param.bi_diff += di_input
        self.param.bf_diff += df_input
        self.param.bo_diff += do_input
        self.param.bg_diff += dg_input

w*_diff是權重矩陣偏差,b*_diff是偏置偏差,用於更新。

dxc = np.zeros_like(self.xc)
        dxc += np.dot(self.param.wi.T, di_input)
        dxc += np.dot(self.param.wf.T, df_input)
        dxc += np.dot(self.param.wo.T, do_input)
        dxc += np.dot(self.param.wg.T, dg_input)

累加輸入xdiff,x在四處起做用,四處diff加和後做xdiff。

self.state.bottom_diff_s = ds * self.state.f
        self.state.bottom_diff_x = dxc[:self.param.x_dim]
        self.state.bottom_diff_h = dxc[self.param.x_dim:]

bottom_diff_s是在t-1時序上s變化和t時序上s變化時f倍關係。dxc是x和h橫向合併矩陣,分別取兩部分diff信息bottom_diff_x和bottom_diff_h。

def x_list_clear(self):
        self.x_list = []

    def x_list_add(self, x):
        self.x_list.append(x)
        if len(self.x_list) > len(self.lstm_node_list):
            # need to add new lstm node, create new state mem
            lstm_state = LstmState(self.lstm_param.mem_cell_ct, self.lstm_param.x_dim)
            self.lstm_node_list.append(LstmNode(self.lstm_param, lstm_state))

        # get index of most recent x input
        idx = len(self.x_list) - 1
        if idx == 0:
            # no recurrent inputs yet
            self.lstm_node_list[idx].bottom_data_is(x)
        else:
            s_prev = self.lstm_node_list[idx - 1].state.s
            h_prev = self.lstm_node_list[idx - 1].state.h
            self.lstm_node_list[idx].bottom_data_is(x, s_prev, h_prev)

添加訓練樣本,輸入x數據。

def example_0():
    # learns to repeat simple sequence from random inputs
    np.random.seed(0)

    # parameters for input data dimension and lstm cell count
    mem_cell_ct = 100
    x_dim = 50
    concat_len = x_dim + mem_cell_ct
    lstm_param = LstmParam(mem_cell_ct, x_dim)
    lstm_net = LstmNetwork(lstm_param)
    y_list = [-0.5,0.2,0.1, -0.5]
    input_val_arr = [np.random.random(x_dim) for _ in y_list]

    for cur_iter in range(100):
        print "cur iter: ", cur_iter
        for ind in range(len(y_list)):
            lstm_net.x_list_add(input_val_arr[ind])
            print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0])

        loss = lstm_net.y_list_is(y_list, ToyLossLayer)
        print "loss: ", loss
        lstm_param.apply_diff(lr=0.1)
        lstm_net.x_list_clear()

初始化LstmParam,指定記憶存儲單元數爲100,指定輸入樣本x維度是50。初始化LstmNetwork訓練模型,生成4組各50個隨機數,分別以[-0.5,0.2,0.1, -0.5]做爲y值訓練,每次喂50個隨機數和一個y值,迭代100次。 lstm輸入一串連續質數預估下一個質數。小測試,生成100之內質數,循環拿出50個質數序列做x,第51個質數做y,拿出10個樣本參與訓練1w次,均方偏差由0.17973最終達到了1.05172e-06,幾乎徹底正確:

import numpy as np
import sys

from lstm import LstmParam, LstmNetwork

class ToyLossLayer:
    """
    Computes square loss with first element of hidden layer array.
    """
    @classmethod
    def loss(self, pred, label):
        return (pred[0] - label) ** 2

    @classmethod
    def bottom_diff(self, pred, label):
        diff = np.zeros_like(pred)
        diff[0] = 2 * (pred[0] - label)
        return diff

class Primes:
    def __init__(self):
        self.primes = list()
        for i in range(2, 100):
            is_prime = True
            for j in range(2, i-1):
                if i % j == 0:
                    is_prime = False
            if is_prime:
                self.primes.append(i)
        self.primes_count = len(self.primes)
    def get_sample(self, x_dim, y_dim, index):
        result = np.zeros((x_dim+y_dim))
        for i in range(index, index + x_dim + y_dim):
            result[i-index] = self.primes[i%self.primes_count]/100.0
        return result

def example_0():
    mem_cell_ct = 100
    x_dim = 50
    concat_len = x_dim + mem_cell_ct
    lstm_param = LstmParam(mem_cell_ct, x_dim)
    lstm_net = LstmNetwork(lstm_param)

    primes = Primes()
    x_list = []
    y_list = []
    for i in range(0, 10):
        sample = primes.get_sample(x_dim, 1, i)
        x = sample[0:x_dim]
        y = sample[x_dim:x_dim+1].tolist()[0]
        x_list.append(x)
        y_list.append(y)

    for cur_iter in range(10000):
        if cur_iter % 1000 == 0:
            print "y_list=", y_list
        for ind in range(len(y_list)):
            lstm_net.x_list_add(x_list[ind])
            if cur_iter % 1000 == 0:
                print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0])

        loss = lstm_net.y_list_is(y_list, ToyLossLayer)
        if cur_iter % 1000 == 0:
            print "loss: ", loss
        lstm_param.apply_diff(lr=0.01)
        lstm_net.x_list_clear()

if __name__ == "__main__":
    example_0()

質數列表全都除以100,這個代碼訓練數據必須是小於1數值。

torch是深度學習框架。1)tensorflow,谷歌主推,時下最火,小型試驗和大型計算均可以,基於python,缺點是上手相對較難,速度通常;2)torch,facebook主推,用於小型試驗,開源應用較多,基於lua,上手較快,網上文檔較全,缺點是lua語言相對冷門;3)mxnet,Amazon主推,主要用於大型計算,基於python和R,缺點是網上開源項目較少;4)caffe,facebook主推,用於大型計算,基於c++、python,缺點是開發不是很方便;5)theano,速度通常,基於python,評價很好。

torch github上lstm實現項目比較多。

在mac上安裝torch。https://github.com/torch/torch7/wiki/Cheatsheet#installing-and-running-torch

git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh

qt安裝不成功問題,本身單獨安裝。

brew install cartr/qt4/qt

安裝後須要手工加到~/.bash_profile中。

. ~/torch/install/bin/torch-activate

source ~/.bash_profile後執行th使用torch。 安裝itorch,安裝依賴

brew install zeromq
brew install openssl
luarocks install luacrypto OPENSSL_DIR=/usr/local/opt/openssl/

git clone https://github.com/facebook/iTorch.git
cd iTorch
luarocks make

用卷積神經網絡實現圖像識別。 建立pattern_recognition.lua:

require 'nn'
require 'paths'
if (not paths.filep("cifar10torchsmall.zip")) then
    os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')
    os.execute('unzip cifar10torchsmall.zip')
end
trainset = torch.load('cifar10-train.t7')
testset = torch.load('cifar10-test.t7')
classes = {'airplane', 'automobile', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck'}
setmetatable(trainset,
{__index = function(t, i)
    return {t.data[i], t.label[i]}
end}
);
trainset.data = trainset.data:double() -- convert the data from a ByteTensor to a DoubleTensor.

function trainset:size()
    return self.data:size(1)
end
mean = {} -- store the mean, to normalize the test set in the future
stdv  = {} -- store the standard-deviation for the future
for i=1,3 do -- over each image channel
    mean[i] = trainset.data[{ {}, {i}, {}, {}  }]:mean() -- mean estimation
    print('Channel ' .. i .. ', Mean: ' .. mean[i])
    trainset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction

    stdv[i] = trainset.data[{ {}, {i}, {}, {}  }]:std() -- std estimation
    print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i])
    trainset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end
net = nn.Sequential()
net:add(nn.SpatialConvolution(3, 6, 5, 5)) -- 3 input image channels, 6 output channels, 5x5 convolution kernel
net:add(nn.ReLU())                       -- non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2))     -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.ReLU())                       -- non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(16*5*5))                    -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5
net:add(nn.Linear(16*5*5, 120))             -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.ReLU())                       -- non-linearity
net:add(nn.Linear(120, 84))
net:add(nn.ReLU())                       -- non-linearity
net:add(nn.Linear(84, 10))                   -- 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax())                     -- converts the output to a log-probability. Useful for classification problems
criterion = nn.ClassNLLCriterion()
trainer = nn.StochasticGradient(net, criterion)
trainer.learningRate = 0.001
trainer.maxIteration = 5
trainer:train(trainset)
testset.data = testset.data:double()   -- convert from Byte tensor to Double tensor
for i=1,3 do -- over each image channel
    testset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction
    testset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling
end
predicted = net:forward(testset.data[100])
print(classes[testset.label[100]])
print(predicted:exp())
for i=1,predicted:size(1) do
    print(classes[i], predicted[i])
end
correct = 0
for i=1,10000 do
    local groundtruth = testset.label[i]
    local prediction = net:forward(testset.data[i])
    local confidences, indices = torch.sort(prediction, true)  -- true means sort in descending order
    if groundtruth == indices[1] then
        correct = correct + 1
    end
end

print(correct, 100*correct/10000 .. ' % ')
class_performance = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
for i=1,10000 do
    local groundtruth = testset.label[i]
    local prediction = net:forward(testset.data[i])
    local confidences, indices = torch.sort(prediction, true)  -- true means sort in descending order
    if groundtruth == indices[1] then
        class_performance[groundtruth] = class_performance[groundtruth] + 1
    end
end

for i=1,#classes do
    print(classes[i], 100*class_performance[i]/1000 .. ' %')
end

執行th pattern_recognition.lua。

首先下載cifar10torchsmall.zip樣本,有50000張訓練用圖片,10000張測試用圖片,分別都標註,包括airplane、automobile等10種分類,對trainset綁定__index和size方法,兼容nn.Sequential使用,綁定函數看lua教程:http://tylerneylon.com/a/learn-lua/ ,trainset數據正規化,數據轉成均值爲1方差爲1的double類型張量。初始化卷積神經網絡模型,包括兩層卷積、兩層池化、一個全鏈接以及一個softmax層,進行訓練,學習率爲0.001,迭代5次,模型訓練好後對測試機第100號圖片作預測,打印出總體正確率以及每種分類準確率。https://github.com/soumith/cvpr2015/blob/master/Deep%20Learning%20with%20Torch.ipynb

torch能夠方便支持gpu計算,須要對代碼作修改。

比較流行的seq2seq基本都用lstm組成編碼器模型實現,開源實現大都基於one-hot embedding(沒有詞向量表達信息量大)。word2vec詞向量 seq2seq模型,只有一個lstm單元機器人。

下載《甄環傳》小說原文。上網隨便百度「甄環傳 txt」,下載下來,把文件轉碼成utf-8編碼,把windows回車符都替換成n,以便後續處理。

對甄環傳切詞。切詞工具word_segment.py到github下載,地址在https://github.com/warmheartli/ChatBotCourse/blob/master/word_segment.py

python ./word_segment.py zhenhuanzhuan.txt zhenhuanzhuan.segment

生成詞向量。用word2vec,word2vec源碼 https://github.com/warmheartli/ChatBotCourse/tree/master/word2vec 。make編譯便可執行。

./word2vec -train ./zhenhuanzhuan.segment -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

生成一個vectors.bin文件,基於甄環傳原文生成的詞向量文件。

訓練代碼。

# -*- coding: utf-8 -*-

import sys
import math
import tflearn
import chardet
import numpy as np
import struct

seq = []

max_w = 50
float_size = 4
word_vector_dict = {}

def load_vectors(input):
    """從vectors.bin加載詞向量,返回一個word_vector_dict的詞典,key是詞,value是200維的向量
    """
    print "begin load vectors"

    input_file = open(input, "rb")

    # 獲取詞表數目及向量維度
    words_and_size = input_file.readline()
    words_and_size = words_and_size.strip()
    words = long(words_and_size.split(' ')[0])
    size = long(words_and_size.split(' ')[1])
    print "words =", words
    print "size =", size

    for b in range(0, words):
        a = 0
        word = ''
        # 讀取一個詞
        while True:
            c = input_file.read(1)
            word = word + c
            if False == c or c == ' ':
                break
            if a < max_w and c != 'n':
                a = a + 1
        word = word.strip()

        vector = []
        for index in range(0, size):
            m = input_file.read(float_size)
            (weight,) = struct.unpack('f', m)
            vector.append(weight)

        # 將詞及其對應的向量存到dict中
        word_vector_dict[word.decode('utf-8')] = vector

    input_file.close()
    print "load vectors finish"

def init_seq():
    """讀取切好詞的文本文件,加載所有詞序列
    """
    file_object = open('zhenhuanzhuan.segment', 'r')
    vocab_dict = {}
    while True:
        line = file_object.readline()
        if line:
            for word in line.decode('utf-8').split(' '):
                if word_vector_dict.has_key(word):
                    seq.append(word_vector_dict[word])
        else:
            break
    file_object.close()

def vector_sqrtlen(vector):
    len = 0
    for item in vector:
        len += item * item
    len = math.sqrt(len)
    return len

def vector_cosine(v1, v2):
    if len(v1) != len(v2):
        sys.exit(1)
    sqrtlen1 = vector_sqrtlen(v1)
    sqrtlen2 = vector_sqrtlen(v2)
    value = 0
    for item1, item2 in zip(v1, v2):
        value += item1 * item2
    return value / (sqrtlen1*sqrtlen2)

def vector2word(vector):
    max_cos = -10000
    match_word = ''
    for word in word_vector_dict:
        v = word_vector_dict[word]
        cosine = vector_cosine(vector, v)
        if cosine > max_cos:
            max_cos = cosine
            match_word = word
    return (match_word, max_cos)

def main():
    load_vectors("./vectors.bin")
    init_seq()
    xlist = []
    ylist = []
    test_X = None
    #for i in range(len(seq)-100):
    for i in range(10):
        sequence = seq[i:i+20]
        xlist.append(sequence)
        ylist.append(seq[i+20])
        if test_X is None:
            test_X = np.array(sequence)
            (match_word, max_cos) = vector2word(seq[i+20])
            print "right answer=", match_word, max_cos

    X = np.array(xlist)
    Y = np.array(ylist)
    net = tflearn.input_data([None, 20, 200])
    net = tflearn.lstm(net, 200)
    net = tflearn.fully_connected(net, 200, activation='linear')
    net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1,
                                     loss='mean_square')
    model = tflearn.DNN(net)
    model.fit(X, Y, n_epoch=500, batch_size=10,snapshot_epoch=False,show_metric=True)
    model.save("model")
    predict = model.predict([test_X])
    #print predict
    #for v in test_X:
    #    print vector2word(v)
    (match_word, max_cos) = vector2word(predict[0])
    print "predict=", match_word, max_cos

main()

load_vectors從vectors.bin加載詞向量,init_seq加載甄環傳切詞文本並存到一個序列裏,vector2word求距離某向量最近詞,模型只有一個lstm單元。 通過500個epoch訓練,均方損失降到0.33673,以0.941794432002餘弦類似度預測出下一個字。 強大gpu,調整參數,整篇文章都訓練,修改代碼predict部分,不斷輸出下一個字,自動吐出甄環體。基於tflearn實現,tflearn官方文檔examples實現seq2seq直接調用tensorflow中的tensorflow/python/ops/seq2seq.py,基於one-hot embedding方法,必定沒有詞向量效果好。

參考資料:

《Python 天然語言處理》 http://www.shareditor.com/blogshow?blogId=116 http://www.shareditor.com/blogshow?blogId=117 http://www.shareditor.com/blogshow?blogId=118

歡迎推薦上海機器學習工做機會,個人微信:qingxingfengzi

相關文章
相關標籤/搜索