學習筆記CB014:TensorFlow seq2seq模型步步進階

神經網絡。《Make Your Own Neural Network》,用很是通俗易懂描述講解人工神經網絡原理用代碼實現,試驗效果很是好。python

循環神經網絡和LSTM。Christopher Olah http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 。git

seq2seq模型基於循環神經網絡序列到序列模型,語言翻譯、自動問答等序列到序列場景,均可用seq2seq模型,用seq2seq實現聊天機器人的原理 http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/ 。github

attention模型(注意力模型)是解決seq2seq解碼器只接受編碼器最後一個輸出遠離以前輸出致使信息丟失的問題。一個回答通常基於問題中關鍵位置信息,注意力集中地方, http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ 。api

tensorflow seq2seq製做聊天機器人。tensorflow提關鍵接口: https://www.tensorflow.org/api_docs/python/tf/contrib/legacy_seq2seq/embedding_attention_seq2seq 。數組

embedding_attention_seq2seq(
    encoder_inputs,
    decoder_inputs,
    cell,
    num_encoder_symbols,
    num_decoder_symbols,
    embedding_size,
    num_heads=1,
    output_projection=None,
    feed_previous=False,
    dtype=None,
    scope=None,
    initial_state_attention=False
)
複製代碼

參數encoder_inputs是list,list每一項是1D Tensor,Tensor shape是[batch_size],Tensor每一項是整數,相似:微信

[array([0, 0, 0, 0], dtype=int32), 
array([0, 0, 0, 0], dtype=int32), 
array([8, 3, 5, 3], dtype=int32), 
array([7, 8, 2, 1], dtype=int32), 
array([6, 2, 10, 9], dtype=int32)]
複製代碼

5個array,表示一句話長度是5個詞。每一個array有4個數,表示batch是4,一共4個樣本。第一個樣本是[[0],[0],[8],[7],[6]],第二個樣本是[[0],[0],[3],[8],[2]],數字區分不一樣詞id,通常經過統計得出,一個id表示一個詞。網絡

參數decoder_inputs和encoder_inputs同樣結構。session

參數cell是tf.nn.rnn_cell.RNNCell類型循環神經網絡單元,可用tf.contrib.rnn.BasicLSTMCell、tf.contrib.rnn.GRUCell。app

參數num_encoder_symbols是整數,表示encoder_inputs整數詞id數目。dom

num_decoder_symbols表示decoder_inputs中整數詞id數目。

embedding_size表示內部word embedding轉成幾維向量,須要和RNNCell size大小相等。

num_heads表示attention_states抽頭數量。

output_projection是(W, B)結構tuple,W是shape [output_size x num_decoder_symbols] weight矩陣,B是shape [num_decoder_symbols] 偏置向量,每一個RNNCell輸出通過WX+B映射成num_decoder_symbols維向量,向量值表示任意一個decoder_symbol可能性,softmax。

feed_previous表示decoder_inputs是直接提供訓練數據輸入,仍是用前一個RNNCell輸出映射,若是feed_previous爲True,用前一個RNNCell輸出,通過WX+B映射。

dtype是RNN狀態數據類型,默認是tf.float32。

scope是子圖命名,默認是「embedding_attention_seq2seq」。

initial_state_attention表示是否初始化attentions,默認爲否,表示全初始化爲0。返回值是(outputs, state)結構tuple,outputs是長度爲句子長度(詞數,與encoder_inputs list長度同樣)list,list每一項是一個2D tf.float32類型 Tensor,第一維度是樣本數,好比4個樣本有四組Tensor,每一個Tensor長度是embedding_size。outputs描述548個浮點數,5是句子長度,4是樣本數,8是詞向量維數。

返回state,num_layers個LSTMStateTuple組成大tuple,num_layers是初始化cell參數,表示神經網絡單元有幾層,一個由3層LSTM神經元組成encoder-decoder多層循環神經網絡。encoder_inputs輸入encoder第一層LSTM神經元,神經元output傳給第二層LSTM神經元,第二層output再傳給第三層,encoder第一層輸出state傳給decoder第一層LSTM神經元,依次類推。

LSTMStateTuple結構,由兩個Tensor組成tuple,第一個tensor命名爲c,由4個8維向量組成(4是batch, 8是state_size詞向量維度), 第二個tensor命名爲h,一樣由4個8維向量組成。

c是傳給下一個時序存儲數據,h是隱藏的輸出。

tensorflow代碼實現:

concat = _linear([inputs, h], 4 * self._num_units, True)
i, j, f, o = array_ops.split(value=concat, num_or_size_splits=4, axis=1)
new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j))
new_h = self._activation(new_c) * sigmoid(o)
複製代碼

直接用embedding_attention_seq2seq訓練,返回state通常用不到。

構造輸入參數訓練一個seq2seq模型。以一、三、五、七、9……奇數序列爲例構造樣本,好比兩個樣本是[[1,3,5],[7,9,11]]和[[3,5,7],[9,11,13]]:

train_set = [[[1, 3, 5], [7, 9, 11]], [[3, 5, 7], [9, 11, 13]]]
複製代碼

知足不一樣長度序列,訓練序列比樣本序列長度要長,比設置5

input_seq_len = 5
output_seq_len = 5
複製代碼

樣本長度小於訓練序列長度,用0填充

PAD_ID = 0
複製代碼

第一個樣本encoder_input:

encoder_input_0 = [PAD_ID] * (input_seq_len - len(train_set[0][0])) + train_set[0][0]
複製代碼

第二個樣本encoder_input:

encoder_input_1 = [PAD_ID] * (input_seq_len - len(train_set[1][0])) + train_set[1][0]
複製代碼

decoder_input用GO_ID做起始,再輸入樣本序列,最後用PAD_ID填充。

GO_ID = 1
decoder_input_0 = [GO_ID] + train_set[0][1] 
    + [PAD_ID] * (output_seq_len - len(train_set[0][1]) - 1)
decoder_input_1 = [GO_ID] + train_set[1][1] 
    + [PAD_ID] * (output_seq_len - len(train_set[1][1]) - 1)
複製代碼

把輸入轉成embedding_attention_seq2seq輸入參數encoder_inputs和decoder_inputs格式:

encoder_inputs = []
decoder_inputs = []
for length_idx in xrange(input_seq_len):
    encoder_inputs.append(np.array([encoder_input_0[length_idx], 
                          encoder_input_1[length_idx]], dtype=np.int32))
for length_idx in xrange(output_seq_len):
    decoder_inputs.append(np.array([decoder_input_0[length_idx], 
                          decoder_input_1[length_idx]], dtype=np.int32))
複製代碼

獨立函數:

# coding:utf-8
import numpy as np

# 輸入序列長度
input_seq_len = 5
# 輸出序列長度
output_seq_len = 5
# 空值填充0
PAD_ID = 0
# 輸出序列起始標記
GO_ID = 1

def get_samples():
    """構造樣本數據

    :return:
        encoder_inputs: [array([0, 0], dtype=int32), 
                         array([0, 0], dtype=int32), 
                         array([1, 3], dtype=int32),
                         array([3, 5], dtype=int32), 
                         array([5, 7], dtype=int32)]
        decoder_inputs: [array([1, 1], dtype=int32), 
                         array([7, 9], dtype=int32), 
                         array([ 9, 11], dtype=int32),
                         array([11, 13], dtype=int32), 
                         array([0, 0], dtype=int32)]
    """
    train_set = [[[1, 3, 5], [7, 9, 11]], [[3, 5, 7], [9, 11, 13]]]
    encoder_input_0 = [PAD_ID] * (input_seq_len - len(train_set[0][0])) 
                      + train_set[0][0]
    encoder_input_1 = [PAD_ID] * (input_seq_len - len(train_set[1][0])) 
                      + train_set[1][0]
    decoder_input_0 = [GO_ID] + train_set[0][1] 
                      + [PAD_ID] * (output_seq_len - len(train_set[0][1]) - 1)
    decoder_input_1 = [GO_ID] + train_set[1][1] 
                      + [PAD_ID] * (output_seq_len - len(train_set[1][1]) - 1)

    encoder_inputs = []
    decoder_inputs = []
    for length_idx in xrange(input_seq_len):
        encoder_inputs.append(np.array([encoder_input_0[length_idx], 
                              encoder_input_1[length_idx]], dtype=np.int32))
    for length_idx in xrange(output_seq_len):
        decoder_inputs.append(np.array([decoder_input_0[length_idx], 
                              decoder_input_1[length_idx]], dtype=np.int32))
    return encoder_inputs, decoder_inputs
複製代碼

構造模型,tensorflow運行過程是先構造圖,再塞數據計算,構建模型過程其實是構建一張圖。 首先建立encoder_inputs和decoder_inputs的placeholder(佔位符):

import tensorflow as tf
encoder_inputs = []
decoder_inputs = []
for i in xrange(input_seq_len):
    encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="encoder{0}".format(i)))
for i in xrange(output_seq_len):
    decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="decoder{0}".format(i)))
複製代碼

建立一個記憶單元數目爲size=8的LSTM神經元結構:

size = 8
cell = tf.contrib.rnn.BasicLSTMCell(size)
複製代碼

訓練奇數序列最大數值是輸入最大10輸出最大16

num_encoder_symbols = 10
num_decoder_symbols = 16
複製代碼

把參數傳入embedding_attention_seq2seq獲取output

from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq
outputs, _ = seq2seq.embedding_attention_seq2seq(
                    encoder_inputs,
                    decoder_inputs[:output_seq_len],
                    cell,
                    cell,
                    num_encoder_symbols=num_encoder_symbols,
                    num_decoder_symbols=num_decoder_symbols,
                    embedding_size=size,
                    output_projection=None,
                    feed_previous=False,
                    dtype=tf.float32)
複製代碼

構建模型部分放單獨函數:

def get_model():
    """構造模型
    """
    encoder_inputs = []
    decoder_inputs = []
    for i in xrange(input_seq_len):
        encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="encoder{0}".format(i)))
    for i in xrange(output_seq_len):
        decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="decoder{0}".format(i)))

    cell = tf.contrib.rnn.BasicLSTMCell(size)

    # 這裏輸出的狀態咱們不須要
    outputs, _ = seq2seq.embedding_attention_seq2seq(
                        encoder_inputs,
                        decoder_inputs,
                        cell,
                        num_encoder_symbols=num_encoder_symbols,
                        num_decoder_symbols=num_decoder_symbols,
                        embedding_size=size,
                        output_projection=None,
                        feed_previous=False,
                        dtype=tf.float32)
    return encoder_inputs, decoder_inputs, outputs
複製代碼

構造運行時session,填入樣本數據:

with tf.Session() as sess:
    sample_encoder_inputs, sample_decoder_inputs = get_samples()
    encoder_inputs, decoder_inputs, outputs = get_model()
    input_feed = {}
    for l in xrange(input_seq_len):
        input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
    for l in xrange(output_seq_len):
        input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]

    sess.run(tf.global_variables_initializer())
    outputs = sess.run(outputs, input_feed)
    print outputs
複製代碼

輸出outputs是由5個array組成list(5是序列長度),每一個array由兩個size是16 list組成(2表示2個樣本,16表示輸出符號有16個)。 outputs對應seq2seq輸出,W、X、Y、Z、EOS,decoder_inputs[1:],樣本里[7,9,11]和[9,11,13]。

decoder_inputs結構:

[array([1, 1], dtype=int32), array([ 7, 29], dtype=int32), array([ 9, 31], dtype=int32), array([11, 33], dtype=int32), array([0, 0], dtype=int32)]
複製代碼

損失函數說明: https://www.tensorflow.org/api_docs/python/tf/contrib/legacy_seq2seq/sequence_loss

sequence_loss(
    logits,
    targets,
    weights,
    average_across_timesteps=True,
    average_across_batch=True,
    softmax_loss_function=None,
    name=None
)
複製代碼

損失函數,目標詞語的平均負對數機率最小。logits是一個由多個2D shape [batch * num_decoder_symbols] Tensor組成list,batch是2,num_decoder_symbols是16,組成list Tensor 個數是output_seq_len。 targets是和logits同樣長度(output_seq_len) list,list每一項是整數組成1D Tensor,每一個Tensor shape是[batch],數據類型是tf.int32,和decoder_inputs[1:] W、X、Y、Z、EOS結構同樣。 weights和targets結構同樣,數據類型是tf.float32。

計算加權交叉熵損失,weights須要初始化佔位符:

target_weights = []
    target_weights.append(tf.placeholder(tf.float32, shape=[None], 
                          name="weight{0}".format(i)))
複製代碼

計算損失值:

targets = [decoder_inputs[i + 1] for i in xrange(len(decoder_inputs) - 1)]
loss = seq2seq.sequence_loss(outputs, targets, target_weights)
複製代碼

targets長度比decoder_inputs少一個,長度保持一致,decoder_inputs的初始化長度加1。 計算加權交叉熵損失,有意義數權重大,無心義權重小,targets有值賦值爲1,沒值賦值爲0:

# coding:utf-8
import numpy as np
import tensorflow as tf
from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq

# 輸入序列長度
input_seq_len = 5
# 輸出序列長度
output_seq_len = 5
# 空值填充0
PAD_ID = 0
# 輸出序列起始標記
GO_ID = 1
# LSTM神經元size
size = 8
# 最大輸入符號數
num_encoder_symbols = 10
# 最大輸出符號數
num_decoder_symbols = 16

def get_samples():
    """構造樣本數據

    :return:
        encoder_inputs: [array([0, 0], dtype=int32), 
                         array([0, 0], dtype=int32), 
                         array([1, 3], dtype=int32),
                         array([3, 5], dtype=int32), 
                         array([5, 7], dtype=int32)]
        decoder_inputs: [array([1, 1], dtype=int32), 
                         array([7, 9], dtype=int32), 
                         array([ 9, 11], dtype=int32),
                         array([11, 13], dtype=int32), 
                         array([0, 0], dtype=int32)]
    """
    train_set = [[[1, 3, 5], [7, 9, 11]], [[3, 5, 7], [9, 11, 13]]]
    encoder_input_0 = [PAD_ID] * (input_seq_len - len(train_set[0][0])) 
                         + train_set[0][0]
    encoder_input_1 = [PAD_ID] * (input_seq_len - len(train_set[1][0])) 
                         + train_set[1][0]
    decoder_input_0 = [GO_ID] + train_set[0][1] 
                         + [PAD_ID] * (output_seq_len - len(train_set[0][1]) - 1)
    decoder_input_1 = [GO_ID] + train_set[1][1] 
                         + [PAD_ID] * (output_seq_len - len(train_set[1][1]) - 1)

    encoder_inputs = []
    decoder_inputs = []
    target_weights = []
    for length_idx in xrange(input_seq_len):
        encoder_inputs.append(np.array([encoder_input_0[length_idx], 
                         encoder_input_1[length_idx]], dtype=np.int32))
    for length_idx in xrange(output_seq_len):
        decoder_inputs.append(np.array([decoder_input_0[length_idx], 
                         decoder_input_1[length_idx]], dtype=np.int32))
        target_weights.append(np.array([
            0.0 if length_idx == output_seq_len - 1 
                         or decoder_input_0[length_idx] == PAD_ID else 1.0,
            0.0 if length_idx == output_seq_len - 1 
                         or decoder_input_1[length_idx] == PAD_ID else 1.0,
        ], dtype=np.float32))
    return encoder_inputs, decoder_inputs, target_weights

def get_model():
    """構造模型
    """
    encoder_inputs = []
    decoder_inputs = []
    target_weights = []
    for i in xrange(input_seq_len):
        encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="encoder{0}".format(i)))
    for i in xrange(output_seq_len + 1):
        decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="decoder{0}".format(i)))
    for i in xrange(output_seq_len):
        target_weights.append(tf.placeholder(tf.float32, shape=[None],
                          name="weight{0}".format(i)))

    # decoder_inputs左移一個時序做爲targets
    targets = [decoder_inputs[i + 1] for i in xrange(output_seq_len)]

    cell = tf.contrib.rnn.BasicLSTMCell(size)

    # 這裏輸出的狀態咱們不須要
    outputs, _ = seq2seq.embedding_attention_seq2seq(
                        encoder_inputs,
                        decoder_inputs[:output_seq_len],
                        cell,
                        num_encoder_symbols=num_encoder_symbols,
                        num_decoder_symbols=num_decoder_symbols,
                        embedding_size=size,
                        output_projection=None,
                        feed_previous=False,
                        dtype=tf.float32)

    # 計算加權交叉熵損失
    loss = seq2seq.sequence_loss(outputs, targets, target_weights)
    return encoder_inputs, decoder_inputs, target_weights, outputs, loss

def main():
    with tf.Session() as sess:
        sample_encoder_inputs, sample_decoder_inputs, sample_target_weights 
                          = get_samples()
        encoder_inputs, decoder_inputs, target_weights, outputs, loss = get_model()

        input_feed = {}
        for l in xrange(input_seq_len):
            input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
        for l in xrange(output_seq_len):
            input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
            input_feed[target_weights[l].name] = sample_target_weights[l]
        input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32)

        sess.run(tf.global_variables_initializer())
        loss = sess.run(loss, input_feed)
        print loss

if __name__ == "__main__":
    main()
複製代碼

訓練模型,通過多輪計算讓loss變得很小,運用梯度降低更新參數。tensorflow提供梯度降低類: https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer 。

Class GradientDescentOptimizer構造方法:

__init__(
    learning_rate,
    use_locking=False,
    name='GradientDescent'
)
複製代碼

關鍵是第一個參數 學習率。 計算梯度方法:

compute_gradients(
    loss,
    var_list=None,
    gate_gradients=GATE_OP
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    grad_loss=None
)
複製代碼

關鍵參數loss是傳入偏差值,返回值是(gradient, variable)組成list。 更新參數方法:

apply_gradients(
    grads_and_vars,
    global_step=None,
    name=None
)
複製代碼

grads_and_vars是compute_gradients返回值。 根據loss計算梯度更新參數方法:

learning_rate = 0.1
opt = tf.train.GradientDescentOptimizer(learning_rate)
update = opt.apply_gradients(opt.compute_gradients(loss))
複製代碼

main函數增長循環迭代:

def main():
    with tf.Session() as sess:
        sample_encoder_inputs, sample_decoder_inputs, sample_target_weights 
                          = get_samples()
        encoder_inputs, decoder_inputs, target_weights, outputs, loss, update 
                          = get_model()

        input_feed = {}
        for l in xrange(input_seq_len):
            input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
        for l in xrange(output_seq_len):
            input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
            input_feed[target_weights[l].name] = sample_target_weights[l]
        input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32)

        sess.run(tf.global_variables_initializer())
        while True:
            [loss_ret, _] = sess.run([loss, update], input_feed)
            print loss_ret
複製代碼

實現預測邏輯,只輸入樣本encoder_input,自動預測decoder_input。 訓練模型保存,從新啓動預測時加載:

def get_model():
      ...
saver = tf.train.Saver(tf.global_variables())
      return ..., saver
複製代碼

訓練結束後執行

saver.save(sess, './model/demo')
複製代碼

模型存儲到./model目錄下以demo開頭文件,加載先調用:

saver.restore(sess, './model/demo')
複製代碼

預測候,原則上不能有decoder_inputs輸入,執行時decoder_inputs取前一個時序輸出,embedding_attention_seq2seq feed_previous參數,若爲True則decoder每一步輸入都用前一步輸出來填充。

get_model傳遞參數區分訓練和預測是不一樣feed_previous配置,預測時main函數也是不一樣,分開兩個函數分別作train和predict。

# coding:utf-8
import sys
import numpy as np
import tensorflow as tf
from tensorflow.contrib.legacy_seq2seq.python.ops import seq2seq

# 輸入序列長度
input_seq_len = 5
# 輸出序列長度
output_seq_len = 5
# 空值填充0
PAD_ID = 0
# 輸出序列起始標記
GO_ID = 1
# 結尾標記
EOS_ID = 2
# LSTM神經元size
size = 8
# 最大輸入符號數
num_encoder_symbols = 10
# 最大輸出符號數
num_decoder_symbols = 16
# 學習率
learning_rate = 0.1

def get_samples():
    """構造樣本數據

    :return:
        encoder_inputs: [array([0, 0], dtype=int32), 
                         array([0, 0], dtype=int32), 
                         array([5, 5], dtype=int32),
                         array([7, 7], dtype=int32), 
                         array([9, 9], dtype=int32)]
        decoder_inputs: [array([1, 1], dtype=int32), 
                         array([11, 11], dtype=int32), 
                         array([13, 13], dtype=int32),
                         array([15, 15], dtype=int32), 
                         array([2, 2], dtype=int32)]
    """
    train_set = [[[5, 7, 9], [11, 13, 15, EOS_ID]], [[7, 9, 11], [13, 15, 17, EOS_ID]]]
    raw_encoder_input = []
    raw_decoder_input = []
    for sample in train_set:
        raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0])
        raw_decoder_input.append([GO_ID] + sample[1] 
                         + [PAD_ID] * (output_seq_len - len(sample[1]) - 1))

    encoder_inputs = []
    decoder_inputs = []
    target_weights = []

    for length_idx in xrange(input_seq_len):
        encoder_inputs.append(np.array([encoder_input[length_idx] 
                          for encoder_input in raw_encoder_input], 
                                                  dtype=np.int32))
    for length_idx in xrange(output_seq_len):
        decoder_inputs.append(np.array([decoder_input[length_idx] 
                          for decoder_input in raw_decoder_input], 
                                                  dtype=np.int32))
        target_weights.append(np.array([
            0.0 if length_idx == output_seq_len - 1 
                         or decoder_input[length_idx] == PAD_ID else 1.0 
                         for decoder_input in raw_decoder_input
        ], dtype=np.float32))
    return encoder_inputs, decoder_inputs, target_weights

def get_model(feed_previous=False):
    """構造模型
    """
    encoder_inputs = []
    decoder_inputs = []
    target_weights = []
    for i in xrange(input_seq_len):
        encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="encoder{0}".format(i)))
    for i in xrange(output_seq_len + 1):
        decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], 
                          name="decoder{0}".format(i)))
    for i in xrange(output_seq_len):
        target_weights.append(tf.placeholder(tf.float32, shape=[None], 
                         name="weight{0}".format(i)))

    # decoder_inputs左移一個時序做爲targets
    targets = [decoder_inputs[i + 1] for i in xrange(output_seq_len)]

    cell = tf.contrib.rnn.BasicLSTMCell(size)

    # 這裏輸出的狀態咱們不須要
    outputs, _ = seq2seq.embedding_attention_seq2seq(
                        encoder_inputs,
                        decoder_inputs[:output_seq_len],
                        cell,
                        num_encoder_symbols=num_encoder_symbols,
                        num_decoder_symbols=num_decoder_symbols,
                        embedding_size=size,
                        output_projection=None,
                        feed_previous=feed_previous,
                        dtype=tf.float32)

    # 計算加權交叉熵損失
    loss = seq2seq.sequence_loss(outputs, targets, target_weights)
    # 梯度降低優化器
    opt = tf.train.GradientDescentOptimizer(learning_rate)
    # 優化目標:讓loss最小化
    update = opt.apply_gradients(opt.compute_gradients(loss))
    # 模型持久化
    saver = tf.train.Saver(tf.global_variables())
    return encoder_inputs, decoder_inputs, target_weights, 
                          outputs, loss, update, saver, targets

def train():
    """
    訓練過程
    """
    with tf.Session() as sess:
        sample_encoder_inputs, sample_decoder_inputs, sample_target_weights 
                          = get_samples()
        encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver, targets 
                          = get_model()

        input_feed = {}
        for l in xrange(input_seq_len):
            input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
        for l in xrange(output_seq_len):
            input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
            input_feed[target_weights[l].name] = sample_target_weights[l]
        input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32)

        # 所有變量初始化
        sess.run(tf.global_variables_initializer())

        # 訓練200次迭代,每隔10次打印一次loss
        for step in xrange(200):
            [loss_ret, _] = sess.run([loss, update], input_feed)
            if step % 10 == 0:
                print 'step=', step, 'loss=', loss_ret

        # 模型持久化
        saver.save(sess, './model/demo')

def predict():
    """
    預測過程
    """
    with tf.Session() as sess:
        sample_encoder_inputs, sample_decoder_inputs, sample_target_weights 
                          = get_samples()
        encoder_inputs, decoder_inputs, target_weights, 
                          outputs, loss, update, saver, targets 
                          = get_model(feed_previous=True)
        # 從文件恢復模型
        saver.restore(sess, './model/demo')

        input_feed = {}
        for l in xrange(input_seq_len):
            input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
        for l in xrange(output_seq_len):
            input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
            input_feed[target_weights[l].name] = sample_target_weights[l]
        input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32)

        # 預測輸出
        outputs = sess.run(outputs, input_feed)
        # 一共試驗樣本有2個,因此分別遍歷
        for sample_index in xrange(2):
            # 由於輸出數據每個是num_decoder_symbols維的
            # 所以找到數值最大的那個就是預測的id,就是這裏的argmax函數的功能
            outputs_seq = [int(np.argmax(logit[sample_index], axis=0)) for logit in outputs]
            # 若是是結尾符,那麼後面的語句就不輸出了
            if EOS_ID in outputs_seq:
                outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)]
            outputs_seq = [str(v) for v in outputs_seq]
            print " ".join(outputs_seq)

if __name__ == "__main__":
    if sys.argv[1] == 'train':
        train()
    else:
        predict()
複製代碼

文件命名demo.py,執行./demo.py train訓練模型,執行./demo.py predict預測。

預測時按照完整encoder_inputs和decoder_inputs計算,繼續改進predict,手工輸入一串數字(只有encoder部分)。

實現從輸入空格分隔數字id串,轉成預測用encoder、decoder、target_weight函數。

def seq_to_encoder(input_seq):
    """從輸入空格分隔的數字id串,轉成預測用的encoder、decoder、target_weight等
    """
    input_seq_array = [int(v) for v in input_seq.split()]
    encoder_input = [PAD_ID] * (input_seq_len - len(input_seq_array)) + input_seq_array
    decoder_input = [GO_ID] + [PAD_ID] * (output_seq_len - 1)
    encoder_inputs = [np.array([v], dtype=np.int32) for v in encoder_input]
    decoder_inputs = [np.array([v], dtype=np.int32) for v in decoder_input]
    target_weights = [np.array([1.0], dtype=np.float32)] * output_seq_len
    return encoder_inputs, decoder_inputs, target_weights
複製代碼

而後咱們改寫predict函數以下:

def predict():
    """
    預測過程
    """
    with tf.Session() as sess:
        encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver 
                          = get_model(feed_previous=True)
        saver.restore(sess, './model/demo')
        sys.stdout.write("> ")
        sys.stdout.flush()
        input_seq = sys.stdin.readline()
        while input_seq:
            input_seq = input_seq.strip()
            sample_encoder_inputs, sample_decoder_inputs, sample_target_weights 
                          = seq_to_encoder(input_seq)

            input_feed = {}
            for l in xrange(input_seq_len):
                input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
            for l in xrange(output_seq_len):
                input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
                input_feed[target_weights[l].name] = sample_target_weights[l]
            input_feed[decoder_inputs[output_seq_len].name] = np.zeros([2], dtype=np.int32)

            # 預測輸出
            outputs_seq = sess.run(outputs, input_feed)
            # 由於輸出數據每個是num_decoder_symbols維的
            # 所以找到數值最大的那個就是預測的id,就是這裏的argmax函數的功能
            outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq]
            # 若是是結尾符,那麼後面的語句就不輸出了
            if EOS_ID in outputs_seq:
                outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)]
            outputs_seq = [str(v) for v in outputs_seq]
            print " ".join(outputs_seq)

            sys.stdout.write("> ")
            sys.stdout.flush()
            input_seq = sys.stdin.readline()
複製代碼

執行./demo.py predict。

設置num_encoder_symbols = 10,11沒法表達,修改參數並增長樣本:

# 最大輸入符號數
num_encoder_symbols = 32
# 最大輸出符號數
num_decoder_symbols = 32
……
train_set = [
              [[5, 7, 9], [11, 13, 15, EOS_ID]], 
              [[7, 9, 11], [13, 15, 17, EOS_ID]], 
              [[15, 17, 19], [21, 23, 25, EOS_ID]]
            ]
……
複製代碼

迭代次數擴大到10000次。

輸入樣本,預測效果很是好,換成其餘輸入,仍是在樣本輸出找最相近結果做預測結果,不會思考,沒有智能,所模型更適合作分類,不適合作推理。

訓練時把中文詞彙轉成id號,預測時把預測id轉成中文。

新建word_token.py文件,並建一個WordToken類,load函數負責加載樣本,生成word2id_dict和id2word_dict詞典,word2id函數負責將詞彙轉成id,id2word負責將id轉成詞彙:

# coding:utf-8
import sys
import jieba

class WordToken(object):
    def __init__(self):
        # 最小起始id號, 保留的用於表示特殊標記
        self.START_ID = 4
        self.word2id_dict = {}
        self.id2word_dict = {}

    def load_file_list(self, file_list):
        """
        加載樣本文件列表,所有切詞後統計詞頻,按詞頻由高到低排序後順次編號
        並存到self.word2id_dict和self.id2word_dict中
        """
        words_count = {}
        for file in file_list:
            with open(file, 'r') as file_object:
                for line in file_object.readlines():
                    line = line.strip()
                    seg_list = jieba.cut(line)
                    for str in seg_list:
                        if str in words_count:
                            words_count[str] = words_count[str] + 1
                        else:
                            words_count[str] = 1

        sorted_list = [[v[1], v[0]] for v in words_count.items()]
        sorted_list.sort(reverse=True)
        for index, item in enumerate(sorted_list):
            word = item[1]
            self.word2id_dict[word] = self.START_ID + index
            self.id2word_dict[self.START_ID + index] = word

    def word2id(self, word):
        if not isinstance(word, unicode):
            print "Exception: error word not unicode"
            sys.exit(1)
        if word in self.word2id_dict:
            return self.word2id_dict[word]
        else:
            return None

    def id2word(self, id):
        id = int(id)
        if id in self.id2word_dict:
            return self.id2word_dict[id]
        else:
            return None
複製代碼

demo.py修改get_train_set:

def get_train_set():
    global num_encoder_symbols, num_decoder_symbols
    train_set = []
    with open('./samples/question', 'r') as question_file:
        with open('./samples/answer', 'r') as answer_file:
            while True:
                question = question_file.readline()
                answer = answer_file.readline()
                if question and answer:
                    question = question.strip()
                    answer = answer.strip()

                    question_id_list = get_id_list_from(question)
                    answer_id_list = get_id_list_from(answer)
                    answer_id_list.append(EOS_ID)
                    train_set.append([question_id_list, answer_id_list])
                else:
                    break
    return train_set
複製代碼

get_id_list_from實現:

def get_id_list_from(sentence):
    sentence_id_list = []
    seg_list = jieba.cut(sentence)
    for str in seg_list:
        id = wordToken.word2id(str)
        if id:
            sentence_id_list.append(wordToken.word2id(str))
    return sentence_id_list
複製代碼

wordToken:

import word_token
import jieba
wordToken = word_token.WordToken()

# 放在全局的位置,爲了動態算出num_encoder_symbols和num_decoder_symbols
max_token_id = wordToken.load_file_list(['./samples/question', './samples/answer'])
num_encoder_symbols = max_token_id + 5
num_decoder_symbols = max_token_id + 5
複製代碼

訓練代碼:

# 訓練不少次迭代,每隔10次打印一次loss,能夠看狀況直接ctrl+c中止
        for step in xrange(100000):
            [loss_ret, _] = sess.run([loss, update], input_feed)
            if step % 10 == 0:
                print 'step=', step, 'loss=', loss_ret

                # 模型持久化
                saver.save(sess, './model/demo')
複製代碼

預測代碼修改:

def predict():
    """
    預測過程
    """
    with tf.Session() as sess:
        encoder_inputs, decoder_inputs, target_weights, outputs, loss, update, saver 
                        = get_model(feed_previous=True)
        saver.restore(sess, './model/demo')
        sys.stdout.write("> ")
        sys.stdout.flush()
        input_seq = sys.stdin.readline()
        while input_seq:
            input_seq = input_seq.strip()
            input_id_list = get_id_list_from(input_seq)
            if (len(input_id_list)):
                sample_encoder_inputs, sample_decoder_inputs, sample_target_weights 
                        = seq_to_encoder(' '.join([str(v) for v in input_id_list]))

                input_feed = {}
                for l in xrange(input_seq_len):
                    input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
                for l in xrange(output_seq_len):
                    input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
                    input_feed[target_weights[l].name] = sample_target_weights[l]
                input_feed[decoder_inputs[output_seq_len].name] 
                        = np.zeros([2], dtype=np.int32)

                # 預測輸出
                outputs_seq = sess.run(outputs, input_feed)
                # 由於輸出數據每個是num_decoder_symbols維的
                # 所以找到數值最大的那個就是預測的id,就是這裏的argmax函數的功能
                outputs_seq = [int(np.argmax(logit[0], axis=0)) for logit in outputs_seq]
                # 若是是結尾符,那麼後面的語句就不輸出了
                if EOS_ID in outputs_seq:
                    outputs_seq = outputs_seq[:outputs_seq.index(EOS_ID)]
                outputs_seq = [wordToken.id2word(v) for v in outputs_seq]
                print " ".join(outputs_seq)
            else:
                print "WARN:詞彙不在服務區"

            sys.stdout.write("> ")
            sys.stdout.flush()
            input_seq = sys.stdin.readline()
複製代碼

用存儲在['./samples/question', './samples/answer']1000個對話樣本訓練,使loss輸出收斂到必定程度(好比1.0)如下:

python demo.py train
複製代碼

到1.0如下後手工ctrl+c中止,每隔10個step都store一次模型。

模型收斂很是慢,設置學習率是0.1。首先學習率大一些,每當下一步loss和上一步相比反彈(反而增大)時再嘗試下降學習率。再也不直接用learning_rate,初始化一個學習率:

init_learning_rate = 1
複製代碼

get_model中建立一個變量,用init_learning_rate初始化:

learning_rate = tf.Variable(float(init_learning_rate), trainable=False, dtype=tf.float32)
複製代碼

再建立一個操做,在適當時候把學習率打9折:

learning_rate_decay_op = learning_rate.assign(learning_rate * 0.9)
複製代碼

訓練代碼調整:

# 訓練不少次迭代,每隔10次打印一次loss,能夠看狀況直接ctrl+c中止
        previous_losses = []
        for step in xrange(100000):
            [loss_ret, _] = sess.run([loss, update], input_feed)
            if step % 10 == 0:
                print 'step=', step, 'loss=', 
                        loss_ret, 'learning_rate=', learning_rate.eval()

                if loss_ret > max(previous_losses[-5:]):
                    sess.run(learning_rate_decay_op)
                previous_losses.append(loss_ret)

                # 模型持久化
                saver.save(sess, './model/demo')
複製代碼

訓練能夠快速收斂。

參考文獻 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/ http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ https://arxiv.org/abs/1406.1078 https://arxiv.org/abs/1409.3215 https://arxiv.org/abs/1409.0473

樣本全量加載,用大量樣本訓練,內存撐不住,老是Out of memory。方法是把全量加載樣本改爲批量加載,樣本量再大,內存也不會無限增長。

https://github.com/warmheartli/ChatBotCourse/tree/master/chatbotv5

樣本量加大內存增加,樣本量達到萬級別,內存佔用達到10G,每次迭代把樣本全量加載到內存並一次性訓練完再更新模型,詞表是基於樣本生成,沒有作任何限制,致使樣本大詞表大,模型很大,佔據內存更大。

優化方案。把全量加載樣本改爲批量加載,修改train()函數。

# 訓練不少次迭代,每隔10次打印一次loss,能夠看狀況直接ctrl+c中止
    previous_losses = []
    for step in xrange(20000):
        sample_encoder_inputs, sample_decoder_inputs, sample_target_weights = get_samples(train_set, 1000)
        input_feed = {}
        for l in xrange(input_seq_len):
            input_feed[encoder_inputs[l].name] = sample_encoder_inputs[l]
        for l in xrange(output_seq_len):
            input_feed[decoder_inputs[l].name] = sample_decoder_inputs[l]
            input_feed[target_weights[l].name] = sample_target_weights[l]
        input_feed[decoder_inputs[output_seq_len].name] = np.zeros([len(sample_decoder_inputs[0])], dtype=np.int32)
        [loss_ret, _] = sess.run([loss, update], input_feed)
        if step % 10 == 0:
            print 'step=', step, 'loss=', loss_ret, 'learning_rate=', learning_rate.eval()

            if len(previous_losses) > 5 and loss_ret > max(previous_losses[-5:]):
                sess.run(learning_rate_decay_op)
            previous_losses.append(loss_ret)

            # 模型持久化
            saver.save(sess, './model/demo')
複製代碼

get_samples(train_set, 1000) 批量獲取樣本,1000是每次獲取樣本量:

if batch_num >= len(train_set):
        batch_train_set = train_set
    else:
        random_start = random.randint(0, len(train_set)-batch_num)
        batch_train_set = train_set[random_start:random_start+batch_num]
    for sample in batch_train_set:
        raw_encoder_input.append([PAD_ID] * (input_seq_len - len(sample[0])) + sample[0])
        raw_decoder_input.append([GO_ID] + sample[1] + [PAD_ID] * (output_seq_len - len(sample[1]) - 1))
複製代碼

每次在全量樣本中隨機位置抽取連續1000條樣本。

加載樣本詞表時作詞最小頻率限制:

def load_file_list(self, file_list, min_freq):
    ......
        for index, item in enumerate(sorted_list):
            word = item[1]
            if item[0] < min_freq:
                break
            self.word2id_dict[word] = self.START_ID + index
            self.id2word_dict[self.START_ID + index] = word
        return index
複製代碼

https://github.com/warmheartli/ChatBotCourse/tree/master/chatbotv5

參考資料: 《Python 天然語言處理》 《NLTK基礎教程 用NLTK和Python庫構建機器學習應用》 http://www.shareditor.com/blogshow?blogId=136 http://www.shareditor.com/blogshow?blogId=137

歡迎推薦上海機器學習工做機會,個人微信:qingxingfengzi

相關文章
相關標籤/搜索