學習筆記TF059:天然語言處理、智能聊天機器人

時間 2019-11-10

標籤學習筆記 tf059 天然語言處理智能聊天機器人简体版

原文原文鏈接

天然語言處理，語音處理、文本處理。語音識別(speech recognition)，讓計算機可以「聽懂」人類語音，語音的文字信息「提取」。html

日本富國生命保險公司花170萬美圓安裝人工智能系統，客戶語言轉換文本，分析詞正面或負面。智能客服是人工能智能公司研究重點。循環神經網絡(recurrent neural network,RNN)模型。python

模型選擇。每個矩形是一個向量，箭頭表示函數。最下面一行輸入向量，最上面一行輸出向量，中間一行RNN狀態。一對一，沒用RNN，如Vanilla模型，固定大小輸入到固定大小輸出(圖像分類)。一對多，序列輸出，圖片描述，輸入一張圖片輸出一段文字序列，CNN、RNN結合，圖像、語言結合。多對一，序列輸入，情感分析，輸入一段文字，分類積極、消極情感，如淘寶商品評論分類，用LSTM。多對多，異步序列輸入、序列輸出，機器翻譯，如RNN讀取英文語句，以法語形式輸出。多對多，同步序列輸入、序列輸出，視頻分類，視頻每幀打標記。中間RNN狀態部分固定，可屢次使用，不需對序列長度預先約束。Andrej Karpathy《The Unreasonable Effectiveness of Recurrent Neural Networks》。http://karpathy.github.io/201... 。天然語言處理，語音合成(文字生成語音)、語單識別、聲紋識別(聲紋鑑權)、文本處理(分詞、情感分析、文本挖掘)。git

英文數字語音識別。https://github.com/pannous/te... 。20行Python代碼建立超簡單語音識別器。LSTM循環神經網絡，TFLearn訓練英文數字口語數據集。spoken numbers pcm數據集 http://pannous.net/spoken_num... 。多人閱讀0~9數字英文音頻，分男女聲，一段音頻(wav文件)只有一個數字對應英文聲音。標識方法{數字}_人名_xxx。github

定義輸入數據，預處理數據。語音處理成矩陣形式。梅爾頻率倒譜系數(Mel frequency cepstral coefficents, MFCC)特徵向量。語音分幀、取對數、逆矩陣，生成MFCC表明語音特徵。小程序

定義網絡模型。LSTM模型。微信

訓練模型，並存儲模型。網絡

預測模型。任意輸入一個語音文件，預測。session

語音識別，可用在智能輸入法、會議快速錄入、語音控制系統、智能家居領域。架構

#!/usr/bin/env python
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
from __future__ import division, print_function, absolute_import
import tflearn
import speech_data
learning_rate = 0.0001
training_iters = 300000  # steps 迭代次數
batch_size = 64
width = 20  # mfcc features MFCC特徵
height = 80  # (max) length of utterance 最大發音長度
classes = 10  # digits 數字類別
batch = word_batch = speech_data.mfcc_batch_generator(batch_size) # 生成每一批MFCC語音
X, Y = next(batch)
# train, test, _ = ,X
trainX, trainY = X, Y
testX, testY = X, Y #overfit for now
# Data preprocessing
# Sequence padding
# trainX = pad_sequences(trainX, maxlen=100, value=0.)
# testX = pad_sequences(testX, maxlen=100, value=0.)
# # Converting labels to binary vectors
# trainY = to_categorical(trainY, nb_classes=2)
# testY = to_categorical(testY, nb_classes=2)
# Network building
# LSTM模型
net = tflearn.input_data([None, width, height])
# net = tflearn.embedding(net, input_dim=10000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, classes, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=learning_rate, loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.load("tflearn.lstm.model")
while 1: #training_iters
  model.fit(trainX, trainY, n_epoch=100, validation_set=(testX, testY), show_metric=True,
          batch_size=batch_size)
  _y=model.predict(X)
model.save("tflearn.lstm.model")
print (_y)
print (y)

智能聊天機器人。將來方向「天然語言人機交互」。蘋果Siri、微軟Cortana和小冰、Google Now、百度度祕、亞馬遜藍牙音箱Amazon Echo內置語音助手Alexa、Facebook 語音助手M。經過和用戶「語音機器人」對話，引導用戶到對應服務。從此智能硬件、智能家居嵌入式應用。
智能聊天機器人3代技術。第一代特徵工程，大量邏輯判斷。第二代檢索庫，給定問題、聊天，從檢索庫找到與已有答案最匹配答案。第三代深度學習，seq2seq+Attention模型，大量訓練，根據輸入生成輸出。app

seq2seq+Attention模型原理、構建方法。翻譯模型，把一個序列翻譯成另外一個序列。兩個RNNLM，一個做編碼器，一個解碼器，組成RNN編碼器-解碼器。文本處理領域，經常使用編碼器-解碼器(encoder-decoder)框架。輸入->編碼器->語義編碼C->解碼器->輸出。適合處理上下文(context)生成一個目標(target)通用處理模型。一個句子對<X,Y>，輸入給定句子X，經過編碼器-解碼器框架生成目標句子Y。X、Y能夠不一樣語言，機器翻譯。X、Y是對話問句答句，聊天機器人。X、Y能夠是圖片和對應描述，看圖說話。
X由x1､x2等單詞序列組成，Y由y1､y2等單詞序列組成。編碼器編碼輸入X，生成中間語義編碼C，解碼器解碼中間語義編碼C，每一個i時刻結合已生成y1､y2……yi-1歷史信息生成Yi。生成句子每一個詞采用中間語義編碼相同 C。短句子貼切，長句子不合語義。
實際實現聊天系統，編碼器和解碼器採用RNN模型、LSTM模型。句子長度超過30，LSTM模型效果急劇降低，引入Attention模型，長句子提高系統效果。Attention機制，人在作一件事情，專一作這件事，忽略周圍其餘事。源句子中對生成句子重要關鍵詞權重提升，產生更準確應答。增長Attention模型編碼器-解碼器模型框架：輸入->編碼器->語義編碼C1､C2､C3->解碼器->輸出Y一、Y二、Y3。中間語義編碼Ci不斷變化，產生更準確Yi。

最佳實踐。https://github.com/suriyadeep... ，依賴TensorFlow 0.12.1環境。康奈爾大學 Corpus數據集(Cornell Movie Dialogs Corpus) http://www.cs.cornell.edu/~cr... 。600 部電影對白。

處理聊天數據。

先把數據集整理成「問」、「答」文件，生成.enc(問句)、.dec(答句)文件。test.dec #測試集答句，test.enc #測試集問句，train.dec #訓練集答句，train.enc #訓練集問句。
建立詞彙表，問句、答句轉換成對應id形式。詞彙表文件2萬個詞彙。vocab20000.dec #答句詞彙表，vocab20000.enc #問句詞彙表。_GO、_EOS、_UNK、_PAD seq2seq模型特殊標記，填充標記對話。_GO標記對話開始。_EOS標記對話結束。_UNK標記未出現詞彙表字符，替換稀有詞彙。_PAD填充序列，保證批次序列長度相同。轉換成ids文件，test.enc.ids20000､train.dec.ids20000､train.enc.ids20000。問句、答句轉換ids文件，每行是一個問句或答句，每行每一個id表明問句或答句對應位置詞。

採用編碼器-解碼器框架訓練。

定義訓練參數。seq2seq.ini。

[strings]
# Mode : train, test, serve 模式
mode = train
train_enc = data/train.enc
train_dec = data/train.dec
test_enc = data/test.enc
test_dec = data/test.dec
# folder where checkpoints, vocabulary, temporary data will be stored
# 模型文件和詞彙表存儲路徑
working_directory = working_dir/
[ints]
# vocabulary size
# 詞彙表大小
#     20,000 is a reasonable size
enc_vocab_size = 20000
dec_vocab_size = 20000
# number of LSTM layers : 1/2/3
# LSTM層數
num_layers = 3
# typical options : 128, 256, 512, 1024 每層大小，可取值
layer_size = 256
# dataset size limit; typically none : no limit
max_train_data_size = 0
batch_size = 64
# steps per checkpoint
# 每多少次迭代存儲一次模型
#     Note : At a checkpoint, models parameters are saved, model is evaluated
#            and results are printed
steps_per_checkpoint = 300
[floats]
learning_rate = 0.5 # 學習速率
learning_rate_decay_factor = 0.99 # 學習速率降低係數
max_gradient_norm = 5.0

定義網絡模型 seq2seq。seq2seq_model.py。TensorFlow 0.12。定義seq2seq+Attention模型類，3個函數。《Grammar as a Foreign Language》 http://arxiv.org/abs/1412.7499 。初始化模型函數(__init__)、訓練模型函數(step)、獲取下一批次訓練數據函數(get_batch)。

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import random
import numpy as np
from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.models.rnn.translate import data_utils
class Seq2SeqModel(object):
  def __init__(self, source_vocab_size, target_vocab_size, buckets, size,
               num_layers, max_gradient_norm, batch_size, learning_rate,
               learning_rate_decay_factor, use_lstm=False,
               num_samples=512, forward_only=False):
    """ 構建模型
    Args: 參數
      source_vocab_size: size of the source vocabulary. 問句詞彙表大小
      target_vocab_size: size of the target vocabulary.答句詞彙表大小
      buckets: a list of pairs (I, O), where I specifies maximum input length
        that will be processed in that bucket, and O specifies maximum output
        length. Training instances that have inputs longer than I or outputs
        longer than O will be pushed to the next bucket and padded accordingly.
        We assume that the list is sorted, e.g., [(2, 4), (8, 16)].
        其中I指定最大輸入長度，O指定最大輸出長度
      size: number of units in each layer of the model.每層神經元數量
      num_layers: number of layers in the model.模型層數
      max_gradient_norm: gradients will be clipped to maximally this norm.梯度被削減到最大規範
      batch_size: the size of the batches used during training;
        the model construction is independent of batch_size, so it can be
        changed after initialization if this is convenient, e.g., for decoding.批次大小。訓練、預測批次大小，可不一樣
      learning_rate: learning rate to start with.學習速率
      learning_rate_decay_factor: decay learning rate by this much when needed.調整學習速率
      use_lstm: if true, we use LSTM cells instead of GRU cells.使用LSTM 單元代替GRU單元
      num_samples: number of samples for sampled softmax.使用softmax樣本數
      forward_only: if set, we do not construct the backward pass in the model.是否僅構建前向傳播
    """
    self.source_vocab_size = source_vocab_size
    self.target_vocab_size = target_vocab_size
    self.buckets = buckets
    self.batch_size = batch_size
    self.learning_rate = tf.Variable(float(learning_rate), trainable=False)
    self.learning_rate_decay_op = self.learning_rate.assign(
        self.learning_rate * learning_rate_decay_factor)
    self.global_step = tf.Variable(0, trainable=False)
    # If we use sampled softmax, we need an output projection.
    output_projection = None
    softmax_loss_function = None
    # Sampled softmax only makes sense if we sample less than vocabulary size.
    # 若是樣本量比詞彙表量小，用抽樣softmax
    if num_samples > 0 and num_samples < self.target_vocab_size:
      w = tf.get_variable("proj_w", [size, self.target_vocab_size])
      w_t = tf.transpose(w)
      b = tf.get_variable("proj_b", [self.target_vocab_size])
      output_projection = (w, b)
      def sampled_loss(inputs, labels):
        labels = tf.reshape(labels, [-1, 1])
        return tf.nn.sampled_softmax_loss(w_t, b, inputs, labels, num_samples,
                self.target_vocab_size)
      softmax_loss_function = sampled_loss
    # Create the internal multi-layer cell for our RNN.
    # 構建RNN
    single_cell = tf.nn.rnn_cell.GRUCell(size)
    if use_lstm:
      single_cell = tf.nn.rnn_cell.BasicLSTMCell(size)
    cell = single_cell
    cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=0.5)
    if num_layers > 1:
      cell = tf.nn.rnn_cell.MultiRNNCell([single_cell] * num_layers)

    # The seq2seq function: we use embedding for the input and attention.
    # Attention模型
    def seq2seq_f(encoder_inputs, decoder_inputs, do_decode):
      return tf.nn.seq2seq.embedding_attention_seq2seq(
          encoder_inputs, decoder_inputs, cell,
          num_encoder_symbols=source_vocab_size,
          num_decoder_symbols=target_vocab_size,
          embedding_size=size,
          output_projection=output_projection,
          feed_previous=do_decode)
    # Feeds for inputs.
    # 給模型填充數據
    self.encoder_inputs = []
    self.decoder_inputs = []
    self.target_weights = []
    for i in xrange(buckets[-1][0]):  # Last bucket is the biggest one.
      self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None],
                                                name="encoder{0}".format(i)))
    for i in xrange(buckets[-1][1] + 1):
      self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None],
                                                name="decoder{0}".format(i)))
      self.target_weights.append(tf.placeholder(tf.float32, shape=[None],
                                                name="weight{0}".format(i)))
    # Our targets are decoder inputs shifted by one.
    # targets值是解碼器偏移1位
    targets = [self.decoder_inputs[i + 1]
               for i in xrange(len(self.decoder_inputs) - 1)]
    # Training outputs and losses.
    # 訓練模型輸出
    if forward_only:
      self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
          self.encoder_inputs, self.decoder_inputs, targets,
          self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, True),
          softmax_loss_function=softmax_loss_function)
      # If we use output projection, we need to project outputs for decoding.
      if output_projection is not None:
        for b in xrange(len(buckets)):
          self.outputs[b] = [
              tf.matmul(output, output_projection[0]) + output_projection[1]
              for output in self.outputs[b]
          ]
    else:
      self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
          self.encoder_inputs, self.decoder_inputs, targets,
          self.target_weights, buckets,
          lambda x, y: seq2seq_f(x, y, False),
          softmax_loss_function=softmax_loss_function)
    # Gradients and SGD update operation for training the model.
    # 訓練模型，更新梯度
    params = tf.trainable_variables()
    if not forward_only:
      self.gradient_norms = []
      self.updates = []
      opt = tf.train.AdamOptimizer()
      for b in xrange(len(buckets)):
        gradients = tf.gradients(self.losses[b], params)
        clipped_gradients, norm = tf.clip_by_global_norm(gradients,
                                                         max_gradient_norm)
        self.gradient_norms.append(norm)
        self.updates.append(opt.apply_gradients(
            zip(clipped_gradients, params), global_step=self.global_step))
    self.saver = tf.train.Saver(tf.global_variables())
  def step(self, session, encoder_inputs, decoder_inputs, target_weights,
           bucket_id, forward_only):
    """Run a step of the model feeding the given inputs.
    定義運行模型的每一步
    Args:
      session: tensorflow session to use.
      encoder_inputs: list of numpy int vectors to feed as encoder inputs.問句向量序列
      decoder_inputs: list of numpy int vectors to feed as decoder inputs.答句向量序列
      target_weights: list of numpy float vectors to feed as target weights.
      bucket_id: which bucket of the model to use.輸入bucket_id
      forward_only: whether to do the backward step or only forward.是否只作前向傳播
    Returns:
      A triple consisting of gradient norm (or None if we did not do backward),
      average perplexity, and the outputs.
    Raises:
      ValueError: if length of encoder_inputs, decoder_inputs, or
        target_weights disagrees with bucket size for the specified bucket_id.
    """
    # Check if the sizes match.
    encoder_size, decoder_size = self.buckets[bucket_id]
    if len(encoder_inputs) != encoder_size:
      raise ValueError("Encoder length must be equal to the one in bucket,"
                       " %d != %d." % (len(encoder_inputs), encoder_size))
    if len(decoder_inputs) != decoder_size:
      raise ValueError("Decoder length must be equal to the one in bucket,"
                       " %d != %d." % (len(decoder_inputs), decoder_size))
    if len(target_weights) != decoder_size:
      raise ValueError("Weights length must be equal to the one in bucket,"
                       " %d != %d." % (len(target_weights), decoder_size))
    # Input feed: encoder inputs, decoder inputs, target_weights, as provided.
    # 輸入填充
    input_feed = {}
    for l in xrange(encoder_size):
      input_feed[self.encoder_inputs[l].name] = encoder_inputs[l]
    for l in xrange(decoder_size):
      input_feed[self.decoder_inputs[l].name] = decoder_inputs[l]
      input_feed[self.target_weights[l].name] = target_weights[l]
    # Since our targets are decoder inputs shifted by one, we need one more.
    last_target = self.decoder_inputs[decoder_size].name
    input_feed[last_target] = np.zeros([self.batch_size], dtype=np.int32)
    # Output feed: depends on whether we do a backward step or not.
    # 輸出填充：與是否有後向傳播有關
    if not forward_only:
      output_feed = [self.updates[bucket_id],  # Update Op that does SGD.
                     self.gradient_norms[bucket_id],  # Gradient norm.
                     self.losses[bucket_id]]  # Loss for this batch.
    else:
      output_feed = [self.losses[bucket_id]]  # Loss for this batch.
      for l in xrange(decoder_size):  # Output logits.
        output_feed.append(self.outputs[bucket_id][l])
    outputs = session.run(output_feed, input_feed)
    if not forward_only:
      return outputs[1], outputs[2], None  # Gradient norm, loss, no outputs.有後向傳播輸出，梯度、損失值、None
    else:
      return None, outputs[0], outputs[1:]  # No gradient norm, loss, outputs.僅有前向傳播輸出，None，損失值，None
  def get_batch(self, data, bucket_id):
    """
    從指定桶獲取一個批次隨機數據，在訓練每步(step)使用
    Args:參數
      data: a tuple of size len(self.buckets) in which each element contains
        lists of pairs of input and output data that we use to create a batch.長度爲(self.buckets)元組，每一個元素包含建立批次輸入、輸出數據對列表
      bucket_id: integer, which bucket to get the batch for.整數，從哪一個bucket獲取批次
    Returns:返回
      The triple (encoder_inputs, decoder_inputs, target_weights) for
      the constructed batch that has the proper format to call step(...) later.一個包含三項元組(encoder_inputs, decoder_inputs, target_weights)
    """
    encoder_size, decoder_size = self.buckets[bucket_id]
    encoder_inputs, decoder_inputs = [], []
    # Get a random batch of encoder and decoder inputs from data,
    # pad them if needed, reverse encoder inputs and add GO to decoder.
    for _ in xrange(self.batch_size):
      encoder_input, decoder_input = random.choice(data[bucket_id])
      # Encoder inputs are padded and then reversed.
      encoder_pad = [data_utils.PAD_ID] * (encoder_size - len(encoder_input))
      encoder_inputs.append(list(reversed(encoder_input + encoder_pad)))
      # Decoder inputs get an extra "GO" symbol, and are padded then.
      decoder_pad_size = decoder_size - len(decoder_input) - 1
      decoder_inputs.append([data_utils.GO_ID] + decoder_input +
                            [data_utils.PAD_ID] * decoder_pad_size)
    # Now we create batch-major vectors from the data selected above.
    batch_encoder_inputs, batch_decoder_inputs, batch_weights = [], [], []
    # Batch encoder inputs are just re-indexed encoder_inputs.
    for length_idx in xrange(encoder_size):
      batch_encoder_inputs.append(
          np.array([encoder_inputs[batch_idx][length_idx]
                    for batch_idx in xrange(self.batch_size)], dtype=np.int32))
    # Batch decoder inputs are re-indexed decoder_inputs, we create weights.
    for length_idx in xrange(decoder_size):
      batch_decoder_inputs.append(
          np.array([decoder_inputs[batch_idx][length_idx]
                    for batch_idx in xrange(self.batch_size)], dtype=np.int32))
      # Create target_weights to be 0 for targets that are padding.
      batch_weight = np.ones(self.batch_size, dtype=np.float32)
      for batch_idx in xrange(self.batch_size):
        # We set weight to 0 if the corresponding target is a PAD symbol.
        # The corresponding target is decoder_input shifted by 1 forward.
        if length_idx < decoder_size - 1:
          target = decoder_inputs[batch_idx][length_idx + 1]
        if length_idx == decoder_size - 1 or target == data_utils.PAD_ID:
          batch_weight[batch_idx] = 0.0
      batch_weights.append(batch_weight)
    return batch_encoder_inputs, batch_decoder_inputs, batch_weights

訓練模型。修改seq2seq.ini文件mode值「train」，execute.py訓練。

驗證模型。修改seq2seq.ini文件mode值「test」，execute.py測試。

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import math
import os
import random
import sys
import time
import numpy as np
from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf
import data_utils
import seq2seq_model
try:
    from ConfigParser import SafeConfigParser
except:
    from configparser import SafeConfigParser # In Python 3, ConfigParser has been renamed to configparser for PEP 8 compliance.
gConfig = {}
def get_config(config_file='seq2seq.ini'):
    parser = SafeConfigParser()
    parser.read(config_file)
    # get the ints, floats and strings
    _conf_ints = [ (key, int(value)) for key,value in parser.items('ints') ]
    _conf_floats = [ (key, float(value)) for key,value in parser.items('floats') ]
    _conf_strings = [ (key, str(value)) for key,value in parser.items('strings') ]
    return dict(_conf_ints + _conf_floats + _conf_strings)
# We use a number of buckets and pad to the closest one for efficiency.
# See seq2seq_model.Seq2SeqModel for details of how they work.
_buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]
def read_data(source_path, target_path, max_size=None):
  """Read data from source and target files and put into buckets.
  Args:
    source_path: path to the files with token-ids for the source language.
    target_path: path to the file with token-ids for the target language;
      it must be aligned with the source file: n-th line contains the desired
      output for n-th line from the source_path.
    max_size: maximum number of lines to read, all other will be ignored;
      if 0 or None, data files will be read completely (no limit).
  Returns:
    data_set: a list of length len(_buckets); data_set[n] contains a list of
      (source, target) pairs read from the provided data files that fit
      into the n-th bucket, i.e., such that len(source) < _buckets[n][0] and
      len(target) < _buckets[n][1]; source and target are lists of token-ids.
  """
  data_set = [[] for _ in _buckets]
  with tf.gfile.GFile(source_path, mode="r") as source_file:
    with tf.gfile.GFile(target_path, mode="r") as target_file:
      source, target = source_file.readline(), target_file.readline()
      counter = 0
      while source and target and (not max_size or counter < max_size):
        counter += 1
        if counter % 100000 == 0:
          print("  reading data line %d" % counter)
          sys.stdout.flush()
        source_ids = [int(x) for x in source.split()]
        target_ids = [int(x) for x in target.split()]
        target_ids.append(data_utils.EOS_ID)
        for bucket_id, (source_size, target_size) in enumerate(_buckets):
          if len(source_ids) < source_size and len(target_ids) < target_size:
            data_set[bucket_id].append([source_ids, target_ids])
            break
        source, target = source_file.readline(), target_file.readline()
  return data_set
def create_model(session, forward_only):
  """Create model and initialize or load parameters"""
  model = seq2seq_model.Seq2SeqModel( gConfig['enc_vocab_size'], gConfig['dec_vocab_size'], _buckets, gConfig['layer_size'], gConfig['num_layers'], gConfig['max_gradient_norm'], gConfig['batch_size'], gConfig['learning_rate'], gConfig['learning_rate_decay_factor'], forward_only=forward_only)
  if 'pretrained_model' in gConfig:
      model.saver.restore(session,gConfig['pretrained_model'])
      return model
  ckpt = tf.train.get_checkpoint_state(gConfig['working_directory'])
  if ckpt and ckpt.model_checkpoint_path:
    print("Reading model parameters from %s" % ckpt.model_checkpoint_path)
    model.saver.restore(session, ckpt.model_checkpoint_path)
  else:
    print("Created model with fresh parameters.")
    session.run(tf.global_variables_initializer())
  return model
def train():
  # prepare dataset
  # 準備數據集
  print("Preparing data in %s" % gConfig['working_directory'])
  enc_train, dec_train, enc_dev, dec_dev, _, _ = data_utils.prepare_custom_data(gConfig['working_directory'],gConfig['train_enc'],gConfig['train_dec'],gConfig['test_enc'],gConfig['test_dec'],gConfig['enc_vocab_size'],gConfig['dec_vocab_size'])
  # setup config to use BFC allocator
  config = tf.ConfigProto()
  config.gpu_options.allocator_type = 'BFC'
  with tf.Session(config=config) as sess:
    # Create model.
    # 構建模型
    print("Creating %d layers of %d units." % (gConfig['num_layers'], gConfig['layer_size']))
    model = create_model(sess, False)
    # Read data into buckets and compute their sizes.
    # 把數據讀入桶(bucket)中，計算桶大小
    print ("Reading development and training data (limit: %d)."
           % gConfig['max_train_data_size'])
    dev_set = read_data(enc_dev, dec_dev)
    train_set = read_data(enc_train, dec_train, gConfig['max_train_data_size'])
    train_bucket_sizes = [len(train_set[b]) for b in xrange(len(_buckets))]
    train_total_size = float(sum(train_bucket_sizes))
    # A bucket scale is a list of increasing numbers from 0 to 1 that we'll use
    # to select a bucket. Length of [scale[i], scale[i+1]] is proportional to
    # the size if i-th training bucket, as used later.
    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size
                           for i in xrange(len(train_bucket_sizes))]
    # This is the training loop.
    # 開始訓練循環
    step_time, loss = 0.0, 0.0
    current_step = 0
    previous_losses = []
    while True:
      # Choose a bucket according to data distribution. We pick a random number
      # in [0, 1] and use the corresponding interval in train_buckets_scale.
      # 隨機生成一個0-1數，在生成bucket_id中使用
      random_number_01 = np.random.random_sample()
      bucket_id = min([i for i in xrange(len(train_buckets_scale))
                       if train_buckets_scale[i] > random_number_01])
      # Get a batch and make a step.
      # 獲取一個批次數據，進行一步訓練
      start_time = time.time()
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
         train_set, bucket_id)
      _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
                                   target_weights, bucket_id, False)
      step_time += (time.time() - start_time) / gConfig['steps_per_checkpoint']
      loss += step_loss / gConfig['steps_per_checkpoint']
      current_step += 1
      # Once in a while, we save checkpoint, print statistics, and run evals.
      # 保存檢查點文件，打印統計數據
      if current_step % gConfig['steps_per_checkpoint'] == 0:
        # Print statistics for the previous epoch.
        perplexity = math.exp(loss) if loss < 300 else float('inf')
        print ("global step %d learning rate %.4f step-time %.2f perplexity "
               "%.2f" % (model.global_step.eval(), model.learning_rate.eval(),
                         step_time, perplexity))
        # Decrease learning rate if no improvement was seen over last 3 times.
        # 若是損失值在最近3次內沒有再下降，減少學習率
        if len(previous_losses) > 2 and loss > max(previous_losses[-3:]):
          sess.run(model.learning_rate_decay_op)
        previous_losses.append(loss)
        # Save checkpoint and zero timer and loss.
        # 保存檢查點文件，計數器、損失值歸零
        checkpoint_path = os.path.join(gConfig['working_directory'], "seq2seq.ckpt")
        model.saver.save(sess, checkpoint_path, global_step=model.global_step)
        step_time, loss = 0.0, 0.0
        # Run evals on development set and print their perplexity.
        for bucket_id in xrange(len(_buckets)):
          if len(dev_set[bucket_id]) == 0:
            print("  eval: empty bucket %d" % (bucket_id))
            continue
          encoder_inputs, decoder_inputs, target_weights = model.get_batch(
              dev_set, bucket_id)
          _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs,
                                       target_weights, bucket_id, True)
          eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')
          print("  eval: bucket %d perplexity %.2f" % (bucket_id, eval_ppx))
        sys.stdout.flush()
def decode():
  with tf.Session() as sess:
    # Create model and load parameters.
    # 創建模型，定義超參數batch_size
    model = create_model(sess, True)
    model.batch_size = 1  # We decode one sentence at a time.一次只解碼一個句子
    # Load vocabularies.
    # 加載詞彙表文件
    enc_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.enc" % gConfig['enc_vocab_size'])
    dec_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.dec" % gConfig['dec_vocab_size'])
    enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path)
    _, rev_dec_vocab = data_utils.initialize_vocabulary(dec_vocab_path)
    # Decode from standard input.
    # 對標準輸入句子解碼
    sys.stdout.write("> ")
    sys.stdout.flush()
    sentence = sys.stdin.readline()
    while sentence:
      # Get token-ids for the input sentence.
      # 獲得輸入句子的token-ids
      token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), enc_vocab)
      # Which bucket does it belong to?
      # 計算token_ids屬於哪一個桶(bucket)
      bucket_id = min([b for b in xrange(len(_buckets))
                       if _buckets[b][0] > len(token_ids)])
      # Get a 1-element batch to feed the sentence to the model.
      # 句子送入模型
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
          {bucket_id: [(token_ids, [])]}, bucket_id)
      # Get output logits for the sentence.
      _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs,
                                       target_weights, bucket_id, True)
      # This is a greedy decoder - outputs are just argmaxes of output_logits.
      # 貪心解碼器，輸出output_logits argmaxes
      outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
      # If there is an EOS symbol in outputs, cut them at that point.
      if data_utils.EOS_ID in outputs:
        outputs = outputs[:outputs.index(data_utils.EOS_ID)]
      # Print out French sentence corresponding to outputs.
      # 打印與輸出句子對應法語句子
      print(" ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs]))
      print("> ", end="")
      sys.stdout.flush()
      sentence = sys.stdin.readline()
def self_test():
  """Test the translation model."""
  with tf.Session() as sess:
    print("Self-test for neural translation model.")
    # Create model with vocabularies of 10, 2 small buckets, 2 layers of 32.
    model = seq2seq_model.Seq2SeqModel(10, 10, [(3, 3), (6, 6)], 32, 2,
                                       5.0, 32, 0.3, 0.99, num_samples=8)
    sess.run(tf.initialize_all_variables())
    # Fake data set for both the (3, 3) and (6, 6) bucket.
    data_set = ([([1, 1], [2, 2]), ([3, 3], [4]), ([5], [6])],
                [([1, 1, 1, 1, 1], [2, 2, 2, 2, 2]), ([3, 3, 3], [5, 6])])
    for _ in xrange(5):  # Train the fake model for 5 steps.
      bucket_id = random.choice([0, 1])
      encoder_inputs, decoder_inputs, target_weights = model.get_batch(
          data_set, bucket_id)
      model.step(sess, encoder_inputs, decoder_inputs, target_weights,
                 bucket_id, False)
def init_session(sess, conf='seq2seq.ini'):
    global gConfig
    gConfig = get_config(conf)
    # Create model and load parameters.
    model = create_model(sess, True)
    model.batch_size = 1  # We decode one sentence at a time.
    # Load vocabularies.
    enc_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.enc" % gConfig['enc_vocab_size'])
    dec_vocab_path = os.path.join(gConfig['working_directory'],"vocab%d.dec" % gConfig['dec_vocab_size'])
    enc_vocab, _ = data_utils.initialize_vocabulary(enc_vocab_path)
    _, rev_dec_vocab = data_utils.initialize_vocabulary(dec_vocab_path)
    return sess, model, enc_vocab, rev_dec_vocab
def decode_line(sess, model, enc_vocab, rev_dec_vocab, sentence):
    # Get token-ids for the input sentence.
    token_ids = data_utils.sentence_to_token_ids(tf.compat.as_bytes(sentence), enc_vocab)
    # Which bucket does it belong to?
    bucket_id = min([b for b in xrange(len(_buckets)) if _buckets[b][0] > len(token_ids)])
    # Get a 1-element batch to feed the sentence to the model.
    encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(token_ids, [])]}, bucket_id)
    # Get output logits for the sentence.
    _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)
    # This is a greedy decoder - outputs are just argmaxes of output_logits.
    outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
    # If there is an EOS symbol in outputs, cut them at that point.
    if data_utils.EOS_ID in outputs:
        outputs = outputs[:outputs.index(data_utils.EOS_ID)]
    return " ".join([tf.compat.as_str(rev_dec_vocab[output]) for output in outputs])
if __name__ == '__main__':
    if len(sys.argv) - 1:
        gConfig = get_config(sys.argv[1])
    else:
        # get configuration from seq2seq.ini
        gConfig = get_config()
    print('\n>> Mode : %s\n' %(gConfig['mode']))
    if gConfig['mode'] == 'train':
        # start training
        train()
    elif gConfig['mode'] == 'test':
        # interactive decode
        decode()
    else:
        # wrong way to execute "serve"
        #   Use : >> python ui/app.py
        #           uses seq2seq_serve.ini as conf file
        print('Serve Usage : >> python ui/app.py')
        print('# uses seq2seq_serve.ini as conf file')

基於文字智能機器人，結合語音識別，產生直接對話機器人。系統架構：
人->語音識別(ASR)->天然語言理解(NLU)->對話管理->天然語言生成(NLG)->語音合成(TTS)->人。《中國人工智能學會通信》2016年第6卷第1期。

圖靈機器人公司，提升對話和語義準確度，提高中文語境智能程度。竹間智能科技，研究記憶、自學習情感機器人，機器人真正理解多模式多渠道信息，高度擬人化迴應，最理想天然語言交流模式交流。騰訊公司，社交對話數據。微信，最龐大天然語言交流語料庫，利用龐大真實數據，結合小程序成爲全部服務入口。

參考資料：
《TensorFlow技術解析與實戰》

歡迎推薦上海機器學習工做機會，個人微信：qingxingfengzi

人工智能工做機會分割線-----------------------------------------

杭州阿里新零售淘寶基礎架構平臺：移動AI高級專家