cataloguehtml
0. 前言 1. 訓練語料庫 2. 數據預處理 3. 詞彙轉向量 4. 訓練 5. 聊天機器人 - 驗證效果
0. 前言python
不是搞機器學習算法專業的,3個月前開始補了一些神經網絡,卷積,神經網絡一大堆基礎概念,尼瑪,還真有點複雜,不過搞懂這些基本數學概念,再看tensorflow的api和python代碼以爲跌跌撞撞居然能看懂了,背後的意思也能明白一點點git
0x1: 模型分類github
1. 基於檢索的模型 vs. 產生式模型算法
基於檢索的模型(Retrieval-Based Models)有一個預先定義的"回答集(repository)",包含了許多回答(responses),還有一些根據輸入的問句和上下文(context),以及用於挑選出合適的回答的啓發式規則。這些啓發式規則多是簡單的基於規則的表達式匹配,或是相對複雜的機器學習分類器的集成。基於檢索的模型不會產生新的文字,它只能從預先定義的"回答集"中挑選出一個較爲合適的回答。
產生式模型(Generative Models)不依賴於預先定義的回答集,它會產生一個新的回答。經典的產生式模型是基於機器翻譯技術的,只不過不是將一種語言翻譯成另外一種語言,而是將問句"翻譯"成回答(response)api
2. 長對話模型 vs. 短對話模型網絡
短對話(Short Conversation)指的是一問一答式的單輪(single turn)對話。舉例來講,當機器收到用戶的一個提問時,會返回一個合適的回答。對應地,長對話(Long Conversation)指的是你來我往的多輪(multi-turn)對話,例如兩個朋友對某個話題交流意見的一段聊天。在這個場景中,須要談話雙方(聊天機器人多是其中一方)記得雙方曾經談論過什麼,這是和短對話的場景的區別之一。現下,機器人客服系統一般是長對話模型app
3. 開放話題模型 vs. 封閉話題模型dom
開放話題(Open Domain)場景下,用戶能夠說任何內容,不須要是有特定的目的或是意圖的詢問。人們在Twitter、Reddit等社交網絡上的對話形式就是典型的開放話題情景。因爲該場景下,可談論的主題的數量不限,並且須要一些常識做爲聊天基礎,使得搭建一個這樣的聊天機器人變得相對困難。
封閉話題(Closed Domain)場景,又稱爲目標驅動型(goal-driven),系統致力於解決特定領域的問題,所以可能的詢問和回答的數量相對有限。技術客服系統或是購物助手等應用就是封閉話題模型的例子。咱們不要求這些系統可以談論政治,只須要它們可以儘量有效地解決咱們的問題。雖然用戶仍是能夠向這些系統問一些不着邊際的問題,可是系統一樣能夠不着邊際地給你回覆 ;)機器學習
Relevant Link:
http://naturali.io/deeplearning/chatbot/introduction/2016/04/28/chatbot-part1.html http://blog.topspeedsnail.com/archives/10735/comment-page-1#comment-1161 http://blog.csdn.net/malefactor/article/details/51901115
1. 訓練語料庫
wget https://raw.githubusercontent.com/rustch3n/dgk_lost_conv/master/dgk_shooter_min.conv.zip 解壓 unzip dgk_shooter_min.conv.zip
Relevant Link:
https://github.com/rustch3n/dgk_lost_conv
2. 數據預處理
通常來講,咱們拿到的基礎語料庫多是一些電影臺詞對話,或者是UBUNTU對話語料庫(Ubuntu Dialog Corpus),但基本上咱們都要完成如下幾大步驟
1. 分詞(tokenized) 2. 英文單詞取詞根(stemmed) 3. 英文單詞變形的歸類(lemmatized)(例如單複數歸類)等 4. 此外,例如人名、地名、組織名、URL連接、系統路徑等專有名詞,咱們也能夠統一用類型標識符來替代
M 表示話語,E 表示分割,遇到M就吧當前對話片斷加入臨時對話集,遇到E就說明遇到一箇中斷或者交談雙方轉換了,一口氣吧臨時對話集加入convs總對話集,一次加入一個對話集,能夠理解爲拍電影裏面的一個"咔"
convs = [] # conversation set with open(conv_path, encoding="utf8") as f: one_conv = [] # a complete conversation for line in f: line = line.strip('\n').replace('/', '') if line == '': continue if line[0] == 'E': if one_conv: convs.append(one_conv) one_conv = [] elif line[0] == 'M': one_conv.append(line.split(' ')[1])
由於場景是聊天機器人,影視劇的臺詞也是一人一句對答的,因此這裏須要忽略2種特殊狀況,只有一問或者只有一答,以及問和答的數量不一致,即最後一我的問完了沒有獲得回答
# Grasping calligraphy answer answer ask = [] # ask response = [] # answers for conv in convs: if len(conv) == 1: continue if len(conv) % 2 != 0: conv = conv[:-1] for i in range(len(conv)): if i % 2 == 0: ask.append(conv[i]) else: response.append(conv[i])
Relevant Link:
3. 詞彙轉向量
咱們知道圖像識別、語音識別之因此能率先在深度學習領域取得較大成就,其中一個緣由在於這2個領域的原始輸入數據自己就帶有很強的樣本關聯性,例如像素權重分佈在同一類物體的不一樣圖像中,表現是基本一致的,這本質上也人腦識別同類物體的機制是同樣的,即咱們常說的"觸類旁通"能力,咱們學過的文字越多,就越可能駕馭甚至能創造組合出新的文字用法,寫出華麗的文章
可是NPL或者語義識別領域的輸入數據,對話或者叫語料每每是不具有這種強關聯性的,爲此,就須要引入一個概念模型,叫詞向量(word2vec)或短語向量(seq2seq),簡單來講就是將語料庫中的詞彙抽象映射到一個向量空間中,向量的排布是根據預發和詞義語境決定的,例如,"中國->人"(中國後面緊跟着一我的字的可能性是極大的)、"你今年幾歲了->我 ** 歲了"
0x1: Token化處理、詞編碼
將訓練集中的對話的每一個文件拆分紅單獨的一個個文字,造成一個詞表(word table)
def gen_vocabulary_file(input_file, output_file): vocabulary = {} with open(input_file) as f: counter = 0 for line in f: counter += 1 tokens = [word for word in line.strip()] for word in tokens: if word in vocabulary: vocabulary[word] += 1 else: vocabulary[word] = 1 vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True) # For taking 10000 custom character kanji if len(vocabulary_list) > 10000: vocabulary_list = vocabulary_list[:10000] print(input_file + " phrase table size:", len(vocabulary_list)) with open(output_file, "w") as ff: for word in vocabulary_list: ff.write(word + "\n")
完成了Token化以後,須要對單詞進行數字編碼,方便後續的向量空間處理,這裏依據的核心思想是這樣的
咱們的訓練語料庫的對話之間都是有強關聯的,基於這份有關聯的對話集得到的詞表的詞之間也有邏輯關聯性,那麼咱們只要按照此表原生的順序對詞進行編碼,這個編碼後的[work, id]就是一個有向量空間關聯性的詞表
def convert_conversation_to_vector(input_file, vocabulary_file, output_file): tmp_vocab = [] with open(vocabulary_file, "r") as f: tmp_vocab.extend(f.readlines()) tmp_vocab = [line.strip() for line in tmp_vocab] vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)]) for item in vocab: print item.encode('utf-8')
因此咱們根據訓練預料集獲得的此表能夠做爲對話訓練集和對話測試機進行向量化的依據,咱們的目的是將對話(包括訓練集和測試集)的問和答都轉化映射到向量空間
土 968 "土"字在訓練集詞彙表中的位置是968,咱們就給該字設置一個編碼968
0x2: 對話轉爲向量
原做者在詞表的選取上做了裁剪,只選取前5000個詞彙,可是仔細思考了一下,感受問題源頭仍是在訓練語料庫不夠豐富,不能徹底覆蓋全部的對話語言場景
這一步獲得一個ask/answer的語句seq向量空間集,對於訓練集,咱們將ask和answer創建映射關係
Relevant Link:
4. 訓練
0x1: Sequence-to-sequence basics
A basic sequence-to-sequence model, as introduced in Cho et al., 2014, consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output. This basic architecture is depicted below.
Each box in the picture above represents a cell of the RNN, most commonly a GRU cell or an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters. Multi-layer cells have been successfully used in sequence-to-sequence models too
In the basic model depicted above, every input has to be encoded into a fixed-size state vector, as that is the only thing passed to the decoder. To allow the decoder more direct access to the input, an attention mechanism was introduced in Bahdanau et al., 2014.; suffice it to say that it allows the decoder to peek into the input at every decoding step. A multi-layer sequence-to-sequence network with LSTM cells and attention mechanism in the decoder looks like this.
0x2: 訓練過程
利用ask/answer的訓練集輸入神經網絡,並使用ask/answer測試向量映射集實現BP反饋與,使用一個三層神經網絡,讓tensorflow自動調整權重參數,得到一個ask-?的模型
# -*- coding: utf-8 -*- import tensorflow as tf # 0.12 from tensorflow.models.rnn.translate import seq2seq_model import os import numpy as np import math PAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3 # ask/answer conversation vector file train_ask_vec_file = 'train_ask.vec' train_answer_vec_file = 'train_answer.vec' test_ask_vec_file = 'test_ask.vec' test_answer_vec_file = 'test_answer.vec' # word table 6000 vocabulary_ask_size = 6000 vocabulary_answer_size = 6000 buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] layer_size = 256 num_layers = 3 batch_size = 64 # read *dencode.vec和*decode.vec data into memory def read_data(source_path, target_path, max_size=None): data_set = [[] for _ in buckets] with tf.gfile.GFile(source_path, mode="r") as source_file: with tf.gfile.GFile(target_path, mode="r") as target_file: source, target = source_file.readline(), target_file.readline() counter = 0 while source and target and (not max_size or counter < max_size): counter += 1 source_ids = [int(x) for x in source.split()] target_ids = [int(x) for x in target.split()] target_ids.append(EOS_ID) for bucket_id, (source_size, target_size) in enumerate(buckets): if len(source_ids) < source_size and len(target_ids) < target_size: data_set[bucket_id].append([source_ids, target_ids]) break source, target = source_file.readline(), target_file.readline() return data_set if __name__ == '__main__': model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size, target_vocab_size=vocabulary_answer_size, buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0, batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.97, forward_only=False) config = tf.ConfigProto() config.gpu_options.allocator_type = 'BFC' # forbidden out of memory with tf.Session(config=config) as sess: # 恢復前一次訓練 ckpt = tf.train.get_checkpoint_state('.') if ckpt != None: print(ckpt.model_checkpoint_path) model.saver.restore(sess, ckpt.model_checkpoint_path) else: sess.run(tf.global_variables_initializer()) train_set = read_data(train_ask_vec_file, train_answer_vec_file) test_set = read_data(test_ask_vec_file, test_answer_vec_file) train_bucket_sizes = [len(train_set[b]) for b in range(len(buckets))] train_total_size = float(sum(train_bucket_sizes)) train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))] loss = 0.0 total_step = 0 previous_losses = [] # continue train,save modle after a decade of time while True: random_number_01 = np.random.random_sample() bucket_id = min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > random_number_01]) encoder_inputs, decoder_inputs, target_weights = model.get_batch(train_set, bucket_id) _, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False) loss += step_loss / 500 total_step += 1 print(total_step) if total_step % 500 == 0: print(model.global_step.eval(), model.learning_rate.eval(), loss) # if model has't not improve,decrese the learning rate if len(previous_losses) > 2 and loss > max(previous_losses[-3:]): sess.run(model.learning_rate_decay_op) previous_losses.append(loss) # save model checkpoint_path = "chatbot_seq2seq.ckpt" model.saver.save(sess, checkpoint_path, global_step=model.global_step) loss = 0.0 # evaluation the model by test dataset for bucket_id in range(len(buckets)): if len(test_set[bucket_id]) == 0: continue encoder_inputs, decoder_inputs, target_weights = model.get_batch(test_set, bucket_id) _, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True) eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf') print(bucket_id, eval_ppx)
Relevant Link:
https://www.tensorflow.org/tutorials/seq2seq http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/
5. 聊天機器人 - 驗證效果
# -*- coding: utf-8 -*- import tensorflow as tf # 0.12 from tensorflow.models.rnn.translate import seq2seq_model import os import sys import locale import numpy as np PAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3 train_ask_vocabulary_file = "train_ask_vocabulary.vec" train_answer_vocabulary_file = "train_answer_vocabulary.vec" def read_vocabulary(input_file): tmp_vocab = [] with open(input_file, "r") as f: tmp_vocab.extend(f.readlines()) tmp_vocab = [line.strip() for line in tmp_vocab] vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)]) return vocab, tmp_vocab if __name__ == '__main__': vocab_en, _, = read_vocabulary(train_ask_vocabulary_file) _, vocab_de, = read_vocabulary(train_answer_vocabulary_file) # word table 6000 vocabulary_ask_size = 6000 vocabulary_answer_size = 6000 buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] layer_size = 256 num_layers = 3 batch_size = 1 model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size, target_vocab_size=vocabulary_answer_size, buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0, batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.99, forward_only=True) model.batch_size = 1 with tf.Session() as sess: # restore last train ckpt = tf.train.get_checkpoint_state('.') if ckpt != None: print(ckpt.model_checkpoint_path) model.saver.restore(sess, ckpt.model_checkpoint_path) else: print("model not found") while True: input_string = raw_input('me > ').decode(sys.stdin.encoding or locale.getpreferredencoding(True)).strip() # 退出 if input_string == 'quit': exit() # convert the user's input to vector input_string_vec = [] for words in input_string.strip(): input_string_vec.append(vocab_en.get(words, UNK_ID)) bucket_id = min([b for b in range(len(buckets)) if buckets[b][0] > len(input_string_vec)]) encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(input_string_vec, [])]}, bucket_id) _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True) outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits] if EOS_ID in outputs: outputs = outputs[:outputs.index(EOS_ID)] response = "".join([tf.compat.as_str(vocab_de[output]) for output in outputs]) print('AI > ' + response)
神經網絡仍是很依賴樣本的訓練的,我在實驗的過程當中發現,用GPU跑到20000 step以後,模型的效果才逐漸顯現出來,纔開始逐漸像正常的人機對話了
Relevant Link:
Copyright (c) 2017 LittleHann All rights reserved