在前面咱們大體介紹了什麼是意圖識別,把這個問題抽象出來實際上是一個分類問題。在結構上面,咱們使用LSTM來提取特徵,Softmax來進行最後的多分類。因爲語料的限制,咱們目前僅考慮電臺,音樂,問答類等三類的意圖識別。更多種類的意圖識別, 其實也是把更多種類的語料加入進來,修改softmax的分類數。最後的目標是在這三類的分類準備率可以達到90%。算法
咱們將考慮使用 keras(嚴格意義上只能說是一個接口)來實現這個意圖識別的工做。網絡
圖一 意圖分類訓練流程架構
咱們總體的流程如圖所示,首先是利用對語料語料進行預處理,包括去除語料的標點符號,去除停用詞等等。將語料初始化之後即是利用word2vec生成詞向量, 生成詞向量之後即是利用LSTM來進行特徵提取,最後即是利用softmax來完成咱們的意圖分類工做。總體流程很是的清晰。app
咱們的數據有三個文件,一個是question.txt, 一個是music.txt, 一個是station.txt。咱們展現一下數據的格式,你們按照以下結構組織訓練便可,至於更多分類是同樣的。dom
music.txtide
我想聽千千闕歌
汪峯的歌曲
question.txt函數
天爲甚麼這麼藍
中國有多大
station.txt學習
我要聽郭德綱的相聲
交通廣播電臺
在語料預處理這塊,咱們的工做目前作的很粗糙,僅僅是將語料按照1:1:1的比例提取出來進行訓練,這裏有個問題你們能夠思考一下,爲何咱們在訓練的時候要儘可能使不一樣類別的數據按照1:1:1的比例來進行訓練.lua
生成詞向量idea
生成詞向量的過程,是將語料從文字轉化爲數值,方便程序後續處理的過程。咱們直接使用word2vec來進行訓練的,至於word2Vec的原理,咱們不在這裏展開。在訓練的時候,咱們把全部一萬五千條數據所有加入進行訓練。
# -*- coding: UTF-8 -*-
import os import numpy as np from gensim.models.word2vec import Word2Vec from gensim.corpora.dictionary import Dictionary class Embedding(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): for line in open(os.path.join(self.dirname, fname)): yield line.split() if __name__ == '__main__': // 訓練word2vec模型 sentences = Embedding('../data/') # a memory-friendly iterator
圖二:多層LSTM提取特徵,外接softmax 三分類
1 # -*- coding: utf-8 -*-
2
3 import yaml 4 import sys 5 reload(sys) 6 sys.setdefaultencoding("utf-8") 7 from sklearn.cross_validation import train_test_split 8 import multiprocessing 9 import numpy as np 10 from keras.utils import np_utils 11 from gensim.models.word2vec import Word2Vec 12 from gensim.corpora.dictionary import Dictionary 13
14 from keras.preprocessing import sequence 15 from keras.models import Sequential 16 from keras.layers.embeddings import Embedding 17 from keras.layers.recurrent import LSTM 18 from keras.layers.core import Dense, Dropout,Activation 19 from keras.models import model_from_yaml 20 from sklearn.preprocessing import LabelEncoder 21 np.random.seed(1337) # For Reproducibility
22 import jieba 23 import pandas as pd 24 sys.setrecursionlimit(1000000) 25 # set parameters:
26 vocab_dim = 100
27 maxlen = 100
28 n_iterations = 1 # ideally more..
29 n_exposures = 10
30 window_size = 7
31 batch_size = 32
32 n_epoch = 15
33 input_length = 100
34 cpu_count = multiprocessing.cpu_count() 35 #加載訓練文件
36
37 def loadfile(): 38 fopen = open('data/question_query.txt', 'r') 39 questtion = [] 40 for line in fopen: 41 question.append(line) 42
43 fopen = open('data/music_query.txt', 'r') 44 music = [] 45 for line in fopen: 46 music.append(line) 47
48 fopen = open('data/station_query.txt', 'r') 49 station = [] 50 for line in fopen: 51 station.append(line) 52
53 combined = np.concatenate((station, music, qabot)) 54 question_array = np.array([-1]*len(question),dtype=int) 55 station_array = np.array([0]*len(station),dtype=int) 56 music_array = np.array([1]*len(music),dtype=int) 57 #y = np.concatenate((np.ones(len(station), dtype=int), np.zeros(len(music), dtype=int)),qabot_array[0])
58 y = np.hstack((qabot_array, station_array,music_array)) 59 print "y is:"
60 print y.size 61 print "combines is:"
62 print combined.size 63 return combined, y 64
66 #對句子分詞,並去掉換行符
67 def tokenizer(document): 68 ''' Simple Parser converting each document to lower-case, then 69 removing the breaks for new lines and finally splitting on the 70 whitespace 71 '''
72 #text = [jieba.lcut(document.replace('\n', '')) for str(document) in text_list]
73 result_list = [] 74 for text in document: 75 result_list.append(' '.join(jieba.cut(text)).encode('utf-8').strip()) 76 return result_list 77
80 #建立詞語字典,並返回每一個詞語的索引,詞向量,以及每一個句子所對應的詞語索引
81 def create_dictionaries(model=None, 82 combined=None): 83 ''' Function does are number of Jobs: 84 1- Creates a word to index mapping 85 2- Creates a word to vector mapping 86 3- Transforms the Training and Testing Dictionaries 87 4- 返回全部詞語的向量的拼接結果
88 '''
89 if (combined is not None) and (model is not None): 90 gensim_dict = Dictionary() 91 gensim_dict.doc2bow(model.wv.vocab.keys(), 92 allow_update=True) 93 w2indx = {v: k+1 for k, v in gensim_dict.items()}#全部頻數超過10的詞語的索引
94 w2vec = {word: model[word] for word in w2indx.keys()}#全部頻數超過10的詞語的詞向量
95
96 def parse_dataset(combined): 97 ''' Words become integers 98 '''
99 data=[] 100 for sentence in combined: 101 new_txt = [] 102 sentences = sentence.split(' ') 103 for word in sentences: 104 try: 105 word = unicode(word, errors='ignore') 106 new_txt.append(w2indx[word]) 107 except: 108 new_txt.append(0) 109 data.append(new_txt) 110 return data 111 combined=parse_dataset(combined) 112 combined= sequence.pad_sequences(combined, maxlen=maxlen)#每一個句子所含詞語對應的索引,因此句子中含有頻數小於10的詞語,索引爲0
113 return w2indx, w2vec,combined 114 else: 115 print 'No data provided...'
116
118 #建立詞語字典,並返回每一個詞語的索引,詞向量,以及每一個句子所對應的詞語索引
119 def word2vec_train(combined): 120 # 加載word2vec 模型
121 model = Word2Vec.load('lstm_data/model/Word2vec_model.pkl') 122 index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined) 123 return index_dict, word_vectors,combined 124
125 def get_data(index_dict,word_vectors,combined,y): 126 # 獲取句子的向量
127 n_symbols = len(index_dict) + 1 # 全部單詞的索引數,頻數小於10的詞語索引爲0,因此加1
128 embedding_weights = np.zeros((n_symbols, vocab_dim)) #索引爲0的詞語,詞向量全爲0
129 for word, index in index_dict.items(): #從索引爲1的詞語開始,對每一個詞語對應其詞向量
130 embedding_weights[index, :] = word_vectors[word] 131 x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2) 132 # encode class values as integers
133 encoder = LabelEncoder() 134 encoded_y_train = encoder.fit_transform(y_train) 135 encoded_y_test = encoder.fit_transform(y_test) 136 # convert integers to dummy variables (one hot encoding)
137 y_train = np_utils.to_categorical(encoded_y_train) 138 y_test = np_utils.to_categorical(encoded_y_test) 139 print x_train.shape,y_train.shape 140 return n_symbols,embedding_weights,x_train,y_train,x_test,y_test 141
142 ##定義網絡結構
143 def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test): 144 nb_classes = 3
145 print 'Defining a Simple Keras Model...'
146 ## 定義基本的網絡結構
147 model = Sequential() # or Graph or whatever
148 ## 對於LSTM 變長的文本使用Embedding 將其變成指定長度的向量
149 model.add(Embedding(output_dim=vocab_dim, 150 input_dim=n_symbols, 151 mask_zero=True, 152 weights=[embedding_weights], 153 input_length=input_length)) # Adding Input Length
154 ## 使用單層LSTM 輸出的向量維度是50,輸入的向量維度是vocab_dim,激活函數relu
155 model.add(LSTM(output_dim=50, activation='relu', inner_activation='hard_sigmoid')) 156 model.add(Dropout(0.5)) 157 ## 在這裏外接softmax,進行最後的3分類
158 model.add(Dense(output_dim=nb_classes, input_dim=50, activation='softmax')) 159 print 'Compiling the Model...'
160 ## 激活函數使用的是adam
161 model.compile(loss='categorical_crossentropy', 162 optimizer='adam',metrics=['accuracy']) 163
164 print "Train..."
165 print y_train 166 model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test)) 167 print "Evaluate..."
168 score = model.evaluate(x_test, y_test, 169 batch_size=batch_size) 170 yaml_string = model.to_yaml() 171 with open('lstm_data/lstm_koubei.yml', 'w') as outfile: 172 outfile.write( yaml.dump(yaml_string, default_flow_style=True) ) 173 model.save_weights('lstm_data/lstm_koubei.h5') 174 print 'Test score:', score 175
176 #訓練模型,並保存
177 def train(): 178 print 'Loading Data...'
179 combined,y=loadfile() 180 print len(combined),len(y) 181 print 'Tokenising...'
182 combined = tokenizer(combined) 183 print 'Training a Word2vec model...'
184 index_dict, word_vectors,combined=word2vec_train(combined) 185 print 'Setting up Arrays for Keras Embedding Layer...'
186 n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y) 187 print x_train.shape,y_train.shape 188 train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)
190
191 #訓練模型,並保存
192 def self_train(): 193 print 'Loading Data...'
194 combined,y=loadfile() 195 print len(combined),len(y) 196 print 'Tokenising...'
197 combined = tokenizer(combined) 198 print 'Training a Word2vec model...'
199 index_dict, word_vectors,combined=word2vec_train(combined) 200 print 'Setting up Arrays for Keras Embedding Layer...'
201 n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y) 202 print x_train.shape,y_train.shape 203 train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test) 204
205 def input_transform(string): 206 words=' '.join(jieba.cut(string)).encode('utf-8').strip() 207 tmp_list = [] 208 tmp_list.append(words) 209 #words=np.array(tmp_list).reshape(1,-1)
210 model=Word2Vec.load('lstm_data/model/Word2vec_model.pkl') 211 _,_,combined=create_dictionaries(model,tmp_list) 212 return combined248
249 if __name__=='__main__': 250 self_train()
咱們使用LSTM單層網絡結構,在迭代15 次之後訓練準確率已經能夠達到96%以上。進一步思考一下,疊加LSTM網絡,是否能夠達到更高的訓練準確率,其餘的部分不變,咱們僅僅修改咱們的網絡定義部分
1 ##定義網絡結構
2 def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test): 3 nb_classes = 3
4 print 'Defining a Simple Keras Model...'
5 model = Sequential() # or Graph or whatever
6 model.add(Embedding(output_dim=vocab_dim, 7 input_dim=n_symbols, 8 mask_zero=True, 9 weights=[embedding_weights], 10 input_length=input_length)) # Adding Input Length
11 print vocab_dim 12 print n_symbols 13 #model.add(LSTM(output_dim=50, activation='relu',inner_activation='hard_sigmoid'))
14 #model.add(LSTM(output_dim=25, activation='relu', return_sequences=True))
15 model.add(LSTM(64, input_dim=vocab_dim, activation='relu', return_sequences=True)) 16 model.add(LSTM(32, return_sequences=True)) 17 model.add(Dropout(0.5)) 18 #model.add(Dense(nb_classes))
19 #model.add(Activation('softmax'))
20 print model.summary() 21 model.add(NonMasking()) 22 model.add(Flatten()) 23 model.add(Dense(output_dim=nb_classes, activation='softmax')) 24 print 'Compiling the Model...'
25 model.compile(loss='categorical_crossentropy', 26 optimizer='adam',metrics=['accuracy']) 27
28 print "Train..."
29 print y_train 30 model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test)) 31 print "Evaluate..."
32 score = model.evaluate(x_test, y_test, 33 batch_size=batch_size) 34
35 yaml_string = model.to_yaml() 36 with open('lstm_data/lstm_koubei.yml', 'w') as outfile: 37 outfile.write( yaml.dump(yaml_string, default_flow_style=True) ) 38 model.save_weights('lstm_data/lstm_koubei.h5') 39 print 'Test score:', score
咱們發現一樣迭代15次,訓練準確率能夠達到97%左右。說明疊加LSTM網絡結構確實是有效的,可以更好的抓取訓練語料的特徵。
訓練反思與總結
目前,咱們僅僅能夠說作了一個意圖識別的demo,已經能夠達到比較高的訓練準確率,可是咱們還有不少方面改進。第一也是最直觀的是咱們目前的訓練語料還不多,而且訓練的類別也比較少,咱們但願在保持訓練準確率的前提下,訓練的語料能夠更多,訓練的類別更多。第二對語料的預處理作的很是的粗糙,沒有去除停用詞,沒有去除標點符號等等,咱們這裏沒有作的緣由是咱們的訓練語料是比較乾淨因此就沒有進行處理了。第三個是咱們目前分詞的算法是很是的粗糙,使用的結巴分詞默認的詞庫進行分詞。分詞的詞庫沒有匹配咱們領域知識。第四咱們還但願使用CNN來對比一下抽取的效果。
可是你們能夠看到深度學習在天然語言處理當中巨大的威力,咱們不用辛辛苦苦的去提取unigram,bigram等等特徵,使用embeding的方法來描述文本,節省了大量人工,而且訓練的準確率遠超過咱們的預期。