1 大綱概述html
文本分類這個系列將會有十篇左右,包括基於word2vec預訓練的文本分類,與及基於最新的預訓練模型(ELMo,BERT等)的文本分類。總共有如下系列:git
word2vec預訓練詞向量github
textCNN 模型json
charCNN 模型session
Bi-LSTM 模型app
RCNN 模型函數
全部代碼均在textClassifier倉庫中。
2 數據集
數據集爲IMDB 電影影評,總共有三個數據文件,在/data/rawData目錄下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在進行文本分類時須要有標籤的數據(labeledTrainData),數據預處理如文本分類實戰(一)—— word2vec預訓練詞向量中同樣,預處理後的文件爲/data/preprocess/labeledTrain.csv。
3 ELMO 預訓練模型
ELMo模型是利用BiLM(雙向語言模型)來預訓練詞的向量表示,能夠根據咱們的訓練集動態的生成詞的向量表示。ELMo預訓練模型來源於論文:Deep contextualized word representations。具體的ELMo模型的詳細介紹見ELMO模型(Deep contextualized word representation)。
ELMo的模型代碼發佈在github上,咱們在調用ELMo預訓練模型時主要使用到bilm中的代碼,所以能夠將bilm這個文件夾拷貝到本身的項目路徑下,以後須要導入這個文件夾中的類和函數。此外,usage_cached.py,usage_character.py,usage_token.py這三個文件中的代碼是告訴你該怎麼去調用ELMo模型動態的生成詞向量。在這裏咱們使用usage_token.py中的方法,這個計算量相對要小一些。
在使用以前咱們還須要去下載已經預訓練好的模型參數權重,打開https://allennlp.org/elmo連接,在Pre-trained ELMo Models 這個版塊下總共有四個不一樣版本的模型,能夠本身選擇,咱們在這裏選擇Small這個規格的模型,總共有兩個文件須要下載,一個"options"的json文件,保存了模型的配置參數,另外一個是"weights"的hdf5文件,保存了模型的結構和權重值(能夠用h5py讀取看看)。
4 配置參數
在這裏咱們須要將optionFile,vocabFile,weightsFile,tokenEmbeddingFile的路徑配置上,還有一個須要注意的地方就是這裏的embeddingSize的值要和ELMo的詞向量的大小一致
咱們須要導入bilm文件夾中的函數和類
import os import csv import time import datetime import random from collections import Counter from math import sqrt import pandas as pd import numpy as np import tensorflow as tf from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score from bilm import TokenBatcher, BidirectionalLanguageModel, weight_layers, dump_token_embeddings, Batcher
# 配置參數 class TrainingConfig(object): epoches = 10 evaluateEvery = 100 checkpointEvery = 100 learningRate = 0.001 class ModelConfig(object): embeddingSize = 256 # 這個值是和ELMo模型的output Size 對應的值 hiddenSizes = [128] # LSTM結構的神經元個數 dropoutKeepProb = 0.5 l2RegLambda = 0.0 class Config(object): sequenceLength = 200 # 取了全部序列長度的均值 batchSize = 128 dataSource = "../data/preProcess/labeledTrain.csv" stopWordSource = "../data/english" optionFile = "modelParams/elmo_options.json" weightFile = "modelParams/elmo_weights.hdf5" vocabFile = "modelParams/vocab.txt" tokenEmbeddingFile = 'modelParams/elmo_token_embeddings.hdf5' numClasses = 2 rate = 0.8 # 訓練集的比例 training = TrainingConfig() model = ModelConfig() # 實例化配置參數對象 config = Config()
5 數據預處理
1)將數據讀取出來,
2)根據訓練集生成vocabFile文件,
3)調用bilm文件夾中的dump_token_embeddings方法生成初始化的詞向量表示,並保存爲hdf5文件,文件中的鍵爲"embedding",
4)固定輸入數據的序列長度
5)分割成訓練集和測試集
# 數據預處理的類,生成訓練集和測試集 class Dataset(object): def __init__(self, config): self._dataSource = config.dataSource self._stopWordSource = config.stopWordSource self._optionFile = config.optionFile self._weightFile = config.weightFile self._vocabFile = config.vocabFile self._tokenEmbeddingFile = config.tokenEmbeddingFile self._sequenceLength = config.sequenceLength # 每條輸入的序列處理爲定長 self._embeddingSize = config.model.embeddingSize self._batchSize = config.batchSize self._rate = config.rate self.trainReviews = [] self.trainLabels = [] self.evalReviews = [] self.evalLabels = [] def _readData(self, filePath): """ 從csv文件中讀取數據集 """ df = pd.read_csv(filePath) labels = df["sentiment"].tolist() review = df["review"].tolist() reviews = [line.strip().split() for line in review] return reviews, labels def _genVocabFile(self, reviews): """ 用咱們的訓練數據生成一個詞彙文件,並加入三個特殊字符 """ allWords = [word for review in reviews for word in review] wordCount = Counter(allWords) # 統計詞頻 sortWordCount = sorted(wordCount.items(), key=lambda x: x[1], reverse=True) words = [item[0] for item in sortWordCount.items()] allTokens = ['<S>', '</S>', '<UNK>'] + words with open(self._vocabFile, 'w') as fout: fout.write('\n'.join(allTokens)) def _fixedSeq(self, reviews): """ 將長度超過200的截斷爲200的長度 """ return [review[:self._sequenceLength] for review in reviews] def _genElmoEmbedding(self): """ 調用ELMO源碼中的dump_token_embeddings方法,基於字符的表示生成詞的向量表示。並保存成hdf5文件,文件中的"embedding"鍵對應的value就是 詞彙表文件中各詞彙的向量表示,這些詞彙的向量表示以後會做爲BiLM的初始化輸入。 """ dump_token_embeddings( self._vocabFile, self._optionFile, self._weightFile, self._tokenEmbeddingFile) def _genTrainEvalData(self, x, y, rate): """ 生成訓練集和驗證集 """ y = [[item] for item in y] trainIndex = int(len(x) * rate) trainReviews = x[:trainIndex] trainLabels = y[:trainIndex] evalReviews = x[trainIndex:] evalLabels = y[trainIndex:] return trainReviews, trainLabels, evalReviews, evalLabels def dataGen(self): """ 初始化訓練集和驗證集 """ # 初始化數據集 reviews, labels = self._readData(self._dataSource) # self._genVocabFile(reviews) # 生成vocabFile # self._genElmoEmbedding() # 生成elmo_token_embedding reviews = self._fixedSeq(reviews) # 初始化訓練集和測試集 trainReviews, trainLabels, evalReviews, evalLabels = self._genTrainEvalData(reviews, labels, self._rate) self.trainReviews = trainReviews self.trainLabels = trainLabels self.evalReviews = evalReviews self.evalLabels = evalLabels data = Dataset(config) data.dataGen()
6 batch數據生成
# 輸出batch數據集 def nextBatch(x, y, batchSize): """ 生成batch數據集,用生成器的方式輸出 """ # 每個epoch時,都要打亂數據集 midVal = list(zip(x, y)) random.shuffle(midVal) x, y = zip(*midVal) x = list(x) y = list(y) numBatches = len(x) // batchSize for i in range(numBatches): start = i * batchSize end = start + batchSize batchX =x[start: end] batchY = y[start: end] yield batchX, batchY
7 模型結構
在這裏咱們輸入的再也不是詞的索引表示的數據,而是動態生成了詞向量的數據,所以inputX的維度是三維。另外在輸入到Bi-LSTM以前,加一個全鏈接層,能夠訓練輸入的詞向量,不然就是將ELMo的詞向量直接輸入到Bi-LSTM中,而這樣的詞向量可能並非最優的詞向量。
# 構建模型 class BiLSTMAttention(object): """""" def __init__(self, config): # 定義模型的輸入 self.inputX = tf.placeholder(tf.float32, [None, config.sequenceLength, config.model.embeddingSize], name="inputX") self.inputY = tf.placeholder(tf.float32, [None, 1], name="inputY") self.dropoutKeepProb = tf.placeholder(tf.float32, name="dropoutKeepProb") # 定義l2損失 l2Loss = tf.constant(0.0) with tf.name_scope("embedding"): embeddingW = tf.get_variable( "embeddingW", shape=[config.model.embeddingSize, config.model.embeddingSize], initializer=tf.contrib.layers.xavier_initializer()) reshapeInputX = tf.reshape(self.inputX, shape=[-1, config.model.embeddingSize]) self.embeddedWords = tf.reshape(tf.matmul(reshapeInputX, embeddingW), shape=[-1, config.sequenceLength, config.model.embeddingSize]) self.embeddedWords = tf.nn.dropout(self.embeddedWords, self.dropoutKeepProb) # 定義兩層雙向LSTM的模型結構 with tf.name_scope("Bi-LSTM"): for idx, hiddenSize in enumerate(config.model.hiddenSizes): with tf.name_scope("Bi-LSTM" + str(idx)): # 定義前向LSTM結構 lstmFwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True), output_keep_prob=self.dropoutKeepProb) # 定義反向LSTM結構 lstmBwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True), output_keep_prob=self.dropoutKeepProb) # 採用動態rnn,能夠動態的輸入序列的長度,若沒有輸入,則取序列的全長 # outputs是一個元祖(output_fw, output_bw),其中兩個元素的維度都是[batch_size, max_time, hidden_size],fw和bw的hidden_size同樣 # self.current_state 是最終的狀態,二元組(state_fw, state_bw),state_fw=[batch_size, s],s是一個元祖(h, c) outputs_, self.current_state = tf.nn.bidirectional_dynamic_rnn(lstmFwCell, lstmBwCell, self.embeddedWords, dtype=tf.float32, scope="bi-lstm" + str(idx)) # 對outputs中的fw和bw的結果拼接 [batch_size, time_step, hidden_size * 2], 傳入到下一層Bi-LSTM中 self.embeddedWords = tf.concat(outputs_, 2)
# 將最後一層Bi-LSTM輸出的結果分割成前向和後向的輸出 outputs = tf.split(self.embeddedWords, 2, -1) # 在Bi-LSTM+Attention的論文中,將前向和後向的輸出相加 with tf.name_scope("Attention"): H = outputs[0] + outputs[1] # 獲得Attention的輸出 output = self._attention(H) outputSize = config.model.hiddenSizes[-1] # 全鏈接層的輸出 with tf.name_scope("output"): outputW = tf.get_variable( "outputW", shape=[outputSize, 1], initializer=tf.contrib.layers.xavier_initializer()) outputB= tf.Variable(tf.constant(0.1, shape=[1]), name="outputB") l2Loss += tf.nn.l2_loss(outputW) l2Loss += tf.nn.l2_loss(outputB) self.predictions = tf.nn.xw_plus_b(output, outputW, outputB, name="predictions") self.binaryPreds = tf.cast(tf.greater_equal(self.predictions, 0.0), tf.float32, name="binaryPreds") # 計算二元交叉熵損失 with tf.name_scope("loss"): losses = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.predictions, labels=self.inputY) self.loss = tf.reduce_mean(losses) + config.model.l2RegLambda * l2Loss def _attention(self, H): """ 利用Attention機制獲得句子的向量表示 """ # 得到最後一層LSTM的神經元數量 hiddenSize = config.model.hiddenSizes[-1] # 初始化一個權重向量,是可訓練的參數 W = tf.Variable(tf.random_normal([hiddenSize], stddev=0.1)) # 對Bi-LSTM的輸出用激活函數作非線性轉換 M = tf.tanh(H) # 對W和M作矩陣運算,W=[batch_size, time_step, hidden_size],計算前作維度轉換成[batch_size * time_step, hidden_size] # newM = [batch_size, time_step, 1],每個時間步的輸出由向量轉換成一個數字 newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1, 1])) # 對newM作維度轉換成[batch_size, time_step] restoreM = tf.reshape(newM, [-1, config.sequenceLength]) # 用softmax作歸一化處理[batch_size, time_step] self.alpha = tf.nn.softmax(restoreM) # 利用求得的alpha的值對H進行加權求和,用矩陣運算直接操做 r = tf.matmul(tf.transpose(H, [0, 2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength, 1])) # 將三維壓縮成二維sequeezeR=[batch_size, hidden_size] sequeezeR = tf.squeeze(r) sentenceRepren = tf.tanh(sequeezeR) # 對Attention的輸出能夠作dropout處理 output = tf.nn.dropout(sentenceRepren, self.dropoutKeepProb) return output
8 性能指標函數
# 定義性能指標函數 def mean(item): return sum(item) / len(item) def genMetrics(trueY, predY, binaryPredY): """ 生成acc和auc值 """ auc = roc_auc_score(trueY, predY) accuracy = accuracy_score(trueY, binaryPredY) precision = precision_score(trueY, binaryPredY) recall = recall_score(trueY, binaryPredY) return round(accuracy, 4), round(auc, 4), round(precision, 4), round(recall, 4)
9 訓練模型
在訓練模型時,咱們須要動態的生成詞向量表示,在session的全局下須要實例化BiLM模型,定義一個elmo的方法來動態的生成ELMO詞向量。
# 訓練模型 # 生成訓練集和驗證集 trainReviews = data.trainReviews trainLabels = data.trainLabels evalReviews = data.evalReviews evalLabels = data.evalLabels # 定義計算圖 with tf.Graph().as_default(): session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) session_conf.gpu_options.allow_growth=True session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9 # 配置gpu佔用率 sess = tf.Session(config=session_conf) # 定義會話 with sess.as_default(): cnn = BiLSTMAttention(config) # 實例化BiLM對象,這個必須放置在全局下,不能在elmo函數中定義,不然會出現重複生成tensorflow節點。 with tf.variable_scope("bilm", reuse=True): bilm = BidirectionalLanguageModel( config.optionFile, config.weightFile, use_character_inputs=False, embedding_weight_file=config.tokenEmbeddingFile ) inputData = tf.placeholder('int32', shape=(None, None)) # 調用bilm中的__call__方法生成op對象 inputEmbeddingsOp = bilm(inputData) # 計算ELMo向量表示 elmoInput = weight_layers('input', inputEmbeddingsOp, l2_coef=0.0) globalStep = tf.Variable(0, name="globalStep", trainable=False) # 定義優化函數,傳入學習速率參數 optimizer = tf.train.AdamOptimizer(config.training.learningRate) # 計算梯度,獲得梯度和變量 gradsAndVars = optimizer.compute_gradients(cnn.loss) # 將梯度應用到變量下,生成訓練器 trainOp = optimizer.apply_gradients(gradsAndVars, global_step=globalStep) # 用summary繪製tensorBoard gradSummaries = [] for g, v in gradsAndVars: if g is not None: tf.summary.histogram("{}/grad/hist".format(v.name), g) tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g)) outDir = os.path.abspath(os.path.join(os.path.curdir, "summarys")) print("Writing to {}\n".format(outDir)) lossSummary = tf.summary.scalar("loss", cnn.loss) summaryOp = tf.summary.merge_all() trainSummaryDir = os.path.join(outDir, "train") trainSummaryWriter = tf.summary.FileWriter(trainSummaryDir, sess.graph) evalSummaryDir = os.path.join(outDir, "eval") evalSummaryWriter = tf.summary.FileWriter(evalSummaryDir, sess.graph) # 初始化全部變量 saver = tf.train.Saver(tf.global_variables(), max_to_keep=5) # 保存模型的一種方式,保存爲pb文件 # builder = tf.saved_model.builder.SavedModelBuilder("../model/textCNN/savedModel") sess.run(tf.global_variables_initializer()) def elmo(reviews): """ 對每個輸入的batch都動態的生成詞向量表示 """ # tf.reset_default_graph() # TokenBatcher是生成詞表示的batch類 batcher = TokenBatcher(config.vocabFile)
# 生成batch數據 inputDataIndex = batcher.batch_sentences(reviews) # 計算ELMo的向量表示 elmoInputVec = sess.run( [elmoInput['weighted_op']], feed_dict={inputData: inputDataIndex} ) return elmoInputVec def trainStep(batchX, batchY): """ 訓練函數 """ feed_dict = { cnn.inputX: elmo(batchX)[0], # inputX直接用動態生成的ELMo向量表示代入 cnn.inputY: np.array(batchY, dtype="float32"), cnn.dropoutKeepProb: config.model.dropoutKeepProb } _, summary, step, loss, predictions, binaryPreds = sess.run( [trainOp, summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds], feed_dict) timeStr = datetime.datetime.now().isoformat() acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds) print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(timeStr, step, loss, acc, auc, precision, recall)) trainSummaryWriter.add_summary(summary, step) def devStep(batchX, batchY): """ 驗證函數 """ feed_dict = { cnn.inputX: elmo(batchX)[0], cnn.inputY: np.array(batchY, dtype="float32"), cnn.dropoutKeepProb: 1.0 } summary, step, loss, predictions, binaryPreds = sess.run( [summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds], feed_dict) acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds) evalSummaryWriter.add_summary(summary, step) return loss, acc, auc, precision, recall for i in range(config.training.epoches): # 訓練模型 print("start training model") for batchTrain in nextBatch(trainReviews, trainLabels, config.batchSize): trainStep(batchTrain[0], batchTrain[1]) currentStep = tf.train.global_step(sess, globalStep) if currentStep % config.training.evaluateEvery == 0: print("\nEvaluation:") losses = [] accs = [] aucs = [] precisions = [] recalls = [] for batchEval in nextBatch(evalReviews, evalLabels, config.batchSize): loss, acc, auc, precision, recall = devStep(batchEval[0], batchEval[1]) losses.append(loss) accs.append(acc) aucs.append(auc) precisions.append(precision) recalls.append(recall) time_str = datetime.datetime.now().isoformat() print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(time_str, currentStep, mean(losses), mean(accs), mean(aucs), mean(precisions), mean(recalls))) # if currentStep % config.training.checkpointEvery == 0: # # 保存模型的另外一種方法,保存checkpoint文件 # path = saver.save(sess, "../model/textCNN/model/my-model", global_step=currentStep) # print("Saved model checkpoint to {}\n".format(path)) # inputs = {"inputX": tf.saved_model.utils.build_tensor_info(cnn.inputX), # "keepProb": tf.saved_model.utils.build_tensor_info(cnn.dropoutKeepProb)} # outputs = {"binaryPreds": tf.saved_model.utils.build_tensor_info(cnn.binaryPreds)} # prediction_signature = tf.saved_model.signature_def_utils.build_signature_def(inputs=inputs, outputs=outputs, # method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME) # legacy_init_op = tf.group(tf.tables_initializer(), name="legacy_init_op") # builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING], # signature_def_map={"predict": prediction_signature}, legacy_init_op=legacy_init_op) # builder.save()