文本分類實戰（九）—— ELMO 預訓練模型

時間 2019-11-07

標籤文本分類實戰 elmo 訓練模型简体版

原文原文鏈接

1 大綱概述html

　　文本分類這個系列將會有十篇左右，包括基於word2vec預訓練的文本分類，與及基於最新的預訓練模型（ELMo，BERT等）的文本分類。總共有如下系列：git

　　word2vec預訓練詞向量github

　　textCNN 模型json

　　charCNN 模型session

　　Bi-LSTM 模型app

　　Bi-LSTM + Attention 模型dom

　　RCNN 模型函數

　　Adversarial LSTM 模型post

　　Transformer 模型性能

　　ELMo 預訓練模型

　　BERT 預訓練模型

　　全部代碼均在textClassifier倉庫中。

2 數據集

　　數據集爲IMDB 電影影評，總共有三個數據文件，在/data/rawData目錄下，包括unlabeledTrainData.tsv，labeledTrainData.tsv，testData.tsv。在進行文本分類時須要有標籤的數據（labeledTrainData），數據預處理如文本分類實戰（一）—— word2vec預訓練詞向量中同樣，預處理後的文件爲/data/preprocess/labeledTrain.csv。

3 ELMO 預訓練模型

　　ELMo模型是利用BiLM（雙向語言模型）來預訓練詞的向量表示，能夠根據咱們的訓練集動態的生成詞的向量表示。ELMo預訓練模型來源於論文：Deep contextualized word representations。具體的ELMo模型的詳細介紹見ELMO模型（Deep contextualized word representation）。

　　ELMo的模型代碼發佈在github上，咱們在調用ELMo預訓練模型時主要使用到bilm中的代碼，所以能夠將bilm這個文件夾拷貝到本身的項目路徑下，以後須要導入這個文件夾中的類和函數。此外，usage_cached.py，usage_character.py，usage_token.py這三個文件中的代碼是告訴你該怎麼去調用ELMo模型動態的生成詞向量。在這裏咱們使用usage_token.py中的方法，這個計算量相對要小一些。

　　在使用以前咱們還須要去下載已經預訓練好的模型參數權重，打開https://allennlp.org/elmo連接，在Pre-trained ELMo Models 這個版塊下總共有四個不一樣版本的模型，能夠本身選擇，咱們在這裏選擇Small這個規格的模型，總共有兩個文件須要下載，一個"options"的json文件，保存了模型的配置參數，另外一個是"weights"的hdf5文件，保存了模型的結構和權重值（能夠用h5py讀取看看）。

4 配置參數

　　在這裏咱們須要將optionFile，vocabFile，weightsFile，tokenEmbeddingFile的路徑配置上，還有一個須要注意的地方就是這裏的embeddingSize的值要和ELMo的詞向量的大小一致

　　咱們須要導入bilm文件夾中的函數和類

import os
import csv
import time
import datetime
import random

from collections import Counter
from math import sqrt

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

from bilm import TokenBatcher, BidirectionalLanguageModel, weight_layers, dump_token_embeddings, Batcher

# 配置參數

class TrainingConfig(object):
    epoches = 10
    evaluateEvery = 100
    checkpointEvery = 100
    learningRate = 0.001
    
class ModelConfig(object):
    embeddingSize = 256  # 這個值是和ELMo模型的output Size 對應的值
    
    hiddenSizes = [128]  # LSTM結構的神經元個數
    
    dropoutKeepProb = 0.5
    l2RegLambda = 0.0
    
class Config(object):
    sequenceLength = 200  # 取了全部序列長度的均值
    batchSize = 128
    
    dataSource = "../data/preProcess/labeledTrain.csv"
    
    stopWordSource = "../data/english"
    
    optionFile = "modelParams/elmo_options.json" 
    weightFile = "modelParams/elmo_weights.hdf5"
    vocabFile = "modelParams/vocab.txt"
    tokenEmbeddingFile = 'modelParams/elmo_token_embeddings.hdf5'
    
    numClasses = 2
    
    rate = 0.8  # 訓練集的比例
    
    training = TrainingConfig()
    
    model = ModelConfig()

    
# 實例化配置參數對象
config = Config()

5 數據預處理

　　1）將數據讀取出來，

　　2）根據訓練集生成vocabFile文件，

　　3）調用bilm文件夾中的dump_token_embeddings方法生成初始化的詞向量表示，並保存爲hdf5文件，文件中的鍵爲"embedding"，

　　4）固定輸入數據的序列長度

　　5）分割成訓練集和測試集

# 數據預處理的類，生成訓練集和測試集

class Dataset(object):
    def __init__(self, config):
        self._dataSource = config.dataSource
        self._stopWordSource = config.stopWordSource  
        self._optionFile = config.optionFile
        self._weightFile = config.weightFile
        self._vocabFile = config.vocabFile
        self._tokenEmbeddingFile = config.tokenEmbeddingFile
        
        self._sequenceLength = config.sequenceLength  # 每條輸入的序列處理爲定長
        self._embeddingSize = config.model.embeddingSize
        self._batchSize = config.batchSize
        self._rate = config.rate
        
        self.trainReviews = []
        self.trainLabels = []
        
        self.evalReviews = []
        self.evalLabels = []
        
    def _readData(self, filePath):
        """
        從csv文件中讀取數據集
        """
        
        df = pd.read_csv(filePath)
        labels = df["sentiment"].tolist()
        review = df["review"].tolist()
        reviews = [line.strip().split() for line in review]

        return reviews, labels
    
    def _genVocabFile(self, reviews):
        """
        用咱們的訓練數據生成一個詞彙文件，並加入三個特殊字符
        """
        allWords = [word for review in reviews for word in review]
        wordCount = Counter(allWords)  # 統計詞頻
        sortWordCount = sorted(wordCount.items(), key=lambda x: x[1], reverse=True)
        words = [item[0] for item in sortWordCount.items()]
        allTokens = ['<S>', '</S>', '<UNK>'] + words
        with open(self._vocabFile, 'w') as fout:
            fout.write('\n'.join(allTokens))
    
    def _fixedSeq(self, reviews):
        """
        將長度超過200的截斷爲200的長度
        """
        return [review[:self._sequenceLength] for review in reviews]
    
    def _genElmoEmbedding(self):
        """
        調用ELMO源碼中的dump_token_embeddings方法，基於字符的表示生成詞的向量表示。並保存成hdf5文件，文件中的"embedding"鍵對應的value就是
        詞彙表文件中各詞彙的向量表示，這些詞彙的向量表示以後會做爲BiLM的初始化輸入。
        """
        dump_token_embeddings(
            self._vocabFile, self._optionFile, self._weightFile, self._tokenEmbeddingFile)

    def _genTrainEvalData(self, x, y, rate):
        """
        生成訓練集和驗證集
        """
        y = [[item] for item in y]
        trainIndex = int(len(x) * rate)
        
        trainReviews = x[:trainIndex]
        trainLabels = y[:trainIndex]
        
        evalReviews = x[trainIndex:]
        evalLabels = y[trainIndex:]

        return trainReviews, trainLabels, evalReviews, evalLabels
        
            
    def dataGen(self):
        """
        初始化訓練集和驗證集
        """
        
        # 初始化數據集
        reviews, labels = self._readData(self._dataSource)
        
        
#         self._genVocabFile(reviews) # 生成vocabFile
#         self._genElmoEmbedding()  # 生成elmo_token_embedding
        
        reviews = self._fixedSeq(reviews)
        
        # 初始化訓練集和測試集
        trainReviews, trainLabels, evalReviews, evalLabels = self._genTrainEvalData(reviews, labels, self._rate)
        self.trainReviews = trainReviews
        self.trainLabels = trainLabels
        
        self.evalReviews = evalReviews
        self.evalLabels = evalLabels
                
data = Dataset(config)
data.dataGen()

6 batch數據生成

# 輸出batch數據集
def nextBatch(x, y, batchSize):
        """
        生成batch數據集，用生成器的方式輸出
        """
        # 每個epoch時，都要打亂數據集
        midVal = list(zip(x, y))
        random.shuffle(midVal)
        x, y = zip(*midVal)
        x = list(x)
        y = list(y)

        numBatches = len(x) // batchSize

        for i in range(numBatches):
            start = i * batchSize
            end = start + batchSize
            batchX =x[start: end]
            batchY = y[start: end]

            yield batchX, batchY

7 模型結構

　　在這裏咱們輸入的再也不是詞的索引表示的數據，而是動態生成了詞向量的數據，所以inputX的維度是三維。另外在輸入到Bi-LSTM以前，加一個全鏈接層，能夠訓練輸入的詞向量，不然就是將ELMo的詞向量直接輸入到Bi-LSTM中，而這樣的詞向量可能並非最優的詞向量。

# 構建模型
class BiLSTMAttention(object):
    """"""
    def __init__(self, config):

        # 定義模型的輸入
        self.inputX = tf.placeholder(tf.float32, [None, config.sequenceLength, config.model.embeddingSize], name="inputX")
        self.inputY = tf.placeholder(tf.float32, [None, 1], name="inputY")
        
        self.dropoutKeepProb = tf.placeholder(tf.float32, name="dropoutKeepProb")
        
        # 定義l2損失
        l2Loss = tf.constant(0.0)
        
        with tf.name_scope("embedding"):
            embeddingW = tf.get_variable(
                "embeddingW",
                shape=[config.model.embeddingSize, config.model.embeddingSize],
                initializer=tf.contrib.layers.xavier_initializer())
            
            reshapeInputX = tf.reshape(self.inputX, shape=[-1, config.model.embeddingSize])
            
            self.embeddedWords = tf.reshape(tf.matmul(reshapeInputX, embeddingW), shape=[-1, config.sequenceLength, config.model.embeddingSize])
            self.embeddedWords = tf.nn.dropout(self.embeddedWords, self.dropoutKeepProb)
            
        # 定義兩層雙向LSTM的模型結構
        with tf.name_scope("Bi-LSTM"):
            for idx, hiddenSize in enumerate(config.model.hiddenSizes):
                with tf.name_scope("Bi-LSTM" + str(idx)):
                    # 定義前向LSTM結構
                    lstmFwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True),
                                                                 output_keep_prob=self.dropoutKeepProb)
                    # 定義反向LSTM結構
                    lstmBwCell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(num_units=hiddenSize, state_is_tuple=True),
                                                                 output_keep_prob=self.dropoutKeepProb)


                    # 採用動態rnn，能夠動態的輸入序列的長度，若沒有輸入，則取序列的全長
                    # outputs是一個元祖(output_fw, output_bw)，其中兩個元素的維度都是[batch_size, max_time, hidden_size],fw和bw的hidden_size同樣
                    # self.current_state 是最終的狀態，二元組(state_fw, state_bw)，state_fw=[batch_size, s]，s是一個元祖(h, c)
                    outputs_, self.current_state = tf.nn.bidirectional_dynamic_rnn(lstmFwCell, lstmBwCell, 
                                                                                  self.embeddedWords, dtype=tf.float32,
                                                                                  scope="bi-lstm" + str(idx))
        
                    # 對outputs中的fw和bw的結果拼接 [batch_size, time_step, hidden_size * 2], 傳入到下一層Bi-LSTM中
                    self.embeddedWords = tf.concat(outputs_, 2)

        # 將最後一層Bi-LSTM輸出的結果分割成前向和後向的輸出
        outputs = tf.split(self.embeddedWords, 2, -1)
        
        # 在Bi-LSTM+Attention的論文中，將前向和後向的輸出相加
        with tf.name_scope("Attention"):
            H = outputs[0] + outputs[1]

            # 獲得Attention的輸出
            output = self._attention(H)
            outputSize = config.model.hiddenSizes[-1]
        
        # 全鏈接層的輸出
        with tf.name_scope("output"):
            outputW = tf.get_variable(
                "outputW",
                shape=[outputSize, 1],
                initializer=tf.contrib.layers.xavier_initializer())
            
            outputB= tf.Variable(tf.constant(0.1, shape=[1]), name="outputB")
            l2Loss += tf.nn.l2_loss(outputW)
            l2Loss += tf.nn.l2_loss(outputB)
            self.predictions = tf.nn.xw_plus_b(output, outputW, outputB, name="predictions")
            self.binaryPreds = tf.cast(tf.greater_equal(self.predictions, 0.0), tf.float32, name="binaryPreds")
        
        # 計算二元交叉熵損失
        with tf.name_scope("loss"):
            
            losses = tf.nn.sigmoid_cross_entropy_with_logits(logits=self.predictions, labels=self.inputY)
            self.loss = tf.reduce_mean(losses) + config.model.l2RegLambda * l2Loss
    
    def _attention(self, H):
        """
        利用Attention機制獲得句子的向量表示
        """
        # 得到最後一層LSTM的神經元數量
        hiddenSize = config.model.hiddenSizes[-1]
        
        # 初始化一個權重向量，是可訓練的參數
        W = tf.Variable(tf.random_normal([hiddenSize], stddev=0.1))
        
        # 對Bi-LSTM的輸出用激活函數作非線性轉換
        M = tf.tanh(H)
        
        # 對W和M作矩陣運算，W=[batch_size, time_step, hidden_size]，計算前作維度轉換成[batch_size * time_step, hidden_size]
        # newM = [batch_size, time_step, 1]，每個時間步的輸出由向量轉換成一個數字
        newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1, 1]))
        
        # 對newM作維度轉換成[batch_size, time_step]
        restoreM = tf.reshape(newM, [-1, config.sequenceLength])
        
        # 用softmax作歸一化處理[batch_size, time_step]
        self.alpha = tf.nn.softmax(restoreM)
        
        # 利用求得的alpha的值對H進行加權求和，用矩陣運算直接操做
        r = tf.matmul(tf.transpose(H, [0, 2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength, 1]))
        
        # 將三維壓縮成二維sequeezeR=[batch_size, hidden_size]
        sequeezeR = tf.squeeze(r)
        
        sentenceRepren = tf.tanh(sequeezeR)
        
        # 對Attention的輸出能夠作dropout處理
        output = tf.nn.dropout(sentenceRepren, self.dropoutKeepProb)
        
        return output

8 性能指標函數

# 定義性能指標函數

def mean(item):
    return sum(item) / len(item)


def genMetrics(trueY, predY, binaryPredY):
    """
    生成acc和auc值
    """
    auc = roc_auc_score(trueY, predY)
    accuracy = accuracy_score(trueY, binaryPredY)
    precision = precision_score(trueY, binaryPredY)
    recall = recall_score(trueY, binaryPredY)
    
    return round(accuracy, 4), round(auc, 4), round(precision, 4), round(recall, 4)

9 訓練模型

　　在訓練模型時，咱們須要動態的生成詞向量表示，在session的全局下須要實例化BiLM模型，定義一個elmo的方法來動態的生成ELMO詞向量。

# 訓練模型

# 生成訓練集和驗證集
trainReviews = data.trainReviews
trainLabels = data.trainLabels
evalReviews = data.evalReviews
evalLabels = data.evalLabels

# 定義計算圖

with tf.Graph().as_default():

    session_conf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
    session_conf.gpu_options.allow_growth=True
    session_conf.gpu_options.per_process_gpu_memory_fraction = 0.9  # 配置gpu佔用率  

    sess = tf.Session(config=session_conf)
    
    # 定義會話
    with sess.as_default():
        cnn = BiLSTMAttention(config)
        
        # 實例化BiLM對象，這個必須放置在全局下，不能在elmo函數中定義，不然會出現重複生成tensorflow節點。
        with tf.variable_scope("bilm", reuse=True):
            bilm = BidirectionalLanguageModel(
                    config.optionFile,
                    config.weightFile,
                    use_character_inputs=False,
                    embedding_weight_file=config.tokenEmbeddingFile
                    )
        inputData = tf.placeholder('int32', shape=(None, None))
        
        # 調用bilm中的__call__方法生成op對象
        inputEmbeddingsOp = bilm(inputData) 
        
        # 計算ELMo向量表示
        elmoInput = weight_layers('input', inputEmbeddingsOp, l2_coef=0.0)
        
        globalStep = tf.Variable(0, name="globalStep", trainable=False)
        # 定義優化函數，傳入學習速率參數
        optimizer = tf.train.AdamOptimizer(config.training.learningRate)
        # 計算梯度,獲得梯度和變量
        gradsAndVars = optimizer.compute_gradients(cnn.loss)
        # 將梯度應用到變量下，生成訓練器
        trainOp = optimizer.apply_gradients(gradsAndVars, global_step=globalStep)
        
        # 用summary繪製tensorBoard
        gradSummaries = []
        for g, v in gradsAndVars:
            if g is not None:
                tf.summary.histogram("{}/grad/hist".format(v.name), g)
                tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
        
        outDir = os.path.abspath(os.path.join(os.path.curdir, "summarys"))
        print("Writing to {}\n".format(outDir))
        
        lossSummary = tf.summary.scalar("loss", cnn.loss)
        summaryOp = tf.summary.merge_all()
        
        trainSummaryDir = os.path.join(outDir, "train")
        trainSummaryWriter = tf.summary.FileWriter(trainSummaryDir, sess.graph)
        
        evalSummaryDir = os.path.join(outDir, "eval")
        evalSummaryWriter = tf.summary.FileWriter(evalSummaryDir, sess.graph)
        
        
        # 初始化全部變量
        saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
        
        # 保存模型的一種方式，保存爲pb文件
#         builder = tf.saved_model.builder.SavedModelBuilder("../model/textCNN/savedModel")
        sess.run(tf.global_variables_initializer())
    
        def elmo(reviews):
            """
            對每個輸入的batch都動態的生成詞向量表示
            """

#           tf.reset_default_graph()
            # TokenBatcher是生成詞表示的batch類
            batcher = TokenBatcher(config.vocabFile)
   
            # 生成batch數據
            inputDataIndex = batcher.batch_sentences(reviews)

            # 計算ELMo的向量表示
            elmoInputVec = sess.run(
                [elmoInput['weighted_op']],
                 feed_dict={inputData: inputDataIndex}
            )

            return elmoInputVec

        def trainStep(batchX, batchY):
            """
            訓練函數
            """   
            
            feed_dict = {
              cnn.inputX: elmo(batchX)[0],  # inputX直接用動態生成的ELMo向量表示代入
              cnn.inputY: np.array(batchY, dtype="float32"),
              cnn.dropoutKeepProb: config.model.dropoutKeepProb
            }
            _, summary, step, loss, predictions, binaryPreds = sess.run(
                [trainOp, summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds],
                feed_dict)
            timeStr = datetime.datetime.now().isoformat()
            acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)
            print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(timeStr, step, loss, acc, auc, precision, recall))
            trainSummaryWriter.add_summary(summary, step)

        def devStep(batchX, batchY):
            """
            驗證函數
            """
            feed_dict = {
              cnn.inputX: elmo(batchX)[0],
              cnn.inputY: np.array(batchY, dtype="float32"),
              cnn.dropoutKeepProb: 1.0
            }
            summary, step, loss, predictions, binaryPreds = sess.run(
                [summaryOp, globalStep, cnn.loss, cnn.predictions, cnn.binaryPreds],
                feed_dict)
            
            acc, auc, precision, recall = genMetrics(batchY, predictions, binaryPreds)
            
            evalSummaryWriter.add_summary(summary, step)
            
            return loss, acc, auc, precision, recall
        
        for i in range(config.training.epoches):
            # 訓練模型
            print("start training model")
            for batchTrain in nextBatch(trainReviews, trainLabels, config.batchSize):
                trainStep(batchTrain[0], batchTrain[1])

                currentStep = tf.train.global_step(sess, globalStep) 
                if currentStep % config.training.evaluateEvery == 0:
                    print("\nEvaluation:")
                    
                    losses = []
                    accs = []
                    aucs = []
                    precisions = []
                    recalls = []
                    
                    for batchEval in nextBatch(evalReviews, evalLabels, config.batchSize):
                        loss, acc, auc, precision, recall = devStep(batchEval[0], batchEval[1])
                        losses.append(loss)
                        accs.append(acc)
                        aucs.append(auc)
                        precisions.append(precision)
                        recalls.append(recall)
                        
                    time_str = datetime.datetime.now().isoformat()
                    print("{}, step: {}, loss: {}, acc: {}, auc: {}, precision: {}, recall: {}".format(time_str, currentStep, mean(losses), 
                                                                                                       mean(accs), mean(aucs), mean(precisions),
                                                                                                       mean(recalls)))
                    
#                 if currentStep % config.training.checkpointEvery == 0:
#                     # 保存模型的另外一種方法，保存checkpoint文件
#                     path = saver.save(sess, "../model/textCNN/model/my-model", global_step=currentStep)
#                     print("Saved model checkpoint to {}\n".format(path))
                    
#         inputs = {"inputX": tf.saved_model.utils.build_tensor_info(cnn.inputX),
#                   "keepProb": tf.saved_model.utils.build_tensor_info(cnn.dropoutKeepProb)}

#         outputs = {"binaryPreds": tf.saved_model.utils.build_tensor_info(cnn.binaryPreds)}

#         prediction_signature = tf.saved_model.signature_def_utils.build_signature_def(inputs=inputs, outputs=outputs,
#                                                                                       method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME)
#         legacy_init_op = tf.group(tf.tables_initializer(), name="legacy_init_op")
#         builder.add_meta_graph_and_variables(sess, [tf.saved_model.tag_constants.SERVING],
#                                             signature_def_map={"predict": prediction_signature}, legacy_init_op=legacy_init_op)

#         builder.save()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。