BERT 文本分類實踐

時間 2019-11-06

標籤 bert 文本分類實踐简体版

原文原文鏈接

上篇文章介紹瞭如何安裝和使用BERT進行文本類似度任務，包括如何修改代碼進行訓練和測試。本文在此基礎上介紹如何進行文本分類任務。git

文本類似度任務具體見： BERT介紹及中文文本類似度任務實踐github

文本類似度任務和文本分類任務的區別在於數據集的準備以及run_classifier.py中數據類的構造部分。json

0. 準備工做

若是想要根據咱們準備的數據集進行fine-tuning，則須要先下載預訓練模型。因爲是處理中文文本，所以下載對應的中文預訓練模型。api

BERTgit地址： google-research/bertbash

BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

文件名爲 chinese_L-12_H-768_A-12.zip。將其解壓至bert文件夾，包含如下三種文件：app

配置文件(bert_config.json)：用於指定模型的超參數

詞典文件(vocab.txt)：用於WordPiece 到 Word id的映射

Tensorflow checkpoint（bert_model.ckpt）：包含了預訓練模型的權重（實際包含三個文件）

1. 數據集的準備

對於文本分類任務，須要準備的數據集的格式以下： label, 文本 ，其中標籤能夠是中文字符串，也能夠是數字。如: 天氣, 一會好像要下雨了 或者0, 一會好像要下雨了函數

將準備好的數據存放於文本文件中，如.txt， .csv等。至於用什麼名字和後綴，只要與數據類中的名稱一致便可。如，在run_classifier.py中的數據類get_train_examples方法中，默認訓練集文件是train.csv，能夠修改成本身命名的文件名便可。測試

def get_train_examples(self, data_dir):
        """See base class."""
        file_path = os.path.join(data_dir, 'train.csv')
複製代碼

2. 增長自定義數據類

將新增的用於文本分類的數據類命名爲 TextClassifierProcessor，以下ui

class TextClassifierProcessor(DataProcessor):
複製代碼

重寫其父類的四個方法，從而實現數據的獲取過程。

get_train_examples：對訓練集獲取InputExample的集合
get_dev_examples：對驗證集...
get_test_examples：對測試集...
get_labels：獲取數據集分類標籤列表

InputExample類的做用是對於單個分類序列的訓練/測試樣例。構建了一個InputExample，包含id, text_a, text_b, label。其定義以下：

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample. Args: guid: Unique id for the example. text_a: string. The untokenized text of the first sequence. For single sequence tasks, only this sequence must be specified. text_b: (Optional) string. The untokenized text of the second sequence. Only must be specified for sequence pair tasks. label: (Optional) string. The label of the example. This should be specified for train and dev examples, but not for test examples. """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label
複製代碼

重寫get_train_examples方法，對於文本分類任務，只須要label和一個文本便可，所以，只須要賦值給text_a。

由於準備的數據集標籤和文本是以逗號隔開的，所以先將每行數據以逗號隔開，則split_line[0]爲標籤賦值給label，split_line[1]爲文本賦值給text_a。

此處，準備的數據集標籤和文本是以逗號隔開的，不免文本中沒有一樣的英文逗號，爲了不獲取到不完整的文本數據，建議使用 str.find(',')找到第一個逗號出現的位置，則 label = line[:line.find(',')].strip()

對於測試集和驗證集的處理相同。

def get_train_examples(self, data_dir):
        """See base class."""
        file_path = os.path.join(data_dir, 'train.csv')
        examples = []
        with open(file_path, encoding='utf-8') as f:
            reader = f.readlines()
        for (i, line) in enumerate(reader):
            guid = "train-%d" % (i)
            split_line = line.strip().split(",")
            text_a = tokenization.convert_to_unicode(split_line[1])
            text_b = None
            label = str(split_line[0])
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples
複製代碼

get_labels方法用於獲取數據集全部的類別標籤，此處使用數字1,2,3.... 來表示，若有66個類別（1—66），則實現方法以下：

def get_labels(self):
        """See base class."""
        labels = [str(i) for i in range(1,67)]
        return labels
複製代碼

<注意>

爲了方便，能夠構建一個字典類型的變量，存放數字類別和文本標籤中間的對應關係。固然也能夠直接使用文本標籤，想用哪一種用哪一種。

定義完TextClassifierProcessor類以後，還須要將其加入到main函數中的processors變量中去。

找到main()函數，增長新定義數據類，以下所示：

def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)

    processors = {
        "cola": ColaProcessor,
        "mnli": MnliProcessor,
        "mrpc": MrpcProcessor,
        "xnli": XnliProcessor,
        "sim": SimProcessor,
        "classifier":TextClassifierProcessor,  # 增長此行
    }
複製代碼

3. 修改predict輸出

在run_classifier.py文件中，預測部分的會輸出兩個文件，分別是 predict.tf_record和test_results.tsv。其中test_results.tsv中存放的是每一個測試數據獲得的屬於全部類別的機率值，維度爲[n*num_labels]。

但這個結果並不能直接反應獲得的預測結果，所以增長處理代碼，直接獲取獲得的預測類別。

原始代碼以下：

if FLAGS.do_predict:
        print('*'*30,'do_predict', '*'*30)
        predict_examples = processor.get_test_examples(FLAGS.data_dir)
        num_actual_predict_examples = len(predict_examples)
        if FLAGS.use_tpu:
            # TPU requires a fixed batch size for all batches, therefore the number
            # of examples must be a multiple of the batch size, or else examples
            # will get dropped. So we pad with fake examples which are ignored
            # later on.
            while len(predict_examples) % FLAGS.predict_batch_size != 0:
                predict_examples.append(PaddingInputExample())

        predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
        file_based_convert_examples_to_features(predict_examples, label_list,
                                                FLAGS.max_seq_length, tokenizer,
                                                predict_file)

        tf.logging.info("***** Running prediction*****")
        tf.logging.info(" Num examples = %d (%d actual, %d padding)",
                        len(predict_examples), num_actual_predict_examples,
                        len(predict_examples) - num_actual_predict_examples)
        tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)

        predict_drop_remainder = True if FLAGS.use_tpu else False
        predict_input_fn = file_based_input_fn_builder(
            input_file=predict_file,
            seq_length=FLAGS.max_seq_length,
            is_training=False,
            drop_remainder=predict_drop_remainder)

        result = estimator.predict(input_fn=predict_input_fn)

        output_predict_file = os.path.join(
            FLAGS.output_dir, "test_results.tsv")
        with tf.gfile.GFile(output_predict_file, "w") as writer:
            num_written_lines = 0
            tf.logging.info("***** Predict results *****")
            for (i, prediction) in enumerate(result):
                probabilities = prediction["probabilities"]
                if i >= num_actual_predict_examples:
                    break
                output_line = "\t".join(
                    str(class_probability)
                    for class_probability in probabilities) + "\n"
                writer.write(output_line)
                num_written_lines += 1
        assert num_written_lines == num_actual_predict_examples
複製代碼

修改後的代碼以下：

result_predict_file = os.path.join(
            FLAGS.output_dir, "test_labels_out.txt")

        right = 0 # 預測正確的個數
        f_res = open(result_predict_file, 'w') #將結果保存到此文件中
        with tf.gfile.GFile(output_predict_file, "w") as writer:
            num_written_lines = 0
            tf.logging.info("***** Predict results *****")
            for (i, prediction) in enumerate(result):
                probabilities = prediction["probabilities"] #預測結果
                if i >= num_actual_predict_examples:
                    break
                output_line = "\t".join(
                    str(class_probability)
                    for class_probability in probabilities) + "\n"
                # 獲取機率值最大的類別的下標Index
                index = np.argmax(probabilities, axis = 0)
                # 將真實標籤和預測標籤及對應的機率值寫入到結果文件中
                res_line = 'real: %s, \tpred:%s, \tscore = %.2f\n' \
                        %(lable_to_cate[real_label[i]], lable_to_cate[index+1], probabilities[index])
                f_res.write(res_line)
                writer.write(output_line)
                num_written_lines += 1

                if real_label[i] == (index+1):
                    right += 1

            print('precision = %.2f' %(right / len(real_label)))
複製代碼

4.fine-tuning模型

準備好數據集，修改完數據類後，接下來就是如何fine-tuning模型。查看 run_classifier.py文件的入口部分，包含了fine-tuning模型所需的必要參數，以下：

if __name__ == "__main__":
    flags.mark_flag_as_required("data_dir")
    flags.mark_flag_as_required("task_name")
    flags.mark_flag_as_required("vocab_file")
    flags.mark_flag_as_required("bert_config_file")
    flags.mark_flag_as_required("output_dir")
    tf.app.run()
複製代碼

部分參數說明 data_dir ：數據存放路徑 task_mask ：processor的名字，對於文本分類任務，則爲classifier vocab_file ：字典文件的地址 bert_config_file ：配置文件 output_dir ：模型輸出地址

因爲須要設置的參數較多，所以將其統一放置到sh腳本中，名稱fine-tuning_classifier.sh，以下所示：

#!/usr/bin/env bash
export BERT_BASE_DIR=/**/NLP/bert/chinese_L-12_H-768_A-12 #全局變量 下載的預訓練bert地址
export MY_DATASET=/**/NLP/bert/data/text_classifition #全局變量 數據集所在地址

python run_classifier.py \
  --task_name=classifier  \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --data_dir=$MY_DATASET \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=32  \
  --train_batch_size=64 \
  --learning_rate=5e-5 \
  --num_train_epochs=10.0 \
  --output_dir=./fine_tuning_out/text_classifier_64_epoch10_5e5
複製代碼

執行命令

sh ./fine-tuning_classifier.sh
複製代碼

生成的模型文件，在output_dir目錄中，以下：

獲得的測試結果文件test_labels_out.txt內容以下：

real: 天氣, pred:天氣, score = 1.00

使用tensorboard查看loss走勢，以下所示：

文本類似度任務具體見： BERT介紹及中文文本類似度任務實踐

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。