用BERT進行中文短文本分類

時間 2019-11-07

標籤 bert 進行中文短文分類简体版

原文原文鏈接

　　1. 環境配置python

　　本實驗使用操做系統：Ubuntu 18.04.3 LTS 4.15.0-29-generic GNU/Linux操做系統。linux

　　1.1 查看CUDA版本json

　　cat /usr/local/cuda/version.txtapp

　　輸出：ide

　　CUDA Version 10.0.130*測試

　　1.2 查看 cudnn版本ui

　　cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2spa

　　輸出：操作系統

　　#define CUDNN_MINOR 63d

　　#define CUDNN_PATCHLEVEL 3

　　#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

　　若是沒有安裝 cuda 和 cudnn，到官網根本身的 GPU 型號版本安裝便可

　　1.3 安裝tensorflow-gpu

　　經過Anaconda建立虛擬環境來安裝tensorflow-gpu(Anaconda安裝步驟就不說了)

　　建立虛擬環境

　　虛擬環境名爲：tensorflow

　　conda create -n tensorflow python=3.7.1

　　進入虛擬環境

　　下次使用也能夠經過此命令進入虛擬環境

　　source activate tensorflow

　　安裝tensorflow-gpu

　　不推薦直接pip install tensorflow-gpu 由於速度比較慢。能夠從豆瓣的鏡像中下載，速度仍是很快的。https://pypi.doubanio.com/simple/tensorflow-gpu/

　　找到本身適用的版本(cp37表示python版本爲3.7)

　　而後經過pip install 安裝

　　pip install https://pypi.doubanio.com/packages/15/21/17f941058556b67ce6d1e3f0e0932c9c2deaf457e3d45eecd93f2c20827d/tensorflow_gpu-1.14.0rc1-cp37-cp37m-manylinux1_x86_64.whl

　　我選擇了1.14.0的tensorflow-gpu linux版本，python版本爲3.7。使用BERT的話，tensorflow-gpu版本必須大於1.11.0。同時，不建議選擇2.0版本，2.0版本好像修改了一些方法，還須要本身手動修改代碼

　　環境測試

　　在tensorflow虛擬環境中，python命令進入Python環境中，輸入import tensorflow，看是否能成功導入

　　2. 準備工做

　　2.1 預訓練模型下載

　　Bert-base Chinese

　　BERT-wwm ：由哈工大和訊飛聯合實驗室發佈的，效果比Bert-base Chinese要好一些(連接地址爲訊飛雲，密碼：mva8。無奈當時用wwm訓練完提交結果時，提交通道已經關閉了，嗚嗚)

　　bert_model.ckpt：負責模型變量載入

　　vocab.txt：訓練時中文文本採用的字典

　　bert_config.json：BERT在訓練時，可選調整的一些參數

　　2.2 數據準備

　　1)將本身的數據集格式改爲以下格式：第一列是標籤，第二列是文本數據，中間用tab隔開(若測試集沒有標籤，只保留一列樣本數據)。分別將訓練集、驗證集、測試集文件名改成train.tsv、val.tsv、test.tsv。文件格式爲UTF-8(無BOM)

　　2)新建data文件夾，存放這三個文件。

　　3)預訓練模型解壓，存放到新建文件夾chinese中

　　2.3 代碼修改

　　咱們須要對bert源碼中run_classifier.py進行兩處修改

　　1)在run_classifier.py中添加咱們的任務類

　　能夠參照其餘Processor類，添加本身的任務類

　　# 自定義Processor類

　　class MyProcessor(DataProcessor):

　　def __init__(self):

　　self.labels = ['Addictive Behavior',

　　'Address',

　　'Age',

　　'Alcohol Consumer',

　　'Allergy Intolerance',

　　'Bedtime',

　　'Blood Donation',

　　'Capacity',

　　'Compliance with Protocol',

　　'Consent',

　　'Data Accessible',

　　'Device',

　　'Diagnostic',

　　'Diet',

　　'Disabilities',

　　'Disease',

　　'Education',

　　'Encounter',

　　'Enrollment in other studies',

　　'Ethical Audit',

　　'Ethnicity',

　　'Exercise',

　　'Gender',

　　'Healthy',

　　'Laboratory Examinations',

　　'Life Expectancy',

　　'Literacy',

　　'Multiple',

　　'Neoplasm Status',

　　'Non-Neoplasm Disease Stage',

　　'Nursing',

　　'Oral related',

　　'Organ or Tissue Status',

　　'Pharmaceutical Substance or Drug',

　　'Pregnancy-related Activity',

　　'Receptor Status',

　　'Researcher Decision',

　　'Risk Assessment',

　　'Sexual related',

　　'Sign',

　　'Smoking Status',

　　'Special Patient Characteristic',

　　'Symptom',

　　'Therapy or Surgery']

　　def get_train_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

　　def get_dev_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "val.tsv")), "val")

　　def get_test_examples(self, data_dir):

　　return self._create_examples(

　　self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

　　def get_labels(self):

　　return self.labels

　　def _create_examples(self, lines, set_type):

　　examples = []

　　for (i, line) in enumerate(lines):

　　guid = "%s-%s" % (set_type, i)

　　if set_type == "test":

　　"""

　　由於個人測試集中沒有標籤，因此對test進行單獨處理，

　　test的label值設爲任意一標籤(必定是存在的類標籤，

　　否則predict時會keyError)，若是測試集中有標籤，就

　　不須要if了，統一處理便可。

　　"""

　　text_a = tokenization.convert_to_unicode(line[0])

　　label = "Address"

　　else:

　　text_a = tokenization.convert_to_unicode(line[1])

　　label = tokenization.convert_to_unicode(line[0])

　　examples.append(

　　InputExample(guid=guid, text_a=text_a, text_b=None, label=label))

　　return examples

　　2)修改processor字典

　　def main(_):

　　tf.logging.set_verbosity(tf.logging.INFO)

　　processors = {

　　"cola": ColaProcessor,

　　"mnli": MnliProcessor,

　　"mrpc": MrpcProcessor,

　　"xnli": XnliProcessor,

　　"mytask": MyProcessor, # 將本身的Processor添加到字典

　　}

　　3 開工

　　3.1 配置訓練腳本

　　建立並運行run.sh這個文件

　　python run_classifier.py \

　　--data_dir=data \

　　--task_name=mytask \

　　--do_train=true \

　　--do_eval=true \

　　--vocab_file=chinese/vocab.txt \

　　--bert_config_file=chinese/bert_config.json \

　　--init_checkpoint=chinese/bert_model.ckpt \

　　--max_seq_length=128 \

　　--train_batch_size=8 \

　　--learning_rate=2e-5 \

　　--num_train_epochs=3.0

　　--output_dir=out \

　　fine-tune須要必定的時間，個人訓練集有兩萬條，驗證集有八千條，GPU爲2080Ti，須要20分鐘左右。若是顯存不夠大，記得適當調整max_seq_length 和 train_batch_size

　　3.2 預測

　　建立並運行test.sh(注：init_checkpoint爲本身以前輸出模型地址)

　　python run_classifier.py \

　　--task_name=mytask \

　　--do_predict=true \

　　--data_dir=data \

　　--vocab_file=chinese/vocab.txt \

　　--bert_config_file=chinese/bert_config.json \

　　--init_checkpoint=out \

　　--max_seq_length=128 \

　　--output_dir=out

　　預測完會在out目錄下生成test_results.tsv。生成文件中，每一行對應你訓練集中的每個樣本，每一列對應的是每一類的機率(對應以前自定義的label列表)。如第5行第8列表示第5個樣本是第8類的機率。

　　3.3 預測結果處理鄭州婦科醫院 http://www.zykdfkyy.com/

　　由於預測結果是機率，咱們須要對其處理，選取每一行中的最大值最爲預測值，並轉換成對應的真實標籤。

　　data_dir = "C:\\test_results.tsv"

　　lable = ['Addictive Behavior',

　　'Address',

　　'Age',

　　'Alcohol Consumer',

　　'Allergy Intolerance',

　　'Bedtime',

　　'Blood Donation',

　　'Capacity',

　　'Compliance with Protocol',

　　'Consent',

　　'Data Accessible',

　　'Device',

　　'Diagnostic',

　　'Diet',

　　'Disabilities',

　　'Disease',

　　'Education',

　　'Encounter',

　　'Enrollment in other studies',

　　'Ethical Audit',

　　'Ethnicity',

　　'Exercise',

　　'Gender',

　　'Healthy',

　　'Laboratory Examinations',

　　'Life Expectancy',

　　'Literacy',

　　'Multiple',

　　'Neoplasm Status',

　　'Non-Neoplasm Disease Stage',

　　'Nursing',

　　'Oral related',

　　'Organ or Tissue Status',

　　'Pharmaceutical Substance or Drug',

　　'Pregnancy-related Activity',

　　'Receptor Status',

　　'Researcher Decision',

　　'Risk Assessment',

　　'Sexual related',

　　'Sign',

　　'Smoking Status',

　　'Special Patient Characteristic',

　　'Symptom',

　　'Therapy or Surgery']

　　# 用pandas讀取test_result.tsv，將標籤設置爲列名

　　data_df = pd.read_table(data_dir, sep="\t", names=lable, encoding="utf-8")

　　label_test = []

　　for i in range(data_df.shape[0]):

　　# 獲取一行中最大值對應的列名，追加到列表

　　label_test.append(data_df.loc[i, :].idxmax())

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。