Few-shot Learning for Named Entity Recognition in Medical Text筆記

時間 2020-09-06

標籤 shot learning named entity recognition medical text 筆記欄目 HTML 简体版

原文原文鏈接

1. Summary

本文對Electronic health records的一些數據集進行了命名實體識別研究。在利用其它相關數據集的基礎上，對target dataset只採集10個樣例進行few-shot learning，提出了五種提高性能的方法（tricks）：
（1）layer-wise initialization with pre-trained weights
（2）hyperparameter tuning
（3）combining pre-training data
（4）custom word embeddings
（5）optimizing out-of-vocabulary (OOV) wordsgit

2. Content

本文所用數據集以下，主要是醫學領域數據集+CoNLL-2003英語新聞專線數據集。
性能

文章使用的baseline model是J. Chiu et al.提出的BLSTM-CNNs，亮點是拼接了character、word和casing embedding，其中casing embedding主要包括numeric, allLower, allUpper, mainly_numeric (more than 50% of characters of a word are numeric), initialUpper, contains_digit, padding and other。優化

5種提高性能的tricks以下：
（1）Single pre-training：使用其它單個數據集分別預訓練，並設置了對比實驗：全部層使用預訓練權重、僅BLSTM使用、全部層除BLSTM、不使用預訓練權重。
（2）Hyperparameter tuning：包括optimizers、pre-training dataset、SGD learning rate、batch normalization(是否使用)、word embedding（是否trainable）以及learning rate decay (constant or time scheduled)。
（3）Combined pre-training：利用多個數據集串聯預訓練模型，並在目標數據集訓練時加載權重。
（4）Customized word embeddings：word embedding是否使用GloVE或者在醫藥數據集上從新用FastText訓練。
（5）Optimizing OOV words：Remove trailing 「:」, 「;」, 「.」 and 「-」、Remove quotations、Remove leading 「+」spa

五種優化方法結果以下:
（1）Single pre-training：F1-score提高+4.52%。
（2）Hyperparameter tuning：優化器選擇最重要（NAdam>>SGD）, 第二重要的是預訓練數據集的選擇（+2.34%）。
（3）Combined pre-training：多數據串聯預訓練，負做用-1.85%。
（4）Customizing word embeddings：自訓練word embedding提高+3.78%。
（5）Optimizing OOV words：提高+0.87%。3d

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。