生物醫學命名實體識別（BioNER）研究進展

時間 2019-11-05

標籤生物醫學命名實體識別 bioner 研究進展简体版

原文原文鏈接

生物醫學命名實體識別（BioNER）研究進展php

最近把以前整理的一些生物醫學命名實體識別（Biomedical Named Entity Recognition, BioNER）相關的論文作了一個BioNER Progress放在了github（https://github.com/lingluodlut/BioNER-Progress）上。主要內容包括BioNER進展中的表明論文列表，以及目前各個主要數據集上的一些先進結果和相關論文，但願對入門的同窗有所幫助。html

論文列表首先給出一些綜述論文，而後根據BioNER研究的發展歷程依次給出了基於詞典，基於規則和基於機器學習方法的表明性工做。機器學習的方法又細分爲了基於傳統機器學習模型（SVM、HMM、MEMM和CRF模型）以及如今主流的神經網絡方法。ios

BioNER Papersgit

A paper list for BioNERgithub

Over the past decades, many automatic BioNER methods have been proposed and used to recognise biomeidcal entities. They can be categorised into dictionary-based, rule-based and machine learning-based methods. Recently, neural network-based machine learning methods exhibit promising results.web

Survey Papersspring

Overview of BioCreative II gene mention recognition. Smith L, Tanabe L K, nee Ando R J, et al. Genome biology, 2008, 9(2): S2. [paper]
Biomedical named entity recognition: a survey of machine-learning tools. Campos D, Matos S, Oliveira J L. Theory and Applications for Advanced Text Mining, 2012: 175-195. [paper]
Chemical named entities recognition: a review on approaches and applications. Eltyeb S, Salim N. Journal of cheminformatics, 2014, 6(1): 17. [paper]
CHEMDNER: The drugs and chemical names extraction challenge. Krallinger M, Leitner F, Rabal O, et al. Journal of cheminformatics, 2015, 7(1): S1. [paper]
A comparative study for biomedical named entity recognition. Wang X, Yang C, Guan R. International Journal of Machine Learning and Cybernetics, 2015, 9(3): 373-382. [paper]

Dictionary-based Methodsbootstrap

Using BLAST for identifying gene and protein names in journal articles. Krauthammer M, Rzhetsky A, Morozov P, et al. Gene, 2000, 259(1-2): 245-252. [paper]
Boosting precision and recall of dictionary-based protein name recognition. Tsuruoka Y, Tsujii J. Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine-Volume 13, 2003: 41-48. [paper]
Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature. Yang Z, Lin H, Li Y. Computational Biology and Chemistry, 2008, 32(4): 287-291. [paper]
A dictionary to identify small molecules and drugs in free text. Hettne K M, Stierum R H, Schuemie M J, et al. Bioinformatics, 2009, 25(22): 2983-2991. [paper] [dictionary]
LINNAEUS: a species name identification system for biomedical literature. Gerner M, Nenadic G, Bergman C M. BMC bioinformatics, 2010, 11(1): 85. [paper]

Rule-based Methods網絡

Toward information extraction: identifying protein names from biological papers. Fukuda K, Tsunoda T, Tamura A, et al. Pac symp biocomput. 1998, 707(18): 707-718. [paper]
A biological named entity recognizer. Narayanaswamy M, Ravikumar K E, Vijay-Shanker K. Biocomputing 2003. 2002: 427-438. [paper]
ProMiner: rule-based protein and gene entity recognition. Hanisch D, Fundel K, Mevissen H T, et al. BMC bioinformatics, 2005, 6(1): S14. [paper]
MutationFinder: a high-performance system for extracting point mutation mentions from text. Caporaso J G, Baumgartner Jr W A, Randolph D A, et al. Bioinformatics, 2007, 23(14): 1862-1865. [paper] [code]
Drug name recognition and classification in biomedical texts: a case study outlining approaches underpinning automated systems. Segura-Bedmar I, Martínez P, Segura-Bedmar M. Drug discovery today, 2008, 13(17-18): 816-823. [paper]
Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon. Xu R, Morgan A, Das A K, et al. Proceedings of the workshop on current trends in biomedical natural language processing, 2009: 63-70. [paper]
Linguistic approach for identification of medication names and related information in clinical narratives. Hamon T, Grabar N. Journal of the American Medical Informatics Association, 2010, 17(5): 549-554. [paper]
SETH detects and normalizes genetic variants in text. Thomas P, Rocktäschel T, Hakenberg J, et al. Bioinformatics, 2016, 32(18): 2883-2885. [paper] [code]
PENNER: Pattern-enhanced Nested Named Entity Recognition in Biomedical Literature. Wang X, Zhang Y, Li Q, et al. 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2018: 540-547. [paper]

Machine Learning-based Methodsapp

SVM-based Methods

Tuning support vector machines for biomedical named entity recognition. Kazama J, Makino T, Ohta Y, et al. Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain-Volume 3, 2002: 1-8. [paper]
Biomedical named entity recognition using two-phase model based on SVMs. Lee K J, Hwang Y S, Kim S, et al. Journal of Biomedical Informatics, 2004, 37(6): 436-447. [paper]
Exploring deep knowledge resources in biomedical name recognition. GuoDong Z, Jian S. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004: 96-99. [paper]

HMM-based Methods

Named entity recognition in biomedical texts using an HMM model. Zhao S. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, 2004: 84-87.[paper]
Annotation of chemical named entities. Corbett P, Batchelor C, Teufel S. Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, 2007: 57-64. [paper]
Conditional random fields vs. hidden markov models in a biomedical named entity recognition task. Ponomareva N, Rosso P, Pla F, et al. Proc. of Int. Conf. Recent Advances in Natural Language Processing, RANLP. 2007, 479: 483.[paper]

MEMM-based Mehtods

Cascaded classifiers for confidence-based chemical named entity recognition. Corbett P, Copestake A. BMC bioinformatics, 2008, 9(11): S4. [paper]
OSCAR4: a flexible architecture for chemical text-mining. Jessop D M, Adams S E, Willighagen E L, et al. Journal of cheminformatics, 2011, 3(1): 41. [paper]

CRF-based Methods

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Settles B. Bioinformatics, 2005, 21(14): 3191-3192.[paper]
BANNER: an executable survey of advances in biomedical named entity recognition. Leaman R, Gonzalez G. Biocomputing 2008. 2008: 652-663.[paper]
Detection of IUPAC and IUPAC-like chemical names. Klinger R, Kolářik C, Fluck J, et al. Bioinformatics, 2008, 24(13): i268-i276. [paper]
Incorporating rich background knowledge for gene named entity classification and recognition. Li Y, Lin H, Yang Z. BMC bioinformatics, 2009, 10(1): 223. [paper]
A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. Jiang M, Chen Y, Liu M, et al. Journal of the American Medical Informatics Association, 2011, 18(5): 601-606. [paper]
ChemSpot: a hybrid system for chemical named entity recognition. Rocktäschel T, Weidlich M, Leser U. Bioinformatics, 2012, 28(12): 1633-1640. [paper]
Gimli: open source and high-performance biomedical name recognition. Campos D, Matos S, Oliveira J L. BMC bioinformatics, 2013, 14(1): 54. [paper]
tmVar: a text mining approach for extracting sequence variants in biomedical literature. Wei C H, Harris B R, Kao H Y, et al. Bioinformatics, 2013, 29(11): 1433-1439. [paper] [code]
Evaluating word representation features in biomedical named entity recognition tasks. Tang B, Cao H, Wang X, et al. BioMed research international, 2014, 2014. [paper]
Drug name recognition in biomedical texts: a machine-learning-based method. He L, Yang Z, Lin H, et al. Drug discovery today, 2014, 19(5): 610-617. [paper]
tmChem: a high performance approach for chemical named entity recognition and normalization. Leaman R, Wei C H, Lu Z. Journal of cheminformatics, 2015, 7(1): S3. [paper]
GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. Wei C H, Kao H Y, Lu Z. BioMed research international, 2015, 2015. [paper]
Mining chemical patents with an ensemble of open systems[J]. Leaman R, Wei C H, Zou C, et al. Database, 2016, 2016. [paper]
nala: text mining natural language mutation mentions. Cejuela J M, Bojchevski A, Uhlig C, et al. Bioinformatics, 2017, 33(12): 1852-1858. [paper]

Neural Network-based Methods

Recurrent neural network models for disease name recognition using domain invariant features. Sahu S, Anand A. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2216-2225. [paper]
Deep learning with word embeddings improves biomedical named entity recognition. Habibi M, Weber L, Neves M, et al. Bioinformatics, 2017, 33(14): i37-i48. [paper]
A neural joint model for entity and relation extraction from biomedical text. Li F, Zhang M, Fu G, et al. BMC bioinformatics, 2017, 18(1): 198. [paper]
A neural network multi-task learning approach to biomedical named entity recognition. Crichton G, Pyysalo S, Chiu B, et al. BMC bioinformatics, 2017, 18(1): 368. [paper] [code]
Disease named entity recognition from biomedical literature using a novel convolutional neural network. Zhao Z, Yang Z, Luo L, et al. BMC medical genomics, 2017, 10(5): 73. [paper]
An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Luo L, Yang Z, Yang P, et al. Bioinformatics, 2018, 34(8): 1381-1388. [paper] [code]
GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text. Zhu Q, Li X, Conesa A, et al. Bioinformatics, 2018, 34(9): 1547-1554. [paper] [code]
D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Dang T H, Le H Q, Nguyen T M, et al. Bioinformatics, 2018, 34(20): 3539-3546. [paper] [code]
Transfer learning for biomedical named entity recognition with neural networks. Giorgi J M, Bader G D. Bioinformatics, 2018, 34(23): 4087-4094. [paper]
Label-Aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition. Wang Z, Qu Y, Chen L, et al. NAACL. 2018: 1-15. [paper]
Recognizing irregular entities in biomedical text via deep neural networks. Li F, Zhang M, Tian B, et al. Pattern Recognition Letters, 2018, 105: 105-113. [paper]
Cross-type biomedical named entity recognition with deep multi-task learning. Wang X, Zhang Y, Ren X, et al. Bioinformatics, 2019, 35(10): 1745-1752. [paper] [code]
Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Zhai Z, Nguyen D Q, Akhondi S, et al. Proceedings of the 18th BioNLP Workshop and Shared Task. 2019: 328-338. [paper] [code]
Chinese Clinical Named Entity Recognition Using Residual Dilated Convolutional Neural Network with Conditional Random Field. Qiu J, Zhou Y, Wang Q, et al. IEEE Transactions on NanoBioscience, 2019, 18(3): 306-315. [paper]
A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization. Zhao S, Liu T, Zhao S, et al. Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 817-824. [paper]
CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. Yoon W, So C H, Lee J, et al. BMC bioinformatics, 2019, 20(10): 249. [paper] [code]
BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Lee J, Yoon W, Kim S, et al. Bioinformatics, Advance article, 2019. [paper] [code]
HUNER: Improving Biomedical NER with Pretraining. Weber L, Münchmeyer J, Rocktäschel T, et al. Bioinformatics, Advance article, 2019. [paper] [code]

Others

TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Leaman R, Lu Z. Bioinformatics, 2016, 32(18): 2839-2846. [paper] [code]
A transition-based joint model for disease named entity recognition and normalization. Lou Y, Zhang Y, Qian T, et al. Bioinformatics, 2017, 33(15): 2363-2371. [paper] [code]

此外，還總結給出了目前各個主要數據集上的一些先進結果。根據實體類型的不一樣分別爲化學藥物（Chemical）、疾病（Disease）、基因蛋白（Gene/Protein）、基因變異（Mutation）和物種（Species）的實體識別。

Chemical NER

CHEMDNER

CHEMDNER (chemical compound and drug name recognition) task as part of the BioCreative IV challenge aims to promote the development of systems for the automatic recognition of chemical entities in text. It was divided into two tasks: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). Here, we only focus on the CEM task.

The CHEMDNER corpus consists of 10,000 PubMed abstracts, which contains a total of 84,355 chemical entity mentions. The original corpus is divided into training set (3,500 abstracts), development set (3,500 abstracts) and test set (3,000 abstracts)

Method	P	R	F1	Paper
tmChem (Leaman et al., 2015)	89.09	85.75	87.39	tmChem: a high performance approach for chemical named entity recognition and normalization
CRF (Lu et al., 2015)	88.73	87.41	88.06	CHEMDNER system with mixed conditional random fields and multi-scale word clustering
BiLSTM-CRF (Lample et al., 2016), Luo et al. (2018) rebuilt the model on the dataset	91.31	87.73	89.48	Neural architectures for named entity recognition
Att-BiLSTM-CRF (Luo et al., 2018)	92.29	90.01	91.14	An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition
MTM-CW (Wang et al., 2019)	91.30	87.53	89.37	Cross-type biomedical named entity recognition with deep multi-task learning
BioBERT (Lee et al., 2019)	92.80	91.92	92.36	BioBERT: a pre-trained biomedical language representation model for biomedical text mining

CDR-Chemical

CDR (chemical disease relation) task as part of the BioCreative V challenge aims to automatically extract CDRs from the literature. The CDR corpus consists of 1,500 PubMed abstracts with annotated chemicals, diseases and chemical-disease interactions, which contains a total of 15,933 chemical entity mentions. The original corpus is separated into training set (500 abstracts), development set (500 abstracts) and test set (500 abstracts).

Method	P	R	F1	Paper
TaggerOne (Leaman and Lu, 2016)	94.20	88.80	91.40	TaggerOne: joint named entity recognition and normalization with semi-Markov Models
BiLSTM-CRF (Lample et al., 2016), Luo et al. (2018) rebuilt the model on the dataset	92.82	88.52	90.62	Neural architectures for named entity recognition
Att-BiLSTM-CRF (Luo et al., 2018)	93.49	91.68	92.57	An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition
D3NER (Dang et al., 2018)	93.73	92.56	93.14	D3NER : biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information
CollaboNet (Yoon et al., 2018)	94.26	92.38	93.31	CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition
ELMo (Peng et al., 2019)	-	-	91.5	Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets
BERT (Peng et al., 2019)	-	-	93.5	Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets
BioBERT (Lee et al., 2019)	93.68	93.26	93.47	BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Disease NER

NCBI-Disease

The NCBI Disease corpus consists of 793 PubMed abstracts separated into training (593), development (100) and test (100) subsets. It contains a total of 6,892 disease entity mentions.

Method	P	R	F1	Paper
DNorm (Leaman et al., 2015)	82.2	77.5	79.8	DNorm: Disease name normalization with pairwise learning to rank
TaggerOne (Leaman and Lu, 2016)	85.1	80.8	82.9	TaggerOne: joint named entity recognition and normalization with semi-Markov Models
MCNN (Zhao et al., 2017)	85.08	85.26	85.17	Disease named entity recognition from biomedical literature using a novel convolutional neural network
BiLSTM-CRF (Lample et al., 2016), Wang et al. (2019) rebuilt the model on the dataset	86.11	85.49	85.80	Neural architectures for named entity recognition
CollaboNet (Yoon et al., 2018)	85.61	82.61	84.08	CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition
GRAM-CNN (Zhu et al., 2018)	86.46	88.07	87.26	GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text
MTM-CW (Wang et al., 2019)	85.86	86.42	86.14	Cross-type biomedical named entity recognition with deep multi-task learning
Dic-Att-BiLSTM-CRF (Xu et al., 2019)	88.3	89.0	88.6	Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition
BioBERT (Lee et al., 2019)	88.22	91.25	89.71	BioBERT: a pre-trained biomedical language representation model for biomedical text mining

CDR-Disease

CDR (chemical disease relation) task as part of the BioCreative V challenge aims to automatically extract CDRs from the literature. The CDR corpus consists of 1,500 PubMed abstracts with annotated chemicals, diseases and chemical-disease interactions, which contains a total of 12,864 disease entity mentions. The original corpus is separated into training set (500 abstracts), development set (500 abstracts) and test set (500 abstracts).

Method	P	R	F1	Paper
HITSZ_CDR (Li et al., 2016)	88.68	85.23	86.93	HITSZ_CDR: an end-to-end chemical and disease relation extraction system for BioCreative V
TaggerOne (Leaman and Lu, 2016)	85.2	80.2	82.6	TaggerOne: joint named entity recognition and normalization with semi-Markov Models
BiLSTM-CRF (Lample et al., 2016), Wang et al. (2019) rebuilt the model on the dataset	87.60	86.25	86.92	Neural architectures for named entity recognition
MCNN (Zhao et al., 2017)	88.20	87.46	87.83	Disease named entity recognition from biomedical literature using a novel convolutional neural network
Transition-based joint model (Lou et al., 2017)	89.61	83.09	86.23	A Transition-based Joint Model for Disease Named Entity Recognition and Normalization
CollaboNet (Yoon et al., 2018)	85.61	82.61	84.08	CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition
MTM-CW (Wang et al., 2019)	89.10	88.47	88.78	Cross-type biomedical named entity recognition with deep multi-task learning
Dic-Att-BiLSTM-CRF (Xu et al., 2019)	89.1	87.5	88.3	Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition
BERT (Peng et al., 2019)	-	-	86.6	Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets
BioBERT (Lee et al., 2019)	86.47	87.84	87.15	BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Gene/Protein NER

BC2GM

Gene Mention Tagging task as part of the BioCreative II challenge is concerned with the named entity extraction of gene and gene product mentions in text. The BC2GM corpus contains a total of 24,583 gene entity mentions.

Method	P	R	F1	Paper
BiLSTM-CRF (Lample et al., 2016), Wang et al. (2019) rebuilt the model on the dataset	81.57	79.48	80.51	Neural architectures for named entity recognition
CollaboNet (Yoon et al., 2018)	80.49	78.99	79.73	CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition
BiLM-NER (Sachan et al., 2018)	81.81	81.57	81.69	Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition
MTM-CW (Wang et al., 2019)	82.10	79.42	80.74	Cross-type biomedical named entity recognition with deep multi-task learning
BioBERT (Lee et al., 2019)	84.32	85.12	84.72	BioBERT: a pre-trained biomedical language representation model for biomedical text mining

JNLPBA

JNLPBA corpus contains 2,404 abstracts extracted from MEDLINE using the MeSH terms 「human」, 「blood- cell」 and 「transcription factor」. The manual annotation of these abstracts was based on five classes of the GENIA ontology, namely protein, DNA, RNA, cell line, and cell type. This corpus was used in the Bio-Entity Recognition Task in BioNLP/NLPBA 2004, providing 2,000 abstracts for training and the remaining 404 abstracts for testing. The overall results are shown in the following table.

Method	P	R	F1	Paper
SVM (Zhou and Su., 2004)	69.42	75.99	72.55	Exploring Deep Knowledge Resources in Biomedical Name Recognition
CRF_NERBio (Tsai et al., 2006)	72.01	73.98	72.98	NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition
BiLSTM-CRF (Lample et al., 2016), Wang et al. (2019) rebuilt the model on the dataset	71.35	75.74	73.48	Neural architectures for named entity recognition
BiLM-NER (Sachan et al., 2018)	71.39	79.06	75.03	Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition
CollaboNet (Yoon et al., 2018)	74.43	83.22	78.58	CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition
MTM-CW (Wang et al., 2019)	70.91	76.34	73.52	Cross-type biomedical named entity recognition with deep multi-task learning
BioBERT (Lee et al., 2019)	72.24	83.56	77.49	BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Mutation NER

MuatationFinder corpus and tmVar corpus

The MutationFinder corpus was established to guide the construction of the patterns. The development data set is made up of 605 point mutation mentions in 305 abstracts selected randomly from primary citations in PDB. The evaluation data set is made up of 910 point mutation mentions in 508 abstracts annotated by two of the authors, not involved in the development of the system.

The tmVar corpus comprises 500 abstracts manually annotated from which 334 were used for training tmVar while the remaining 166 were used for testing it.

Method	MF-P	MF-R	MF-F1	tmvar-P	tmvar-R	tmvar-F1	Paper
MutationFinder (Caporaso et al., 2007)	98.41	81.92	89.41	89.66	69.15	78.08	MutationFinder: A high-performance system for extracting point mutation mentions from text
tmVar (Wei et al., 2013)	98.80	89.62	93.98	91.38	91.40	91.39	tmVar: a text mining approach for extracting sequence variants in biomedical literature
SETH (Thomas et al., 2016)	98	82	89	95	77	85	SETH detects and normalizes genetic variants in text
Character-based network (Thomas et al., 2018)	-	-	-	88.1	86.6	87.4	Recognition of genetic mutations in text using Deep Learning

Species NER

LINNAEUS corpus

The LINNAEUS corpus: A set of open access documents in text format, manually annotated for species mention tags. It consists of 100 full-text documents from the PMC OA document, which contains a total of 4,259 species entity mentions.

Method	P	R	F1	Paper
LINNAEUS (Gerner et al., 2010)	97.07	94.28	95.65	LINNAEUS : A species name identification system for biomedical literature
SR4GN (Wei et al., 2012)	86	85	86	SR4GN : A Species Recognition Software Tool for Gene Normalization
BiLSTM-CRF (Habibi et al., 2017)	-	-	94.03	Deep learning with word embeddings improves biomedical named entity recognition
Transfer learning-based model (Giorgi and Bader, 2018)	92.80	94.29	93.54	Transfer learning for biomedical named entity recognition with neural networks
BioBERT (Lee et al., 2019)	93.84	86.11	89.81	BioBERT: a pre-trained biomedical language representation model for biomedical text mining