本文將會簡單介紹天然語言處理(NLP)中的命名實體識別(NER)。
命名實體識別(Named Entity Recognition,簡稱NER)是信息提取、問答系統、句法分析、機器翻譯等應用領域的重要基礎工具,在天然語言處理技術走向實用化的過程當中佔有重要地位。通常來講,命名實體識別的任務就是識別出待處理文本中三大類(實體類、時間類和數字類)、七小類(人名、機構名、地名、時間、日期、貨幣和百分比)命名實體。
舉個簡單的例子,在句子「小明早上8點去學校上課。」中,對其進行命名實體識別,應該能提取信息html
人名:小明,時間:早上8點,地點:學校。java
本文將會介紹幾個工具用來進行命名實體識別,後續有機會的話,咱們將會嘗試着用HMM、CRF或深度學習來實現命名實體識別。
首先咱們來看一下NLTK和Stanford NLP中對命名實體識別的分類,以下圖:python
在上圖中,LOCATION和GPE有重合。GPE一般表示地理—政治條目,好比城市,州,國家,洲等。LOCATION除了上述內容外,還能表示名山大川等。FACILITY一般表示知名的記念碑或人工製品等。
下面介紹兩個工具來進行NER的任務:NLTK和Stanford NLP。
首先是NLTK,咱們的示例文檔(介紹FIFA,來源於維基百科)以下:web
FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. Member countries must each also be members of one of
the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America
and the Caribbean, Oceania, and South America.算法
實現NER的Python代碼以下:微信
import re import pandas as pd import nltk def parse_document(document): document = re.sub('\n', ' ', document) if isinstance(document, str): document = document else: raise ValueError('Document is not string!') document = document.strip() sentences = nltk.sent_tokenize(document) sentences = [sentence.strip() for sentence in sentences] return sentences # sample document text = """ FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. Member countries must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America. """ # tokenize sentences sentences = parse_document(text) tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] # tag sentences and use nltk's Named Entity Chunker tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences] ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences] # extract all named entities named_entities = [] for ne_tagged_sentence in ne_chunked_sents: for tagged_tree in ne_tagged_sentence: # extract only chunks having NE labels if hasattr(tagged_tree, 'label'): entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #get NE name entity_type = tagged_tree.label() # get NE category named_entities.append((entity_name, entity_type)) # get unique named entities named_entities = list(set(named_entities)) # store named entities in a data frame entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type']) # display results print(entity_frame)
輸出結果以下:app
Entity Name Entity Type 0 FIFA ORGANIZATION 1 Central America ORGANIZATION 2 Belgium GPE 3 Caribbean LOCATION 4 Asia GPE 5 France GPE 6 Oceania GPE 7 Germany GPE 8 South America GPE 9 Denmark GPE 10 Zürich GPE 11 Africa PERSON 12 Sweden GPE 13 Netherlands GPE 14 Spain GPE 15 Switzerland GPE 16 North GPE 17 Europe GPE
能夠看到,NLTK中的NER任務大致上完成得仍是不錯的,可以識別FIFA爲組織(ORGANIZATION),Belgium,Asia爲GPE, 可是也有一些不太如人意的地方,好比,它將Central America識別爲ORGANIZATION,而實際上它應該爲GPE;將Africa識別爲PERSON,實際上應該爲GPE。ide
接下來,咱們嘗試着用Stanford NLP工具。關於該工具,咱們主要使用Stanford NER 標註工具。在使用這個工具以前,你須要在本身的電腦上安裝Java(通常是JDK),並將Java添加到系統路徑中,同時下載英語NER的文件包:stanford-ner-2018-10-16.zip(大小爲172MB),下載地址爲:https://nlp.stanford.edu/software/CRF-NER.shtml。以筆者的電腦爲例,Java所在的路徑爲:C:\Program Files\Java\jdk1.8.0_161\bin\java.exe, 下載Stanford NER的zip文件解壓後的文件夾的路徑爲:E://stanford-ner-2018-10-16,以下圖所示:工具
在classifer文件夾中有以下文件:學習
它們表明的含義以下:
3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time
可使用Python實現Stanford NER,完整的代碼以下:
import re from nltk.tag import StanfordNERTagger import os import pandas as pd import nltk def parse_document(document): document = re.sub('\n', ' ', document) if isinstance(document, str): document = document else: raise ValueError('Document is not string!') document = document.strip() sentences = nltk.sent_tokenize(document) sentences = [sentence.strip() for sentence in sentences] return sentences # sample document text = """ FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its membership now comprises 211 national associations. Member countries must each also be members of one of the six regional confederations into which the world is divided: Africa, Asia, Europe, North & Central America and the Caribbean, Oceania, and South America. """ sentences = parse_document(text) tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] # set java path in environment variables java_path = r'C:\Program Files\Java\jdk1.8.0_161\bin\java.exe' os.environ['JAVAHOME'] = java_path # load stanford NER sn = StanfordNERTagger('E://stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz', path_to_jar='E://stanford-ner-2018-10-16/stanford-ner.jar') # tag sentences ne_annotated_sentences = [sn.tag(sent) for sent in tokenized_sentences] # extract named entities named_entities = [] for sentence in ne_annotated_sentences: temp_entity_name = '' temp_named_entity = None for term, tag in sentence: # get terms with NE tags if tag != 'O': temp_entity_name = ' '.join([temp_entity_name, term]).strip() #get NE name temp_named_entity = (temp_entity_name, tag) # get NE and its category else: if temp_named_entity: named_entities.append(temp_named_entity) temp_entity_name = '' temp_named_entity = None # get unique named entities named_entities = list(set(named_entities)) # store named entities in a data frame entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type']) # display results print(entity_frame)
輸出結果以下:
Entity Name Entity Type 0 1904 DATE 1 Denmark LOCATION 2 Spain LOCATION 3 North & Central America ORGANIZATION 4 South America LOCATION 5 Belgium LOCATION 6 Zürich LOCATION 7 the Netherlands LOCATION 8 France LOCATION 9 Caribbean LOCATION 10 Sweden LOCATION 11 Oceania LOCATION 12 Asia LOCATION 13 FIFA ORGANIZATION 14 Europe LOCATION 15 Africa LOCATION 16 Switzerland LOCATION 17 Germany LOCATION
能夠看到,在Stanford NER的幫助下,NER的實現效果較好,將Africa識別爲LOCATION,將1904識別爲時間(這在NLTK中沒有識別出來),但仍是對North & Central America識別有誤,將其識別爲ORGANIZATION。
值得注意的是,並非說Stanford NER必定會比NLTK NER的效果好,二者針對的對象,預料,算法可能有差別,所以,須要根據本身的需求決定使用什麼工具。
本次分享到此結束,之後有機會的話,將會嘗試着用HMM、CRF或深度學習來實現命名實體識別。
注意:本人現已開通微信公衆號: Python爬蟲與算法(微信號爲:easy_web_scrape), 歡迎你們關注哦~~