用 Python 和 Stanford CoreNLP 進行中文天然語言處理

時間 2019-11-09

標籤 python stanford corenlp 進行中文天然語言處理欄目 Python 简体版

原文原文鏈接

實驗環境：Windows 7 / Python 3.6.1 / CoreNLP 3.7.0html

1、下載 CoreNLP

在 Stanford NLP 官網下載最新的模型文件：java

CoreNLP 完整包 stanford-corenlp-full-2016-10-31.zip：下載後解壓到工做目錄。python
中文模型stanford-chinese-corenlp-2016-10-31-models.jar：下載後複製到上述工做目錄。git

2、安裝 stanza

stanza 是 Stanford CoreNLP 官方最新開發的 Python 接口。github

根據 StanfordNLPHelp 在 stackoverflow 上的解釋，推薦 Python 用戶使用 stanza 而非 nltk 的接口。瀏覽器

If you want to use our tools in Python, I would recommend using the Stanford CoreNLP 3.7.0 server and making small server requests (or using the stanza library).服務器

If you use nltk what I believe happens is Python just calls our Java code with subprocess and this can actually be very inefficient since distinct calls reload all of the models.app

注意 stanza\setup.py 文件臨近結尾部分，有一行是post

packages=['stanza', 'stanza.text', 'stanza.monitoring', 'stanza.util'],

這樣安裝後缺乏模塊，須要手動修改成測試

packages=['stanza', 'stanza.text', 'stanza.monitoring', 'stanza.util', 'stanza.corenlp', 'stanza.ml', 'stanza.cluster', 'stanza.research'],

3、測試

在CoreNLP工做目錄中，打開cmd窗口，啓動服務器：

若是處理英文，輸入
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
若是處理中文，輸入
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000

注意stanford-chinese-corenlp-2016-10-31-models.jar應當位於工做目錄下。

可在瀏覽器中鍵入 http://localhost:9000/ 或 corenlp.run 進行直觀測試。

Python示例代碼：

from stanza.nlp.corenlp import CoreNLPClient
client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'lemma', 'tokenize', 'pos', 'ner']) # 注意在之前的版本中，中文分詞爲 segment，新版已經和其餘語言統一爲 tokenize

# 分詞和詞性標註測試
test1 = "深藍的天空中掛着一輪金黃的圓月，下面是海邊的沙地，都種着無邊無際的碧綠的西瓜，其間有一個十一二歲的少年，項帶銀圈，手捏一柄鋼叉，向一匹猹盡力的刺去，那猹卻將身一扭，反從他的胯下逃走了。"
annotated = client.annotate(test1)
for sentence in annotated.sentences:
    for token in sentence:
        print(token.word, token.pos)

# 命名實體識別測試
test2 = "大概是物以希爲貴罷。北京的白菜運往浙江，便用紅頭繩繫住菜根，倒掛在水果店頭，尊爲膠菜；福建野生着的蘆薈，一到北京就請進溫室，且美其名曰龍舌蘭。我到仙台也頗受了這樣的優待……"
annotated = client.annotate(test2)
for sentence in annotated.sentences:
    for token in sentence:
        if token.ner != 'O':
          print(token.word, token.ner)