用 Python 和 Stanford CoreNLP 進行中文天然語言處理

實驗環境:Windows 7 / Python 3.6.1 / CoreNLP 3.7.0html

1、下載 CoreNLP

Stanford NLP 官網 下載最新的模型文件:java

2、安裝 stanza

stanza 是 Stanford CoreNLP 官方最新開發的 Python 接口。github

根據 StanfordNLPHelp 在 stackoverflow 上的解釋,推薦 Python 用戶使用 stanza 而非 nltk 的接口。瀏覽器

If you want to use our tools in Python, I would recommend using the Stanford CoreNLP 3.7.0 server and making small server requests (or using the stanza library).服務器

If you use nltk what I believe happens is Python just calls our Java code with subprocess and this can actually be very inefficient since distinct calls reload all of the models.app

注意 stanza\setup.py 文件臨近結尾部分,有一行是post

packages=['stanza', 'stanza.text', 'stanza.monitoring', 'stanza.util'],


packages=['stanza', 'stanza.text', 'stanza.monitoring', 'stanza.util', 'stanza.corenlp', 'stanza.ml', 'stanza.cluster', 'stanza.research'],



  • 若是處理英文,輸入
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

  • 若是處理中文,輸入
    java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000


可在瀏覽器中鍵入 http://localhost:9000/ 或 corenlp.run 進行直觀測試。


from stanza.nlp.corenlp import CoreNLPClient
client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'lemma', 'tokenize', 'pos', 'ner']) # 注意在之前的版本中,中文分詞爲 segment,新版已經和其餘語言統一爲 tokenize

# 分詞和詞性標註測試
test1 = "深藍的天空中掛着一輪金黃的圓月,下面是海邊的沙地,都種着無邊無際的碧綠的西瓜,其間有一個十一二歲的少年,項帶銀圈,手捏一柄鋼叉,向一匹猹盡力的刺去,那猹卻將身一扭,反從他的胯下逃走了。"
annotated = client.annotate(test1)
for sentence in annotated.sentences:
    for token in sentence:
        print(token.word, token.pos)

# 命名實體識別測試
test2 = "大概是物以希爲貴罷。北京的白菜運往浙江,便用紅頭繩繫住菜根,倒掛在水果店頭,尊爲膠菜;福建野生着的蘆薈,一到北京就請進溫室,且美其名曰龍舌蘭。我到仙台也頗受了這樣的優待……"
annotated = client.annotate(test2)
for sentence in annotated.sentences:
    for token in sentence:
        if token.ner != 'O':
          print(token.word, token.ner)