使用Standford coreNLP進行中文命名實體識別

時間 2019-12-11

標籤使用 standford corenlp 進行中文命名實體識別简体版

原文原文鏈接

由於工做須要，調研了一下Stanford coreNLP的命名實體識別功能。html

Stanford CoreNLP是一個比較厲害的天然語言處理工具，不少模型都是基於深度學習方法訓練獲得的。java

先附上其官網連接：git

https://stanfordnlp.github.io/CoreNLP/index.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/
https://github.com/stanfordnlp/CoreNLP

本文主要講解如何在java工程中使用Stanford CoreNLP；github

1.環境準備

3.5以後的版本都須要java8以上的環境才能運行。須要進行中文處理的話，比較佔用內存，3G左右的內存消耗。算法

筆者使用的maven進行依賴的引入，使用的是3.9.1版本。app

直接在pom文件中加入下面的依賴：maven

        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.9.1</version>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.9.1</version>
            <classifier>models</classifier>
        </dependency>
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.9.1</version>
            <classifier>models-chinese</classifier>
        </dependency>

3個包分別是CoreNLP的算法包、英文語料包、中文預料包。這3個包的總大小爲1.43G。maven默認鏡像在國外，而這幾個依賴包特別大，能夠找有着三個依賴的國內鏡像試一下。筆者用的是本身公司的maven倉庫。工具

2.代碼調用

須要注意的是，由於我是須要進行中文的命名實體識別，所以須要使用中文分詞和中文的詞典。咱們能夠先打開引入的jar包的結構：post

其中有個StanfordCoreNLP-chinese.properties文件，這裏面設定了進行中文天然語言處理的一些參數。主要指定相應的pipeline的操做步驟以及對應的預料文件的位置。實際上咱們可能用不到全部的步驟，或者要使用不一樣的語料庫，所以能夠自定義配置文件，而後再引入。那在個人項目中，我就直接讀取了該properties文件。學習

attention：此處筆者要使用的是ner功能，但可能不想使用其餘的一些annotation，想去掉。然而，Stanford CoreNLP有一些侷限，就是在ner執行以前，必定須要

tokenize, ssplit, pos, lemma

的引入，固然這增長了很大的時間耗時。

其實咱們能夠先來分析一下這個properties文件：

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
annotators = tokenize, ssplit, pos, lemma, ner, parse, coref

# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.。]|[!?！？]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

# ner 此處設定了ner使用的語言、模型（crf），目前SUTime只支持英文，不支持中文，因此設置爲false。
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false

# regexner
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# parse
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz

# depparse
depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
depparse.language = chinese

# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.algorithm = hybrid
coref.path.word2vec =
coref.language = zh
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
coref.print.md.log = false
coref.md.type = RULE
coref.md.liberalChineseMD = false

# kbp
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
kbp.language = zh
kbp.model = none

# entitylink
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz

那咱們就直接在代碼中引入這個properties文件，參考代碼以下：

package com.baidu.corenlp;

import java.util.List;
import java.util.Map;
import java.util.Properties;

import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;

/**
 * Created by sonofelice on 2018/3/27.
 */
public class TestNLP {
    public void test() throws Exception {
        //構造一個StanfordCoreNLP對象，配置NLP的功能，如lemma是詞幹化，ner是命名實體識別等
        Properties props = new Properties();
        props.load(this.getClass().getResourceAsStream("/StanfordCoreNLP-chinese.properties"));
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        String text = "袁隆平是中國科學院的院士,他於2009年10月到中國山東省東營市東營區永樂機場附近承包了一千畝鹽鹼地,"
                + "開始種植棉花, 年產量達到一萬噸, 哈哈, 反正棣琦說的是假的,逗你玩兒,明天下午2點來我家吃飯吧。"
                + "棣琦是山東大學畢業的,目前在百度作java開發,位置是東北旺東路102號院,手機號14366778890";

        long startTime = System.currentTimeMillis();
        // 創造一個空的Annotation對象
        Annotation document = new Annotation(text);

        // 對文本進行分析
        pipeline.annotate(document);

        //獲取文本處理結果
        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
        for (CoreMap sentence : sentences) {
            // traversing the words in the current sentence
            // a CoreLabel is a CoreMap with additional token-specific methods
            for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                //                // 獲取句子的token（能夠是做爲分詞後的詞語）
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                System.out.println(word);
                //詞性標註
                String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                System.out.println(pos);
                // 命名實體識別
                String ne = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);
                String ner = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
                System.out.println(word + " | analysis : {  original : " + ner + "," + " normalized : "
                        + ne + "}");
                //詞幹化處理
                String lema = token.get(CoreAnnotations.LemmaAnnotation.class);
                System.out.println(lema);
            }

            // 句子的解析樹
            Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
            System.out.println("句子的解析樹:");
            tree.pennPrint();

            // 句子的依賴圖
            SemanticGraph graph =
                    sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
            System.out.println("句子的依賴圖");
            System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST));

        }

        long endTime = System.currentTimeMillis();
        long time = endTime - startTime;
        System.out.println("The analysis lasts " + time + " seconds * 1000");

        // 指代詞鏈
        //每條鏈保存指代的集合
        // 句子和偏移量都從1開始
        Map<Integer, CorefChain> corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
        if (corefChains == null) {
            return;
        }
        for (Map.Entry<Integer, CorefChain> entry : corefChains.entrySet()) {
            System.out.println("Chain " + entry.getKey() + " ");
            for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) {
                // We need to subtract one since the indices count from 1 but the Lists start from 0
                List<CoreLabel> tokens = sentences.get(m.sentNum - 1).get(CoreAnnotations.TokensAnnotation.class);
                // We subtract two for end: one for 0-based indexing, and one because we want last token of mention 
                // not one following.
                System.out.println(
                        "  " + m + ", i.e., 0-based character offsets [" + tokens.get(m.startIndex - 1).beginPosition()
                                +
                                ", " + tokens.get(m.endIndex - 2).endPosition() + ")");
            }
        }
    }
}

public static void main(String[] args) throws  Exception {
    TestNLP nlp=new TestNLP();
    nlp.test();
}

固然，我在運行過程當中，只保留了ner相關的分析，別的功能註釋掉了。輸出結果以下：

19:46:16.000 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
19:46:19.387 [main] INFO  e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [3.4 sec].
19:46:19.388 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
19:46:19.389 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
19:46:21.938 [main] INFO  e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [2.5 sec].
19:46:22.099 [main] WARN  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亞 STATE_OR_PROVINCE    MISC,GPE,LOCATION    1.  Taking type to be MISC
19:46:22.100 [main] WARN  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亞 州 STATE_OR_PROVINCE    MISC,GPE,LOCATION    1.  Taking type to be MISC
19:46:22.100 [main] INFO  e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Read 21238 unique entries out of 21249 from edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab, 0 TokensRegex patterns.
19:46:22.532 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
19:46:35.855 [main] INFO  e.s.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/chineseSR.ser.gz ... done [13.3 sec].
19:46:35.859 [main] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
19:46:43.139 [main] INFO  e.s.n.pipeline.CorefMentionAnnotator - Using mention detector type: rule
19:46:43.148 [main] INFO  e.s.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
19:46:43.148 [main] INFO  e.s.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
19:46:43.329 [main] INFO  e.s.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
19:46:43.379 [main] INFO  edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
19:46:43.380 [main] INFO  e.s.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].
袁隆平 | analysis : {  original : PERSON, normalized : null}
是 | analysis : {  original : O, normalized : null}
中國 | analysis : {  original : ORGANIZATION, normalized : null}
科學院 | analysis : {  original : ORGANIZATION, normalized : null}
的 | analysis : {  original : O, normalized : null}
院士 | analysis : {  original : TITLE, normalized : null}
, | analysis : {  original : O, normalized : null}
他 | analysis : {  original : O, normalized : null}
於 | analysis : {  original : O, normalized : null}
2009年 | analysis : {  original : DATE, normalized : 2009-10-XX}
10月 | analysis : {  original : DATE, normalized : 2009-10-XX}
到 | analysis : {  original : O, normalized : null}
中國 | analysis : {  original : COUNTRY, normalized : null}
山東省 | analysis : {  original : STATE_OR_PROVINCE, normalized : null}
東營市 | analysis : {  original : CITY, normalized : null}
東營區 | analysis : {  original : FACILITY, normalized : null}
永樂 | analysis : {  original : FACILITY, normalized : null}
機場 | analysis : {  original : FACILITY, normalized : null}
附近 | analysis : {  original : O, normalized : null}
承包 | analysis : {  original : O, normalized : null}
了 | analysis : {  original : O, normalized : null}
一千 | analysis : {  original : NUMBER, normalized : 1000}
畝 | analysis : {  original : O, normalized : null}
鹽 | analysis : {  original : O, normalized : null}
鹼地 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
開始 | analysis : {  original : O, normalized : null}
種植 | analysis : {  original : O, normalized : null}
棉花 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
年產量 | analysis : {  original : O, normalized : null}
達到 | analysis : {  original : O, normalized : null}
一萬 | analysis : {  original : NUMBER, normalized : 10000}
噸 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
哈哈 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
反正 | analysis : {  original : O, normalized : null}
棣琦 | analysis : {  original : PERSON, normalized : null}
說 | analysis : {  original : O, normalized : null}
的 | analysis : {  original : O, normalized : null}
是 | analysis : {  original : O, normalized : null}
假 | analysis : {  original : O, normalized : null}
的 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
逗 | analysis : {  original : O, normalized : null}
你 | analysis : {  original : O, normalized : null}
玩兒 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
明天 | analysis : {  original : DATE, normalized : XXXX-XX-XX}
下午 | analysis : {  original : TIME, normalized : null}
2點 | analysis : {  original : TIME, normalized : null}
來 | analysis : {  original : O, normalized : null}
我 | analysis : {  original : O, normalized : null}
家 | analysis : {  original : O, normalized : null}
吃飯 | analysis : {  original : O, normalized : null}
吧 | analysis : {  original : O, normalized : null}
。 | analysis : {  original : O, normalized : null}
棣琦 | analysis : {  original : PERSON, normalized : null}
是 | analysis : {  original : O, normalized : null}
山東 | analysis : {  original : ORGANIZATION, normalized : null}
大學 | analysis : {  original : ORGANIZATION, normalized : null}
畢業 | analysis : {  original : O, normalized : null}
的 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
目前 | analysis : {  original : DATE, normalized : null}
在 | analysis : {  original : O, normalized : null}
百度 | analysis : {  original : ORGANIZATION, normalized : null}
作 | analysis : {  original : O, normalized : null}
java | analysis : {  original : O, normalized : null}
開發 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
位置 | analysis : {  original : O, normalized : null}
是 | analysis : {  original : O, normalized : null}
東北 | analysis : {  original : LOCATION, normalized : null}
旺 | analysis : {  original : O, normalized : null}
東路 | analysis : {  original : O, normalized : null}
102 | analysis : {  original : NUMBER, normalized : 102}
號院 | analysis : {  original : O, normalized : null}
, | analysis : {  original : O, normalized : null}
手機號 | analysis : {  original : O, normalized : null}
143667788 | analysis : {  original : NUMBER, normalized : 14366778890}
90 | analysis : {  original : NUMBER, normalized : 14366778890}
The analysis lasts 819 seconds * 1000

Process finished with exit code 0