由於工做須要,調研了一下Stanford coreNLP的命名實體識別功能。html
Stanford CoreNLP是一個比較厲害的天然語言處理工具,不少模型都是基於深度學習方法訓練獲得的。java
先附上其官網連接:git
本文主要講解如何在java工程中使用Stanford CoreNLP;github
3.5以後的版本都須要java8以上的環境才能運行。須要進行中文處理的話,比較佔用內存,3G左右的內存消耗。算法
筆者使用的maven進行依賴的引入,使用的是3.9.1版本。app
直接在pom文件中加入下面的依賴:maven
<dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.9.1</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.9.1</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.9.1</version> <classifier>models-chinese</classifier> </dependency>
3個包分別是CoreNLP的算法包、英文語料包、中文預料包。這3個包的總大小爲1.43G。maven默認鏡像在國外,而這幾個依賴包特別大,能夠找有着三個依賴的國內鏡像試一下。筆者用的是本身公司的maven倉庫。工具
須要注意的是,由於我是須要進行中文的命名實體識別,所以須要使用中文分詞和中文的詞典。咱們能夠先打開引入的jar包的結構:post
其中有個StanfordCoreNLP-chinese.properties文件,這裏面設定了進行中文天然語言處理的一些參數。主要指定相應的pipeline的操做步驟以及對應的預料文件的位置。實際上咱們可能用不到全部的步驟,或者要使用不一樣的語料庫,所以能夠自定義配置文件,而後再引入。那在個人項目中,我就直接讀取了該properties文件。學習
attention:此處筆者要使用的是ner功能,但可能不想使用其餘的一些annotation,想去掉。然而,Stanford CoreNLP有一些侷限,就是在ner執行以前,必定須要
tokenize, ssplit, pos, lemma
的引入,固然這增長了很大的時間耗時。
其實咱們能夠先來分析一下這個properties文件:
# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system) annotators = tokenize, ssplit, pos, lemma, ner, parse, coref # segment tokenize.language = zh segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz segment.sighanPostProcessing = true # sentence split ssplit.boundaryTokenRegex = [.。]|[!?!?]+ # pos pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger # ner 此處設定了ner使用的語言、模型(crf),目前SUTime只支持英文,不支持中文,因此設置爲false。 ner.language = chinese ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ner.applyNumericClassifiers = true ner.useSUTime = false # regexner ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE # parse parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz # depparse depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz depparse.language = chinese # coref coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch coref.input.type = raw coref.postprocessing = true coref.calculateFeatureImportance = false coref.useConstituencyTree = true coref.useSemantics = false coref.algorithm = hybrid coref.path.word2vec = coref.language = zh coref.defaultPronounAgreement = true coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz coref.print.md.log = false coref.md.type = RULE coref.md.liberalChineseMD = false # kbp kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex kbp.language = zh kbp.model = none # entitylink entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz
那咱們就直接在代碼中引入這個properties文件,參考代碼以下:
package com.baidu.corenlp; import java.util.List; import java.util.Map; import java.util.Properties; import edu.stanford.nlp.coref.CorefCoreAnnotations; import edu.stanford.nlp.coref.data.CorefChain; import edu.stanford.nlp.ling.CoreAnnotations; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.semgraph.SemanticGraph; import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations; import edu.stanford.nlp.trees.Tree; import edu.stanford.nlp.trees.TreeCoreAnnotations; import edu.stanford.nlp.util.CoreMap; /** * Created by sonofelice on 2018/3/27. */ public class TestNLP { public void test() throws Exception { //構造一個StanfordCoreNLP對象,配置NLP的功能,如lemma是詞幹化,ner是命名實體識別等 Properties props = new Properties(); props.load(this.getClass().getResourceAsStream("/StanfordCoreNLP-chinese.properties")); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); String text = "袁隆平是中國科學院的院士,他於2009年10月到中國山東省東營市東營區永樂機場附近承包了一千畝鹽鹼地," + "開始種植棉花, 年產量達到一萬噸, 哈哈, 反正棣琦說的是假的,逗你玩兒,明天下午2點來我家吃飯吧。" + "棣琦是山東大學畢業的,目前在百度作java開發,位置是東北旺東路102號院,手機號14366778890"; long startTime = System.currentTimeMillis(); // 創造一個空的Annotation對象 Annotation document = new Annotation(text); // 對文本進行分析 pipeline.annotate(document); //獲取文本處理結果 List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class); for (CoreMap sentence : sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) { // // 獲取句子的token(能夠是做爲分詞後的詞語) String word = token.get(CoreAnnotations.TextAnnotation.class); System.out.println(word); //詞性標註 String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); System.out.println(pos); // 命名實體識別 String ne = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class); String ner = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); System.out.println(word + " | analysis : { original : " + ner + "," + " normalized : " + ne + "}"); //詞幹化處理 String lema = token.get(CoreAnnotations.LemmaAnnotation.class); System.out.println(lema); } // 句子的解析樹 Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class); System.out.println("句子的解析樹:"); tree.pennPrint(); // 句子的依賴圖 SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class); System.out.println("句子的依賴圖"); System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST)); } long endTime = System.currentTimeMillis(); long time = endTime - startTime; System.out.println("The analysis lasts " + time + " seconds * 1000"); // 指代詞鏈 //每條鏈保存指代的集合 // 句子和偏移量都從1開始 Map<Integer, CorefChain> corefChains = document.get(CorefCoreAnnotations.CorefChainAnnotation.class); if (corefChains == null) { return; } for (Map.Entry<Integer, CorefChain> entry : corefChains.entrySet()) { System.out.println("Chain " + entry.getKey() + " "); for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) { // We need to subtract one since the indices count from 1 but the Lists start from 0 List<CoreLabel> tokens = sentences.get(m.sentNum - 1).get(CoreAnnotations.TokensAnnotation.class); // We subtract two for end: one for 0-based indexing, and one because we want last token of mention // not one following. System.out.println( " " + m + ", i.e., 0-based character offsets [" + tokens.get(m.startIndex - 1).beginPosition() + ", " + tokens.get(m.endIndex - 2).endPosition() + ")"); } } } }
public static void main(String[] args) throws Exception {
TestNLP nlp=new TestNLP();
nlp.test();
}
固然,我在運行過程當中,只保留了ner相關的分析,別的功能註釋掉了。輸出結果以下:
19:46:16.000 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos 19:46:19.387 [main] INFO e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [3.4 sec]. 19:46:19.388 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma 19:46:19.389 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner 19:46:21.938 [main] INFO e.s.n.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [2.5 sec]. 19:46:22.099 [main] WARN e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亞 STATE_OR_PROVINCE MISC,GPE,LOCATION 1. Taking type to be MISC 19:46:22.100 [main] WARN e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Entry has multiple types for ner: 巴伐利亞 州 STATE_OR_PROVINCE MISC,GPE,LOCATION 1. Taking type to be MISC 19:46:22.100 [main] INFO e.s.n.p.TokensRegexNERAnnotator - TokensRegexNERAnnotator ner.fine.regexner: Read 21238 unique entries out of 21249 from edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab, 0 TokensRegex patterns. 19:46:22.532 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator parse 19:46:35.855 [main] INFO e.s.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/chineseSR.ser.gz ... done [13.3 sec]. 19:46:35.859 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator coref 19:46:43.139 [main] INFO e.s.n.pipeline.CorefMentionAnnotator - Using mention detector type: rule 19:46:43.148 [main] INFO e.s.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: 19:46:43.148 [main] INFO e.s.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz 19:46:43.329 [main] INFO e.s.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200. 19:46:43.379 [main] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done]. 19:46:43.380 [main] INFO e.s.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done]. 袁隆平 | analysis : { original : PERSON, normalized : null} 是 | analysis : { original : O, normalized : null} 中國 | analysis : { original : ORGANIZATION, normalized : null} 科學院 | analysis : { original : ORGANIZATION, normalized : null} 的 | analysis : { original : O, normalized : null} 院士 | analysis : { original : TITLE, normalized : null} , | analysis : { original : O, normalized : null} 他 | analysis : { original : O, normalized : null} 於 | analysis : { original : O, normalized : null} 2009年 | analysis : { original : DATE, normalized : 2009-10-XX} 10月 | analysis : { original : DATE, normalized : 2009-10-XX} 到 | analysis : { original : O, normalized : null} 中國 | analysis : { original : COUNTRY, normalized : null} 山東省 | analysis : { original : STATE_OR_PROVINCE, normalized : null} 東營市 | analysis : { original : CITY, normalized : null} 東營區 | analysis : { original : FACILITY, normalized : null} 永樂 | analysis : { original : FACILITY, normalized : null} 機場 | analysis : { original : FACILITY, normalized : null} 附近 | analysis : { original : O, normalized : null} 承包 | analysis : { original : O, normalized : null} 了 | analysis : { original : O, normalized : null} 一千 | analysis : { original : NUMBER, normalized : 1000} 畝 | analysis : { original : O, normalized : null} 鹽 | analysis : { original : O, normalized : null} 鹼地 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 開始 | analysis : { original : O, normalized : null} 種植 | analysis : { original : O, normalized : null} 棉花 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 年產量 | analysis : { original : O, normalized : null} 達到 | analysis : { original : O, normalized : null} 一萬 | analysis : { original : NUMBER, normalized : 10000} 噸 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 哈哈 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 反正 | analysis : { original : O, normalized : null} 棣琦 | analysis : { original : PERSON, normalized : null} 說 | analysis : { original : O, normalized : null} 的 | analysis : { original : O, normalized : null} 是 | analysis : { original : O, normalized : null} 假 | analysis : { original : O, normalized : null} 的 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 逗 | analysis : { original : O, normalized : null} 你 | analysis : { original : O, normalized : null} 玩兒 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 明天 | analysis : { original : DATE, normalized : XXXX-XX-XX} 下午 | analysis : { original : TIME, normalized : null} 2點 | analysis : { original : TIME, normalized : null} 來 | analysis : { original : O, normalized : null} 我 | analysis : { original : O, normalized : null} 家 | analysis : { original : O, normalized : null} 吃飯 | analysis : { original : O, normalized : null} 吧 | analysis : { original : O, normalized : null} 。 | analysis : { original : O, normalized : null} 棣琦 | analysis : { original : PERSON, normalized : null} 是 | analysis : { original : O, normalized : null} 山東 | analysis : { original : ORGANIZATION, normalized : null} 大學 | analysis : { original : ORGANIZATION, normalized : null} 畢業 | analysis : { original : O, normalized : null} 的 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 目前 | analysis : { original : DATE, normalized : null} 在 | analysis : { original : O, normalized : null} 百度 | analysis : { original : ORGANIZATION, normalized : null} 作 | analysis : { original : O, normalized : null} java | analysis : { original : O, normalized : null} 開發 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 位置 | analysis : { original : O, normalized : null} 是 | analysis : { original : O, normalized : null} 東北 | analysis : { original : LOCATION, normalized : null} 旺 | analysis : { original : O, normalized : null} 東路 | analysis : { original : O, normalized : null} 102 | analysis : { original : NUMBER, normalized : 102} 號院 | analysis : { original : O, normalized : null} , | analysis : { original : O, normalized : null} 手機號 | analysis : { original : O, normalized : null} 143667788 | analysis : { original : NUMBER, normalized : 14366778890} 90 | analysis : { original : NUMBER, normalized : 14366778890} The analysis lasts 819 seconds * 1000 Process finished with exit code 0
咱們能夠看到,整個工程的啓動耗時仍是挺久的。分析過程也比較耗時,819毫秒。
而且結果也不夠準確,跟我在其官網在線demo獲得的結果仍是有些差別的: