最近作一些音樂類、讀物類的天然語言理解,就調研使用了下Stanford corenlp,記錄下來。html
Stanford Corenlp是一套天然語言分析工具集包括:java
語音交互類的應用(如語音助手、智能音箱echo)收到的一般是口語化的天然語言,如:我想聽一個段子,給我來個牛郎織女的故事,要想精確的返回結果,就須要提出有用的主題詞,段子/牛郎織女/故事。看了一圈就想使用下corenlp的TokensRegex,基於tokens序列的正則表達式。由於它提供的可用的工具備:正則表達式、分詞、詞性、實體類別,另外還能夠本身指定實體類別,如指定牛郎織女是READ類別的實體。正則表達式
規則格式express
{ // ruleType is "text", "tokens", "composite", or "filter" ruleType: "tokens",//tokens是基於切詞用於tokens正則,text是文本串用於文本正則,composite/filter還沒搞明白 // pattern to be matched pattern: ( ( [ { ner:PERSON } ]) /was/ /born/ /on/ ([ { ner:DATE } ]) ), // value associated with the expression for which the pattern was matched // matched expressions are returned with "DATE_OF_BIRTH" as the value // (as part of the MatchedExpression class) result: "DATE_OF_BIRTH" }
除了上面的字段外還有action/name/stage/active/priority等,能夠參考文後的文獻。bash
ruleTypes是tokens,pattern中的基本元素是token,總體用(),1個token用[<expression>],1個expression用{tag:xx;ner:xx}來表述app
ruleTypes是text,pattern就是常規的正則表達式,基本元素就是字符了,總體用//包圍函數
corenlp提供了單條/多條正則表達式的提取,本文就介紹從文件中加載規則來攔截咱們須要的文本,並從中提取主題詞。工具
<dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.4.1</version> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.4.1</version> <classifier>models</classifier> </dependency>
<!--中文支持--> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.6.0</version> <classifier>models-chinese</classifier> </dependency>
annotators = segment, ssplit, pos, ner, regexner, parse regexner.mapping = regexner.txt//自定義的實體正則表達式文件 customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator segment.model = edu/stanford/nlp/models/segmenter/chinese/pku.gz segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz segment.sighanPostProcessing = true ssplit.boundaryTokenRegex = [.]|[!?]+|[。]|[!?]+ //句子切分符 pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ner.applyNumericClassifiers = false ner.useSUTime = false parse.model = edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz
corenlp中對文本的一次處理稱爲一個pipeline,annotators表明一個處理節點,如segment切詞、ssplit句子切割(將一段話分爲多個句子)、pos詞性、ner實體命名、regexner是用自定義正則表達式來標註實體類型、parse是句子結構解析。後面就是各annotator的屬性。學習
regexner.txt(將'牛郎織女'的實體類別識別爲READ)翻譯
牛郎織女 READ
rule.txt(tokensregex規則)
$TYPE="/笑話|故事|段子|口技|謎語|寓言|評書|相聲|小品|唐詩|古詩|宋詞|繞口令|故事|小說/ | /腦筋/ /急轉彎/" //單類型 { ruleType: "tokens", pattern: ((?$type $TYPE)), result: Format("%s;%s;%s", "", $$type.text.replace(" ",""), "") }
(?type xx)表明一個命名group,提取該group將結果組裝成xx;xx;xx形式返回
//加載tokens正則表達 CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFile(TokenSequencePattern.getNewEnv(), "rule.txt"); //建立pipeline StanfordCoreNLP coreNLP = new StanfordCoreNLP("CoreNLP-chinese.properties"); //處理文本 Annotation annotation = coreNLP.process("聽個故事"); List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class); CoreMap sentence = sentences.get(0); //得到第一個句子分析結果 //過一遍tokens正則 List<MatchedExpression> matchedExpressions = extractor.extractExpressions(sentence); for (MatchedExpression match : matchedExpressions) { System.out.println("Matched expression: " + match.getText() + " with value " + match.getValue()); }
想看下分析結果,如切詞、詞性、實體名,能夠使用下面的函數
private void debug(CoreMap sentence) { // 從CoreMap中取出CoreLabel List,逐一打印出來 List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class); System.out.println("字/詞" + "\t " + "詞性" + "\t " + "實體標記"); System.out.println("-----------------------------"); for (CoreLabel token : tokens) { String word = token.getString(CoreAnnotations.TextAnnotation.class); String pos = token.getString(CoreAnnotations.PartOfSpeechAnnotation.class); String ner = token.getString(CoreAnnotations.NamedEntityTagAnnotation.class); System.out.println(word + "\t " + pos + "\t " + ner); } }
功能仍是很強大的,畢竟能夠用的東西多了,遇到問題時方法就多了。
TokensRegex: http://nlp.stanford.edu/software/tokensregex.shtml
SequenceMatchRules: http://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html