CoreNLP是由斯坦福大學開源的一套Java NLP工具,提供諸如:詞性標註(part-of-speech (POS) tagger)、命名實體識別(named entity recognizer (NER))、情感分析(sentiment analysis)等功能。html
【開源中文分詞工具探析】系列:java
\[ P_w(y|x) = \frac{exp \left( \sum_i w_i f_i(x,y) \right)}{Z_w(x)} \]github
其中,\(Z_w(x)\)爲歸一化因子,\(w\)爲模型的參數,\(f_i(x,y)\)爲特徵函數。數組
如下源碼分析基於3.7.0版本,分詞示例見SegDemo
類。dom
主要模型文件有兩份,一份爲詞典文件dict-chris6.ser.gz
:函數
// dict-chris6.ser.gz 對應於長度爲7的Set數組詞典 // 共計詞數:0+7323+125336+142252+82139+26907+39243 ChineseDictionary::loadDictionary(String serializePath) { Set<String>[] dict = new HashSet[MAX_LEXICON_LENGTH + 1]; for (int i = 0; i <= MAX_LEXICON_LENGTH; i++) { dict[i] = Generics.newHashSet(); } dict = IOUtils.readObjectFromURLOrClasspathOrFileSystem(serializePath); return dict; }
詞典的索引值爲詞的長度,好比第0個詞典中沒有詞,第1個詞典爲長度爲1的詞,第6個詞典爲長度爲6的詞。其中,第6個詞典爲半成詞,好比,有詞「《雙峯》(電」、「80年國家領」、「1824年英」。工具
另外一份爲CRF訓練模型文件ctb.gz
:源碼分析
CRFClassifier::loadClassifier(ObjectInputStream ois, Properties props) { Object o = ois.readObject(); if (o instanceof List) { labelIndices = (List<Index<CRFLabel>>) o; // label索引 } classIndex = (Index<String>) ois.readObject(); // 序列標註label featureIndex = (Index<String>) ois.readObject(); // 特徵 flags = (SeqClassifierFlags) ois.readObject(); // 模型配置 Object featureFactory = ois.readObject(); // 特徵模板,用於生成特徵 else if (featureFactory instanceof FeatureFactory) { featureFactories = Generics.newArrayList(); featureFactories.add((FeatureFactory<IN>) featureFactory); } windowSize = ois.readInt(); // 窗口大小爲2 weights = (double[][]) ois.readObject(); // 特徵+label 對應的權重 Set<String> lcWords = (Set<String>) ois.readObject(); // Set爲空 else { knownLCWords = new MaxSizeConcurrentHashSet<>(lcWords); } reinit(); }
不一樣於其餘分詞器採用B、M、E、S四種label來作分詞,CoreNLP的中文分詞label只有兩種,「1」表示當前字符與前一字符鏈接成詞,「0」則表示當前字符爲另外一詞的開始——換言以前一字符爲上一個詞的結尾。測試
class CRFClassifier { classIndex: class edu.stanford.nlp.util.HashIndex ["1","0"] } // 中文分詞label對應的類 public static class AnswerAnnotation implements CoreAnnotation<String>{}
CoreNLP的特徵以下(示例):
class CRFClassifier { // 特徵 featureIndex: class edu.stanford.nlp.util.HashIndex size = 3408491 0=的膀cc2|C 1=身也pc|C 44=LSSLp2spscsc2s|C 45=科背p2p|C 46=迪。cc2|C ... =球-行pc2|CnC =音非cc2|CpC // 權重 weights: double[3408491][2] [[2.2114868426005005E-5, -2.2114868091546352E-5]...] }
特徵後綴只有3類:C, CpC, CnC,分別表明了三大類特徵;均由特徵模板生成:
// 特徵模板List featureFactories: ArrayList<FeatureFactory> 0 = Gale2007ChineseSegmenterFeatureFactory // 具體特徵模板 Gale2007ChineseSegmenterFeatureFactory::getCliqueFeatures() { if (clique == cliqueC) { addAllInterningAndSuffixing(features, featuresC(cInfo, loc), "C"); } else if (clique == cliqueCpC) { addAllInterningAndSuffixing(features, featuresCpC(cInfo, loc), "CpC"); addAllInterningAndSuffixing(features, featuresCnC(cInfo, loc - 1), "CnC"); } }
特徵模板只用到了兩個特徵簇cliqueC
與cliqueCpC
,其中,cliqueC
由函數featuresC()
實現,cliqueCpC
由函數featuresCpC()
與featuresCnC()
Gale2007ChineseSegmenterFeatureFactory::featuresC() { if (flags.useWord1) { // Unigram 特徵 features.add(charc +"::c"); // c[0] features.add(charc2+"::c2"); // c[1] features.add(charp +"::p"); // c[-1] features.add(charp2 +"::p2"); // c[-2] // Bigram 特徵 features.add(charc +charc2 +"::cn"); // c[0]c[1] features.add(charc +charc3 +"::cn2"); // c[0]c[2] features.add(charp +charc +"::pc"); // c[-1]c[0] features.add(charp +charc2 +"::pn"); // c[-1]c[1] features.add(charp2 +charp +"::p2p"); // c[-2]c[-1] features.add(charp2 +charc +"::p2c"); // c[-2]c[0] features.add(charc2 +charc +"::n2c"); // c[1]c[0] } // 三個字符c[-1]c[0]c[1]對應的LBeginAnnotation、LMiddleAnnotation、LEndAnnotation 三種label特徵 // 結果特徵分別以6種形式結尾,"-lb", "-lm", "-le", "-plb", "-plm", "-ple", "-c2lb", "-c2lm", "-c2le" // null || ".../models/segmenter/chinese/dict-chris6.ser.gz" if (flags.dictionary != null || flags.serializedDictionary != null) { dictionaryFeaturesC(CoreAnnotations.LBeginAnnotation.class, CoreAnnotations.LMiddleAnnotation.class, CoreAnnotations.LEndAnnotation.class, "", features, p, c, c2); } // 特徵 c[1]c[0], c[1] if (flags.useFeaturesC4gram || flags.useFeaturesC5gram || flags.useFeaturesC6gram) { features.add(charp2 + charp + "p2p"); features.add(charp2 + "p2"); } // Unicode特徵 if (flags.useUnicodeType || flags.useUnicodeType4gram || flags.useUnicodeType5gram) { features.add(uTypep + "-" + uTypec + "-" + uTypec2 + "-uType3"); } // UnicodeType特徵 if (flags.useUnicodeType4gram || flags.useUnicodeType5gram) { features.add(uTypep2 + "-" + uTypep + "-" + uTypec + "-" + uTypec2 + "-uType4"); } // UnicodeBlock特徵 if (flags.useUnicodeBlock) { features.add(p.getString(CoreAnnotations.UBlockAnnotation.class) + "-" + c.getString(CoreAnnotations.UBlockAnnotation.class) + "-" + c2.getString(CoreAnnotations.UBlockAnnotation.class) + "-uBlock"); } // Shape特徵 if (flags.useShapeStrings) { if (flags.useShapeStrings1) { features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + "ps"); features.add(c.getString(CoreAnnotations.ShapeAnnotation.class) + "cs"); features.add(c2.getString(CoreAnnotations.ShapeAnnotation.class) + "c2s"); } if (flags.useShapeStrings3) { features.add(p.getString(CoreAnnotations.ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString(CoreAnnotations.ShapeAnnotation.class) + "pscsc2s"); } if (flags.useShapeStrings4) { features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class) + p.getString(CoreAnnotations.ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString(CoreAnnotations.ShapeAnnotation.class) + "p2spscsc2s"); } if (flags.useShapeStrings5) { features.add(p2.getString(CoreAnnotations.ShapeAnnotation.class) + p.getString(CoreAnnotations.ShapeAnnotation.class) + c.getString(CoreAnnotations.ShapeAnnotation.class) + c2.getString(CoreAnnotations.ShapeAnnotation.class) + c3.getString(CoreAnnotations.ShapeAnnotation.class) + "p2spscsc2sc3s"); } } } Gale2007ChineseSegmenterFeatureFactory::featuresCpC() {} Gale2007ChineseSegmenterFeatureFactory::featuresCnC() {}
三大類特徵分別以「|C」爲結尾(共計有32個)、以「|CpC」結尾(共計有37個)、以「|CnC」結尾(共計有9個);總計78個特徵。我的感受CoreNLP定義的特徵過於複雜,大部分特徵並無什麼用。CoreNLP後面處理流程跟其餘分詞器別無二樣了,求每一個label的權重加權之和,Viterbi解碼求解最大機率路徑,解析label序列獲得分詞結果。
CoreNLP分詞速度巨慢,效果也通常,在PKU、MSR測試集上的表現以下:
測試集 | 分詞器 | 準確率 | 召回率 | F1 |
---|---|---|---|---|
PKU | thulac4j | 0.948 | 0.936 | 0.942 |
CoreNLP | 0.901 | 0.894 | 0.897 | |
MSR | thulac4j | 0.866 | 0.896 | 0.881 |
CoreNLP | 0.822 | 0.859 | 0.840 |
[1] Huihsin, Tseng, et al. "A conditional random field word segmenter." Fourth SIGHAN Workshop. 2005. [2] Chang, Pi-Chuan, Michel Galley, and Christopher D. Manning. "Optimizing Chinese word segmentation for machine translation performance." Proceedings of the third workshop on statistical machine translation. Association for Computational Linguistics, 2008.