LTP是哈工大開源的一套中文語言處理系統,涵蓋了基本功能:分詞、詞性標註、命名實體識別、依存句法分析、語義角色標註、語義依存分析等。html
【開源中文分詞工具探析】系列:java
同THULAC同樣,LTP也是基於結構化感知器(Structured Perceptron, SP),以最大熵準則建模標註序列\(Y\)在輸入序列\(X\)的狀況下的score函數:git
\[ S(Y,X) = \sum_s \alpha_s \Phi_s(Y,X) \]github
其中,\(\Phi_s(Y,X)\)爲本地特徵函數。中文分詞問題等價於給定\(X\)序列,求解score函數最大值對應的\(Y\)序列:數組
\[ \mathop{\arg \max}_Y S(Y,X) \]app
如下源碼分析基於版本3.4.0。函數
分詞流程與其餘分詞器別無二致,先提取字符特徵,計算特徵權重值,而後Viterbi解碼。代碼詳見__ltp_dll_segmentor_wrapper::segment()
:工具
int segment(const char *str, std::vector<std::string> &words) { ltp::framework::ViterbiFeatureContext ctx; ltp::framework::ViterbiScoreMatrix scm; ltp::framework::ViterbiDecoder decoder; ltp::segmentor::Instance inst; int ret = preprocessor.preprocess(str, inst.raw_forms, inst.forms, inst.chartypes); if (-1 == ret || 0 == ret) { words.clear(); return 0; } ltp::segmentor::SegmentationConstrain con; con.regist(&(inst.chartypes)); build_lexicon_match_state(lexicons, &inst); extract_features(inst, model, &ctx, false); calculate_scores(inst, (*model), ctx, true, &scm); // allocate a new decoder so that the segmentor support multithreaded // decoding. this modification was committed by niuox decoder.decode(scm, con, inst.predict_tagsidx); build_words(inst.raw_forms, inst.predict_tagsidx, words); return words.size(); }
模型文件cws.model
包含了類別、特徵、權重、內部詞典(internal lexicon)等。我用Java 重寫了模型解析,代碼以下:源碼分析
DataInputStream is = new DataInputStream(new FileInputStream(path)); char[] octws = readCharArray(is, 128); // 1. read label SmartMap label = readSmartMap(is); int[] entries = readIntArray(is, label.numEntries); // 2. read feature Space char[] space = readCharArray(is, 16); int offset = readInt(is); int sz = readInt(is); SmartMap[] dicts = new SmartMap[sz]; for (int i = 0; i < sz; i++) { dicts[i] = readSmartMap(is); } // 3. read param char[] param = readCharArray(is, 16); int dim = readInt(is); double[] w = readDoubleArray(is, dim); double[] wSum = readDoubleArray(is, dim); int lastTimestamp = readInt(is); // 4. read internal lexicon SmartMap internalLexicon = readSmartMap(is); // read char array private static char[] readCharArray(DataInputStream is, int length) throws IOException { char[] chars = new char[length]; for (int i = 0; i < length; i++) { chars[i] = (char) is.read(); } return chars; } // read int array private static int[] readIntArray(DataInputStream is, int length) throws IOException { byte[] bytes = new byte[4 * length]; is.read(bytes); IntBuffer intBuffer = ByteBuffer.wrap(bytes) .order(ByteOrder.LITTLE_ENDIAN) .asIntBuffer(); int[] array = new int[length]; intBuffer.get(array); return array; }
LTP共用到了15類特徵,故sz
爲15;特徵是採用Map表示,LTP稱之爲SmartMap,看代碼本質上是一個HashMap。分詞工具測評結果代表,LTP分詞速度較THULAC要慢。究其緣由,THULAC採用雙數組Trie來表示模型,特徵檢索速度要優於LTP。ui
LTP所用到的特徵大體可分爲如下幾類:
源碼見extractor.cpp
:
Extractor::Extractor() { // delimit feature templates templates.push_back(new Template("1={c-2}")); templates.push_back(new Template("2={c-1}")); templates.push_back(new Template("3={c-0}")); templates.push_back(new Template("4={c+1}")); templates.push_back(new Template("5={c+2}")); templates.push_back(new Template("6={c-2}-{c-1}")); templates.push_back(new Template("7={c-1}-{c-0}")); templates.push_back(new Template("8={c-0}-{c+1}")); templates.push_back(new Template("9={c+1}-{c+2}")); templates.push_back(new Template("14={ct-1}")); templates.push_back(new Template("15={ct-0}")); templates.push_back(new Template("16={ct+1}")); templates.push_back(new Template("17={lex1}")); templates.push_back(new Template("18={lex2}")); templates.push_back(new Template("19={lex3}")); } #define TYPE(x) (strutils::to_str(inst.chartypes[(x)]&0x07)) data.set("c-2", (idx - 2 < 0 ? BOS : inst.forms[idx - 2])); data.set("c-1", (idx - 1 < 0 ? BOS : inst.forms[idx - 1])); data.set("c-0", inst.forms[idx]); data.set("c+1", (idx + 1 >= len ? EOS : inst.forms[idx + 1])); data.set("c+2", (idx + 2 >= len ? EOS : inst.forms[idx + 2])); data.set("ct-1", (idx - 1 < 0 ? BOT : TYPE(idx - 1))); data.set("ct-0", TYPE(idx)); data.set("ct+1", (idx + 1 >= len ? EOT : TYPE(idx + 1))); data.set("lex1", strutils::to_str(inst.lexicon_match_state[idx] & 0x0f)); data.set("lex2", strutils::to_str((inst.lexicon_match_state[idx] >> 4) & 0x0f)); data.set("lex3", strutils::to_str((inst.lexicon_match_state[idx] >> 8) & 0x0f)); #undef TYPE