word分詞是一個Java實現的中文分詞組件,提供了多種基於詞典的分詞算法,並利用ngram模型來消除歧義。 能準確識別英文、數字,以及日期、時間等數量詞,能識別人名、地名、組織機構名等未登陸詞。 同時提供了Lucene、Solr、ElasticSearch插件。java
word分詞器分詞效果評估主要評估下面7種分詞算法:git
正向最大匹配算法:MaximumMatching
逆向最大匹配算法:ReverseMaximumMatching
正向最小匹配算法:MinimumMatching
逆向最小匹配算法:ReverseMinimumMatching
雙向最大匹配算法:BidirectionalMaximumMatching
雙向最小匹配算法:BidirectionalMinimumMatching
雙向最大最小匹配算法:BidirectionalMaximumMinimumMatchinggithub
全部的雙向算法都使用ngram來消歧,分詞效果評估分別評估bigram和trigram。算法
評估採用的測試文本有253 3709行,共2837 4490個字符,標準文本和測試文本一行行對應,標準文本中的詞以空格分隔,評估標準爲嚴格一致,評估核心代碼以下:app
/** * 分詞效果評估 * @param resultText 實際分詞結果文件路徑 * @param standardText 標準分詞結果文件路徑 * @return 評估結果 */ public static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分詞結果和標準如出一轍 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分詞結果和標準不同 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { LOGGER.error("分詞效果評估失敗:", ex); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; }
/** * 中文分詞效果評估結果 * @author 楊尚川 */ public class EvaluationResult implements Comparable{ private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return segmentationAlgorithm.name()+"("+segmentationAlgorithm.getDes()+"):" +"\n" +"分詞速度:"+segSpeed+" 字符/毫秒" +"\n" +"行數完美率:"+getLinePerfectRate()+"%" +" 行數錯誤率:"+getLineWrongRate()+"%" +" 總的行數:"+totalLineCount +" 完美行數:"+perfectLineCount +" 錯誤行數:"+wrongLineCount +"\n" +"字數完美率:"+getCharPerfectRate()+"%" +" 字數錯誤率:"+getCharWrongRate()+"%" +" 總的字數:"+totalCharCount +" 完美字數:"+perfectCharCount +" 錯誤字數:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } }
word分詞使用trigram評估結果:dom
BidirectionalMaximumMinimumMatching(雙向最大最小匹配算法): 分詞速度:265.62566 字符/毫秒 行數完美率:55.352688% 行數錯誤率:44.647312% 總的行數:2533709 完美行數:1402476 錯誤行數:1131233 字數完美率:46.23227% 字數錯誤率:53.76773% 總的字數:28374490 完美字數:13118171 錯誤字數:15256319 BidirectionalMaximumMatching(雙向最大匹配算法): 分詞速度:335.62155 字符/毫秒 行數完美率:50.16934% 行數錯誤率:49.83066% 總的行數:2533709 完美行數:1271145 錯誤行數:1262564 字數完美率:40.692997% 字數錯誤率:59.307003% 總的字數:28374490 完美字數:11546430 錯誤字數:16828060 ReverseMaximumMatching(逆向最大匹配算法): 分詞速度:686.71045 字符/毫秒 行數完美率:46.723125% 行數錯誤率:53.27688% 總的行數:2533709 完美行數:1183828 錯誤行數:1349881 字數完美率:36.67598% 字數錯誤率:63.32402% 總的字數:28374490 完美字數:10406622 錯誤字數:17967868 MaximumMatching(正向最大匹配算法): 分詞速度:733.9535 字符/毫秒 行數完美率:46.661713% 行數錯誤率:53.338287% 總的行數:2533709 完美行數:1182272 錯誤行數:1351437 字數完美率:36.72861% 字數錯誤率:63.271393% 總的字數:28374490 完美字數:10421556 錯誤字數:17952934 BidirectionalMinimumMatching(雙向最小匹配算法): 分詞速度:432.87375 字符/毫秒 行數完美率:45.863907% 行數錯誤率:54.136093% 總的行數:2533709 完美行數:1162058 錯誤行數:1371651 字數完美率:35.942123% 字數錯誤率:64.05788% 總的字數:28374490 完美字數:10198395 錯誤字數:18176095 ReverseMinimumMatching(逆向最小匹配算法): 分詞速度:1033.58636 字符/毫秒 行數完美率:41.776066% 行數錯誤率:58.223934% 總的行數:2533709 完美行數:1058484 錯誤行數:1475225 字數完美率:31.678978% 字數錯誤率:68.32102% 總的字數:28374490 完美字數:8988748 錯誤字數:19385742 MinimumMatching(正向最小匹配算法): 分詞速度:1175.4431 字符/毫秒 行數完美率:36.853836% 行數錯誤率:63.146164% 總的行數:2533709 完美行數:933769 錯誤行數:1599940 字數完美率:26.859812% 字數錯誤率:73.14019% 總的字數:28374490 完美字數:7621334 錯誤字數:20753156
word分詞使用bigram評估結果:ide
BidirectionalMaximumMinimumMatching(雙向最大最小匹配算法): 分詞速度:233.49121 字符/毫秒 行數完美率:55.31531% 行數錯誤率:44.68469% 總的行數:2533709 完美行數:1401529 錯誤行數:1132180 字數完美率:45.834396% 字數錯誤率:54.165604% 總的字數:28374490 完美字數:13005277 錯誤字數:15369213 BidirectionalMaximumMatching(雙向最大匹配算法): 分詞速度:303.59401 字符/毫秒 行數完美率:52.007233% 行數錯誤率:47.992767% 總的行數:2533709 完美行數:1317712 錯誤行數:1215997 字數完美率:42.424194% 字數錯誤率:57.575806% 總的字數:28374490 完美字數:12037649 錯誤字數:16336841 BidirectionalMinimumMatching(雙向最小匹配算法): 分詞速度:349.67215 字符/毫秒 行數完美率:46.766422% 行數錯誤率:53.23358% 總的行數:2533709 完美行數:1184925 錯誤行數:1348784 字數完美率:36.52718% 字數錯誤率:63.47282% 總的字數:28374490 完美字數:10364401 錯誤字數:18010089 ReverseMaximumMatching(逆向最大匹配算法): 分詞速度:598.04272 字符/毫秒 行數完美率:46.723125% 行數錯誤率:53.27688% 總的行數:2533709 完美行數:1183828 錯誤行數:1349881 字數完美率:36.67598% 字數錯誤率:63.32402% 總的字數:28374490 完美字數:10406622 錯誤字數:17967868 MaximumMatching(正向最大匹配算法): 分詞速度:676.7993 字符/毫秒 行數完美率:46.661713% 行數錯誤率:53.338287% 總的行數:2533709 完美行數:1182272 錯誤行數:1351437 字數完美率:36.72861% 字數錯誤率:63.271393% 總的字數:28374490 完美字數:10421556 錯誤字數:17952934 ReverseMinimumMatching(逆向最小匹配算法): 分詞速度:806.9586 字符/毫秒 行數完美率:41.776066% 行數錯誤率:58.223934% 總的行數:2533709 完美行數:1058484 錯誤行數:1475225 字數完美率:31.678978% 字數錯誤率:68.32102% 總的字數:28374490 完美字數:8988748 錯誤字數:19385742 MinimumMatching(正向最小匹配算法): 分詞速度:1020.9208 字符/毫秒 行數完美率:36.853836% 行數錯誤率:63.146164% 總的行數:2533709 完美行數:933769 錯誤行數:1599940 字數完美率:26.859812% 字數錯誤率:73.14019% 總的字數:28374490 完美字數:7621334 錯誤字數:20753156
Ansj0.9的評估結果以下:測試
Ansj ToAnalysis 精準分詞: 分詞速度:495.9188 字符/毫秒 行數完美率:58.609295% 行數錯誤率:41.390705% 總的行數:2533709 完美行數:1484989 錯誤行數:1048720 字數完美率:50.97614% 字數錯誤率:49.023857% 總的字數:28374490 完美字數:14464220 錯誤字數:13910270 Ansj NlpAnalysis NLP分詞: 分詞速度:350.7527 字符/毫秒 行數完美率:58.60353% 行數錯誤率:41.396465% 總的行數:2533709 完美行數:1484843 錯誤行數:1048866 字數完美率:50.75546% 字數錯誤率:49.244545% 總的字數:28374490 完美字數:14401602 錯誤字數:13972888 Ansj BaseAnalysis 基本分詞: 分詞速度:532.65424 字符/毫秒 行數完美率:54.028584% 行數錯誤率:45.97142% 總的行數:2533709 完美行數:1368927 錯誤行數:1164782 字數完美率:46.84512% 字數錯誤率:53.15488% 總的字數:28374490 完美字數:13292064 錯誤字數:15082426 Ansj IndexAnalysis 面向索引的分詞: 分詞速度:564.6103 字符/毫秒 行數完美率:53.510803% 行數錯誤率:46.489197% 總的行數:2533709 完美行數:1355808 錯誤行數:1177901 字數完美率:46.355087% 字數錯誤率:53.644913% 總的字數:28374490 完美字數:13153019 錯誤字數:15221471
Ansj1.4的評估結果以下:ui
Ansj ToAnalysis 精準分詞: 分詞速度:581.7306 字符/毫秒 行數完美率:58.60302% 行數錯誤率:41.39698% 總的行數:2533709 完美行數:1484830 錯誤行數:1048879 字數完美率:50.968987% 字數錯誤率:49.031013% 總的字數:28374490 完美字數:14462190 錯誤字數:13912300 Ansj NlpAnalysis NLP分詞: 分詞速度:138.81165 字符/毫秒 行數完美率:58.1515% 行數錯誤率:41.8485% 總的行數:2533687 完美行數:1473377 錯誤行數:1060310 字數完美率:49.806484% 字數錯誤率:50.19352% 總的字數:28374398 完美字數:14132290 錯誤字數:14242108 Ansj BaseAnalysis 基本分詞: 分詞速度:627.68475 字符/毫秒 行數完美率:55.3174% 行數錯誤率:44.6826% 總的行數:2533709 完美行數:1401582 錯誤行數:1132127 字數完美率:48.177986% 字數錯誤率:51.822014% 總的字數:28374490 完美字數:13670258 錯誤字數:14704232 Ansj IndexAnalysis 面向索引的分詞: 分詞速度:715.55176 字符/毫秒 行數完美率:50.89444% 行數錯誤率:49.10556% 總的行數:2533709 完美行數:1289517 錯誤行數:1244192 字數完美率:42.965115% 字數錯誤率:57.034885% 總的字數:28374490 完美字數:12191132 錯誤字數:16183358
Ansj分詞評估程序以下:this
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Collections; import java.util.List; import org.ansj.domain.Term; import org.ansj.splitWord.analysis.BaseAnalysis; import org.ansj.splitWord.analysis.IndexAnalysis; import org.ansj.splitWord.analysis.NlpAnalysis; import org.ansj.splitWord.analysis.ToAnalysis; /** * Ansj分詞器分詞效果評估 * @author 楊尚川 */ public class AnsjEvaluation { public static void main(String[] args) throws Exception{ // 測試文件 d:/test-text.txt 和 標準分詞結果文件 d:/standard-text.txt 的下載地址: // http://pan.baidu.com/s/1hqihzjY List<EvaluationResult> list = new ArrayList<>(); // 對文本進行分詞 float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis"); // 對分詞結果進行評估 EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj BaseAnalysis 基本分詞"); result.setSegSpeed(rate); list.add(result); // 對文本進行分詞 rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis"); // 對分詞結果進行評估 result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj ToAnalysis 精準分詞"); result.setSegSpeed(rate); list.add(result); // 對文本進行分詞 rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis"); // 對分詞結果進行評估 result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj NlpAnalysis NLP分詞"); result.setSegSpeed(rate); list.add(result); // 對文本進行分詞 rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis"); // 對分詞結果進行評估 result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt"); result.setAnalyzer("Ansj IndexAnalysis 面向索引的分詞"); result.setSegSpeed(rate); list.add(result); //輸出評估結果 Collections.sort(list); System.out.println(""); for(EvaluationResult r : list){ System.out.println(r+"\n"); } } private static float seg(final String input, final String output, final String type) throws Exception{ float rate = 0; try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){ long size = Files.size(Paths.get(input)); System.out.println("size:"+size); System.out.println("文件大小:"+(float)size/1024/1024+" MB"); int textLength=0; int progress=0; long start = System.currentTimeMillis(); String line = null; while((line = reader.readLine()) != null){ if("".equals(line.trim())){ writer.write("\n"); continue; } textLength += line.length(); switch(type){ case "BaseAnalysis": for(Term term : BaseAnalysis.parse(line)){ writer.write(term.getName()+" "); } break; case "ToAnalysis": for(Term term : ToAnalysis.parse(line)){ writer.write(term.getName()+" "); } break; case "NlpAnalysis": try{ for(Term term : NlpAnalysis.parse(line)){ writer.write(term.getName()+" "); } }catch(Exception e){} break; case "IndexAnalysis": for(Term term : IndexAnalysis.parse(line)){ writer.write(term.getName()+" "); } break; } writer.write("\n"); progress += line.length(); if( progress > 500000){ progress = 0; System.out.println("分詞進度:"+(int)(textLength*2.99/size*100)+"%"); } } long cost = System.currentTimeMillis() - start; rate = textLength/(float)cost; System.out.println("字符數目:"+textLength); System.out.println("分詞耗時:"+cost+" 毫秒"); System.out.println("分詞速度:"+rate+" 字符/毫秒"); } return rate; } /** * 分詞效果評估 * @param resultText 實際分詞結果文件路徑 * @param standardText 標準分詞結果文件路徑 * @return 評估結果 */ private static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分詞結果和標準如出一轍 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分詞結果和標準不同 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { System.err.println("分詞效果評估失敗:" + ex.getMessage()); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; } /** * 分詞結果 */ private static class EvaluationResult implements Comparable{ private String analyzer; private float segSpeed; private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public String getAnalyzer() { return analyzer; } public void setAnalyzer(String analyzer) { this.analyzer = analyzer; } public float getSegSpeed() { return segSpeed; } public void setSegSpeed(float segSpeed) { this.segSpeed = segSpeed; } public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return analyzer+":" +"\n" +"分詞速度:"+segSpeed+" 字符/毫秒" +"\n" +"行數完美率:"+getLinePerfectRate()+"%" +" 行數錯誤率:"+getLineWrongRate()+"%" +" 總的行數:"+totalLineCount +" 完美行數:"+perfectLineCount +" 錯誤行數:"+wrongLineCount +"\n" +"字數完美率:"+getCharPerfectRate()+"%" +" 字數錯誤率:"+getCharWrongRate()+"%" +" 總的字數:"+totalCharCount +" 完美字數:"+perfectCharCount +" 錯誤字數:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } } }
MMSeg4j1.9.1的評估結果以下:
MMSeg4j ComplexSeg: 分詞速度:794.24805 字符/毫秒 行數完美率:38.817604% 行數錯誤率:61.182396% 總的行數:2533688 完美行數:983517 錯誤行數:1550171 字數完美率:29.604435% 字數錯誤率:70.39557% 總的字數:28374428 完美字數:8400089 錯誤字數:19974339 MMSeg4j SimpleSeg: 分詞速度:1026.1058 字符/毫秒 行數完美率:37.570095% 行數錯誤率:62.429905% 總的行數:2533688 完美行數:951909 錯誤行數:1581779 字數完美率:28.455273% 字數錯誤率:71.54473% 總的字數:28374428 完美字數:8074021 錯誤字數:20300407 MMSeg4j MaxWordSeg: 分詞速度:813.0676 字符/毫秒 行數完美率:34.27573% 行數錯誤率:65.72427% 總的行數:2533688 完美行數:868440 錯誤行數:1665248 字數完美率:25.20896% 字數錯誤率:74.79104% 總的字數:28374428 完美字數:7152898 錯誤字數:21221530
MMSeg4j1.9.1分詞評估程序以下:
import com.chenlb.mmseg4j.ComplexSeg; import com.chenlb.mmseg4j.Dictionary; import com.chenlb.mmseg4j.MMSeg; import com.chenlb.mmseg4j.MaxWordSeg; import com.chenlb.mmseg4j.Seg; import com.chenlb.mmseg4j.SimpleSeg; import com.chenlb.mmseg4j.Word; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.StringReader; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Collections; import java.util.List; /** * MMSeg4j分詞器分詞效果評估 * @author 楊尚川 */ public class MMSeg4jEvaluation { public static void main(String[] args) throws Exception{ // 測試文件 d:/test-text.txt 和 標準分詞結果文件 d:/standard-text.txt 的下載地址: // http://pan.baidu.com/s/1hqihzjY List<EvaluationResult> list = new ArrayList<>(); Dictionary dic = Dictionary.getInstance(); // 對文本進行分詞 float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic)); // 對分詞結果進行評估 EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("MMSeg4j ComplexSeg"); result.setSegSpeed(rate); list.add(result); // 對文本進行分詞 rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic)); // 對分詞結果進行評估 result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("MMSeg4j SimpleSeg"); result.setSegSpeed(rate); list.add(result); // 對文本進行分詞 rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic)); // 對分詞結果進行評估 result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("MMSeg4j MaxWordSeg"); result.setSegSpeed(rate); list.add(result); //輸出評估結果 Collections.sort(list); System.out.println(""); for(EvaluationResult r : list){ System.out.println(r+"\n"); } } private static float seg(final String input, final String output, final Seg seg) throws Exception{ float rate = 0; try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){ long size = Files.size(Paths.get(input)); System.out.println("size:"+size); System.out.println("文件大小:"+(float)size/1024/1024+" MB"); int textLength=0; int progress=0; long start = System.currentTimeMillis(); String line = null; while((line = reader.readLine()) != null){ if("".equals(line.trim())){ writer.write("\n"); continue; } textLength += line.length(); writer.write(seg(line, seg)); writer.write("\n"); progress += line.length(); if( progress > 500000){ progress = 0; System.out.println("分詞進度:"+(int)(textLength*2.99/size*100)+"%"); } } long cost = System.currentTimeMillis() - start; rate = textLength/(float)cost; System.out.println("字符數目:"+textLength); System.out.println("分詞耗時:"+cost+" 毫秒"); System.out.println("分詞速度:"+rate+" 字符/毫秒"); } return rate; } private static String seg(String text, Seg seg) throws IOException { StringBuilder result = new StringBuilder(); MMSeg mmSeg = new MMSeg(new StringReader(text), seg); Word word = null; while((word=mmSeg.next())!=null) { result.append(word.getString()).append(" "); } return result.toString().trim(); } /** * 分詞效果評估 * @param resultText 實際分詞結果文件路徑 * @param standardText 標準分詞結果文件路徑 * @return 評估結果 */ private static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分詞結果和標準如出一轍 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分詞結果和標準不同 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { System.err.println("分詞效果評估失敗:" + ex.getMessage()); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; } /** * 分詞結果 */ private static class EvaluationResult implements Comparable{ private String analyzer; private float segSpeed; private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public String getAnalyzer() { return analyzer; } public void setAnalyzer(String analyzer) { this.analyzer = analyzer; } public float getSegSpeed() { return segSpeed; } public void setSegSpeed(float segSpeed) { this.segSpeed = segSpeed; } public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return analyzer+":" +"\n" +"分詞速度:"+segSpeed+" 字符/毫秒" +"\n" +"行數完美率:"+getLinePerfectRate()+"%" +" 行數錯誤率:"+getLineWrongRate()+"%" +" 總的行數:"+totalLineCount +" 完美行數:"+perfectLineCount +" 錯誤行數:"+wrongLineCount +"\n" +"字數完美率:"+getCharPerfectRate()+"%" +" 字數錯誤率:"+getCharWrongRate()+"%" +" 總的字數:"+totalCharCount +" 完美字數:"+perfectCharCount +" 錯誤字數:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } } }
ik-analyzer2012_u6的評估結果以下:
IKAnalyzer 智能切分: 分詞速度:178.3516 字符/毫秒 行數完美率:37.55943% 行數錯誤率:62.440567% 總的行數:2533686 完美行數:951638 錯誤行數:1582048 字數完美率:27.978464% 字數錯誤率:72.02154% 總的字數:28374416 完美字數:7938726 錯誤字數:20435690 IKAnalyzer 細粒度切分: 分詞速度:182.97859 字符/毫秒 行數完美率:18.872742% 行數錯誤率:81.12726% 總的行數:2533686 完美行數:478176 錯誤行數:2055510 字數完美率:10.936535% 字數錯誤率:89.06347% 總的字數:28374416 完美字數:3103178 錯誤字數:25271238
ik-analyzer2012_u6分詞評估程序以下:
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.StringReader; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Collections; import java.util.List; import org.wltea.analyzer.core.IKSegmenter; import org.wltea.analyzer.core.Lexeme; /** * IKAnalyzer分詞器分詞效果評估 * @author 楊尚川 */ public class IKAnalyzerEvaluation { public static void main(String[] args) throws Exception{ // 測試文件 d:/test-text.txt 和 標準分詞結果文件 d:/standard-text.txt 的下載地址: // http://pan.baidu.com/s/1hqihzjY List<EvaluationResult> list = new ArrayList<>(); // 對文本進行分詞 float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true); // 對分詞結果進行評估 EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("IKAnalyzer 智能切分"); result.setSegSpeed(rate); list.add(result); // 對文本進行分詞 rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false); // 對分詞結果進行評估 result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt"); result.setAnalyzer("IKAnalyzer 細粒度切分"); result.setSegSpeed(rate); list.add(result); //輸出評估結果 Collections.sort(list); System.out.println(""); for(EvaluationResult r : list){ System.out.println(r+"\n"); } } private static float seg(final String input, final String output, final boolean useSmart) throws Exception{ float rate = 0; try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8")); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){ long size = Files.size(Paths.get(input)); System.out.println("size:"+size); System.out.println("文件大小:"+(float)size/1024/1024+" MB"); int textLength=0; int progress=0; long start = System.currentTimeMillis(); String line = null; while((line = reader.readLine()) != null){ if("".equals(line.trim())){ writer.write("\n"); continue; } textLength += line.length(); writer.write(seg(line, useSmart)); writer.write("\n"); progress += line.length(); if( progress > 500000){ progress = 0; System.out.println("分詞進度:"+(int)(textLength*2.99/size*100)+"%"); } } long cost = System.currentTimeMillis() - start; rate = textLength/(float)cost; System.out.println("字符數目:"+textLength); System.out.println("分詞耗時:"+cost+" 毫秒"); System.out.println("分詞速度:"+rate+" 字符/毫秒"); } return rate; } private static String seg(String text, boolean useSmart) throws IOException { StringBuilder result = new StringBuilder(); IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart); Lexeme word = null; while((word=ik.next())!=null) { result.append(word.getLexemeText()).append(" "); } return result.toString().trim(); } /** * 分詞效果評估 * @param resultText 實際分詞結果文件路徑 * @param standardText 標準分詞結果文件路徑 * @return 評估結果 */ private static EvaluationResult evaluation(String resultText, String standardText) { int perfectLineCount=0; int wrongLineCount=0; int perfectCharCount=0; int wrongCharCount=0; try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8")); BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){ String result; while( (result = resultReader.readLine()) != null ){ result = result.trim(); String standard = standardReader.readLine().trim(); if(result.equals("")){ continue; } if(result.equals(standard)){ //分詞結果和標準如出一轍 perfectLineCount++; perfectCharCount+=standard.replaceAll("\\s+", "").length(); }else{ //分詞結果和標準不同 wrongLineCount++; wrongCharCount+=standard.replaceAll("\\s+", "").length(); } } } catch (IOException ex) { System.err.println("分詞效果評估失敗:" + ex.getMessage()); } int totalLineCount = perfectLineCount+wrongLineCount; int totalCharCount = perfectCharCount+wrongCharCount; EvaluationResult er = new EvaluationResult(); er.setPerfectCharCount(perfectCharCount); er.setPerfectLineCount(perfectLineCount); er.setTotalCharCount(totalCharCount); er.setTotalLineCount(totalLineCount); er.setWrongCharCount(wrongCharCount); er.setWrongLineCount(wrongLineCount); return er; } /** * 分詞結果 */ private static class EvaluationResult implements Comparable{ private String analyzer; private float segSpeed; private int totalLineCount; private int perfectLineCount; private int wrongLineCount; private int totalCharCount; private int perfectCharCount; private int wrongCharCount; public String getAnalyzer() { return analyzer; } public void setAnalyzer(String analyzer) { this.analyzer = analyzer; } public float getSegSpeed() { return segSpeed; } public void setSegSpeed(float segSpeed) { this.segSpeed = segSpeed; } public float getLinePerfectRate(){ return perfectLineCount/(float)totalLineCount*100; } public float getLineWrongRate(){ return wrongLineCount/(float)totalLineCount*100; } public float getCharPerfectRate(){ return perfectCharCount/(float)totalCharCount*100; } public float getCharWrongRate(){ return wrongCharCount/(float)totalCharCount*100; } public int getTotalLineCount() { return totalLineCount; } public void setTotalLineCount(int totalLineCount) { this.totalLineCount = totalLineCount; } public int getPerfectLineCount() { return perfectLineCount; } public void setPerfectLineCount(int perfectLineCount) { this.perfectLineCount = perfectLineCount; } public int getWrongLineCount() { return wrongLineCount; } public void setWrongLineCount(int wrongLineCount) { this.wrongLineCount = wrongLineCount; } public int getTotalCharCount() { return totalCharCount; } public void setTotalCharCount(int totalCharCount) { this.totalCharCount = totalCharCount; } public int getPerfectCharCount() { return perfectCharCount; } public void setPerfectCharCount(int perfectCharCount) { this.perfectCharCount = perfectCharCount; } public int getWrongCharCount() { return wrongCharCount; } public void setWrongCharCount(int wrongCharCount) { this.wrongCharCount = wrongCharCount; } @Override public String toString(){ return analyzer+":" +"\n" +"分詞速度:"+segSpeed+" 字符/毫秒" +"\n" +"行數完美率:"+getLinePerfectRate()+"%" +" 行數錯誤率:"+getLineWrongRate()+"%" +" 總的行數:"+totalLineCount +" 完美行數:"+perfectLineCount +" 錯誤行數:"+wrongLineCount +"\n" +"字數完美率:"+getCharPerfectRate()+"%" +" 字數錯誤率:"+getCharWrongRate()+"%" +" 總的字數:"+totalCharCount +" 完美字數:"+perfectCharCount +" 錯誤字數:"+wrongCharCount; } @Override public int compareTo(Object o) { EvaluationResult other = (EvaluationResult)o; if(other.getLinePerfectRate() - getLinePerfectRate() > 0){ return 1; } if(other.getLinePerfectRate() - getLinePerfectRate() < 0){ return -1; } return 0; } } }
ansj、mmseg4j和ik-analyzer的評估程序可在附件中下載,word分詞只需運行項目根目錄下的evaluation.bat腳本便可。
參考資料: