word分詞器、ansj分詞器、mmseg4j分詞器、ik-analyzer分詞器分詞效果評估

時間 2019-11-16

標籤 word 分詞器 ansj mmseg4j mmseg analyzer 分詞效果評估欄目 Microsoft Office 简体版

原文原文鏈接

word分詞是一個Java實現的中文分詞組件，提供了多種基於詞典的分詞算法，並利用ngram模型來消除歧義。能準確識別英文、數字，以及日期、時間等數量詞，能識別人名、地名、組織機構名等未登陸詞。同時提供了Lucene、Solr、ElasticSearch插件。java

word分詞器分詞效果評估主要評估下面7種分詞算法：git

正向最大匹配算法：MaximumMatching
逆向最大匹配算法：ReverseMaximumMatching
正向最小匹配算法：MinimumMatching
逆向最小匹配算法：ReverseMinimumMatching
雙向最大匹配算法：BidirectionalMaximumMatching
雙向最小匹配算法：BidirectionalMinimumMatching
雙向最大最小匹配算法：BidirectionalMaximumMinimumMatchinggithub

全部的雙向算法都使用ngram來消歧，分詞效果評估分別評估bigram和trigram。算法

評估採用的測試文本有253 3709行，共2837 4490個字符，標準文本和測試文本一行行對應，標準文本中的詞以空格分隔，評估標準爲嚴格一致，評估核心代碼以下：app

/**
 * 分詞效果評估
 * @param resultText 實際分詞結果文件路徑
 * @param standardText 標準分詞結果文件路徑
 * @return 評估結果
 */
public static EvaluationResult evaluation(String resultText, String standardText) {
	int perfectLineCount=0;
	int wrongLineCount=0;
	int perfectCharCount=0;
	int wrongCharCount=0;
	try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
		BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
		String result;
		while( (result = resultReader.readLine()) != null ){
			result = result.trim();
			String standard = standardReader.readLine().trim();
			if(result.equals("")){
				continue;
			}
			if(result.equals(standard)){
				//分詞結果和標準如出一轍
				perfectLineCount++;
				perfectCharCount+=standard.replaceAll("\\s+", "").length();
			}else{
				//分詞結果和標準不同
				wrongLineCount++;
				wrongCharCount+=standard.replaceAll("\\s+", "").length();
			}
		}
	} catch (IOException ex) {
		LOGGER.error("分詞效果評估失敗：", ex);
	}
	int totalLineCount = perfectLineCount+wrongLineCount;
	int totalCharCount = perfectCharCount+wrongCharCount;
	EvaluationResult er = new EvaluationResult();
	er.setPerfectCharCount(perfectCharCount);
	er.setPerfectLineCount(perfectLineCount);
	er.setTotalCharCount(totalCharCount);
	er.setTotalLineCount(totalLineCount);
	er.setWrongCharCount(wrongCharCount);
	er.setWrongLineCount(wrongLineCount);     
	return er;
}

/**
 * 中文分詞效果評估結果
 * @author 楊尚川
 */
public class EvaluationResult implements Comparable{
    private int totalLineCount;
    private int perfectLineCount;
    private int wrongLineCount;
    private int totalCharCount;
    private int perfectCharCount;
    private int wrongCharCount;

    
    public float getLinePerfectRate(){
        return perfectLineCount/(float)totalLineCount*100;
    }
    public float getLineWrongRate(){
        return wrongLineCount/(float)totalLineCount*100;
    }
    public float getCharPerfectRate(){
        return perfectCharCount/(float)totalCharCount*100;
    }
    public float getCharWrongRate(){
        return wrongCharCount/(float)totalCharCount*100;
    }
    public int getTotalLineCount() {
        return totalLineCount;
    }
    public void setTotalLineCount(int totalLineCount) {
        this.totalLineCount = totalLineCount;
    }
    public int getPerfectLineCount() {
        return perfectLineCount;
    }
    public void setPerfectLineCount(int perfectLineCount) {
        this.perfectLineCount = perfectLineCount;
    }
    public int getWrongLineCount() {
        return wrongLineCount;
    }
    public void setWrongLineCount(int wrongLineCount) {
        this.wrongLineCount = wrongLineCount;
    }
    public int getTotalCharCount() {
        return totalCharCount;
    }
    public void setTotalCharCount(int totalCharCount) {
        this.totalCharCount = totalCharCount;
    }
    public int getPerfectCharCount() {
        return perfectCharCount;
    }
    public void setPerfectCharCount(int perfectCharCount) {
        this.perfectCharCount = perfectCharCount;
    }
    public int getWrongCharCount() {
        return wrongCharCount;
    }
    public void setWrongCharCount(int wrongCharCount) {
        this.wrongCharCount = wrongCharCount;
    }
    @Override
    public String toString(){
        return segmentationAlgorithm.name()+"（"+segmentationAlgorithm.getDes()+"）："
                +"\n"
                +"分詞速度："+segSpeed+" 字符/毫秒"
                +"\n"
                +"行數完美率："+getLinePerfectRate()+"%"
                +"  行數錯誤率："+getLineWrongRate()+"%"
                +"  總的行數："+totalLineCount
                +"  完美行數："+perfectLineCount
                +"  錯誤行數："+wrongLineCount
                +"\n"
                +"字數完美率："+getCharPerfectRate()+"%"
                +" 字數錯誤率："+getCharWrongRate()+"%"
                +" 總的字數："+totalCharCount
                +" 完美字數："+perfectCharCount
                +" 錯誤字數："+wrongCharCount;
    }
    @Override
    public int compareTo(Object o) {
        EvaluationResult other = (EvaluationResult)o;
        if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
            return 1;
        }
        if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
            return -1;
        }
        return 0;
    }
}

word分詞使用trigram評估結果：dom

BidirectionalMaximumMinimumMatching（雙向最大最小匹配算法）：
分詞速度：265.62566 字符/毫秒
行數完美率：55.352688%  行數錯誤率：44.647312%  總的行數：2533709  完美行數：1402476  錯誤行數：1131233
字數完美率：46.23227% 字數錯誤率：53.76773% 總的字數：28374490 完美字數：13118171 錯誤字數：15256319

BidirectionalMaximumMatching（雙向最大匹配算法）：
分詞速度：335.62155 字符/毫秒
行數完美率：50.16934%  行數錯誤率：49.83066%  總的行數：2533709  完美行數：1271145  錯誤行數：1262564
字數完美率：40.692997% 字數錯誤率：59.307003% 總的字數：28374490 完美字數：11546430 錯誤字數：16828060

ReverseMaximumMatching（逆向最大匹配算法）：
分詞速度：686.71045 字符/毫秒
行數完美率：46.723125%  行數錯誤率：53.27688%  總的行數：2533709  完美行數：1183828  錯誤行數：1349881
字數完美率：36.67598% 字數錯誤率：63.32402% 總的字數：28374490 完美字數：10406622 錯誤字數：17967868

MaximumMatching（正向最大匹配算法）：
分詞速度：733.9535 字符/毫秒
行數完美率：46.661713%  行數錯誤率：53.338287%  總的行數：2533709  完美行數：1182272  錯誤行數：1351437
字數完美率：36.72861% 字數錯誤率：63.271393% 總的字數：28374490 完美字數：10421556 錯誤字數：17952934

BidirectionalMinimumMatching（雙向最小匹配算法）：
分詞速度：432.87375 字符/毫秒
行數完美率：45.863907%  行數錯誤率：54.136093%  總的行數：2533709  完美行數：1162058  錯誤行數：1371651
字數完美率：35.942123% 字數錯誤率：64.05788% 總的字數：28374490 完美字數：10198395 錯誤字數：18176095

ReverseMinimumMatching（逆向最小匹配算法）：
分詞速度：1033.58636 字符/毫秒
行數完美率：41.776066%  行數錯誤率：58.223934%  總的行數：2533709  完美行數：1058484  錯誤行數：1475225
字數完美率：31.678978% 字數錯誤率：68.32102% 總的字數：28374490 完美字數：8988748 錯誤字數：19385742

MinimumMatching（正向最小匹配算法）：
分詞速度：1175.4431 字符/毫秒
行數完美率：36.853836%  行數錯誤率：63.146164%  總的行數：2533709  完美行數：933769  錯誤行數：1599940
字數完美率：26.859812% 字數錯誤率：73.14019% 總的字數：28374490 完美字數：7621334 錯誤字數：20753156

word分詞使用bigram評估結果：ide

BidirectionalMaximumMinimumMatching（雙向最大最小匹配算法）：
分詞速度：233.49121 字符/毫秒
行數完美率：55.31531%  行數錯誤率：44.68469%  總的行數：2533709  完美行數：1401529  錯誤行數：1132180
字數完美率：45.834396% 字數錯誤率：54.165604% 總的字數：28374490 完美字數：13005277 錯誤字數：15369213

BidirectionalMaximumMatching（雙向最大匹配算法）：
分詞速度：303.59401 字符/毫秒
行數完美率：52.007233%  行數錯誤率：47.992767%  總的行數：2533709  完美行數：1317712  錯誤行數：1215997
字數完美率：42.424194% 字數錯誤率：57.575806% 總的字數：28374490 完美字數：12037649 錯誤字數：16336841

BidirectionalMinimumMatching（雙向最小匹配算法）：
分詞速度：349.67215 字符/毫秒
行數完美率：46.766422%  行數錯誤率：53.23358%  總的行數：2533709  完美行數：1184925  錯誤行數：1348784
字數完美率：36.52718% 字數錯誤率：63.47282% 總的字數：28374490 完美字數：10364401 錯誤字數：18010089

ReverseMaximumMatching（逆向最大匹配算法）：
分詞速度：598.04272 字符/毫秒
行數完美率：46.723125%  行數錯誤率：53.27688%  總的行數：2533709  完美行數：1183828  錯誤行數：1349881
字數完美率：36.67598% 字數錯誤率：63.32402% 總的字數：28374490 完美字數：10406622 錯誤字數：17967868

MaximumMatching（正向最大匹配算法）：
分詞速度：676.7993 字符/毫秒
行數完美率：46.661713%  行數錯誤率：53.338287%  總的行數：2533709  完美行數：1182272  錯誤行數：1351437
字數完美率：36.72861% 字數錯誤率：63.271393% 總的字數：28374490 完美字數：10421556 錯誤字數：17952934

ReverseMinimumMatching（逆向最小匹配算法）：
分詞速度：806.9586 字符/毫秒
行數完美率：41.776066%  行數錯誤率：58.223934%  總的行數：2533709  完美行數：1058484  錯誤行數：1475225
字數完美率：31.678978% 字數錯誤率：68.32102% 總的字數：28374490 完美字數：8988748 錯誤字數：19385742

MinimumMatching（正向最小匹配算法）：
分詞速度：1020.9208 字符/毫秒
行數完美率：36.853836%  行數錯誤率：63.146164%  總的行數：2533709  完美行數：933769  錯誤行數：1599940
字數完美率：26.859812% 字數錯誤率：73.14019% 總的字數：28374490 完美字數：7621334 錯誤字數：20753156

Ansj0.9的評估結果以下：測試

Ansj ToAnalysis 精準分詞：
分詞速度：495.9188 字符/毫秒
行數完美率：58.609295%  行數錯誤率：41.390705%  總的行數：2533709  完美行數：1484989  錯誤行數：1048720
字數完美率：50.97614%   字數錯誤率：49.023857%  總的字數：28374490 完美字數：14464220 錯誤字數：13910270

Ansj NlpAnalysis NLP分詞：
分詞速度：350.7527 字符/毫秒
行數完美率：58.60353%  行數錯誤率：41.396465%  總的行數：2533709  完美行數：1484843  錯誤行數：1048866
字數完美率：50.75546%  字數錯誤率：49.244545%  總的字數：28374490 完美字數：14401602 錯誤字數：13972888

Ansj BaseAnalysis 基本分詞：
分詞速度：532.65424 字符/毫秒
行數完美率：54.028584%  行數錯誤率：45.97142%  總的行數：2533709  完美行數：1368927  錯誤行數：1164782
字數完美率：46.84512%   字數錯誤率：53.15488%  總的字數：28374490 完美字數：13292064 錯誤字數：15082426

Ansj IndexAnalysis 面向索引的分詞：
分詞速度：564.6103 字符/毫秒
行數完美率：53.510803%  行數錯誤率：46.489197%  總的行數：2533709  完美行數：1355808  錯誤行數：1177901
字數完美率：46.355087%  字數錯誤率：53.644913%  總的字數：28374490 完美字數：13153019 錯誤字數：15221471

Ansj1.4的評估結果以下：ui

Ansj ToAnalysis 精準分詞：
分詞速度：581.7306 字符/毫秒
行數完美率：58.60302%  行數錯誤率：41.39698%  總的行數：2533709  完美行數：1484830  錯誤行數：1048879
字數完美率：50.968987% 字數錯誤率：49.031013% 總的字數：28374490 完美字數：14462190 錯誤字數：13912300

Ansj NlpAnalysis NLP分詞：
分詞速度：138.81165 字符/毫秒
行數完美率：58.1515%  行數錯誤率：41.8485%  總的行數：2533687  完美行數：1473377  錯誤行數：1060310
字數完美率：49.806484% 字數錯誤率：50.19352% 總的字數：28374398 完美字數：14132290 錯誤字數：14242108

Ansj BaseAnalysis 基本分詞：
分詞速度：627.68475 字符/毫秒
行數完美率：55.3174%  行數錯誤率：44.6826%  總的行數：2533709  完美行數：1401582  錯誤行數：1132127
字數完美率：48.177986% 字數錯誤率：51.822014% 總的字數：28374490 完美字數：13670258 錯誤字數：14704232

Ansj IndexAnalysis 面向索引的分詞：
分詞速度：715.55176 字符/毫秒
行數完美率：50.89444%  行數錯誤率：49.10556%  總的行數：2533709  完美行數：1289517  錯誤行數：1244192
字數完美率：42.965115% 字數錯誤率：57.034885% 總的字數：28374490 完美字數：12191132 錯誤字數：16183358

Ansj分詞評估程序以下：this

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.BaseAnalysis;
import org.ansj.splitWord.analysis.IndexAnalysis;
import org.ansj.splitWord.analysis.NlpAnalysis;
import org.ansj.splitWord.analysis.ToAnalysis;

/**
 * Ansj分詞器分詞效果評估
 * @author 楊尚川
 */
public class AnsjEvaluation {

    public static void main(String[] args) throws Exception{
        // 測試文件 d:/test-text.txt 和 標準分詞結果文件 d:/standard-text.txt 的下載地址：
        // http://pan.baidu.com/s/1hqihzjY
        
        List<EvaluationResult> list = new ArrayList<>();
        // 對文本進行分詞
        float rate = seg("d:/test-text.txt", "d:/result-text-BaseAnalysis.txt", "BaseAnalysis");
        // 對分詞結果進行評估
        EvaluationResult result = evaluation("d:/result-text-BaseAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj BaseAnalysis 基本分詞");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 對文本進行分詞
        rate = seg("d:/test-text.txt", "d:/result-text-ToAnalysis.txt", "ToAnalysis");
        // 對分詞結果進行評估
        result = evaluation("d:/result-text-ToAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj ToAnalysis 精準分詞");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 對文本進行分詞
        rate = seg("d:/test-text.txt", "d:/result-text-NlpAnalysis.txt", "NlpAnalysis");
        // 對分詞結果進行評估
        result = evaluation("d:/result-text-NlpAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj NlpAnalysis NLP分詞");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 對文本進行分詞
        rate = seg("d:/test-text.txt", "d:/result-text-IndexAnalysis.txt", "IndexAnalysis");
        // 對分詞結果進行評估
        result = evaluation("d:/result-text-IndexAnalysis.txt", "d:/standard-text.txt");
        result.setAnalyzer("Ansj IndexAnalysis 面向索引的分詞");
        result.setSegSpeed(rate);
        list.add(result);
        
        //輸出評估結果
        Collections.sort(list);
        System.out.println("");
        for(EvaluationResult r : list){
            System.out.println(r+"\n");
        }
    }
    private static float seg(final String input, final String output, final String type) throws Exception{
        float rate = 0;
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
            long size = Files.size(Paths.get(input));
            System.out.println("size:"+size);
            System.out.println("文件大小："+(float)size/1024/1024+" MB");
            int textLength=0;
            int progress=0;
            long start = System.currentTimeMillis();
            String line = null;
            while((line = reader.readLine()) != null){
                if("".equals(line.trim())){
                    writer.write("\n");
                    continue;
                }
                textLength += line.length();
                switch(type){
                    case "BaseAnalysis":
                        for(Term term : BaseAnalysis.parse(line)){
                            writer.write(term.getName()+" ");
                        }
                        break;
                    case "ToAnalysis":
                        for(Term term : ToAnalysis.parse(line)){
                            writer.write(term.getName()+" ");
                        }
                        break;
                    case "NlpAnalysis":
                        try{
                            for(Term term : NlpAnalysis.parse(line)){
                                writer.write(term.getName()+" ");
                            }
                        }catch(Exception e){}
                        break;
                    case "IndexAnalysis":
                        for(Term term : IndexAnalysis.parse(line)){
                            writer.write(term.getName()+" ");
                        }
                        break;
                }                
                writer.write("\n");
                progress += line.length();
                if( progress > 500000){
                    progress = 0;
                    System.out.println("分詞進度："+(int)(textLength*2.99/size*100)+"%");
                }
            }
            long cost = System.currentTimeMillis() - start;
            rate = textLength/(float)cost;
            System.out.println("字符數目："+textLength);
            System.out.println("分詞耗時："+cost+" 毫秒");
            System.out.println("分詞速度："+rate+" 字符/毫秒");
        }
        return rate;
    }
    /**
     * 分詞效果評估
     * @param resultText 實際分詞結果文件路徑
     * @param standardText 標準分詞結果文件路徑
     * @return 評估結果
     */
    private static EvaluationResult evaluation(String resultText, String standardText) {
        int perfectLineCount=0;
        int wrongLineCount=0;
        int perfectCharCount=0;
        int wrongCharCount=0;
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
            String result;
            while( (result = resultReader.readLine()) != null ){
                result = result.trim();
                String standard = standardReader.readLine().trim();
                if(result.equals("")){
                    continue;
                }
                if(result.equals(standard)){
                    //分詞結果和標準如出一轍
                    perfectLineCount++;
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();
                }else{
                    //分詞結果和標準不同
                    wrongLineCount++;
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();
                }
            }
        } catch (IOException ex) {
            System.err.println("分詞效果評估失敗：" + ex.getMessage());
        }
        int totalLineCount = perfectLineCount+wrongLineCount;
        int totalCharCount = perfectCharCount+wrongCharCount;
        EvaluationResult er = new EvaluationResult();
        er.setPerfectCharCount(perfectCharCount);
        er.setPerfectLineCount(perfectLineCount);
        er.setTotalCharCount(totalCharCount);
        er.setTotalLineCount(totalLineCount);
        er.setWrongCharCount(wrongCharCount);
        er.setWrongLineCount(wrongLineCount);     
        return er;
    }
    /**
     * 分詞結果
     */
    private static class EvaluationResult implements Comparable{
        private String analyzer;
        private float segSpeed;
        private int totalLineCount;
        private int perfectLineCount;
        private int wrongLineCount;
        private int totalCharCount;
        private int perfectCharCount;
        private int wrongCharCount;

        public String getAnalyzer() {
            return analyzer;
        }
        public void setAnalyzer(String analyzer) {
            this.analyzer = analyzer;
        }
        public float getSegSpeed() {
            return segSpeed;
        }
        public void setSegSpeed(float segSpeed) {
            this.segSpeed = segSpeed;
        }
        public float getLinePerfectRate(){
            return perfectLineCount/(float)totalLineCount*100;
        }
        public float getLineWrongRate(){
            return wrongLineCount/(float)totalLineCount*100;
        }
        public float getCharPerfectRate(){
            return perfectCharCount/(float)totalCharCount*100;
        }
        public float getCharWrongRate(){
            return wrongCharCount/(float)totalCharCount*100;
        }
        public int getTotalLineCount() {
            return totalLineCount;
        }
        public void setTotalLineCount(int totalLineCount) {
            this.totalLineCount = totalLineCount;
        }
        public int getPerfectLineCount() {
            return perfectLineCount;
        }
        public void setPerfectLineCount(int perfectLineCount) {
            this.perfectLineCount = perfectLineCount;
        }
        public int getWrongLineCount() {
            return wrongLineCount;
        }
        public void setWrongLineCount(int wrongLineCount) {
            this.wrongLineCount = wrongLineCount;
        }
        public int getTotalCharCount() {
            return totalCharCount;
        }
        public void setTotalCharCount(int totalCharCount) {
            this.totalCharCount = totalCharCount;
        }
        public int getPerfectCharCount() {
            return perfectCharCount;
        }
        public void setPerfectCharCount(int perfectCharCount) {
            this.perfectCharCount = perfectCharCount;
        }
        public int getWrongCharCount() {
            return wrongCharCount;
        }
        public void setWrongCharCount(int wrongCharCount) {
            this.wrongCharCount = wrongCharCount;
        }
        @Override
        public String toString(){
            return analyzer+"："
                    +"\n"
                    +"分詞速度："+segSpeed+" 字符/毫秒"
                    +"\n"
                    +"行數完美率："+getLinePerfectRate()+"%"
                    +"  行數錯誤率："+getLineWrongRate()+"%"
                    +"  總的行數："+totalLineCount
                    +"  完美行數："+perfectLineCount
                    +"  錯誤行數："+wrongLineCount
                    +"\n"
                    +"字數完美率："+getCharPerfectRate()+"%"
                    +" 字數錯誤率："+getCharWrongRate()+"%"
                    +" 總的字數："+totalCharCount
                    +" 完美字數："+perfectCharCount
                    +" 錯誤字數："+wrongCharCount;
        }
        @Override
        public int compareTo(Object o) {
            EvaluationResult other = (EvaluationResult)o;
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
                return 1;
            }
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
                return -1;
            }
            return 0;
        }
    }
}

MMSeg4j1.9.1的評估結果以下：

MMSeg4j ComplexSeg：
分詞速度：794.24805 字符/毫秒
行數完美率：38.817604%  行數錯誤率：61.182396%  總的行數：2533688  完美行數：983517  錯誤行數：1550171
字數完美率：29.604435% 字數錯誤率：70.39557% 總的字數：28374428 完美字數：8400089 錯誤字數：19974339

MMSeg4j SimpleSeg：
分詞速度：1026.1058 字符/毫秒
行數完美率：37.570095%  行數錯誤率：62.429905%  總的行數：2533688  完美行數：951909  錯誤行數：1581779
字數完美率：28.455273% 字數錯誤率：71.54473% 總的字數：28374428 完美字數：8074021 錯誤字數：20300407

MMSeg4j MaxWordSeg：
分詞速度：813.0676 字符/毫秒
行數完美率：34.27573%  行數錯誤率：65.72427%  總的行數：2533688  完美行數：868440  錯誤行數：1665248
字數完美率：25.20896% 字數錯誤率：74.79104% 總的字數：28374428 完美字數：7152898 錯誤字數：21221530

MMSeg4j1.9.1分詞評估程序以下：

import com.chenlb.mmseg4j.ComplexSeg;
import com.chenlb.mmseg4j.Dictionary;
import com.chenlb.mmseg4j.MMSeg;
import com.chenlb.mmseg4j.MaxWordSeg;
import com.chenlb.mmseg4j.Seg;
import com.chenlb.mmseg4j.SimpleSeg;
import com.chenlb.mmseg4j.Word;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/**
 * MMSeg4j分詞器分詞效果評估
 * @author 楊尚川
 */
public class MMSeg4jEvaluation {

    public static void main(String[] args) throws Exception{
        // 測試文件 d:/test-text.txt 和 標準分詞結果文件 d:/standard-text.txt 的下載地址：
        // http://pan.baidu.com/s/1hqihzjY
        
        List<EvaluationResult> list = new ArrayList<>();
        Dictionary dic = Dictionary.getInstance();
        // 對文本進行分詞
        float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", new ComplexSeg(dic));
        // 對分詞結果進行評估
        EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("MMSeg4j ComplexSeg");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 對文本進行分詞
        rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", new SimpleSeg(dic));
        // 對分詞結果進行評估
        result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("MMSeg4j SimpleSeg");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 對文本進行分詞
        rate = seg("d:/test-text.txt", "d:/result-text-MaxWordSeg.txt", new MaxWordSeg(dic));
        // 對分詞結果進行評估
        result = evaluation("d:/result-text-MaxWordSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("MMSeg4j MaxWordSeg");
        result.setSegSpeed(rate);
        list.add(result);
        
        //輸出評估結果
        Collections.sort(list);
        System.out.println("");
        for(EvaluationResult r : list){
            System.out.println(r+"\n");
        }
    }
    private static float seg(final String input, final String output, final Seg seg) throws Exception{
        float rate = 0;
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
            long size = Files.size(Paths.get(input));
            System.out.println("size:"+size);
            System.out.println("文件大小："+(float)size/1024/1024+" MB");
            int textLength=0;
            int progress=0;
            long start = System.currentTimeMillis();
            String line = null;
            while((line = reader.readLine()) != null){
                if("".equals(line.trim())){
                    writer.write("\n");
                    continue;
                }
                textLength += line.length();
                writer.write(seg(line, seg));
                writer.write("\n");
                progress += line.length();
                if( progress > 500000){
                    progress = 0;
                    System.out.println("分詞進度："+(int)(textLength*2.99/size*100)+"%");
                }
            }
            long cost = System.currentTimeMillis() - start;
            rate = textLength/(float)cost;
            System.out.println("字符數目："+textLength);
            System.out.println("分詞耗時："+cost+" 毫秒");
            System.out.println("分詞速度："+rate+" 字符/毫秒");
        }
        return rate;
    }
    private static String seg(String text, Seg seg) throws IOException {
        StringBuilder result = new StringBuilder();
        MMSeg mmSeg = new MMSeg(new StringReader(text), seg);
        Word word = null;
        while((word=mmSeg.next())!=null) {
            result.append(word.getString()).append(" ");			
        }
        return result.toString().trim();
    }
    /**
     * 分詞效果評估
     * @param resultText 實際分詞結果文件路徑
     * @param standardText 標準分詞結果文件路徑
     * @return 評估結果
     */
    private static EvaluationResult evaluation(String resultText, String standardText) {
        int perfectLineCount=0;
        int wrongLineCount=0;
        int perfectCharCount=0;
        int wrongCharCount=0;
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
            String result;
            while( (result = resultReader.readLine()) != null ){
                result = result.trim();
                String standard = standardReader.readLine().trim();
                if(result.equals("")){
                    continue;
                }
                if(result.equals(standard)){
                    //分詞結果和標準如出一轍
                    perfectLineCount++;
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();
                }else{
                    //分詞結果和標準不同
                    wrongLineCount++;
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();
                }
            }
        } catch (IOException ex) {
            System.err.println("分詞效果評估失敗：" + ex.getMessage());
        }
        int totalLineCount = perfectLineCount+wrongLineCount;
        int totalCharCount = perfectCharCount+wrongCharCount;
        EvaluationResult er = new EvaluationResult();
        er.setPerfectCharCount(perfectCharCount);
        er.setPerfectLineCount(perfectLineCount);
        er.setTotalCharCount(totalCharCount);
        er.setTotalLineCount(totalLineCount);
        er.setWrongCharCount(wrongCharCount);
        er.setWrongLineCount(wrongLineCount);     
        return er;
    }
    /**
     * 分詞結果
     */
    private static class EvaluationResult implements Comparable{
        private String analyzer;
        private float segSpeed;
        private int totalLineCount;
        private int perfectLineCount;
        private int wrongLineCount;
        private int totalCharCount;
        private int perfectCharCount;
        private int wrongCharCount;

        public String getAnalyzer() {
            return analyzer;
        }
        public void setAnalyzer(String analyzer) {
            this.analyzer = analyzer;
        }
        public float getSegSpeed() {
            return segSpeed;
        }
        public void setSegSpeed(float segSpeed) {
            this.segSpeed = segSpeed;
        }
        public float getLinePerfectRate(){
            return perfectLineCount/(float)totalLineCount*100;
        }
        public float getLineWrongRate(){
            return wrongLineCount/(float)totalLineCount*100;
        }
        public float getCharPerfectRate(){
            return perfectCharCount/(float)totalCharCount*100;
        }
        public float getCharWrongRate(){
            return wrongCharCount/(float)totalCharCount*100;
        }
        public int getTotalLineCount() {
            return totalLineCount;
        }
        public void setTotalLineCount(int totalLineCount) {
            this.totalLineCount = totalLineCount;
        }
        public int getPerfectLineCount() {
            return perfectLineCount;
        }
        public void setPerfectLineCount(int perfectLineCount) {
            this.perfectLineCount = perfectLineCount;
        }
        public int getWrongLineCount() {
            return wrongLineCount;
        }
        public void setWrongLineCount(int wrongLineCount) {
            this.wrongLineCount = wrongLineCount;
        }
        public int getTotalCharCount() {
            return totalCharCount;
        }
        public void setTotalCharCount(int totalCharCount) {
            this.totalCharCount = totalCharCount;
        }
        public int getPerfectCharCount() {
            return perfectCharCount;
        }
        public void setPerfectCharCount(int perfectCharCount) {
            this.perfectCharCount = perfectCharCount;
        }
        public int getWrongCharCount() {
            return wrongCharCount;
        }
        public void setWrongCharCount(int wrongCharCount) {
            this.wrongCharCount = wrongCharCount;
        }
        @Override
        public String toString(){
            return analyzer+"："
                    +"\n"
                    +"分詞速度："+segSpeed+" 字符/毫秒"
                    +"\n"
                    +"行數完美率："+getLinePerfectRate()+"%"
                    +"  行數錯誤率："+getLineWrongRate()+"%"
                    +"  總的行數："+totalLineCount
                    +"  完美行數："+perfectLineCount
                    +"  錯誤行數："+wrongLineCount
                    +"\n"
                    +"字數完美率："+getCharPerfectRate()+"%"
                    +" 字數錯誤率："+getCharWrongRate()+"%"
                    +" 總的字數："+totalCharCount
                    +" 完美字數："+perfectCharCount
                    +" 錯誤字數："+wrongCharCount;
        }
        @Override
        public int compareTo(Object o) {
            EvaluationResult other = (EvaluationResult)o;
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
                return 1;
            }
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
                return -1;
            }
            return 0;
        }
    }
}

ik-analyzer2012_u6的評估結果以下：

IKAnalyzer 智能切分：
分詞速度：178.3516 字符/毫秒
行數完美率：37.55943%  行數錯誤率：62.440567%  總的行數：2533686  完美行數：951638  錯誤行數：1582048
字數完美率：27.978464% 字數錯誤率：72.02154% 總的字數：28374416 完美字數：7938726 錯誤字數：20435690

IKAnalyzer 細粒度切分：
分詞速度：182.97859 字符/毫秒
行數完美率：18.872742%  行數錯誤率：81.12726%  總的行數：2533686  完美行數：478176  錯誤行數：2055510
字數完美率：10.936535% 字數錯誤率：89.06347% 總的字數：28374416 完美字數：3103178 錯誤字數：25271238

ik-analyzer2012_u6分詞評估程序以下：

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * IKAnalyzer分詞器分詞效果評估
 * @author 楊尚川
 */
public class IKAnalyzerEvaluation {

    public static void main(String[] args) throws Exception{
        // 測試文件 d:/test-text.txt 和 標準分詞結果文件 d:/standard-text.txt 的下載地址：
        // http://pan.baidu.com/s/1hqihzjY
        
        List<EvaluationResult> list = new ArrayList<>();
        
        // 對文本進行分詞
        float rate = seg("d:/test-text.txt", "d:/result-text-ComplexSeg.txt", true);
        // 對分詞結果進行評估
        EvaluationResult result = evaluation("d:/result-text-ComplexSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("IKAnalyzer 智能切分");
        result.setSegSpeed(rate);
        list.add(result);
        
        // 對文本進行分詞
        rate = seg("d:/test-text.txt", "d:/result-text-SimpleSeg.txt", false);
        // 對分詞結果進行評估
        result = evaluation("d:/result-text-SimpleSeg.txt", "d:/standard-text.txt");
        result.setAnalyzer("IKAnalyzer 細粒度切分");
        result.setSegSpeed(rate);
        list.add(result);
        
        //輸出評估結果
        Collections.sort(list);
        System.out.println("");
        for(EvaluationResult r : list){
            System.out.println(r+"\n");
        }
    }
    private static float seg(final String input, final String output, final boolean useSmart) throws Exception{
        float rate = 0;
        try(BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(input),"utf-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output),"utf-8"))){
            long size = Files.size(Paths.get(input));
            System.out.println("size:"+size);
            System.out.println("文件大小："+(float)size/1024/1024+" MB");
            int textLength=0;
            int progress=0;
            long start = System.currentTimeMillis();
            String line = null;
            while((line = reader.readLine()) != null){
                if("".equals(line.trim())){
                    writer.write("\n");
                    continue;
                }
                textLength += line.length();
                writer.write(seg(line, useSmart));
                writer.write("\n");
                progress += line.length();
                if( progress > 500000){
                    progress = 0;
                    System.out.println("分詞進度："+(int)(textLength*2.99/size*100)+"%");
                }
            }
            long cost = System.currentTimeMillis() - start;
            rate = textLength/(float)cost;
            System.out.println("字符數目："+textLength);
            System.out.println("分詞耗時："+cost+" 毫秒");
            System.out.println("分詞速度："+rate+" 字符/毫秒");
        }
        return rate;
    }
    private static String seg(String text, boolean useSmart) throws IOException {
        StringBuilder result = new StringBuilder();
        IKSegmenter ik = new IKSegmenter(new StringReader(text), useSmart);
        Lexeme word = null;
        while((word=ik.next())!=null) {
            result.append(word.getLexemeText()).append(" ");			
        }
        return result.toString().trim();
    }
    /**
     * 分詞效果評估
     * @param resultText 實際分詞結果文件路徑
     * @param standardText 標準分詞結果文件路徑
     * @return 評估結果
     */
    private static EvaluationResult evaluation(String resultText, String standardText) {
        int perfectLineCount=0;
        int wrongLineCount=0;
        int perfectCharCount=0;
        int wrongCharCount=0;
        try(BufferedReader resultReader = new BufferedReader(new InputStreamReader(new FileInputStream(resultText),"utf-8"));
            BufferedReader standardReader = new BufferedReader(new InputStreamReader(new FileInputStream(standardText),"utf-8"))){
            String result;
            while( (result = resultReader.readLine()) != null ){
                result = result.trim();
                String standard = standardReader.readLine().trim();
                if(result.equals("")){
                    continue;
                }
                if(result.equals(standard)){
                    //分詞結果和標準如出一轍
                    perfectLineCount++;
                    perfectCharCount+=standard.replaceAll("\\s+", "").length();
                }else{
                    //分詞結果和標準不同
                    wrongLineCount++;
                    wrongCharCount+=standard.replaceAll("\\s+", "").length();
                }
            }
        } catch (IOException ex) {
            System.err.println("分詞效果評估失敗：" + ex.getMessage());
        }
        int totalLineCount = perfectLineCount+wrongLineCount;
        int totalCharCount = perfectCharCount+wrongCharCount;
        EvaluationResult er = new EvaluationResult();
        er.setPerfectCharCount(perfectCharCount);
        er.setPerfectLineCount(perfectLineCount);
        er.setTotalCharCount(totalCharCount);
        er.setTotalLineCount(totalLineCount);
        er.setWrongCharCount(wrongCharCount);
        er.setWrongLineCount(wrongLineCount);     
        return er;
    }
    /**
     * 分詞結果
     */
    private static class EvaluationResult implements Comparable{
        private String analyzer;
        private float segSpeed;
        private int totalLineCount;
        private int perfectLineCount;
        private int wrongLineCount;
        private int totalCharCount;
        private int perfectCharCount;
        private int wrongCharCount;

        public String getAnalyzer() {
            return analyzer;
        }
        public void setAnalyzer(String analyzer) {
            this.analyzer = analyzer;
        }
        public float getSegSpeed() {
            return segSpeed;
        }
        public void setSegSpeed(float segSpeed) {
            this.segSpeed = segSpeed;
        }
        public float getLinePerfectRate(){
            return perfectLineCount/(float)totalLineCount*100;
        }
        public float getLineWrongRate(){
            return wrongLineCount/(float)totalLineCount*100;
        }
        public float getCharPerfectRate(){
            return perfectCharCount/(float)totalCharCount*100;
        }
        public float getCharWrongRate(){
            return wrongCharCount/(float)totalCharCount*100;
        }
        public int getTotalLineCount() {
            return totalLineCount;
        }
        public void setTotalLineCount(int totalLineCount) {
            this.totalLineCount = totalLineCount;
        }
        public int getPerfectLineCount() {
            return perfectLineCount;
        }
        public void setPerfectLineCount(int perfectLineCount) {
            this.perfectLineCount = perfectLineCount;
        }
        public int getWrongLineCount() {
            return wrongLineCount;
        }
        public void setWrongLineCount(int wrongLineCount) {
            this.wrongLineCount = wrongLineCount;
        }
        public int getTotalCharCount() {
            return totalCharCount;
        }
        public void setTotalCharCount(int totalCharCount) {
            this.totalCharCount = totalCharCount;
        }
        public int getPerfectCharCount() {
            return perfectCharCount;
        }
        public void setPerfectCharCount(int perfectCharCount) {
            this.perfectCharCount = perfectCharCount;
        }
        public int getWrongCharCount() {
            return wrongCharCount;
        }
        public void setWrongCharCount(int wrongCharCount) {
            this.wrongCharCount = wrongCharCount;
        }
        @Override
        public String toString(){
            return analyzer+"："
                    +"\n"
                    +"分詞速度："+segSpeed+" 字符/毫秒"
                    +"\n"
                    +"行數完美率："+getLinePerfectRate()+"%"
                    +"  行數錯誤率："+getLineWrongRate()+"%"
                    +"  總的行數："+totalLineCount
                    +"  完美行數："+perfectLineCount
                    +"  錯誤行數："+wrongLineCount
                    +"\n"
                    +"字數完美率："+getCharPerfectRate()+"%"
                    +" 字數錯誤率："+getCharWrongRate()+"%"
                    +" 總的字數："+totalCharCount
                    +" 完美字數："+perfectCharCount
                    +" 錯誤字數："+wrongCharCount;
        }
        @Override
        public int compareTo(Object o) {
            EvaluationResult other = (EvaluationResult)o;
            if(other.getLinePerfectRate() - getLinePerfectRate() > 0){
                return 1;
            }
            if(other.getLinePerfectRate() - getLinePerfectRate() < 0){
                return -1;
            }
            return 0;
        }
    }
}

ansj、mmseg4j和ik-analyzer的評估程序可在附件中下載，word分詞只需運行項目根目錄下的evaluation.bat腳本便可。

參考資料：

一、word分詞器分詞效果評估測試數據集和標準數據集

二、word分詞器評估程序

三、word分詞器主頁

四、ansj分詞器主頁

五、mmseg4j分詞器主頁

六、ik-analyzer分詞器主頁