中文分詞算法之基於詞典的正向最大匹配算法

時間 2019-11-20

標籤中文分詞算法基於詞典正向最大匹配简体版

原文原文鏈接

基於詞典的正向最大匹配算法（最長詞優先匹配），算法會根據詞典文件自動調整最大長度，分詞的好壞徹底取決於詞典。java

算法流程圖以下：node

Java實現代碼以下：git

/**
 * 基於詞典的正向最大匹配算法
 * @author 楊尚川
 */
public class WordSeg {
    private static final List<String> DIC = new ArrayList<>();
    private static final int MAX_LENGTH;
    static{
        try {
            System.out.println("開始初始化詞典");
            int max=1;
            int count=0;
            List<String> lines = Files.readAllLines(Paths.get("D:/dic.txt"), Charset.forName("utf-8"));
            for(String line : lines){
                DIC.add(line);
                count++;
                if(line.length()>max){
                    max=line.length();
                }
            }
            MAX_LENGTH = max;
            System.out.println("完成初始化詞典，詞數目："+count);
            System.out.println("最大分詞長度："+MAX_LENGTH);
        } catch (IOException ex) {
            System.err.println("詞典裝載失敗:"+ex.getMessage());
        }
        
    }
    public static void main(String[] args){
        String text = "楊尚川是APDPlat應用級產品開發平臺的做者";  
        System.out.println(seg(text));
    }
    public static List<String> seg(String text){        
        List<String> result = new ArrayList<>();
        while(text.length()>0){
            int len=MAX_LENGTH;
            if(text.length()<len){
                len=text.length();
            }
            //取指定的最大長度的文本去詞典裏面匹配
            String tryWord = text.substring(0, 0+len);
            while(!DIC.contains(tryWord)){
                //若是長度爲一且在詞典中未找到匹配，則按長度爲一切分
                if(tryWord.length()==1){
                    break;
                }
                //若是匹配不到，則長度減一繼續匹配
                tryWord=tryWord.substring(0, tryWord.length()-1);
            }
            result.add(tryWord);
            //從待分詞文本中去除已經分詞的文本
            text=text.substring(tryWord.length());
        }
        return result;
    }
}

詞典文件下載地址dic.rar，簡單吧，呵呵github

實現功能是簡單，不過這裏的詞典中詞的數目爲：427452，咱們須要頻繁執行DIC.contains(tryWord))來判斷一個詞是否在詞典中，因此優化這行代碼可以顯著提高分詞效率（不要過早優化、不要作不成熟的優化）。算法

上面的代碼是利用了JDK的Collection接口的contains方法來判斷一個詞是否在詞典中，而這個方法的不一樣實現，其性能差別極大，上面的初始版本是用了ArrayList：List<String> DIC = new ArrayList<>()。那麼這個ArrayList的性能如何呢？還有更好性能的實現嗎？數組

一般來講，對於查找算法，在有序列表中查找比在無序列表中查找更快，分區查找比全局遍歷要快。安全

經過查看ArrayList、LinkedList、HashSet的contains方法的源代碼，發現ArrayList和LinkedList採用全局遍歷的方式且未利用有序列表的優點，HashSet使用了分區查找，若是hash分佈均勻衝突少，則須要遍歷的列表就不多甚至不須要。理論歸理論，仍是寫個代碼來測測更直觀放心，測試代碼以下：數據結構

/**
 * 比較詞典查詢算法的性能
 * @author 楊尚川
 */
public class SearchTest {
    //爲了生成隨機查詢的詞列表
    private static final List<String> DIC_FOR_TEST = new ArrayList<>();
    //經過更改這裏DIC的實現來比較不一樣實現之間的性能
    private static final List<String> DIC = new ArrayList<>();
    static{
        try {
            System.out.println("開始初始化詞典");
            int count=0;
            List<String> lines = Files.readAllLines(Paths.get("D:/dic.txt"), Charset.forName("utf-8"));
            for(String line : lines){
                DIC.add(line);
                DIC_FOR_TEST.add(line);
                count++;
            }
            System.out.println("完成初始化詞典，詞數目："+count);
        } catch (IOException ex) {
            System.err.println("詞典裝載失敗:"+ex.getMessage());
        }        
    }
    public static void main(String[] args){
        //選取隨機值
        List<String> words = new ArrayList<>();
        for(int i=0;i<100000;i++){
            words.add(DIC_FOR_TEST.get(new Random(System.nanoTime()+i).nextInt(427452)));
        }
        long start = System.currentTimeMillis();
        for(String word : words){
            DIC.contains(word);
        }
        long cost = System.currentTimeMillis()-start;
        System.out.println("cost time:"+cost+" ms");
    }
}

#分別運行10次測試，而後取平均值
LinkedList     10000次查詢       cost time:48812 ms
ArrayList      10000次查詢       cost time:40219 ms
HashSet        10000次查詢       cost time:8 ms
HashSet        1000000次查詢     cost time:258 ms
HashSet        100000000次查詢   cost time:28575 ms

咱們發現HashSet性能最好，比LinkedList和ArrayList快約3個數量級！這個測試結果跟前面的分析一致，LinkedList要比ArrayList慢一些，雖然他們都是全局遍歷，可是LinkedList須要操做下一個數據的引用，因此會多一些操做，LinkedList由於須要保存前驅和後繼引用，佔用的內存也要高一些。dom

雖然HashSet已經有不錯的性能了，可是若是詞典愈來愈大，內存佔用愈來愈多怎麼辦？若是有一個數據結構，有接近HashSet性能的同時，又能對詞典的數據進行壓縮以減小內存佔用，那就完美了。oop

前綴樹（Trie）有可能能夠實現「魚與熊掌兼得」的好事，本身實現一個Trie的數據結構，代碼以下：

/**
 * 前綴樹的Java實現
 * 用於查找一個指定的字符串是否在詞典中
 * @author 楊尚川
 */
public class Trie {
    private final TrieNode ROOT_NODE = new TrieNode('/');

    public boolean contains(String item){
        //去掉首尾空白字符
        item=item.trim();
        int len = item.length();
        if(len < 1){
            return false;
        }
        //從根節點開始查找
        TrieNode node = ROOT_NODE;
        for(int i=0;i<len;i++){
            char character = item.charAt(i);
            TrieNode child = node.getChild(character);
            if(child == null){
                //未找到匹配節點
                return false;
            }else{
                //找到節點，繼續往下找
                node = child;
            }
        }
        if(node.isTerminal()){
            return true;
        }
        return false;
    }
    public void addAll(List<String> items){
        for(String item : items){
            add(item);
        }
    }
    public void add(String item){
        //去掉首尾空白字符
        item=item.trim();
        int len = item.length();
        if(len < 1){
            //長度小於1則忽略
            return;
        }
        //從根節點開始添加
        TrieNode node = ROOT_NODE;
        for(int i=0;i<len;i++){
            char character = item.charAt(i);
            TrieNode child = node.getChildIfNotExistThenCreate(character);
            //改變頂級節點
            node = child;
        }
        //設置終結字符，表示從根節點遍歷到此是一個合法的詞
        node.setTerminal(true);
    }
    private static class TrieNode{
        private char character;
        private boolean terminal;
        private final Map<Character,TrieNode> children = new ConcurrentHashMap<>();        
        public TrieNode(char character){
            this.character = character;
        }
        public boolean isTerminal() {
            return terminal;
        }
        public void setTerminal(boolean terminal) {
            this.terminal = terminal;
        }        
        public char getCharacter() {
            return character;
        }
        public void setCharacter(char character) {
            this.character = character;
        }
        public Collection<TrieNode> getChildren() {
            return this.children.values();
        }
        public TrieNode getChild(char character) {
            return this.children.get(character);
        }        
        public TrieNode getChildIfNotExistThenCreate(char character) {
            TrieNode child = getChild(character);
            if(child == null){
                child = new TrieNode(character);
                addChild(child);
            }
            return child;
        }
        public void addChild(TrieNode child) {
            this.children.put(child.getCharacter(), child);
        }
        public void removeChild(TrieNode child) {
            this.children.remove(child.getCharacter());
        }        
    }
    
    public void show(){
        show(ROOT_NODE,"");
    }
    private void show(TrieNode node, String indent){
        if(node.isTerminal()){
            System.out.println(indent+node.getCharacter()+"(T)");
        }else{
            System.out.println(indent+node.getCharacter());
        }
        for(TrieNode item : node.getChildren()){
            show(item,indent+"\t");
        }
    }
    public static void main(String[] args){
        Trie trie = new Trie();
        trie.add("APDPlat");
        trie.add("APP");
        trie.add("APD");
        trie.add("Nutch");
        trie.add("Lucene");
        trie.add("Hadoop");
        trie.add("Solr");
        trie.add("楊尚川");
        trie.add("楊尚昆");
        trie.add("楊尚喜");
        trie.add("中華人民共和國");
        trie.add("中華人民打太極");
        trie.add("中華");
        trie.add("中心思想");
        trie.add("楊家將");        
        trie.show();
    }
}

修改前面的測試代碼，把List<String> DIC = new ArrayList<>()改成Trie DIC = new Trie()，使用Trie來作詞典查找，最終的測試結果以下：

#分別運行10次測試，而後取平均值
LinkedList     10000次查詢       cost time:48812 ms
ArrayList      10000次查詢       cost time:40219 ms
HashSet        10000次查詢       cost time:8 ms
HashSet        1000000次查詢     cost time:258 ms
HashSet        100000000次查詢   cost time:28575 ms
Trie           10000次查詢       cost time:15 ms
Trie           1000000次查詢     cost time:1024 ms
Trie           100000000次查詢   cost time:104635 ms

能夠發現Trie和HashSet的性能差別較小，在半個數量級之內，經過jvisualvm驚奇地發現Trie佔用的內存比HashSet的大約2.6倍，以下圖所示：

HashSet:

Trie:

詞典中詞的數目爲427452，HashSet是基於HashMap實現的，因此咱們看到佔內存最多的是HashMap$Node、char[]和String，手動執行GC屢次，這三種類型的實例數一直在變化，固然都始終大於詞數427452。Trie是基於ConcurrentHashMap實現的，因此咱們看到佔內存最多的是ConcurrentHashMap、ConcurrentHashMap$Node[]、ConcurrentHashMap$Node、Trie$TrieNode和Character，手動執行GC屢次，發現Trie$TrieNode的實例數一直保持不變，說明427452個詞通過Trie處理後的節點數爲603141。

很明顯地能夠看到，這裏Trie的實現不夠好，選用ConcurrentHashMap佔用的內存至關大，那麼咱們如何來改進呢？把ConcurrentHashMap替換爲HashMap能夠嗎？HashSet不是也基於HashMap嗎？看看把ConcurrentHashMap替換爲HashMap後的效果，以下圖所示：

內存佔用雖然少了10M左右，但仍然是HashSet的約2.4倍，原本是打算使用Trie來節省內存，沒想反正更加佔用內存了，既然使用HashMap來實現Trie佔用內存極高，那麼試試使用數組的方式，以下代碼所示：

/**
 * 前綴樹的Java實現
 * 用於查找一個指定的字符串是否在詞典中
 * @author 楊尚川
 */
public class TrieV2 {
    private final TrieNode ROOT_NODE = new TrieNode('/');

    public boolean contains(String item){
        //去掉首尾空白字符
        item=item.trim();
        int len = item.length();
        if(len < 1){
            return false;
        }
        //從根節點開始查找
        TrieNode node = ROOT_NODE;
        for(int i=0;i<len;i++){
            char character = item.charAt(i);
            TrieNode child = node.getChild(character);
            if(child == null){
                //未找到匹配節點
                return false;
            }else{
                //找到節點，繼續往下找
                node = child;
            }
        }
        if(node.isTerminal()){
            return true;
        }
        return false;
    }
    public void addAll(List<String> items){
        for(String item : items){
            add(item);
        }
    }
    public void add(String item){
        //去掉首尾空白字符
        item=item.trim();
        int len = item.length();
        if(len < 1){
            //長度小於1則忽略
            return;
        }
        //從根節點開始添加
        TrieNode node = ROOT_NODE;
        for(int i=0;i<len;i++){
            char character = item.charAt(i);
            TrieNode child = node.getChildIfNotExistThenCreate(character);
            //改變頂級節點
            node = child;
        }
        //設置終結字符，表示從根節點遍歷到此是一個合法的詞
        node.setTerminal(true);
    }
    private static class TrieNode{
        private char character;
        private boolean terminal;
        private TrieNode[] children = new TrieNode[0];
        public TrieNode(char character){
            this.character = character;
        }
        public boolean isTerminal() {
            return terminal;
        }
        public void setTerminal(boolean terminal) {
            this.terminal = terminal;
        }        
        public char getCharacter() {
            return character;
        }
        public void setCharacter(char character) {
            this.character = character;
        }
        public Collection<TrieNode> getChildren() {
            return Arrays.asList(children);            
        }
        public TrieNode getChild(char character) {
            for(TrieNode child : children){
                if(child.getCharacter() == character){
                    return child;
                }
            }
            return null;
        }        
        public TrieNode getChildIfNotExistThenCreate(char character) {
            TrieNode child = getChild(character);
            if(child == null){
                child = new TrieNode(character);
                addChild(child);
            }
            return child;
        }
        public void addChild(TrieNode child) {
            children = Arrays.copyOf(children, children.length+1);
            this.children[children.length-1]=child;
        }
    }
    
    public void show(){
        show(ROOT_NODE,"");
    }
    private void show(TrieNode node, String indent){
        if(node.isTerminal()){
            System.out.println(indent+node.getCharacter()+"(T)");
        }else{
            System.out.println(indent+node.getCharacter());
        }        
        for(TrieNode item : node.getChildren()){
            show(item,indent+"\t");
        }
    }
    public static void main(String[] args){
        TrieV2 trie = new TrieV2();
        trie.add("APDPlat");
        trie.add("APP");
        trie.add("APD");
        trie.add("楊尚川");
        trie.add("楊尚昆");
        trie.add("楊尚喜");
        trie.add("中華人民共和國");
        trie.add("中華人民打太極");
        trie.add("中華");
        trie.add("中心思想");
        trie.add("楊家將");        
        trie.show();
    }
}

內存佔用狀況以下圖所示：

如今內存佔用只有HashSet方式的80%了，內存問題總算是解決了，進一步分析，若是詞典夠大，詞典中有共同前綴的詞足夠多，節省的內存空間必定很是客觀。那麼性能呢？看以下從新測試的數據：

#分別運行10次測試，而後取平均值
LinkedList     10000次查詢       cost time:48812 ms
ArrayList      10000次查詢       cost time:40219 ms
HashSet        10000次查詢       cost time:8 ms
HashSet        1000000次查詢     cost time:258 ms
HashSet        100000000次查詢   cost time:28575 ms
Trie           10000次查詢       cost time:15 ms
Trie           1000000次查詢     cost time:1024 ms
Trie           100000000次查詢   cost time:104635 
TrieV1         10000次查詢       cost time:16 ms
TrieV1         1000000次查詢     cost time:780 ms
TrieV1         100000000次查詢   cost time:90949 ms
TrieV2         10000次查詢       cost time:50 ms
TrieV2         1000000次查詢     cost time:4361 ms
TrieV2         100000000次查詢   cost time:483398

總結一下，ArrayList和LinkedList方式實在太慢，跟最快的HashSet比將近慢約3個數量級，果斷拋棄。Trie比HashSet慢約半個數量級，內存佔用多約2.6倍，改進的TrieV1比Trie稍微節省一點內存約10%，速度差很少。進一步改進的TrieV2比Trie大大節省內存，只有HashSet的80%，不過速度比HashSet慢約1.5個數量級。

TrieV2實現了節省內存的目標，節省了約70%，可是速度也慢了，慢了約10倍，能夠對TrieV2作進一步優化，TrieNode的數組children採用有序數組，採用二分查找來加速。

下面看看TrieV3的實現：

使用了一個新的方法insert來加入數組元素，從無到有構建有序數組，把新的元素插入到已有的有序數組中，insert的代碼以下：

        /**
         * 將一個字符追加到有序數組
         * @param array 有序數組
         * @param element 字符
         * @return 新的有序數字
         */
        private TrieNode[] insert(TrieNode[] array, TrieNode element){
            int length = array.length;
            if(length == 0){
                array = new TrieNode[1];
                array[0] = element;
                return array;
            }
            TrieNode[] newArray = new TrieNode[length+1];
            boolean insert=false;
            for(int i=0; i<length; i++){
                if(element.getCharacter() <= array[i].getCharacter()){
                    //新元素找到合適的插入位置
                    newArray[i]=element;
                    //將array中剩下的元素依次加入newArray便可退出比較操做
                    System.arraycopy(array, i, newArray, i+1, length-i);
                    insert=true;
                    break;
                }else{
                    newArray[i]=array[i];
                }
            }
            if(!insert){
                //將新元素追加到尾部
                newArray[length]=element;
            }
            return newArray;
        }

有了有序數組，在搜索的時候就能夠利用有序數組的優點，重構搜索方法getChild：

數組中的元素是TrieNode，因此須要自定義TrieNode的比較方法：

好了，一個基於有序數組的二分搜索的性能提高重構就完成了，良好的單元測試是重構的安全防禦網，沒有單元測試的重構就猶如高空走鋼索卻沒有防禦墊同樣危險，同時，不過早優化，不作不成熟的優化是咱們應該謹記的原則，要根據應用的具體場景在算法的時空中作權衡。

OK，看看TrieV3的性能表現，固然了，內存使用沒有變化，和TrieV2同樣：

TrieV2         10000次查詢       cost time:50 ms
TrieV2         1000000次查詢     cost time:4361 ms
TrieV2         100000000次查詢   cost time:483398 ms
TrieV3         10000次查詢       cost time:21 ms
TrieV3         1000000次查詢     cost time:1264 ms
TrieV3         100000000次查詢   cost time:121740 ms

提高效果很明顯，約4倍。性能還有提高的空間嗎？呵呵......

參考資料：