天然語言處理入門何晗讀書筆記第2章詞典分詞

中文分詞指的是將一段文本拆分爲一系列單詞的過程，這些單詞順序拼接後等於原文本。中文分詞算法大體分爲基於詞典規則與基於機器學習這兩大派別。本章先從簡單的規則入手，爲讀者介紹一些高效的詞典匹配算法。php

詞典分詞是最簡單、最多見的分詞算法，僅需一部詞典和一套查詞典的規則便可，適合初學者入門。給定一部詞典，詞典分詞就是一個肯定的查詞與輸出的規則系統。詞典分詞的重點不在於分詞自己，而在於支撐詞典的數據結構。html

本章先介紹詞的定義與性質，而後給出一部詞典。python

2.1 什麼是詞算法

2.1.1 詞的定義編程

在基於詞典的中文分詞中，詞的定義要現實得多：詞典中的字符串就是詞。根據此定義，詞典以外的字符串就不是詞了。這個推論或許不符合讀者的指望，但這就是詞典分詞故有的弱點。事實上，語言中的詞彙數量是無窮的，沒法用任何詞典完整收錄，數組

2.1.2 詞的性質----齊夫定律數據結構

齊夫定律：一個單詞的詞頻與它的詞頻排名成反比。就是說，雖然存在不少生詞，但生詞的詞頻較小，趨近於0，平時很難碰到。至少在常見的單詞的切分上，能夠放心地試一試詞典分詞。app

實現詞典分詞的第一個步驟，固然是準備一份詞典了。機器學習

2.2 詞典編程語言

互聯網上有許多公開的中文詞庫，

好比搜狗實驗室發佈的互聯網詞庫（SogouW,其中有15萬個詞條） https://www.sogou.com/labs/resource/w.php,

清華大學開放中文詞庫(THUOCL),http://thunlp.org

何晗發佈的千萬級巨型漢語詞庫（千萬級詞條）:http://www.hankcs.com/nlp/corpus/tens-of-millions-of-giant-chinese-word-library-share.html

2.2.1 HanLP詞典

CoreNatureDictionary.mini.txt文件

第一列是單詞自己，以後每兩列分別表示詞性與相應的詞頻。但願這個詞以動詞出現了386次，以名詞的身份出現了96次。

2.2.2 詞典的加載

利用HanLP,讀取CoreNatureDictionary.mini.txt文件，只需一行代碼

TreeMap<String, CoreDictionary.Attribute> dictionary = IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");

得了一個TreeMap,它的鍵宿舍單詞自己，而值是CoreDictionary.Attribute

查看這份詞典的大小，以及按照字典序排列的第一個單詞:

System.out.printf("詞典大小：%d個詞條\n", dictionary.size()); System.out.println(dictionary.keySet().iterator().next());

2.3 切分算法

2.3.1 徹底切分

徹底切分指的是，找出一段文本中的全部單詞。樸素的徹底切分算法其實很是簡單，只要遍歷文本中的連續序列，查詢該序列是否在詞典中便可。定義詞典爲dic,文本爲text,當前的處理位置爲i,徹底切分的python算法以下：

def fully_segment(text, dic): word_list = [] for i in range(len(text)):                  # i 從 0 到text的最後一個字的下標遍歷
        for j in range(i + 1, len(text) + 1):   # j 遍歷[i + 1, len(text)]區間
            word = text[i:j]                    # 取出連續區間[i, j]對應的字符串
            if word in dic:                     # 若是在詞典中，則認爲是一個詞
 word_list.append(word) return word_list

代碼詳見tests/book/ch02/fully_segment.py

主函數

if __name__ == '__main__': dic = load_dictionary() print(fully_segment('商品和服務', dic))

運行結果：

Java代碼

/** * 徹底切分式的中文分詞算法 * * @param text 待分詞的文本 * @param dictionary 詞典 * @return 單詞列表 */
    public static List<String> segmentFully(String text, Map<String, CoreDictionary.Attribute> dictionary) { List<String> wordList = new LinkedList<String>(); for (int i = 0; i < text.length(); ++i) { for (int j = i + 1; j <= text.length(); ++j) { String word = text.substring(i, j); if (dictionary.containsKey(word)) { wordList.add(word); } } } return wordList; }

// 徹底切分
System.out.println(segmentFully("就讀北京大學", dictionary));

結果：

2.3.2 正向最長匹配

徹底切分的結果比較沒有意義，咱們更須要那種有意義的詞語序列，而不是全部出如今詞典中的單詞所構成的鏈表。因此須要完善一下處理規則，考慮到越長的單詞表達的意義越豐富，因而咱們定義單詞越長優先級越高。具體說來，就是在以某個下標爲起點遞增查詞的過程當中，優先輸出更長的單詞，這種規則被稱爲最長匹配算法。掃描順序從前日後，則稱爲正向最長匹配，反之則爲逆向最長匹配。

Python代碼

def forward_segment(text, dic): word_list = [] i = 0 while i < len(text): longest_word = text[i]                      # 當前掃描位置的單字
        for j in range(i + 1, len(text) + 1):       # 全部可能的結尾
            word = text[i:j]                        # 從當前位置到結尾的連續字符串
            if word in dic:                         # 在詞典中
                if len(word) > len(longest_word):   # 而且更長
                    longest_word = word             # 則更優先輸出
        word_list.append(longest_word)              # 輸出最長詞
        i += len(longest_word)                      # 正向掃描
    return word_list

調用

if __name__ == '__main__': dic = load_dictionary() print(forward_segment('就讀北京大學', dic)) print(forward_segment('研究生命起源', dic))

Java代碼

/** * 正向最長匹配的中文分詞算法 * * @param text 待分詞的文本 * @param dictionary 詞典 * @return 單詞列表 */
    public static List<String> segmentForwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary) { List<String> wordList = new LinkedList<String>(); for (int i = 0; i < text.length(); ) { String longestWord = text.substring(i, i + 1); for (int j = i + 1; j <= text.length(); ++j) { String word = text.substring(i, j); if (dictionary.containsKey(word)) { if (word.length() > longestWord.length()) { longestWord = word; } } } wordList.add(longestWord); i += longestWord.length(); } return wordList; }

2.3.3 逆向最長匹配

Python代碼

def backward_segment(text, dic): word_list = [] i = len(text) - 1
    while i >= 0:                                   # 掃描位置做爲終點
        longest_word = text[i]                      # 掃描位置的單字
        for j in range(0, i):                       # 遍歷[0, i]區間做爲待查詢詞語的起點
            word = text[j: i + 1]                   # 取出[j, i]區間做爲待查詢單詞
            if word in dic: if len(word) > len(longest_word):   # 越長優先級越高
                    longest_word = word word_list.insert(0, longest_word) # 逆向掃描，因此越先查出的單詞在位置上越靠後
        i -= len(longest_word) return word_list

Java代碼

/** * 逆向最長匹配的中文分詞算法 * * @param text 待分詞的文本 * @param dictionary 詞典 * @return 單詞列表 */
    public static List<String> segmentBackwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary) { List<String> wordList = new LinkedList<String>(); for (int i = text.length() - 1; i >= 0; ) { String longestWord = text.substring(i, i + 1); for (int j = 0; j <= i; ++j) { String word = text.substring(j, i + 1); if (dictionary.containsKey(word)) { if (word.length() > longestWord.length()) { longestWord = word; } } } wordList.add(0, longestWord); i -= longestWord.length(); } return wordList; }

結果仍是出現問題，所以有人提出綜合兩種規則，期待它們取長補短，稱爲雙向最長匹配。

2.3.4 雙向最長匹配

統計顯示，正向匹配錯誤而逆向匹配正確的句子佔9.24%。

雙向最長匹配規則集，流程以下：

（1）同時執行正向和逆向最長匹配，若二者的詞數不一樣，則返回詞數更少的那一個。

（2）不然，返回二者中單字更少的那一個。當單字數也相同時，優先返回逆向最長匹配的結果。

Python代碼

from backward_segment import backward_segment
from forward_segment import forward_segment
from utility import load_dictionary


def count_single_char(word_list: list):  # 統計單字成詞的個數
    return sum(1 for word in word_list if len(word) == 1)

def bidirectional_segment(text, dic):
    f = forward_segment(text, dic)
    b = backward_segment(text, dic)
    if len(f) < len(b):                                  # 詞數更少優先級更高
        return f
    elif len(f) > len(b):
        return b
    else:
        if count_single_char(f) < count_single_char(b):  # 單字更少優先級更高
            return f
        else:
            return b                                     # 都相等時逆向匹配優先級更高


if __name__ == '__main__':
    dic = load_dictionary()
    print(bidirectional_segment('研究生命起源', dic))

Java版本

/** * 雙向最長匹配的中文分詞算法 * * @param text 待分詞的文本 * @param dictionary 詞典 * @return 單詞列表 */
    public static List<String> segmentBidirectional(String text, Map<String, CoreDictionary.Attribute> dictionary) { List<String> forwardLongest = segmentForwardLongest(text, dictionary); List<String> backwardLongest = segmentBackwardLongest(text, dictionary); if (forwardLongest.size() < backwardLongest.size()) return forwardLongest; else if (forwardLongest.size() > backwardLongest.size()) return backwardLongest; else { if (countSingleChar(forwardLongest) < countSingleChar(backwardLongest)) return forwardLongest; else
                return backwardLongest; } }

主函數調用部分代碼

// 雙向最長匹配
        String[] text = new String[]{ "項目的研究", "商品和服務", "研究生命起源", "當下雨天地面積水", "結婚的和還沒有結婚的", "歡迎新老師生前來就餐", }; for (int i = 0; i < text.length; i++) { System.out.printf("| %d | %s | %s | %s | %s |\n", i + 1, text[i], segmentForwardLongest(text[i], dictionary), segmentBackwardLongest(text[i], dictionary), segmentBidirectional(text[i], dictionary) ); }

比較以後發現，雙向最長匹配在二、三、5這3種狀況下選擇出了最好的結果，但在4號句子上選擇了錯誤的結果，使得最終正確率3/6反而小於逆向最長匹配的4/6。由此，規則系統的脆弱可見一斑。規則集的維護有時是拆東牆補西牆，有時是幫倒忙。

2.3.5 速度評測

詞典分詞的規則沒有技術含量，消除歧義的效果很差。詞典分詞的核心價值不在於精度，而在於速度。

Python

def evaluate_speed(segment, text, dic): start_time = time.time() for i in range(pressure): segment(text, dic) elapsed_time = time.time() - start_time print('%.2f 萬字/秒' % (len(text) * pressure / 10000 / elapsed_time)) if __name__ == '__main__': text = "江西鄱陽湖乾枯，中國最大淡水湖變成大草原" pressure = 10000 dic = load_dictionary() print('因爲JPype調用開銷巨大，如下速度顯著慢於原生Java') evaluate_speed(forward_segment, text, dic) evaluate_speed(backward_segment, text, dic) evaluate_speed(bidirectional_segment, text, dic)

Java

public static void evaluateSpeed(Map<String, CoreDictionary.Attribute> dictionary) { String text = "江西鄱陽湖乾枯，中國最大淡水湖變成大草原"; long start; double costTime; final int pressure = 10000; System.out.println("正向最長"); start = System.currentTimeMillis(); for (int i = 0; i < pressure; ++i) { segmentForwardLongest(text, dictionary); } costTime = (System.currentTimeMillis() - start) / (double) 1000; System.out.printf("%.2f萬字/秒\n", text.length() * pressure / 10000 / costTime); System.out.println("逆向最長"); start = System.currentTimeMillis(); for (int i = 0; i < pressure; ++i) { segmentBackwardLongest(text, dictionary); } costTime = (System.currentTimeMillis() - start) / (double) 1000; System.out.printf("%.2f萬字/秒\n", text.length() * pressure / 10000 / costTime); System.out.println("雙向最長"); start = System.currentTimeMillis(); for (int i = 0; i < pressure; ++i) { segmentBidirectional(text, dictionary); } costTime = (System.currentTimeMillis() - start) / (double) 1000; System.out.printf("%.2f萬字/秒\n", text.length() * pressure / 10000 / costTime); }

總結：

一、Python的運行速度比Java慢，效率只有Java的一半不到

二、正向匹配與逆向匹配的速度差很少，是雙向的兩倍。由於雙向作了兩倍的工做

三、Java實現的正向匹配比逆向匹配快。

2.4 字典樹

2.4.1 什麼是字典樹

字符串集合經常使用字典樹存儲，這是一種字符串上的樹形數據結構。字典樹中每條邊都對應一個字，從根節點往下的路徑構成一個個字符串。字典樹並不直接在節點上存儲字符串，而是將詞語視做根節點到某節點之間的一條路徑，並在終點節點上作個標記"該節點對應詞語的結尾".字符串就是一條路徑，要查詢一個單詞，只需順着這條路徑從根節點往下走。若是能走到特殊標記的節點，則說明該字符串在集合中，不然說明不存在。

藍色標記着該節點是一個詞的結尾，數字是人爲的編號。這棵樹中存儲的詞典以下所示:

入門:　　0--1--2

天然:　　0--3--4

天然人:　0--3--4--5　

2.4.2 字典樹的節點實現

約定用值爲None表示節點不對應詞語，雖然這樣就不能插入值爲None的鍵了，但實現起來更簡單。

節點的Python描述以下:

class Node(object): def __init__(self, value) -> None: self._children = {} self._value = value def _add_child(self, char, value, overwrite=False): child = self._children.get(char) if child is None: child = Node(value) self._children[char] = child elif overwrite: child._value = value return child

2.4.3 字典樹的增刪改查實現

"刪改查"實際上是一回事，都是查詢。刪除操做就是將終點的值設爲None而已，修改操做無非是將它的值設爲另外一個值而已。

從肯定有限狀態自動機的角度來說，每一個節點都是一個狀態，狀態表示當前已查詢到的前綴。

狀態　　　　　　前綴

0　　　　　　　　「（空白）

1　　　　　　　　入

2　　　　　　　　入門

。。。。

從父節點到子節點的移動過程能夠看做一次狀態轉移。

」增長鍵值對「其實仍是查詢，只不過在狀態轉移失敗的時候，則建立相應的子節點，保證轉移成功。

字典樹的完整實現以下：

class Trie(Node): def __init__(self) -> None: super().__init__(None) def __contains__(self, key): return self[key] is not None def __getitem__(self, key): state = self for char in key: state = state._children.get(char) if state is None: return None return state._value def __setitem__(self, key, value): state = self for i, char in enumerate(key): if i < len(key) - 1: state = state._add_child(char, None, False) else: state = state._add_child(char, value, True)

寫一些測試：

if __name__ == '__main__': trie = Trie() # 增
    trie['天然'] = 'nature' trie['天然人'] = 'human' trie['天然語言'] = 'language' trie['自語'] = 'talk to oneself' trie['入門'] = 'introduction'
    assert '天然' in trie # 刪
    trie['天然'] = None assert '天然' not in trie # 改
    trie['天然語言'] = 'human language'
    assert trie['天然語言'] == 'human language'
    # 查
    assert trie['入門'] == 'introduction'

2.4.4 首字散列其他二分的字典樹

讀者也許據說過散列函數，它用來將對象轉換爲整數。散列函數必須知足的基本要求是：對象相同，散列值必須相同。散列函數設計不當，則散列表的內存效率和查找效率都不高。Python沒有char類型，字符被視做長度爲1的字符串，因此實際調用的就是str的散列函數。在64位系統上，str的散列函數返回64位的整數。但Unicode字符總共也才136690個，遠遠小於2^64。這致使兩個字符在字符集中明明相鄰，然而散列值卻相差萬里。

Java中的字符散列函數則要友好一些，Java中字符的編碼爲UTF-16。每一個字符均可以映射爲16位不重複的連續整數，剛好是完美散列。這個完美的散列函數輸出的是區間[0,65535]內的正整數，用來索引子節點很是合適。具體作法是建立一個長爲65536的數組，將子節點按對應的字符整型值做爲下標放入該數組中便可。這樣每次狀態轉移時，只需訪問對應下標就好了，這在任何編程語言中都是極快的。然而這種待遇沒法讓每一個節點都享受，若是詞典中的詞語最長爲l,則最壞狀況下字典樹第l層的數組容量之和爲O(65536^l)。內存指數膨脹，不現實。一個變通的方法是僅在根節點實施散列策略。

天然語言處理入門 何晗 讀書筆記 第2章 詞典分詞

天然語言處理入門何晗讀書筆記第2章詞典分詞