結巴分詞原理

時間 2019-12-21

原文原文鏈接

介紹

結巴分詞是一個受你們喜好的分詞庫，源碼地址爲github，今天咱們就跟進源碼，看一下結巴分詞的原理git

原理

def cut(self, sentence, cut_all=False, HMM=True):
        '''
        The main function that segments an entire sentence that contains
        Chinese characters into separated words.

        Parameter:
            - sentence: The str(unicode) to be segmented.
            - cut_all: Model type. True for full pattern, False for accurate pattern.
            - HMM: Whether to use the Hidden Markov Model.
        '''

使用結巴分詞的時候，有三種模式，這三種模式的進入條件分別爲：github

if cut_all:
            cut_block = self.__cut_all
        elif HMM:
            cut_block = self.__cut_DAG
        else:
            cut_block = self.__cut_DAG_NO_HMM

首先咱們看一下這三種模式算法

__cut_all:app
1. 原句：我來到北京清華大學　結果：我/ 來到/ 北京/ 清華/ 清華大學/ 華大/ 大學
2. 原句：他來到了網易杭研大廈　結果：他/ 來到/ 了/ 網易/ 杭/ 研/ 大廈
__cut_DAG:函數
1. 原句：我來到北京清華大學　結果:我/ 來到/ 北京/ 清華大學
2. 原句：他來到了網易杭研大廈結果：他/ 來到/ 了/ 網易/ 杭研/ 大廈
__cut_DAG_NO_HMM:spa
1. 原句：我來到北京清華大學　結果:我/ 來到/ 北京/ 清華大學
2. 原句：他來到了網易杭研大廈結果：他/ 來到/ 了/ 網易/ 杭/ 研/ 大廈

下面咱們就來分析一下這三種模式：
這三種模式有一個共同點，第一步都是先構造DAG，也就是構造有向無環圖。
源碼以下：code

def get_DAG(self, sentence):
        self.check_initialized()
        DAG = {}
        N = len(sentence)
        for k in xrange(N):
            tmplist = []
            i = k
            frag = sentence[k]
            while i < N and frag in self.FREQ:
                if self.FREQ[frag]:
                    tmplist.append(i)
                i += 1
                frag = sentence[k:i + 1]
            if not tmplist:
                tmplist.append(k)
            DAG[k] = tmplist
        return DAG

若是sentence是'我來到北京清華大學‘，那麼DAG爲blog

{0: [0], 1: [1, 2], 2: [2], 3: [3, 4], 4: [4], 5: [5, 6, 8], 6: [6, 7], 7: [7, 8], 8: [8]}

直觀上來看，DAG[5]=[5,6,8]的意思就是，以’清‘開頭的話，分別以五、六、8結束時，能夠是一個詞語，即’清‘、’清華‘、’清華大學‘
get_DAG方法中，最重要的也就是self.FREQ了，它是怎麼來的呢？ip

其實就是經過jieba目錄下，dict.txt文件來產生的self.FREQ,方法以下：
dict.txt共有349046行，每一行格式爲：unicode

一 217830 m
一一 1670 m
一一二 11 m
一一例 3 m
一一分 8 m
一一列舉 34 i

第一部分爲詞語，第二部分爲該詞出現的頻率，第三部分爲該詞的詞性。
以讀取’一一列舉‘爲例子，首先執行self.FREQ['一一列舉']=34，而後會檢查’一‘、’一一‘、’一一列‘、’一一列舉‘以前是否在self.FREQ中存儲過，若是以前存儲過，則跳過，不然執行self.FREQ['一']=0，self.FREQ['一一']=0，self.FREQ['一一列']=0
因此self.FREQ中不止存儲了正常的詞語和它出現的次數，同時也存儲了全部詞語的前綴，並將前綴出現的次數設置爲0,以和正常詞語區別開。

好了，如今DAG這部分咱們介紹完了，而後咱們分開來介紹一下這三種模式：

__cut_all

源碼以下：

def __cut_all(self, sentence):
        dag = self.get_DAG(sentence)
        old_j = -1
        for k, L in iteritems(dag):
            if len(L) == 1 and k > old_j:
                yield sentence[k:L[0] + 1]
                old_j = L[0]
            else:
                for j in L:
                    if j > k:
                        yield sentence[k:j + 1]
                        old_j = j

這個具體的遍歷方式咱們就不細說了，你們自行看源碼吧

__cut_DAG

def __cut_DAG(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        ......

首先咱們先看一下self.calc方法

def calc(self, sentence, DAG, route):
        N = len(sentence)
        route[N] = (0, 0)
        logtotal = log(self.total)
        for idx in xrange(N - 1, -1, -1):
            route[idx] = max((log(self.FREQ.get(sentence[idx:x + 1]) or 1) -
                              logtotal + route[x + 1][0], x) for x in DAG[idx])

這裏使用了一個技巧，也就是log(a) + log(b) = log(ab)，從而巧妙的避過了乘法，也就避免了溢出的風險。
其實calc函數就是實現了vertibi算法，不瞭解vertibi算法的同窗自行百度吧。

而後再貼上整個__cut_DAG的源碼：

def __cut_DAG(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        buf = ''
        N = len(sentence)
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if y - x == 1:
                buf += l_word
            else:
                if buf:
                    if len(buf) == 1:
                        yield buf
                        buf = ''
                    else:
                        if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t
                        else:
                            for elem in buf:
                                yield elem
                        buf = ''
                yield l_word
            x = y

        if buf:
            if len(buf) == 1:
                yield buf
            elif not self.FREQ.get(buf):
                recognized = finalseg.cut(buf)
                for t in recognized:
                    yield t
            else:
                for elem in buf:
                    yield elem

其中，重點關注這一部分

if not self.FREQ.get(buf):
                            recognized = finalseg.cut(buf)
                            for t in recognized:
                                yield t

何時會進入finalseg.cut(buf)呢？實際上，就是當遇到一些dict.txt中沒出現的詞的時候，纔會進入這個函數：
在這個函數中，就是使用HMM的方法，對這些未識別成功的詞進行標註，而後咱們來介紹一下項目中相關的內容：

其中，prob_start.py存儲的是HMM的起始狀態相關的信息，文件中的數字都通過log處理過：

P={'B': -0.26268660809250016,
 'E': -3.14e+100,
 'M': -3.14e+100,
 'S': -1.4652633398537678}

B表明begin，E表明end，M表明middle，S表明single。因此在開始時，HMM的狀態只多是S或者B，而E和M爲負無窮
prob_trans.py存儲的是狀態轉移矩陣：

P={'B': {'E': -0.510825623765990, 'M': -0.916290731874155},
 'E': {'B': -0.5897149736854513, 'S': -0.8085250474669937},
 'M': {'E': -0.33344856811948514, 'M': -1.2603623820268226},
 'S': {'B': -0.7211965654669841, 'S': -0.6658631448798212}}

prob_emit.py中存儲的是在該狀態下出現該漢字的機率，例如p('劉'|S)=-0.916

P={'B': {'\u4e00': -3.6544978750449433,
       '\u4e01': -8.125041941842026,
       '\u4e03': -7.817392401429855,
       '\u4e07': -6.3096425804013165,
       '\u4e08': -8.866689067453933,
       '\u4e09': -5.932085850549891,
       '\u4e0a': -5.739552583325728,
       '\u4e0b': -5.997089097239644,
       '\u4e0d': -4.274262055936421,
       '\u4e0e': -8.355569307500769,
       ......

經過這種方式，也就能夠進行分詞了。
‘我/ 來到/ 北京/ 清華大學’對應的狀態應該爲'SBEBEBMME'

__cut_DAG_NO_HMM

其實__cut_DAG_NO_HMM和__cut_DAG的區別就是：對vertibi未成功切分的部分，__cut_DAG_NO_HMM沒有使用HMM進行分詞。源碼以下：

def __cut_DAG_NO_HMM(self, sentence):
        DAG = self.get_DAG(sentence)
        route = {}
        self.calc(sentence, DAG, route)
        x = 0
        N = len(sentence)
        buf = ''
        while x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if re_eng.match(l_word) and len(l_word) == 1:
                buf += l_word
                x = y
            else:
                if buf:
                    yield buf
                    buf = ''
                yield l_word
                x = y
        if buf:
            yield buf
            buf = ''

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。