搜索引擎--範例：中英文混雜分詞算法的實現--正向最大匹配算法的原理和實現

時間 2019-11-12

標籤搜索引擎範例英文混雜分詞算法實現正向最大匹配原理欄目搜索引擎简体版

原文原文鏈接

純中文和中英文混雜的惟一區別是，分詞的時候你如何辨別一個字符是英文字符仍是孩子字符，python

人眼很容易區分，可是對於計算機來講就沒那麼容易了，只要能辨別出中文字符和英文的字符，分詞自己就不是一個難題git

1：文本的編碼問題：算法

　　utf8：windows下，以utf8格式保存的文本是一個3個字節（以16進制）的BOM的，而且你不知道一個漢字是不是用3位表示，可是英文適合ascii編碼同樣的windows

ascii：英文一位，中文兩位，而且中文的第一個字節的值是大於128和，不會和英文混淆，推薦app

　　unicode：中文基本是兩個字節，適合網頁純中文分詞搜索引擎

2：分析選擇文本編碼編碼

　　線下的分詞我選用的是ascii編碼的文本，由於英文和中文的編碼容易區分，因此最容易實現spa

　　線上的分詞，因爲傳入的參數是unicode的，純中文分詞更加簡單code

3：中文分詞原理（詞庫或者語義分詞，後者須要大量的數據），這裏採用的是詞庫分詞對象

　　我用的中文詞庫：點此下載原本30萬的詞庫，去掉大於5個字詞語後，只剩下25萬多，基本夠用，ps,我是用結巴分詞詞庫本身提取出來的，能夠本身去提取30萬的詞庫^_^

　　中文停用詞庫：點此下載

　　分詞素材，點詞下載10000條新浪微博的數據

　　若是點開下載不了，把連接拷貝到迅雷，旋風等下載軟件上面下載便可

4：正向最大匹配算法分詞的原理

定義一個匹配的最大長度max_length

從左往右，依次遍歷文檔，

　　若是是漢字字符的話，ord>128

　　　　若是長度不足max_length，繼續，

　　　　若是長度==max_length，

　　　　　　依次匹配max_length到1長度的單詞

　　　　　　　　若匹配到

判斷是否爲停用詞，

　　　　　　　　　　　　　若果不是

記錄

　　若是是停用詞

　　　　　　　　　　　　　　　　從新切詞

　　若爲非漢字字符，對當前遍歷獲得的中文詞組進行處理

　　　　若是爲空，繼續

　　　　若是不爲空，進行分詞處理

5：python代碼實現正向最大匹配中英文分詞以下：weibo是分詞的對象，result是分詞後結果，能夠print出來看看對不對

# -*- coding: cp936 -*-
import string

dist = {}
df = file("bdist.txt","r")
while True:
    line = df.readline()
    if len(line)==0:
        break
    term = line.strip()
    dist[term]=1

stop = {}
sf = file("stopword.txt","r")
while True:
    line = sf.readline()
    if len(line)==0:
        break
    stopping = line.strip()
    stop[stopping]=1

re = {}
def record(t,i,w_id):
    #print(t)
    if re.get(t)==None:
        re[t]=str(w_id)+'['+str(i)+'.'
    else:
        re[t]=re[t]+str(i)+'.'

wf = file("weibo.txt","r")
while True:
    re = {}
    line = wf.readline()
    if len(line) ==0:
        break
    b = 0;
    #print(line[len(line)-2:len(line)-1])
    if line[len(line)-2:len(line)-1]!='1':
        continue
    w_id_end = line.find(r',',0)
    w_id = line[0:w_id_end]
    if not w_id.isdigit():
        continue
    w_id = string.atoi(line[0:w_id_end])
    #print(w_id)
    w_userid_end = line.find(r',',w_id_end+1)
    w_userid = line[w_id_end+1:w_userid_end]
    #print(w_userid)
    w_username_end = line.find(r',',w_userid_end+1)
    w_username = line[w_userid_end+1:w_username_end]
    #print(w_username)
    w_content_end = line.find(r',',w_username_end+1)
    w_content = line[w_username_end+1:w_content_end]
    #print(w_content)

    w_pt_end = line.find(r',',w_content_end+1)
    w_pt = line[w_content_end+1:w_pt_end]
    #print(line[w_pt_end+1:])
    w_count = string.atoi(line[w_pt_end+1:] )
    #print(w_count)
    
    weibo = w_content
    #s= type(weibo)
    #print(s)
    #print(weibo)

    #begin particle
    max_length = 10
    weibo_length = len(weibo)
    #print(weibo_length)
    t = 2
    index = 0
    temp = ''
    result = ''
    #print(weibo_length)
    while index<weibo_length:
        #print(index)
        #print(temp)
        s=weibo[index:index+1]
        if ord(s[0])>128:
            #print("ord")
            s = weibo[index:index+2]
            temp = temp+s
            index = index+2
            if len(temp)<max_length and index+1<weibo_length:
                #print("@")
                continue
            else:
                t =temp
                while True:
                    #print(temp)
                    if temp=='':
                        break
                    if dist.get(temp)==1:
                        result = result+temp+'/'
                        if stop.get(temp)==None:
                            record(temp,index-2,w_id)
                        temp = t[len(temp):len(t)]
                        #print(temp)
                        if temp!='' and index+1>weibo_length:
                            t =temp
                            while True:
                                #print(temp)
                                if temp=='':
                                    break
                                if dist.get(temp)==1:
                                    result = result+temp+'/'
                                    if stop.get(temp)==None:
                                        record(temp,index-2,w_id)
                                    #print(result)
                                    temp = t[len(temp):len(t)]
                                    t = temp
                                    #print(temp)
                                else:
                                    if len(temp)>0:
                                        temp=temp[0:len(temp)-2]
                        else:
                            break
                    else:
                        if len(temp)>0:
                            temp=temp[0:len(temp)-2]
        else:
            #print("ooo")
            index=index+1
            if temp=='':
                #print("$")
                result =result+s
                continue
            else:
                #print("&")
                t =temp
                while True:
                    #print(temp)
                    if temp=='':
                        break
                    if dist.get(temp)==1:
                        result = result+temp+'/'
                        if stop.get(temp)==None:
                            record(temp,index-2,w_id)
                        #print(result)
                        temp = t[len(temp):len(t)]
                        t = temp
                        #print(temp)
                    else:
                        if len(temp)>0:
                            temp=temp[0:len(temp)-2]
                result =result+s
    print(result)
sf.close()
df.close()
wf.close()