北大開源中文分詞器被打臉現場...

有作過搜索的朋友知道,分詞的好壞直接影響咱們最終的搜索結果。在分詞的領域,英文分詞要簡單不少,由於英文語句中都是經過一個個空格來劃分的,而咱們的中文博大精深,一樣的詞在不一樣的語境中所表明的含義千差萬別,有時候必須聯繫上下文才能知道它準確的表達意思,所以中文分詞一直是分詞領域的一大挑戰。python

以前介紹過一款北大新開源的分詞器,根據做者的測試結果,這是一個準確率和速度都超過 jieba 等其餘分詞的分詞器。測試

因此我就想來作個簡單的測試!因而我想用《三國演義》來作一個測試,提取其中著名人名出現的頻率。spa

首先搜索下三國中的人物名單,3d

獲得人名以後,作成一我的名的列表,以前設置成一個以人物名爲鍵,值爲 0 的字典。我只取了曹魏和蜀漢的部分人名,代碼以下:code

wei = ["許褚","荀攸","賈詡","郭嘉","程昱","戲志","劉曄","蔣濟","陳羣","華歆","鍾繇","滿寵","董昭","王朗","崔琰","鄧艾","杜畿","田疇","王修","楊修","辛毗",
       "楊阜",
       "田豫","王粲","蒯越","張繼","于禁","棗祗","曹操","孟德","任峻","陳矯","郗慮","桓玠","丁儀","丁廙","司馬朗","韓暨","韋康","邴原","趙儼","婁圭","賈逵",
       "陳琳",
       "司馬懿","張遼","徐晃","夏侯惇","夏侯淵","龐德","張郃","李典","樂進","典韋","曹洪","曹仁","曹彰"]

wei_dict = dict.fromkeys(wei, 0)
shu_dict = dict.fromkeys(shu, 0)
複製代碼

接着去下載一部三國的電子書,並讀取cdn

def read_txt():
    with open("三國.txt", encoding="utf-8") as f:
        content = f.read()

    return content
複製代碼

pkuseg 測試結果

pkuseg 的用法很簡單,首先實例化 pkuseg 的對象,獲取人物名稱數量的思路是這樣的:循環咱們分詞後的列表,若是檢測到有人物名稱字典中的人物名,就將該數據加 1,代碼以下:對象

def extract_pkuseg(content):
    start = time.time()
    seg = pkuseg.pkuseg()
    text = seg.cut(content)

    for name in text:
        if name in wei:
            wei_dict[name] = wei_dict.get(name) + 1
        elif name in shu:
            shu_dict[name] = shu_dict.get(name) + 1

print(f"pkuseg 用時:{time.time() - start}")
print(f"pkuseg 讀取人名總數:{sum(wei_dict.values()) + sum(shu_dict.values())}")
複製代碼

執行結果以下:blog

pkuseg 測試結果

jieba 測試結果

代碼基本差很少,只是分詞器的用法有些不一樣。utf-8

def extract_jieba(content):
    start = time.time()
    seg_list = jieba.cut(content)
    for name in seg_list:
        if name in wei:
            wei_dict[name] = wei_dict.get(name) + 1
        elif name in shu:
            shu_dict[name] = shu_dict.get(name) + 1

    print(f"jieba 用時:{time.time() - start}")
    print(f"jieba 讀取人名總數:{sum(wei_dict.values()) + sum(shu_dict.values())}")
複製代碼

執行結果以下:get

jieba 測試結果

emmm 測試結果好像好像有點出乎意料,說好的 pkuseg 準確率更高呢???pkuseg 用時將近 jieba 的三倍,並且提取效果也沒有 jieba 分詞好!因而我就去逼乎搜了一下 pkuseg ,結果是這樣的....

整體而言 pkuseg 吹的有點過了,並無做者說的那麼神奇,有點博眼球的成分,也許它更是一款更注重細分領域的分詞器!

相關文章
相關標籤/搜索