有作過搜索的朋友知道,分詞的好壞直接影響咱們最終的搜索結果。在分詞的領域,英文分詞要簡單不少,由於英文語句中都是經過一個個空格來劃分的,而咱們的中文博大精深,一樣的詞在不一樣的語境中所表明的含義千差萬別,有時候必須聯繫上下文才能知道它準確的表達意思,所以中文分詞一直是分詞領域的一大挑戰。python
以前介紹過一款北大新開源的分詞器,根據做者的測試結果,這是一個準確率和速度都超過 jieba 等其餘分詞的分詞器。測試
因此我就想來作個簡單的測試!因而我想用《三國演義》來作一個測試,提取其中著名人名出現的頻率。spa
首先搜索下三國中的人物名單,3d
獲得人名以後,作成一我的名的列表,以前設置成一個以人物名爲鍵,值爲 0 的字典。我只取了曹魏和蜀漢的部分人名,代碼以下:code
wei = ["許褚","荀攸","賈詡","郭嘉","程昱","戲志","劉曄","蔣濟","陳羣","華歆","鍾繇","滿寵","董昭","王朗","崔琰","鄧艾","杜畿","田疇","王修","楊修","辛毗",
"楊阜",
"田豫","王粲","蒯越","張繼","于禁","棗祗","曹操","孟德","任峻","陳矯","郗慮","桓玠","丁儀","丁廙","司馬朗","韓暨","韋康","邴原","趙儼","婁圭","賈逵",
"陳琳",
"司馬懿","張遼","徐晃","夏侯惇","夏侯淵","龐德","張郃","李典","樂進","典韋","曹洪","曹仁","曹彰"]
wei_dict = dict.fromkeys(wei, 0)
shu_dict = dict.fromkeys(shu, 0)
複製代碼
接着去下載一部三國的電子書,並讀取cdn
def read_txt():
with open("三國.txt", encoding="utf-8") as f:
content = f.read()
return content
複製代碼
pkuseg 的用法很簡單,首先實例化 pkuseg 的對象,獲取人物名稱數量的思路是這樣的:循環咱們分詞後的列表,若是檢測到有人物名稱字典中的人物名,就將該數據加 1,代碼以下:對象
def extract_pkuseg(content):
start = time.time()
seg = pkuseg.pkuseg()
text = seg.cut(content)
for name in text:
if name in wei:
wei_dict[name] = wei_dict.get(name) + 1
elif name in shu:
shu_dict[name] = shu_dict.get(name) + 1
print(f"pkuseg 用時:{time.time() - start}")
print(f"pkuseg 讀取人名總數:{sum(wei_dict.values()) + sum(shu_dict.values())}")
複製代碼
執行結果以下:blog
代碼基本差很少,只是分詞器的用法有些不一樣。utf-8
def extract_jieba(content):
start = time.time()
seg_list = jieba.cut(content)
for name in seg_list:
if name in wei:
wei_dict[name] = wei_dict.get(name) + 1
elif name in shu:
shu_dict[name] = shu_dict.get(name) + 1
print(f"jieba 用時:{time.time() - start}")
print(f"jieba 讀取人名總數:{sum(wei_dict.values()) + sum(shu_dict.values())}")
複製代碼
執行結果以下:get
emmm 測試結果好像好像有點出乎意料,說好的 pkuseg 準確率更高呢???pkuseg 用時將近 jieba 的三倍,並且提取效果也沒有 jieba 分詞好!因而我就去逼乎搜了一下 pkuseg ,結果是這樣的....
整體而言 pkuseg 吹的有點過了,並無做者說的那麼神奇,有點博眼球的成分,也許它更是一款更注重細分領域的分詞器!