jieba庫及wordcloud庫的使用

時間 2019-11-25

標籤 jieba wordcloud 使用简体版

原文原文鏈接

知識內容：git

1.jieba庫的使用github

2.wordcloud庫的使用dom

參考資料：字體

https://github.com/fxsjy/jieba搜索引擎

https://blog.csdn.net/fontthrone/article/details/72775865雲計算

1、jieba庫的使用spa

1.jieba庫介紹.net

jieba是優秀的中文分詞第三方庫，使用pip安裝後可使用其來對中文文本進行分詞code

特色：對象

支持三種分詞模式：
- 精確模式，試圖將句子最精確地切開，適合文本分析，單詞無冗餘；
- 全模式，把句子中全部的能夠成詞的詞語都掃描出來, 速度很是快，可是不能解決歧義，存在冗餘；
- 搜索引擎模式，在精確模式的基礎上，對長詞再次切分，提升召回率，適合用於搜索引擎分詞。
支持繁體分詞
支持自定義詞典
MIT 受權協議

2.jieba庫方法

(1)jieba庫3種分詞方法(3種模式)

3種模式對應的方法以下：

cut(s)和lcut(s) 　　　　　　　　 # 精確模式
lcut(s, cut_all=True) 　　　　 # 全模式(存在冗餘)
cut_for_search(s)和lcut_for_search(s) 　　 # 搜索模式(存在冗餘)

注：cut()和lcut()的不一樣：cut返回的是生成器，lcut返回的是列表。cut_for_search()和lcut_for_search()也是前者返回生成器，後者返回列表

另外：

cut 方法lcut方法接受三個輸入參數: 須要分詞的字符串；cut_all 參數用來控制是否採用全模式；HMM 參數用來控制是否使用 HMM 模型
cut_for_search 方法和lcut_for_searchlcut_for_search接受兩個參數：須要分詞的字符串；是否使用 HMM 模型。該方法適合用於搜索引擎構建倒排索引的分詞，粒度比較細
待分詞的字符串能夠是 unicode 或 UTF-8 字符串、GBK 字符串。注：不建議直接輸入 GBK 字符串，可能沒法預料地錯誤解碼成 UTF-8

示例：

1 import jieba
2 
3 s = "中國是一個偉大的國家"
4 res1 = jieba.lcut(s)                                          # 精確模式
5 res2 = jieba.lcut(s, cut_all=True)                            # 全模式(存在冗餘)
6 res3 = jieba.lcut_for_search("中華人民共和國是偉大的")           # 搜索模式(存在冗餘)
7 
8 print(res1, res2, res3, sep="\n")

 1 import jieba
 2 
 3 seg_list = jieba.cut("我來到北京清華大學", cut_all=True)
 4 print("Full Mode: " + "/ ".join(seg_list))  # 全模式
 5 
 6 seg_list = jieba.cut("我來到北京清華大學", cut_all=False)
 7 print("Default Mode: " + "/ ".join(seg_list))  # 精確模式
 8 
 9 seg_list = jieba.cut("他來到了網易杭研大廈")  # 默認是精確模式
10 print(", ".join(seg_list))
11 
12 seg_list = jieba.cut_for_search("小明碩士畢業於中國科學院計算所，後在日本京都大學深造")  # 搜索引擎模式
13 print(", ".join(seg_list))

結果：

(2)向字典中添加新詞或添加自定義詞典

使用 add_word(word, freq=None, tag=None) 和 del_word(word) 可在程序中動態修改詞典，固然也可使用load_userdict(file_name)來導入自定義字典

最簡單的用法：add_word() 直接向分詞詞典中添加新詞

示例：

1 s = "李小福是創新辦主任也是雲計算方面的專家"
2 print(jieba.lcut(s))
3 jieba.add_word("創新辦")
4 print(jieba.lcut(s))

還可使用load_userdict(file_name)導入自定義字典

示例：

自定義字典文件dict.txt內容以下：

1 雲計算 5
2 李小福 2 nr
3 創新辦 3 i
4 easy_install 3 eng
5 好用 300

1 s = "李小福是創新辦主任也是雲計算方面的專家"
2 print(jieba.lcut(s))
3 jieba.load_userdict("dict.txt")
4 print(jieba.lcut(s))

3.文本詞頻統計通用代碼

  1 import string
  2 import jieba
  3 # 統計哈姆雷特和三國演義的詞頻
  4 
  5 
  6 # 統計hamlet的詞頻 -> 能夠用作英文的通用分詞和統計
  7 class Hamlet(object):
  8     def __init__(self, name):
  9         """
 10         :param name: 文本名字或路徑
 11         """
 12         self.text_name = name
 13 
 14     def get_text(self):
 15         """
 16         獲取文本並進行相關處理
 17         :return: 返回文本內容
 18         """
 19         txt = open(self.text_name, "r").read().lower()
 20         for ch in string.punctuation:
 21             txt = txt.replace(ch, " ")
 22         return txt
 23 
 24     def count(self):
 25         """
 26         統計單詞出現的次數並輸出結果
 27         """
 28         hamlet_txt = self.get_text()
 29         words = hamlet_txt.split()
 30         counts = {}
 31         for word in words:
 32             counts[word] = counts.get(word, 0) + 1
 33         items = list(counts.items())
 34         # key指定用列表中每一項中第二個值做爲排序依據, reverse設置排序順序 設爲True的排序順序爲從大到小
 35         items.sort(key=lambda x: x[1], reverse=True)
 36         for i in range(10):
 37             print(items[i][0], items[i][1])
 38 
 39 
 40 # 統計三國演義中人物名字的詞頻 -> 能夠用作中文的通用分詞及統計
 41 class ThreeKindDom(object):
 42     def __init__(self, name):
 43         """
 44         :param name: 文本名字或路徑
 45         """
 46         self.text_name = name
 47 
 48     def get_text(self):
 49         """
 50         獲取文本並進行相關處理
 51         :return: 返回文本內容
 52         """
 53         txt = open(self.text_name, "r", encoding="utf-8").read()
 54         return txt
 55 
 56     def split_txt(self):
 57         """
 58         對文本進行分詞
 59         :return: 返回分詞後的列表
 60         """
 61         threekingdom_txt = self.get_text()
 62         words = jieba.lcut(threekingdom_txt)
 63         return words
 64 
 65     def count(self):
 66         """
 67         統計單詞出現的次數並輸出結果
 68         """
 69         words = self.split_txt()
 70         # excludes爲要去掉的詞
 71         excludes = {"將軍", "卻說", "二人", "不可", "荊州", "不能", "如此", "商議", "如何", "左右",
 72                     "軍馬", "引兵", "軍士", "第二天", "主公", "大喜", "天下", "東吳", "因而", "今日", "魏兵"}
 73         counts = {}
 74         for word in words:
 75             rword = word
 76             if len(word) == 1:
 77                 continue
 78             # 對一些特殊的詞進行處理
 79             elif word == "諸葛亮" or word == "孔明" or word == "孔明曰":
 80                 rword = "孔明"
 81             elif word == "關公" or word == "雲長":
 82                 rword = "關羽"
 83             elif word == "玄德" or word == "玄德曰":
 84                 rword = "劉備"
 85             elif word == "孟德" or word == "丞相":
 86                 rword = "曹操"
 87             counts[rword] = counts.get(rword, 0) + 1
 88         for word in excludes:
 89             del counts[word]
 90         items = list(counts.items())
 91         # key指定用列表中每一項中第二個值做爲排序依據, reverse設置排序順序 設爲True的排序順序爲從大到小
 92         items.sort(key=lambda x: x[1], reverse=True)
 93         for i in range(8):
 94             print(items[i][0], items[i][1])
 95 
 96 
 97 if __name__ == '__main__':
 98     # s1 = Hamlet("hamlet.txt")
 99     # s1.count()
100 
101     s2 = ThreeKindDom("threekingdoms.txt")
102     s2.count()

2、wordcloud庫的使用

1.wordcloud庫介紹

wordcloud庫是基於Python的詞雲生成類庫,很好用,並且功能強大

詞雲以下所示：

2.wordcloud庫基本使用

實例：

1 import wordcloud
2 
3 c = wordcloud.WordCloud()                           # 生成詞雲對象     
4 c.generate("wordcloud by Python")                   # 加載詞雲文本
5 c.to_file("wordcloud.png")                          # 輸出詞雲文件

WordCloud方法的參數以下：

width：指定詞雲對象生成的圖片的寬度(默認爲200px)
height：指定詞雲對象生成的圖片的高度(默認爲400px)
min_font_size：指定詞雲中字體最小字號，默認爲4
max_font_size：指定詞雲中字體最大字號
font_step：指定詞雲中字體之間的間隔，默認爲1
font_path：指定字體文件路徑
max_words：指定詞雲中能顯示的最多單詞數，默認爲200
stop_words：指定在詞雲中不顯示的單詞列表
background_color：指定詞雲圖片的背景顏色，默認爲黑色

指定詞雲形狀：

 1 import jieba
 2 import wordcloud
 3 from scipy.misc import imread
 4 
 5 mask = imread("yun.png")                # 讀取圖片數據到mask中
 6 
 7 f = open("文檔.txt", "r", encoding="utf-8")
 8 data = f.read()
 9 f.close()
10 
11 ls = jieba.lcut(data)                   # 分詞
12 txt = " ".join(ls)                      # 將列表中的單詞鏈接成一個字符串
13 
14 w = wordcloud.WordCloud(mask=mask)      # 指定詞雲形狀
15 w.generate(txt)
16 w.to_file("output.png")

3.生成詞雲通用代碼

 1 import jieba
 2 import wordcloud
 3 from scipy.misc import imread
 4 
 5 
 6 def make_cloud(input_file, output_file, **kwargs):
 7     """
 8     製做詞雲的通用代碼
 9     :param input_file: 輸入文本的路徑或名字
10     :param output_file: 輸出圖片的路徑或名字
11     :param kwargs:   WordCloud的參數(width、height、background_color、font_path、max_words)
12     :return:
13     """
14     width = kwargs.get("width")
15     height = kwargs.get("height")
16     background_color = kwargs.get("background_color")
17     font_path = kwargs.get("font_path")
18     max_words = kwargs.get("max_words")
19 
20     f = open(input_file, "r", encoding="utf-8")
21     data = f.read()
22     f.close()
23 
24     ls = jieba.lcut(data)                   # 分詞
25     txt = " ".join(ls)                      # 將列表中的單詞鏈接成一個字符串
26 
27     w = wordcloud.WordCloud(width=width, height=height, background_color=background_color, font_path=font_path,
28                             max_words=max_words)
29     w.generate(txt)
30     w.to_file(output_file)
31 
32 
33 def make_cloud_png(input_file, output_file, png_file, **kwargs):
34     """
35     用特殊圖形制做詞雲的通用代碼
36     :param input_file: 輸入文本的路徑或名字
37     :param output_file: 輸出圖片的路徑或名字
38     :param png_file:  設置詞雲的圖片形狀的文件路徑或名字
39     :param kwargs: WordCloud的參數(width、height、background_color、font_path、max_words)
40     :return:
41     """
42     width = kwargs.get("width")
43     height = kwargs.get("height")
44     background_color = kwargs.get("background_color")
45     font_path = kwargs.get("font_path")
46     max_words = kwargs.get("max_words")
47     mask = imread(png_file)
48 
49     f = open(input_file, "r", encoding="utf-8")
50     data = f.read()
51     f.close()
52 
53     ls = jieba.lcut(data)                   # 分詞
54     txt = " ".join(ls)                      # 將列表中的單詞鏈接成一個字符串
55 
56     w = wordcloud.WordCloud(width=width, height=height, background_color=background_color, font_path=font_path,
57                             max_words=max_words, mask=mask)
58     w.generate(txt)
59     w.to_file(output_file)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。