詞頻:某個詞在該文檔中出現的次數停用詞:數據處理時過濾掉某些字或詞,如:網站、的等語料庫:也就是咱們要分析的全部文檔的集合中文分詞:將漢字序列分紅一個個單獨的詞python
jieba jieba.cut(content) content 爲分詞的句子pandas pandas.DataFrame()生成DataFrame對象 pandas.DataFrame.groupby()分組統計 分組統計實例 pandas.DataFrame.groupby(by=列名數組)[統計列名數組].agg({ 統計項名稱:統計函數})wordcloudpython構建詞雲的庫文件 安裝方式請自行案例數組
#!/usr/bin/env python # coding=utf-8import osimport jiebaimport codecsimport pandas as pdimport numpy as npfrom wordcloud import WordCloud,ImageColorGeneratorimport matplotlib.pyplot as plt #導入所用庫文件basefile = data存儲路徑 # 語料庫加載 f_in = codecs.open(basefile+'an.txt','r','utf-8') content = f_in.read() #分詞,生成segments列表segments = [] segs = jieba.cut(content)for seg in segs: if len(seg)>1: segments.append(seg) #生成DataFrame對象segmentDF = pd.DataFrame({'segment':segments}) #分組統計segStat = segmentDF.groupby( by = ['segment'] )['segment'].agg({ '計數':np.size}).reset_index().sort_values(by = ['計數'], ascending = False ) #加載停用詞 stopwords = pd.read_csv( "./StopwordsCN.txt", encoding='utf8', index_col=False) #移除停用詞,並作去反操做fSegStat = segStat[ ~segStat.segment.isin(stopwords.stopword)] #構建詞雲文件wordcloud = WordCloud( font_path='./simhei.ttf', #詞雲展現字體 background_color="black", #詞雲展現背景顏色 ) words = fSegStat.set_index('segment').to_dict()wordcloud.fit_words(words['計數'])plt.imshow(wordcloud)plt.show()
from scipy.misc import imread #讀取圖片背景 bimg = imread(basefile+'An.png') wordcloud = WordCloud( background_color="white", mask=bimg, font_path='./simhei.ttf')wordcloud = wordcloud.fit_words(words['計數']) #設置圖片大小 plt.figure( num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k') #獲取圖片顏色 bimgColors = ImageColorGenerator(bimg)plt.axis("off") #重置詞雲顏色 plt.imshow(wordcloud.recolor(color_func=bimgColors))plt.show()