中文詞頻統計及詞雲製做

一、中軟國際華南區技術總監曾老師還會來上兩次課,同窗們但願曾老師講些什麼內容?(認真想想回答)ide

a、關於這門課的相關工做經歷測試

b、本身對於這門課的見解ui

二、中文分詞spa

a、以前的英文練習將要測試詞頻的文章放在一個TXT裏,code

>>> fo = open('kang.txt','w')
>>> fo.write('''A Chinese company offering sex dolls for rent has withdrawn its services just days after launching.

Touch had begun offering five different sex doll types for daily or longer-term rent on Thursday in Beijing but quickly drew complaints and criticism.

The company said in a statement on Weibo it "sincerely apologised for the negative impact" of the concept.

But the firm stressed sex was "not vulgar" and said it would keep working towards more people enjoying it.

Touch told the BBC the rental service had operated for two days.

"We prepared ten dolls for the trial operation," a company spokesperson said via email, adding that they received very positive feedback from users.

"But it’s really hard in China," the firm wrote, saying there had been a lot of controversy with the police over the issue.

The company had offered the sex dolls for a daily fee of 298 yuan ($46), according to Chinese media.

The models on offer were marketed as Chinese, Korean and Russian women, with one also modelled on the movie character Wonder Woman, complete with a sword and shield.

In its Weibo statement, the firm said its original intention had been to make expensive silicone dolls more affordable but conceded that the service triggered a heated public debate.

The company also said it would pay out compensation to users worth double the amount they had paid as a deposit for reserving a doll.

The statement added that Touch would in future pay more attention to its "social duty", and would actively promote a "healthier and more harmonious sex lifestyle".

Aside from its short-lived rental offering, the firm sells an array of sex toys, including sex dolls.''')
1664
>>> fo.close()
>>> fr=open('kang.txt','r')
>>> fr.readline()
'A Chinese company offering sex dolls for rent has withdrawn its services just days after launching.\n'
>>> 

而後引用blog

fo=open('kang.txt','r')
news = fo.read()

news= news.lower()
for i in ',./-_"":;':
    news=news.replace(i,' ')
words=news.split(' ')
exp = {'','the','\n\nthe','its','that','it','a','for','and','had','said','to','of','in','on','as','they','also','or','an','\n\nin','\n\n','\n\ntouch'}
dict={}
keys=set(words)-exp
for i in keys:
    dict[i]=words.count(i)

tj=list(dict.items())
tj.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    print(tj[i])
fo.close()

結果以下ci

>>> 
=============== RESTART: C:/Users/Administrator/Desktop/詞頻2.py ===============
('sex', 7)
('company', 5)
('dolls', 5)
('would', 4)
('firm', 4)
('more', 4)
('but', 3)
('chinese', 3)
('statement', 3)
('offering', 3)
>>> 

 b、測試jiebaget

>>> import jieba
>>> word = jieba.cut('太陽出來就去耕做田地,太陽落山就回家去休息。')
>>> w=list(word)
>>> w
['太陽', '出來', '', '', '耕做', '田地', '', '太陽', '落山', '', '回家', '', '休息', '']
>>> a = list(jieba.cut('太陽出來就去耕做田地,太陽落山就回家去休息。',cut_all=True))
>>> a
['太陽', '出來', '', '', '耕做', '田地', '', '', '太陽', '落山', '', '回家', '', '休息', '', '']
>>> s = list(jieba.cut_for_search('太陽出來就去耕做田地,太陽落山就回家去休息。'))
>>> s
['太陽', '出來', '', '', '耕做', '田地', '', '太陽', '落山', '', '回家', '', '休息', '']

 c、此次我選擇的是對於小說雪山飛狐的中文分詞it

import jieba
xs=open('xs.txt','w')
xs.write('''下載的雪山飛狐小說''')
xs.close()
fr=open('xs.txt','r',encoding='GBK').read()
zs=jieba.cut(fr)

dic={}
for z in zs:
    if len(z)==1:
        continue
    else:
        rez=z
        dic[z] = dic.get(z,0) + 1
keys=set(z)
a=sorted(dic.items())

tj=list(dic.items())
tj.sort(key=lambda x:x[1],reverse=True)
for i in range(20):
    print(tj[i])

結果io

>>> 
 RESTART: C:/Users/Administrator/AppData/Local/Programs/Python/Python36/zhongwen222.py 
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.706 seconds.
Prefix dict has been built succesfully.
('曹雲奇', 227)
('苗若蘭', 210)
('一個', 206)
('胡一刀', 204)
('衆人', 203)
('說道', 201)
('金面佛', 174)
('胡斐', 136)
('本身', 132)
('兩人', 132)
('心中', 127)
('阮士中', 127)
('寶樹', 126)
('爹爹', 124)
('苗人鳳', 121)
('孩子', 114)
('一聲', 113)
('不知', 110)
('劉元鶴', 106)
('什麼', 104)
>>> 
相關文章
相關標籤/搜索