做業 4：詞頻統計——基本功能

時間 2019-11-21

標籤詞頻統計基本功能简体版

原文原文鏈接

1、基本信息python

本次做業的地址：https://edu.cnblogs.com/campus/ntu/Embedded_Application/homework/2088git

　　項目Git地址：https://gitee.com/ntucs/PairProg.git正則表達式

　　結對成員：王立凱 1613072048算法

　　　　　　　張天弈 1613072049編程

2、項目分析app

　　Task1 基本任務函數

　　　　實現一個控制檯程序，在給定一個英文字符串文件，統計其中各個英文單詞出現的頻率。性能

　　1.程序運行模塊（方法、函數）介紹：學習

　　（1）統計文件中的有效行數，存放在lines中優化

1 file = open(dst, 'r')  # dst爲文本的目錄路徑
2 lines = len(file.readlines())

　　（2）統計文件中的單詞總數

 1 def process_buffer(bvffer):  # 處理緩衝區，返回存放每一個單詞頻率的字典WordCount，單詞總數
 2     if bvffer:
 3         WordCount = {}
 4         # 將文本內容都小寫
 5         bvffer = bvffer.lower()
 6         # 用空格消除文本中標點符號
 7         words = bvffer.replace(punctuation, ' ').split(' ')
 8         # 正則匹配
 9         regex_word = "^[a-z]{4}(\w)*"
10         for word in words:
11             if re.match(regex_word, word):
12                 # 數據字典已經存在該單詞，數量+1
13                 if word in WordCount.keys():
14                     WordCount[word] = WordCount[word] + 1
15                 # 不存在，把單詞存入字典，數量置爲1
16                 else:
17                     WordCount[word] = 1
18     return WordCount, len(words)

　　（3）統計文件中各單詞出現的次數，輸出頻率最高的十個。

1 def output_result(WordCount):  # 按照單詞的頻數排序，返回前十的單詞組
2     if WordCount:
3         sorted_WordCount = sorted(WordCount.items(), key=lambda v: v[1], reverse=True)
4         for item in sorted_WordCount[:10]:  # 輸出 Top 10 的單詞
5             print('<' + str(item[0]) + '>:' + str(item[1]))
6     return sorted_WordCount[:10]

　　（4）將結果輸出到文件result.txt。

 1 def save_result(lines, words, items):  # 保存結果到文件（result.txt)
 2     try:
 3         result = open("result.txt", "w")  # 以寫模式打開，並清空文件內容
 4     except Exception as e:
 5         result = open("result.txt", "x")  # 文件不存在，建立文件並打開
 6     # 寫入文件result.txt
 7     result.write("lines:" + lines + "\n")
 8     result.write("words:" + words + "\n")
 9     for item in items:
10         item = '<' + str(item[0]) + '>:' + str(item[1]) + '\n'
11         result.write(item)
12     print('寫入result.txt已完成')
13     result.close()

　　2.程序算法的時間、空間複雜度分析

　　時間複雜度：本程序中全部的循環都是單層循環且每一層循環執行的次數都是常量有限次，所以函數循環部分的時間複雜度爲O（n）。

　　　　　　　　而python程序中sort函數的時間複雜度爲O( n*log2(n) )，所以程序的時間複雜度爲O( n*log2(n) )

　　空間複雜度：sort函數的空間複雜度是O( n*log2(n) )，for循環的空間複雜度是O（n），所以程序的空間複雜度是O（n*log2（n））

　　3.程序運行案例截圖

　　（1）編寫完成WordCount.py，在DOS窗口執行python WordCount.py Gone_with_the_wind.txt

　　（2）結果保存在result.txt文件中

　　Task2 任務進階

　　1.程序介紹

　　（1）支持 stop words，輸出時跳過停詞表中的單詞

 1 # 停詞表模塊
 2         txtWords = open("stopwords.txt", 'r').readlines()  # 讀取停詞表文件
 3         stopWords = []  # 存放停詞表的list
 4         # 讀取文本是readlines因此寫入list要將換行符取代
 5         for i in range(len(txtWords)):
 6             txtWords[i] = txtWords[i].replace('\n', '')
 7             stopWords.append(txtWords[i])
 8         for word in words:
 9             if word not in stopWords:  # 當單詞不在停詞表中時，使用正則表達式匹配
10                 if re.match(regex_word, word):
11                     # 數據字典已經存在該單詞，數量+1
12                     if word in word_freq.keys():
13                         word_freq[word] = word_freq[word] + 1
14                     # 不存在，把單詞存入字典，數量置爲1
15                     else:
16                         word_freq[word] = 1

　　2.算法的時間、空間複雜度分析

　　時間複雜度：本模塊中全部的循環都是單層循環且每一層循環執行的次數都是常量有限次，所以函數循環部分的時間複雜度爲O（n）。

　　空間複雜度：for循環的空間複雜度是O（n），所以該段程序的空間複雜度是O（n）。

　　3.程序運行案例截圖

　　　　停詞表中單詞：

　　　　使用停詞表後程序運行結果：

3、性能分析

　　Task1

　　1.使用cProfile進行性能分析

1 python -m cProfile WordCount.py Gone_with_the_wind.txt | grep WordCount.py

　　2.可視化操做

1 F:\通大教學網\軟件工程>python -m cProfile -o result.out -s cumulative WordCount.py Gone_with_the_wind.txt
2 F:\通大教學網\軟件工程>python gprof2dot.py -f pstats result.out | dot -Tpng -o result.png

　　轉換獲得圖以下：　　　　　　

　　Task2

　　1.使用cProfile進行性能分析

　　2.可視化操做

4、其餘

　　結對編程時間開銷：用了兩天的時間摸索、分析和改進代碼，一天的時間寫博客，具體花費時間大約爲13個小時。

　　結對編程照片：

5、過後分析總結

　　1.針對正則表達式的討論過程

　　　　一開始咱們並不知道規範化並限定單詞是能夠用公式來解決，咱們用了len(word)>3等方法的結合來達到目的，花了不少

　　時間並且並無達到很好的效果。後來咱們參考了已完成的程序代碼才明白如何優化咱們的算法。

　　2.評價對方