軟工實踐第二次做業—Wordcount

時間 2020-05-07

標籤實踐第二次 wordcount 欄目 Microsoft Office 简体版

原文原文鏈接

Git倉庫地址：https://github.com/cwabc/PersonProject-Cios

1、問題描述

輸入一個txt文件名，以命令行參數傳入，程序可以統計txt文件中的如下幾個指標：c++
統計文件的字符數：git

只須要統計Ascii碼，漢字不需考慮

空格，水平製表符，換行符，均算字符

統計文件的單詞總數，單詞：至少以4個英文字母開頭，跟上字母數字符號，單詞以分隔符分割，不區分大小寫。github

英文字母： A-Z，a-z

字母數字符號：A-Z， a-z，0-9

分割符：空格，非字母數字符號

例：file123是一個單詞， 123file不是一個單詞。file，File和FILE是同一個單詞

統計文件的有效行數：任何包含非空白字符的行，都須要統計。

統計文件中各單詞的出現次數，最終只輸出頻率最高的10個。頻率相同的單詞，優先輸出字典序靠前的單詞。

按照字典序輸出到文件result.txt：例如，windows95，windows98和windows2000同時出現時，則先輸出windows2000算法

輸出的單詞統一爲小寫格式
輸出的格式爲：windows
characters: number words: number lines: number <word1>: number <word2>: number ...

本次課程實踐要求咱們實現一個可以統計文本文檔詞頻的控制檯程序。須要用到c++中對於文件流的控制。因爲本身對於c++的文件流控制不熟悉，而且也對c++的一部分語法有點遺忘，因此開始時花費了比較多的時間查找關於c++文件流控制的資料和方法。在成功導入文本以後，須要對文本進行單詞分割，按照要求統計符合規定的單詞出現頻次，而且把他們按出現頻率優先，字典序次之的順序排列，輸出頻次最高的前十個單詞，並把全部分割出的單詞輸入到「result.txt"文件中。數組

因此解決問題的關鍵要設計一個合理的單詞分割算法，準確地分割出單詞並存儲到對應數據結構中。然後採用恰當的算法給分割出的單詞統計頻次而且按要求排序。數據結構

2、解決方案

一、PSP表

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃
• Estimate	• 估計這個任務須要多少時間	700	990
Development	開發	600	890
• Analysis	• 需求分析 (包括學習新技術)	180	220
• Design Spec	• 生成設計文檔	20	30
• Design Review	• 設計複審	30	60
• Coding Standard	• 代碼規範 (爲目前的開發制定合適的規範)	10	20
• Design	• 具體設計	30	60
• Coding	• 具體編碼	180	200
• Code Review	• 代碼複審	30	100
• Test	• 測試（自我測試，修改代碼，提交修改）	120	200
Reporting	報告	100	100
• Test Repor	• 測試報告	10	10
• Size Measurement	• 計算工做量	30	20
• Postmortem & Process Improvement Plan	• 過後總結, 並提出過程改進計劃	60	70
	合計	700	990

二、具體實現

（1）頭文件和類的定義

c++對文件的控制要加入頭文件<fstream>,後續要調用map函數進行排序須要頭文件<map>;定義一個文件類，內含公有參數content(即文本文件內容），私有參數characters(字符數），lines(行數)，words(單詞數），以及若干成員函數。函數

class testfile { public: testfile countcha(char *, testfile);//計算字符數
    testfile countword(char *, testfile);//計算單詞數
    testfile countline(char *, testfile);//計算行數
    int getcharacters(); int getlines(); int getwords(); char *content;//存放文本文件數據 
    void init(); private: int characters; int words; int lines; }; void testfile::init() { characters = 0; words = 0; lines = 0; content = (char*)malloc(sizeof(char*)*MAXN); }

（2）統計文本文件的字符數以及行數

調用c++語言中的文本文件輸入功能，分別以按字符輸入和按行輸入統計字符數和行數。其中按字符輸入按項目要求必須強制讀入空格和換行符。並進行文本文件打開與否的差錯判斷。性能

testfile testfile::countcha(char *t, testfile f1) { int i = 0; ifstream myfile; myfile.open(t); if (!myfile.is_open()) { cout << "文件打開失敗" << endl; } char c; myfile >> noskipws;//強制讀入空格和換行符
    while (!myfile.eof()) { myfile >> c; if (myfile.eof()) break;//防止最後一個字符輸出兩次
        i++; } f1.characters = i; myfile.close(); return f1; } testfile testfile::countline(char *t, testfile f1) { ifstream myfile; myfile.open(t, ios::in); int i = 0; string temp;//做爲getline參數使用
    if (!myfile.is_open()) { cout << "文件打開失敗" << endl; } while (getline(myfile, temp)) {
 if(temp.empty()) continue; i++; } f1.lines = i; myfile.close(); return f1; }

（3)統計單詞數並存儲單詞

統計單詞數並逐個把單詞存入map關聯式容器，能夠自動創建word-value的對應關係，查詢的複雜度爲O（log(n)）。map內部自建一顆二叉樹具備自動排序的功能。這樣單詞就能按照字典序排好。並能夠返回單詞出現的頻次。單詞分割算法把從文件讀入的字符串存入testfile類的公有參數content裏，大寫轉小寫，並對content進行單詞分割操做，以非字母數字的符號爲分隔符，頭四個字符爲字母做爲一個單詞。

map<string, int> mapword1;

void loadword(char w[]) { string wr; wr = w; map<string, int>::iterator it1 = mapword1.find(wr);//在map紅黑樹中查找單詞 
    if (it1 == mapword1.end()) mapword1.insert(pair<string, int>(wr, 1));//未找到單詞，插入單詞並設定頻次爲1 
    else
        ++it1->second;//找到單詞，單詞出現頻次增長 
} testfile testfile::countword(char *t, testfile f1) { int n = 0; ifstream myfile; myfile.open(t); if (!myfile.is_open()) { cout << "文件打開失敗" << endl; } char c; myfile >> noskipws; while (!myfile.eof()) { myfile >> c; if (myfile.eof()) break;//防止最後一個字符輸出兩次
        if (c >= 65 && c <= 90) c += 32;//大寫字母轉小寫 
        f1.content[n++] = c;//把文本文件內的數據存入類的content字符數組中 
 } myfile.close(); char temp[4]; int i = 0, j = 0, flag = 0, words = 0, m = 0, k = 0; for (i = 0; i < n; i++) { if (!((f1.content[i] >= 48 && f1.content[i] <= 57) || (f1.content[i] >= 97 && f1.content[i] <= 122)))//跳過非字母和非數字字符 
            continue; else { for (j = 0; j < 4 && i < n; j++) { if (!((f1.content[i] >= 48 && f1.content[i] <= 57) || (f1.content[i] >= 97 && f1.content[i] <= 122))) break; temp[j] = f1.content[i++];//temp中存入四個非空格字符
 } if (j == 4) { for (m = 0; m < 4; m++) { if (temp[m] < 97 || temp[m]>122) { flag = 1; break;//判斷這四個字符是否都是字母
 } } if (flag == 0)//四個字符都是字母的狀況，判斷爲一個單詞
 { char *w = new char[100];//存放單詞 
                    for (m = 0; m < 4; m++) { w[k++] = temp[m];//temp中字符存入w
 } while (((f1.content[i] >= 48 && f1.content[i] <= 57) || (f1.content[i] >= 97 && f1.content[i] <= 122)) && i < n)//繼續存入單詞剩餘字符
 { w[k++] = f1.content[i++]; } w[k] = '\0'; loadword(w);//能夠在此處插入一個外部函數返回一個單詞存入map紅黑樹 
                    delete[]w; words++; k = 0; } else { flag = 0; j = 0; } } } } f1.words = words; return f1; }

（4）對map中的單詞按出現頻次進行排序，輸出指定內容，在主函數中實現

定義一個單詞結構體sWord,包含單詞w和出現頻次count。把map中依次返回的單詞和對應頻次存入sWord中，對sWord按count從大到小排序。在控制檯輸出指定內容後把單詞輸出到"result.txt"文件中。釋放內存。

struct sWord { string w; int count; };//定義一個用於存放單詞及頻次的結構體 

void merge(sWord *a, sWord *c, int l, int mid, int r) { int i = l, j = mid + 1, m = 1; while (i <= mid && j <= r) { if (a[i].count < a[j].count) c[m++] = a[j++]; else c[m++] = a[i++]; } while (i <= mid) c[m++] = a[i++]; while (j <= r) c[m++] = a[j++]; for (int k = 1; k <= r - l + 1; k++) a[l + k - 1] = c[k]; } void sort(sWord *a, sWord *c, int l, int r) { if (l < r) { int mid = (l + r) / 2; sort(a, c, l, mid); sort(a, c, mid + 1, r); merge(a, c, l, mid, r); } } int main(int argc, char *argv[]) { clock_t start = clock(); int i, num = 0, j; testfile f1; f1.init(); if (!argv[1]) { cout << "未輸入文件名或文件不存在" << endl; return 0; } f1 = f1.countcha(argv[1], f1); f1 = f1.countline(argv[1], f1); f1 = f1.countword(argv[1], f1); sWord *ww = new sWord[f1.getwords()];//給結構體分配一個大小爲單詞數目的動態空間 
    sWord *temp = new sWord[f1.getwords()]; map<string, int>::iterator it; it = mapword1.begin(); for (it; it != mapword1.end(); it++) { ww[num].w = it->first; ww[num].count = it->second; num++; } sort(ww, temp, 0, num - 1);//把已經按字典序排號按出現頻率進行從大到小的歸併排序 //輸出 
 ofstream fout; fout.open("result.txt"); if (!fout) cout << "文件打開失敗" << endl; cout << "characters: " << f1.getcharacters() << endl; fout << "characters: " << f1.getcharacters() << endl; cout << "words: " << f1.getwords() << endl; fout << "words: " << f1.getwords() << endl; cout << "lines: " << f1.getlines() << endl; fout << "lines: " << f1.getlines() << endl; if (num < 10) { for (i = 0; i < num; i++) { cout << "<" << ww[i].w << ">" << ": " << ww[i].count << endl; fout << "<" << ww[i].w << ">" << ": " << ww[i].count << endl; } } else { for (i = 0; i < 10; i++) { cout << "<" << ww[i].w << ">" << ": " << ww[i].count << endl; fout << "<" << ww[i].w << ">" << ": " << ww[i].count << endl; } } delete[]ww; free(f1.content);//動態空間釋放 
    clock_t ends = clock(); cout << "運行時間 : " << (double)(ends - start) / CLOCKS_PER_SEC << "秒" << endl; return 0; }

（5）代碼優化

關於map函數的使用，經過後續的學習知道它也能按頻次從大到小輸出，這樣就節省了開闢單詞結構體的空間並節省了給結構體排序的時間，但本身掌握得很差，因此沒有使用。對於結構體按從大到小排序，開始時的想法是構建一個最大化堆，但堆排序不是一個穩定的排序算法，會破壞原有的字典序。所以考慮了穩定排序算法中的歸併排序，途中出了個bug沒有查出緣由，這裏先採用冒泡排序。後續會增強這方面的學習並改進代碼。

3、差錯檢測

若是從命令行輸入的文件名錯誤，或者當前文件夾下沒有對應的txt文件，輸出對應的錯誤信息，並結束程序。

if (!argv[1]) { cout << "未輸入文件名或文件不存在" << endl; return 0; }

ifstream myfile; myfile.open(t); if (!myfile.is_open()) { cout << "文件打開失敗" << endl; }

4、實例測試

一、測試文件input6.txt

字符數兩百萬以上的較大文本測試。

二、測試結果

5、類的封裝

一、封裝testfile類

#ifndef wordcount_h #define wordcount_h
class testfile { public: testfile countcha(char *, testfile);//計算字符數
    testfile countword(char *, testfile);//計算單詞數
    testfile countline(char *, testfile);//計算行數
    int getcharacters(); int getlines(); int getwords(); char *content;//存放文本文件數據 
    void init(); private: int characters; int words; int lines; }; #endif

二、功能測試

#include"wordcount.h" #include<iostream> #include<locale>
using namespace std; int main() { char filename[10]; testfile f1; cin >> filename; f1.init(); f1 = f1.countcha(filename, f1); cout << f1.getcharacters()<< endl; return 0; }

計算字符數功能正常實現。

6、性能分析

兩百萬的數據量執行時間爲16秒多，已經盡力優化了。。。可是把exe放在桌面上執行時執行時間在5秒內，不知道什麼緣由。

7、分析總結

經過此次的實踐項目，我又複習了一遍c++中的一些語法，而且學會了c++語言中對於文本文件的輸入輸出控制。可是學習的過程當中也意識到本身的算法練習得太少，不少算法道理能明白一些可是就是寫不出來，要麼就是編譯不經過，要麼就是未知錯誤致使程序停止。這是不熟練的體現，從此要增強這方面的學習！

2018-09-18：

增長了博客中關於簡單差錯檢測的描述，把原先的冒泡排序改成歸併排序並修改了代碼的一些細節。爲了實現功能的獨立性在各個函數中有對於文本文件的重複讀入下降了程序運行的效率。一步步糾錯和改進能夠認識到本身的不足，也讓本身從周圍厲害的同窗那裏學到了很多，我感受這是選擇這門課最重大的意義之一了。我以爲寫完做業也確實不能馬上就把它放下了，尤爲是敲代碼這種，畢竟不少錯誤咱們在一開始執行時本身也不能清楚地知道，可是隨着時間流逝會暴露出不少的問題。想着萬一之後的學弟學妹們看到我本身漏洞百出的代碼恐怕要貽笑大方，我感受有點慌啊。。。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。