現代軟件工程我的做業——詞頻統計（字符數、行數、單詞數、高頻單詞和詞組）

時間 2020-05-07

標籤現代軟件工程我的詞頻統計字符行數單詞高頻詞組简体版

原文原文鏈接

現代軟件工程課的第一次我的做業博主作的至關差勁，讓我清楚地意識到本身與他人的差距。html

經過這篇博客博主將展現本身是如何走上事倍功半的歧路，認真分析錯誤緣由，但願你們不要重蹈個人覆轍。ios

首先讓咱們來看一下做業要求：詳細要求在鄧宏平老師的博客：第一次我的做業——詞頻統計正則表達式

此次詞頻統計的主要功能有：express

1. 統計文件的字符數（只須要統計Ascii碼，漢字不用考慮，換行符不用考慮,'\0'不用考慮）（ascii碼大小在[32,126]之間）編程

2. 統計文件的單詞總數windows

3. 統計文件的總行數（任何字符構成的行，都須要統計）（不要只看換行符的數量，要當心最後一行沒有換行符的情形）（空行算一行）數據結構

4. 統計文件中各單詞的出現次數，對給定文件夾及其遞歸子文件夾下的全部文件進行統計架構

6. 統計兩個單詞（詞組）在一塊兒的頻率，輸出頻率最高的前10個。app

注意：編程語言

a) 空格，水平製表符，換行符，均算字符

b) 單詞的定義：至少以4個英文字母開頭，跟上字母數字符號，單詞以分隔符分割，不區分大小寫。

英文字母：A-Z，a-z

字母數字符號：A-Z，a-z，0-9

分割符：空格，非字母數字符號

例如：」file123」是一個單詞，」123file」不是一個單詞。file，File和FILE是同一個單詞。

若是兩個單詞只有最後的數字結尾不一樣，則認爲是同一個單詞，例如，windows，windows95和windows7是同一個單詞，iPhone4和IPhone5是同一個單詞，可是，windows和windows32a是不一樣的單詞，由於他們不是僅有數字結尾不一樣。輸出按字典順序，例如，windows95，windows98和windows2000同時出現時，輸出windows2000。單詞長度只須要考慮[4, 1024]，超出此範圍的不用統計。

c)詞組的定義：windows95 good， windows2000 good123，能夠算是同一種詞組。按照詞典順序輸出。三詞相同的情形，好比good123 good456 good789，根據定義，則是 good123 good123 這個詞組出現了兩次。

good123 good456 good789這種狀況，因爲這三個單詞與good123都是同一個詞，最終統計結果是good123 good123這個詞組出現了2次。

兩個單詞分屬兩行，也能夠直接組成一個詞組。統計詞組，只看順序上，是否相鄰。

d) 輸入文件名以命令行參數傳入。須要遍歷整個文件夾時，則要輸入文件夾的路徑。

e) 輸出文件result.txt

characters: number

words: number

lines: number

<word>: number

<word>爲文件中真實出現的單詞大小寫格式，例如，若是文件中只出現了File和file，程序不該當輸出FILE，且<word>按字典順序（基於ASCII）排列，上例中程序應該輸出File: 2

f) 根據命令行參數判斷是否爲目錄

g) 將全部文件中的詞彙，進行統計，最終只輸出一個總體的詞頻統計結果。

評分標準

1. 統計文件的字符數(1分)

2. 統計文件的單詞總數(1分)

3. 統計文件的總行數(1分)

4. 統計文件中各單詞的出現次數(1分)

5. 對給定文件夾及其遞歸子文件夾下的全部文件進行統計(2分)

6. 統計兩個單詞（詞組）在一塊兒的頻率，輸出頻率最高的前10個(2分)

以上六個結果輸出錯誤則對應子任務得-1分，所有輸出正確則按運行時間肯定排名(用時按升序前30%得滿分8分，30%-70%得7.5分，後30%得7分)。

7. 博客撰寫(代碼實現過程，性能分析、優化報告等)(2分)

8. 在Linux系統下，進行性能分析，過程寫到blog中（附加題，2分）

完成時間：一週以內

需求分析：

1.統計字符數和行數容易實現。

2.統計單詞總數：做業要求中對單詞的定義，是4個英文字母開頭，後跟零個或多個英文字母或數字，單詞長度在[4,1024]之間。通常來講匹配必定格式的字符串都用正則表達式和迭代器來實現。

3.統計統計文件中各單詞的出現次數，並輸出出現頻率最高的10個：單詞存放於容器中，沒出現一個新單詞須要查找它是否是已經存在了，若是存在的話單詞頻率加一，不然將單詞加入容器。如何實現判斷單詞相等和不等是重要的一點。將全部單詞收集到容器後須要根據出現頻率對單詞進行排序，並輸出頻率最高的10個。

4.輸出出現頻率最高的10個詞組：相鄰的兩個單詞組成一個詞組，也須要查重和依頻率排序。

5.對給定文件夾及其遞歸子文件夾下的全部文件進行統計：判斷是目錄仍是文件，若是是目錄，須要獲取目錄下文件的名字再對文件進行處理。

接下來博主就走上了錯誤的第一步：選擇了不熟悉的編程語言

通常來講，要在短期內完成複雜和難度較高的工程，應該選擇你熟悉的編程語言。

博主會對C比較熟悉可是沒有系統學過C++，但願經過這門課的實踐能夠把C++學起來。結果博主天真地看了一成天的Primer C++(╯°Д°）╯，次日就急躁躁地開始編程了。相信不少人都知道這本經典有多麼晦澀難懂，可想而知一天下來我能吸取多少知識。事實上，更多的時間應該用來進一步分析需求，有針對性地查找解決方法，綜合功能和性能的考慮設計多個方案，在比較、測試事後篩選出合理的方案。對於這個做業我仍是推薦用C++寫的，只是日程這麼緊張的狀況下，時間應該多分配給需求分析和前期構建的工做，C++上面的新知識能夠現查現學。

結果，博主又走錯了路：選擇了不合理的數據結構

數據結構是重中之重，要慎之又慎

博主在沒有事先了解數據量的狀況下選擇了vector做爲容器，由於它有find函數和sort函數。這種偷懶的行爲是極其不可取的。

構建工做應該優先考慮需求，而不是你目前的編程水平或者工做量。

事實上博主考慮過map，可是爲何沒有選擇它呢，緣由主要是我一開始的思路是爲單詞和詞組各定義一個類，在類中存放單詞和頻率，重構==運算符以便判斷是否相等，另外若是相等的單詞字典序先於目前這個單詞，就修改目前這個單詞，若是用map，單詞作key，頻率做value，但是map不支持修改key值，由於map會自動根據key排序，key是它排序的基礎。而後博主就這麼把它拋棄了。

其實，能夠創建兩個map，以單詞的簡寫做爲共同的key，一個map的value是單詞的完整形式，另外一個map的value是單詞的頻率。（助教的思路）

另外，C++11還支持unordered_map，它以哈希表爲基礎，查找時間複雜度只有O(1)，並且不會自動根據Key值進行排序，可是佔的空間相對較大。不過，對map類進行按值排序，通常須要將map中的數據以pair的形式傳遞給順序容器（如我選擇的vector）再用sort進行排序。順序容器可用的sort排序效率都很是高， vector使用的是快速排序。

若是讓博主再有一次機會，博主會選什麼呢？答案是map。雖然將map中的數據轉移到vector中也須要耗費較多地時間，但只須要操做一次，也就是O(n)的複雜度可是vector中的find用的是線性探查，每獲得一個新單詞都得查重，時間複雜度已是O(n*n)了。後面博主本身寫了hash查找函數，和實時申請內存結合起來就會很是複雜（時間限制博主沒有實現(ಥ_ಥ)）。綜合來看仍是用map或unordered map比較合理。

接下來分享代碼。

先上初版，使用了vector可用的stl函數find和sort.

說明：1.這一版程序裏面可能還有一些小bug和比較寫得比較生涉的地方，歡迎你們指出。05.txt是一份用於測試的文本文件。

2. 將文件裏的全部內容讀做一個字符串，用來統計字符總數和收集新單詞、新詞組。

3.從str裏面獲取新單詞是用正則表達式匹配的，具體在getNewExpr函數裏。

4.使用sort以前須要定義compare方式或者重構>、<運算符，博主採用前者。

5.getAllFiles函數用於判斷路徑是目錄仍是文件，若是是目錄的話獲取全部文件名並放入一個string類的vector中。

6.閱讀源碼建議先讀類定義而後從main函數開始依照線程閱讀。

#include<iostream> #include<fstream> #include<string> #include<sstream> #include<vector> #include<algorithm> #include<cctype> #include<regex> #include<io.h>

using namespace std; int begFlag = 1; typedef struct { unsigned int charNum; unsigned int lineNum; unsigned int wordNum; }amount; class word { private: string wordStr; unsigned int freq; public: word() = default; word(string str) { wordStr = str; freq = 1; } string getWordStr() { return wordStr; } unsigned int getFreq() { return freq; } void addFreq() { freq++; } void resetWordStr(string str) { if (str < wordStr) { wordStr = str; } } bool operator == (const word &obj) const { string word1 = this->wordStr, word2 = obj.wordStr; int i = word1.length() - 1; int j = word2.length() - 1; while (i >= 0) { if (word1[i] >= '0'&&word1[i] <= '9') word1[i] = '\0'; else break; i--; } while (j >= 0) { if (word2[j] >= '0'&&word2[j] <= '9') word2[j] = '\0'; else break; j--; } if (i == j) { for (int t = 0; t <= i; t++) { if (word1[t] != word2[t] && abs(word1[t] - word2[t]) != 32) return false; } } else return false; return true; } void printWord(ofstream &output) { output << wordStr << "\t" << freq << endl; } }; class phrase { private: //string phrStr;
    unsigned int freq; word part1, part2; public: //lack a default constructor
 phrase(word part1, word part2) { this->part1 = part1; this->part2 = part2; //phrStr = str;
        freq = 1; } /* string getPhrStr() { return phrStr; } */ word getPart1() { return part1; } word getPart2() { return part2; } unsigned int getFreq() { return freq; } void addFreq() { freq++; } void resetPhrase(phrase &obj) { //string objStr = obj.getPhrStr();
 word objPart1 = obj.getPart1(); word objPart2 = obj.getPart2(); // '||' is a short circuit operator
        if (objPart1.getWordStr() < this->part1.getWordStr() || objPart2.getWordStr() < this->part2.getWordStr()) { this->part1 = objPart1; this->part2 = objPart2; } } bool operator == (const phrase &obj) const { word objPart1 = obj.part1, objPart2 = obj.part2; return (part1 == objPart1 && part2 == objPart2); } void printPhrase(ofstream &output) { string word1 = part1.getWordStr(), word2 = part2.getWordStr(); word1 < word2 ? output << word1 + " " + word2 << "\t" << freq << endl : output << word2 + " " + word1 << "\t" << freq << endl; } }; bool wordCompare(word former, word latter) { return former.getFreq() > latter.getFreq(); } bool phraseCompare(phrase former, phrase latter) { return former.getFreq() > latter.getFreq(); } void examineNewWord(vector<word> &wvec, word &newWord) { vector<word>::iterator beg = wvec.begin(), end = wvec.end(), itr; itr = find(beg, end, newWord);    //is there any repition?

    if (itr != end) {                 // this word already exists in wvec
        itr->resetWordStr(newWord.getWordStr()); itr->addFreq(); } else { wvec.push_back(newWord); } } void examineNewPhr(vector<phrase> &pvec, phrase &newPhrase) { vector<phrase>::iterator beg = pvec.begin(), end = pvec.end(), itr; itr = find(beg, end, newPhrase);   ////is there any repition?

    if (itr != end) { itr->resetPhrase(newPhrase); itr->addFreq(); } else { pvec.push_back(newPhrase); } } /* collect all expressions that match the definition of word in the parameter string */
void getNewExpr(string &str, vector<word> &wvec, vector<phrase> &pvec, amount &result) { word newWord; string wordPattern("[[:alpha:]]{4}[[:alnum:]]{0,1020}"); regex reg(wordPattern); //intermediate variables in generating a new phrase //string::size_type pos1, pos2;
    string newPhrStr = "\0"; word part1("\0"), part2("\0"); phrase newPhrase( part1, part2); /* collect a word in advance, then combine two words and the substring between them into a phrase */
    for (sregex_iterator it(str.begin(), str.end(), reg), end_it; it != end_it; it++) { result.wordNum++; newWord = word(it->str()); examineNewWord(wvec, newWord); if (begFlag) { begFlag = 0; part1 = newWord; } else { part2 = newWord; newPhrase = phrase(part1, part2); examineNewPhr(pvec, newPhrase); //pos1 = pos2;
            part1 = part2; } } } /* calculate the amount of characters with ASCII code within [32,126]*/ unsigned long getCharNum(string &str) { unsigned long charNum = 0; string::iterator end = str.end(), citr; for (citr = str.begin(); citr != end; citr++) { if (*citr >= 32 && *citr <= 126) charNum++; } return charNum; } /* calculate the number of lines in one file */ unsigned long getLineNum(string filename) { ifstream input(filename); unsigned long lines = 0; string str; while (!input.eof()) { getline(input, str); lines++; } return lines; } /* process one file, update the amount of characters and the amount of lines, collect all expressions that match the word definition into wvec. */
void fileProcess(string filename, amount &result, vector<word> &wvec, vector<phrase> &pvec) { ifstream input; stringstream buffer; string srcStr; try { input.open(filename); if (!input.is_open()) { throw runtime_error("cannot open the file"); } } catch (runtime_error err) { cout << err.what(); return ; } if (input.eof()) return; buffer << input.rdbuf(); srcStr = buffer.str(); // update the amount of characters
        result.charNum += getCharNum(srcStr); //update the amount of lines
        result.lineNum += getLineNum(filename); //update the wvec
 getNewExpr(srcStr, wvec, pvec,result); input.close(); } /* print the results in the required format*/
void getResult(const char* resfile, amount &result, vector<word> &wvec, vector<phrase> &pvec) { auto wvecSize = wvec.size(); auto pvecSize = pvec.size(); ofstream output(resfile); output << "char_number :" << result.charNum << endl; output << "line_number :" << result.lineNum << endl; output << "word_number :" << result.wordNum << endl; //sort wvec in descending frequency order
    vector<word>::iterator wbeg = wvec.begin(), wend = wvec.end(), witr; sort(wbeg, wend, wordCompare); output << " " << endl; output << "the top ten frequency of words" << endl; if(wvecSize){ if (wvecSize < 10) { for (witr = wbeg; witr != wend; witr++) { witr->printWord(output); } } else { vector<word>::iterator wlast = wbeg + 10; for (witr = wbeg; witr != wlast; witr++) { witr->printWord(output); } } } //sort pvec in descending frequency order
    vector<phrase>::iterator pbeg = pvec.begin(), pend = pvec.end(), pitr; sort(pbeg, pend, phraseCompare); output << " " << endl; output << "the top ten frequency of phrases" << endl; if (pvecSize) { if (pvecSize < 10) { for (pitr = pbeg; pitr != pend; pitr++) { pitr->printPhrase(output); } } else { vector<phrase>::iterator plast = pbeg + 10; for (pitr = pbeg; pitr != plast; pitr++) { pitr->printPhrase(output); } } } } /* determine whether the given path is a directory or a file, if it is a directory, push names of all the files in the directory into fvec*/
int getAllFiles(string path, vector<string> &files) { long   hFile = 0; int flag = -1; struct _finddata_t fileinfo; string p; if ((hFile = _findfirst(p.assign(path).append("\\*").c_str(), &fileinfo)) != -1) { flag = 0; while (_findnext(hFile, &fileinfo) == 0) { if ((fileinfo.attrib &  _A_SUBDIR))  //if it is a folder
 { if (strcmp(fileinfo.name, ".") != 0 && strcmp(fileinfo.name, "..") != 0) { //files.push_back(p.assign(path).append("/").append(fileinfo.name));//save filename
                    getAllFiles(p.assign(path).append("/").append(fileinfo.name), files); } } else    //it is a file
 { files.push_back(p.assign(path).append("/").append(fileinfo.name));//文件名
 } } _findclose(hFile); } return flag; } int main(int argc, char* argv[]) { amount result; result.charNum = 0; result.lineNum = 0; result.wordNum = 0; vector<word> wvec; vector<phrase> pvec; int dirFlag; vector<string> fvec; string path = "05.txt"; const char* resFile = "AllFiles.txt"; dirFlag = getAllFiles(path, fvec); if (dirFlag == 0) { vector<string>::iterator end = fvec.end(), it; for (it = fvec.begin(); it != end; it++) { fileProcess(*it, result, wvec, pvec); } } else { fileProcess(path, result, wvec, pvec); } getResult(resFile, result, wvec, pvec); system("pause"); }

輸出結果：

VS性能嚮導給出的結果：

由此能夠肯定examineNewPhr裏的查找過程find很是耗時間，由於find是根據這個重構的運算符來判斷是否相等的。

爲此，博主決定在vector裏面存放一個哈希表。

第二版代碼以下：

說明：這一版只實現了單詞統計沒有實現詞組統計。除了word類定義、examineNewWord函數被修改，增長了hash函數外，其餘基本沒太大的變化。newsample是咱們此次做業的測試集，數據量大概175M字節。用上面初版根本跑不起來。

#include<iostream> #include<fstream> #include<string> #include<sstream> #include<vector> #include<algorithm> #include<cctype> #include<regex> #include<io.h>

using namespace std; #define WORD_POOL_SIZE 18000000;
#define MAX_FIGURES 20;
int begFlag = 1; typedef struct { unsigned int charNum; unsigned int lineNum; unsigned int wordNum; }amount; class word{ public: string wordStr; unsigned int freq; word(string str, unsigned int fre) { wordStr = str; freq = fre; } void resetWordStr(string str) { if (str < wordStr) { wordStr = str; } } bool operator == (const word &obj) const { string word1 = this->wordStr, word2 = obj.wordStr; int i = word1.length() - 1; int j = word2.length() - 1; while (i >= 0) { if (word1[i] >= '0'&&word1[i] <= '9') word1[i] = '\0'; else break; i--; } while (j >= 0) { if (word2[j] >= '0'&&word2[j] <= '9') word2[j] = '\0'; else break; j--; } if (i == j) { for (int t = 0; t <= i; t++) { if (word1[t] != word2[t] && abs(word1[t] - word2[t]) != 32) return false; } } else return false; return true; } void printWord(ofstream &output) { output << wordStr << "\t" << freq << endl; } }; bool wordCompare(word former, word latter) { return former.freq > latter.freq; } unsigned int Hash(string str) { const char *p = str.c_str(); unsigned int seed = 7, key; unsigned long long hash = 0; int figures=0; while (*p!='\0'&& figures<= 20) { hash = hash*seed + (*p); p++; figures++; } key = hash%WORD_POOL_SIZE; return(key); } void examineNewWord(vector<word> &wvec, word &newWord) { string str = newWord.wordStr; int i = str.length() - 1; while (i >= 0) { if (str[i] >= '0'&&str[i] <= '9') { str[i] = '\0'; } else if (str[i]>=97&&str[i]<=122) { str[i] = str[i] - 32; } i--; } unsigned int key = Hash(str); int outOfSlot = 1; int open = 0; vector<word>::iterator beg = wvec.begin(); vector<word>::iterator itr = beg + key; while (outOfSlot) { itr = beg + (itr - beg + open * 13)%WORD_POOL_SIZE; if (itr->wordStr == "\0") { itr->wordStr = newWord.wordStr; itr->freq++; outOfSlot = 0; } else if (*itr == newWord) { itr->resetWordStr(newWord.wordStr); itr->freq++; outOfSlot = 0; } open++; } } void getNewExpr(string &str, vector<word> &wvec, unsigned int &wordNum) { word newWord("\0",1); string wordPattern("[[:alpha:]]{4}[[:alnum:]]{0,1020}"); regex reg(wordPattern); for (sregex_iterator it(str.begin(), str.end(), reg), end_it; it != end_it; it++) { wordNum++; newWord.wordStr = it->str(); examineNewWord(wvec, newWord); } } /* calculate the amount of characters with ASCII code within [32,126]*/ unsigned long getCharNum(string &str) { unsigned long charNum = 0; string::iterator end = str.end(), citr; for (citr = str.begin(); citr != end; citr++) { if (*citr >= 32 && *citr <= 126) charNum++; } return charNum; } /* calculate the number of lines in one file */ unsigned long getLineNum(string filename) { ifstream input(filename); unsigned long lines = 0; string str; while (!input.eof()) { /* if (getline(input, str)) { lines++; }*/ getline(input, str); lines++; } return lines; } void fileProcess(const char* filename, amount &result, vector<word> &wvec) { ifstream input; stringstream buffer; string srcStr; try { input.open(filename); if (!input.is_open()) { throw runtime_error("cannot open the file"); } } catch (runtime_error err) { cout << err.what(); return; } if (input.eof()) return; buffer << input.rdbuf(); srcStr = buffer.str(); // update the amount of characters
    result.charNum += getCharNum(srcStr); //update the amount of lines
    result.lineNum += getLineNum(filename); //update the wvec
 getNewExpr(srcStr, wvec, result.wordNum); input.close(); } /* print the results in the required format*/
void getResult(const char* resfile, amount &result, vector<word> &wvec) { auto wvecSize = wvec.size(); //auto pvecSize = pvec.size();
 ofstream output(resfile); output << "char_number :" << result.charNum << endl; output << "line_number :" << result.lineNum << endl; output << "word_number :" << result.wordNum << endl; //sort wvec in descending frequency order
    vector<word>::iterator wbeg = wvec.begin(), wend = wvec.end(), witr; sort(wbeg, wend, wordCompare); output << " " << endl; output << "the top ten frequency of words" << endl; if (wvecSize) { if (wvecSize < 10) { for (witr = wbeg; witr != wend; witr++) { witr->printWord(output); } } else { vector<word>::iterator wlast = wbeg + 10; for (witr = wbeg; witr != wlast; witr++) { witr->printWord(output); } } } } /* determine whether the given path is a directory or a file, if it is a directory, push names of all the files in the directory into fvec*/
int getAllFiles(string path, vector<string> &files) { long   hFile = 0; int flag = -1; struct _finddata_t fileinfo; string p; if ((hFile = _findfirst(p.assign(path).append("\\*").c_str(), &fileinfo)) != -1) { flag = 0; while (_findnext(hFile, &fileinfo) == 0) { if ((fileinfo.attrib &  _A_SUBDIR))  //if it is a folder
 { if (strcmp(fileinfo.name, ".") != 0 && strcmp(fileinfo.name, "..") != 0) { //files.push_back(p.assign(path).append("/").append(fileinfo.name));//save filename
                    getAllFiles(p.assign(path).append("/").append(fileinfo.name), files); } } else    //it is a file
 { files.push_back(p.assign(path).append("/").append(fileinfo.name));//文件名
 } } _findclose(hFile); } return flag; } int main(int argc, char* argv[]) { amount result; result.charNum = 0; result.lineNum = 0; result.wordNum = 0; vector<word> wvec(18000000,word("\0",0)); int dirFlag; vector<string> fvec; const char* path = "D:/Visual Studio/newsample"; const char* resFile = "AllFiles.txt"; dirFlag = getAllFiles(path, fvec); if (dirFlag == 0) { vector<string>::iterator end = fvec.end(), it; for (it = fvec.begin(); it != end; it++) { fileProcess(it->c_str(), result, wvec); } } else { fileProcess(path, result, wvec); } getResult(resFile, result, wvec); //system("pause");
}

在來看看VS給出的性能嚮導報告：

顯然查找的效率變高了不少，新的冤大頭轉移到到了正則表達式匹配上面。

這個問題如何優化，博主暫時尚未進行調查。另外，這一版程序有一個突出的問題就是空間的浪費，即在main函數中直接開size爲18000000的vector，這種作法對棧的佔用率很是高，因爲詞組的數目至少是單詞的兩倍，就須要把vector的空間開到35000000(由於實際結果是33000000多)，會致使Stack Overflow問題。這說明個人哈希衝突解決策略不合理，應該選擇能夠動態申請內存、對空間利用比較合理的衝突解決方法。

看到這裏，你大概能理解博主後悔的心情。要說錯誤的起始點在哪裏，那仍是前期需求分析和構建工做作的太倉促了。應老師要求，咱們使用Teambition製做PSP（Personal Software Process (PSP，我的開發流程，或稱個體軟件過程)。可是博主找不到導出的功能鍵在哪裏，（ಠ_ಠ），因此先上截圖再臨時作個表格吧

任務

預計完成時間

實際用時

學習C++基礎知識

10h

爲各項功能實現解決方案

設計程序總架構

30min

25min

實現字符總數、行數、單詞總數、單詞出現次數的統計

2.5h

實現統計詞組出現次數，輸出最高10個