詞頻統計——軟工第一次我的做業

詞頻統計php

1.項目要求和基本功能html

項目要求java

  • 對源文件(*.txt,*.cpp,*.h,*.cs,*.html,*.js,*.java,*.py,*.php等,文件夾內的全部文件)統計字符數、單詞數、行數、詞頻,統計結果以指定格式輸出到默認文件中,以及其餘擴展功能,並可以快速地處理多個文件。
  • 使用性能測試工具進行分析,找到性能的瓶頸並改進
  • 對代碼進行質量分析,消除全部警告,http://msdn.microsoft.com/en-us/library/dd264897.aspx
  • 設計10個測試樣例用於測試,確保程序正常運行(例如:空文件,只包含一個詞的文件,只有一行的文件,典型文件等等)
  • 使用Github進行代碼管理
  • 撰寫博客 

基本功能git

  • 統計文件的字符數(只須要統計Ascii碼,漢字不用考慮)
  • 統計文件的單詞總數
  • 統計文件的總行數(任何字符構成的行,都須要統計)
  • 統計文件中各單詞的出現次數,輸出頻率最高的10個。
  • 對給定文件夾及其遞歸子文件夾下的全部文件進行統計
  • 統計兩個單詞(詞組)在一塊兒的頻率,輸出頻率最高的前10個。
  • 在Linux系統下,進行性能分析,過程寫到blog中(附加題) 

2.PSP表格github

Statu Stages  預估耗時/min 實際耗時/min 
 Accept 【計劃】 30  20 
 Accept 估計時間 30 20
 Accept 【開發】 1330  1910
 Accept 需求分析 20  30 
 Accept 設計文檔  30  30 
 Accept 設計複審  10 
 Accept 代碼規範  10 
 Accept 具體設計  60 60 
 Accept 具體編碼  600 1000 
 Accept 代碼複審  300  300 
 Accept 測試  300  480 
 Accept 【記錄用時】  10  10 
 Accept 【測試報告】  30  60 
 Accept 【算工做量】  10  10 
 Accept 【總結改進】  60  60 
 Accept 【合計】  1470 2090

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.解題思路編程

數據結構:ubuntu

全局變量數組

unsigned long characterNum;//存放字符數
unsigned long wordNum;      //存放單詞數
unsigned long lineNum;        //存放行數    

採用結構體數組(動態內存)存儲單詞及其出現次數數據結構

struct wordInfo {
    char* wordStr;
    char**  nextWordPoint;
    int*  nextWordFrequency;
    int      presentNextWordNum;
    int   frequency;
    int   strlength;
    int   wordLength;//不包含最後的數字部分
};
struct alphaArray {
    wordInfo* wordArray;
    int       presentWordArrayLength;
};

struct wordStatisticsResult {
    char* wordStr;
    int   wordFrequency;
};
struct phaseStatisticsResult {
    char* firstStr;
    char* secondStr;
    int   phaseFrequency;
};
View Code

遍歷文件的方法:ide

_findfirst,_findnext函數實現(Windows平臺),參考例程

readdir函數實現(Linux平臺),參考例程

具體實現方案:

1>主函數:

初始化各變量
遍歷給定文件夾中的每一個文件
只讀方式打開符合要求的文件
單詞統計,詞組統計
循環至全部文件遍歷完成
關閉文件
輸出統計結果

2>單詞統計:

遍歷字符並統計
判斷是不是換行符並統計
創建緩衝區域存儲一個單詞中連續的字符
採集單詞字符串
生成單詞的哈希值(散列函數使用ELFHash、衝突解決方案採用二次探測)
根據首字母和哈希值肯定單詞的存儲位置並存儲單詞信息
將當前單詞的地址存儲到前一個單詞的結構體中,以實現詞組頻率統計

3>詞組統計:

存儲單詞
返回當前單詞在詞表中的位置
若是不是第一個單詞
   根據位置獲得字符串指針
   在前一個單詞的結構體中查找是否存在該指針
   若是存在 該指針對應計數加一
   若是不存在 存儲該指針,初始化數量爲1
記錄該位置

最後遍歷便可獲得全部詞組出現頻率

4.代碼實現

(1)初始化詞表

 1 void dictionaryInit(struct alphaArray* dictionary)
 2 {
 3     int i, j, k;
 4     characterNum = 0;
 5     wordNum = 0;
 6     lineNum = 0;
 7     for (i = 0; i < alphabet; i++)
 8     {
 9         (dictionary + i)->wordArray = (wordInfo*)malloc(sizeof(wordInfo)*wordArrayLength);
10         (dictionary + i)->presentWordArrayLength = wordArrayLength;
11         if ((dictionary + i)->wordArray == NULL) exit(-1);
12         for (j = 0; j < (dictionary + i)->presentWordArrayLength; j++)
13         {
14             ((dictionary + i)->wordArray + j)->wordStr = (char*)malloc(sizeof(char)*wordStrLength);
15             if (((dictionary + i)->wordArray + j)->wordStr == NULL) exit(-1);
16             *(((dictionary + i)->wordArray + j)->wordStr) = '\0';
17             ((dictionary + i)->wordArray + j)->frequency = 0;
18             ((dictionary + i)->wordArray + j)->strlength = wordStrLength;
19             ((dictionary + i)->wordArray + j)->wordLength = 0;
20             ((dictionary + i)->wordArray + j)->nextWordPoint = (char**)malloc(sizeof(char*)*nextWordNum);
21             if (((dictionary + i)->wordArray + j)->nextWordPoint == NULL) exit(-1);
22             ((dictionary + i)->wordArray + j)->nextWordFrequency = (int*)malloc(sizeof(int)*nextWordNum);
23             if (((dictionary + i)->wordArray + j)->nextWordFrequency == NULL) exit(-1);
24             for (k = 0; k < nextWordNum; k++)
25             {
26                 *(((dictionary + i)->wordArray + j)->nextWordPoint + k) = NULL;
27                 *(((dictionary + i)->wordArray + j)->nextWordFrequency + k) = 0;
28             }
29             ((dictionary + i)->wordArray + j)->presentNextWordNum = nextWordNum;
30         }
31     }
32 }
View Code

 申請初始內存空間並將全部值置零。

(2)遍歷文件夾

  a)Windows平臺

 1 void traverseFileandCount(char* filePath, struct alphaArray* dictionary)
 2 {
 3     _finddata_t FileInfo;
 4     char* presentPath;
 5     char* newPath;
 6     presentPath = (char*)malloc(sizeof(char)*filePathLength);
 7     if (presentPath == NULL) exit(-1);
 8     newPath = (char*)malloc(sizeof(char)*filePathLength);
 9     if (newPath == NULL) exit(-1);
10     strcpy_s(presentPath, filePathLength, filePath);
11     strcat_s(presentPath, filePathLength, "\\*");
12     long Handle = _findfirst(presentPath, &FileInfo);
13     if (Handle == -1L) exit(-1);
14     do {
15         if (FileInfo.attrib & _A_SUBDIR)
16         {
17             if ((strcmp(FileInfo.name, ".") != 0) && (strcmp(FileInfo.name, "..") != 0))
18             {
19                 generatePath(FileInfo, filePath, newPath);
20                 traverseFileandCount(newPath, dictionary);
21             }
22         }
23         else
24         {
25             generatePath(FileInfo, filePath, presentPath);
26             count(presentPath, dictionary);
27         }
28     } while (_findnext(Handle, &FileInfo) == 0);
29     _findclose(Handle);
30     free(presentPath);
31     free(newPath);
32 }
View Code

  b)Linux平臺

 1 void traverseFileandCount(char* path, struct alphaArray* dictionary)
 2 {
 3     DIR *pDir; //定義一個DIR類的指針
 4     struct dirent *ent=NULL; //定義一個結構體 dirent的指針,dirent結構體見上
 5     int i = 0;
 6     char childpath[512]; //定義一個字符數組,用來存放讀取的路徑
 7     pDir = opendir(path); // opendir方法打開path目錄,並將地址付給pDir指針
 8     memset(childpath, 0, sizeof(childpath)); //將字符數組childpath的數組元素所有置零
 9     while ((ent = readdir(pDir)) != NULL)
10         //讀取pDir打開的目錄,並賦值給ent, 同時判斷是否目錄爲空,不爲空則執行循環體
11     {
12         if (ent->d_type&DT_DIR)
13             /*讀取 打開目錄的文件類型 並與 DT_DIR進行位與運算操做,即若是讀取的d_type類型爲DT_DIR
14             (=4 表示讀取的爲目錄)*/
15         {
16             if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
17                 //若是讀取的d_name爲 . 或者.. 表示讀取的是當前目錄符和上一目錄符,
18                 //則用contiue跳過,不進行下面的輸出
19                 continue;
20             sprintf(childpath, "%s/%s", path, ent->d_name);
21             //若是非. ..則將 路徑 和 文件名d_name 付給childpath, 並在下一行prinf輸出
22             //printf("path:%s\n",childpath);原文連接這裏是要打印出文件夾的地址
23             traverseFileandCount(childpath, dictionary);
24             //遞歸讀取下層的字目錄內容, 由於是遞歸,因此從外往裏逐次輸出全部目錄(路徑+目錄名),
25             //而後纔在else中由內往外逐次輸出全部文件名
26         }
27         else
28             //若是讀取的d_type類型不是 DT_DIR, 即讀取的不是目錄,而是文件,
29             //則直接輸出 d_name, 即輸出文件名
30         {
31             //cout<<ent->d_name<<endl; 輸出文件名
32             //cout<<childpath<<"/"<<ent->d_name<<endl; 輸出帶有目錄的文件名
33             sprintf(childpath, "%s/%s", path, ent->d_name);
34             //你能夠惟一注意的地方是下一行
35             //目前childpath就是你要讀入的文件的path了,能夠做爲你的讀入文件的函數的參數
36             count(childpath, dictionary);//這裏就是你的處理文件的接口!,
37         }
38     }
39 }
View Code

(3)計數

 1 void count(char* path, struct alphaArray* dictionary)
 2 {
 3     FILE* fp;
 4     bool firstWordSign = 1;
 5     int i = 0;
 6     int finalAlphaPosition = 0;
 7     int tempWordStrLength = wordStrLength;
 8     int presentWordOffset;
 9     char ch, *tempWordStr;
10     unsigned long hash;
11     struct wordInfo* lastWordInfo = NULL, *presentWordInfo = NULL;
12     tempWordStr = (char*)malloc(sizeof(char)*wordStrLength);
13     if (tempWordStr == NULL) exit(-1);
14     if (fopen_s(&fp, path, "r") != 0) exit(-1);
15     do
16     {
17         ch = fgetc(fp);
18         characterNumandLineNum(ch);
19         if (!isDigitorAlpha(ch))
20         {
21             tempWordStr[i] = '\0';
22             if (isWord(tempWordStr))
23             {
24                 hash = storeTempWord(dictionary, tempWordStr, finalAlphaPosition);
25                 getOffset(presentWordOffset, tempWordStr[0]);
26                 presentWordInfo = ((dictionary + presentWordOffset)->wordArray + hash);
27                 if (!firstWordSign)
28                     storePhaseInfo(lastWordInfo, presentWordInfo);
29                 lastWordInfo = presentWordInfo;
30                 firstWordSign = 0;
31             }
32             i = 0;
33             finalAlphaPosition = 0;
34             tempWordStr[0] = '\0';
35         }
36         else
37         {
38             if (i < wordStrLength)
39             {
40                 if (isAlpha(ch)) finalAlphaPosition = i;
41                 tempWordStr[i++] = ch;
42             }
43             /*if (i >= tempWordStrLength)
44             {
45                 tempWordStrLength *= 2;
46                 tempWordStr = (char*)realloc(tempWordStr, sizeof(char)*tempWordStrLength);
47                 if (tempWordStr == NULL) exit(-1);
48             }*/
49         }
50     } while (ch != EOF);
51     free(tempWordStr);
52     lineNum++;
53     fclose(fp);
54 }
View Code

(4)存儲單詞信息

 1 unsigned long storeTempWord(struct alphaArray* dictionary, char* tempWordArray, int lastAlphaPosition)
 2 {
 3     unsigned long hash = 0;
 4     int i = 0, j = 0, offset;
 5     char* wordstrPoint;
 6     struct alphaArray* page;
 7     hash = ELFHash(tempWordArray, lastAlphaPosition);
 8     getOffset(offset,tempWordArray[0]);
 9     page = dictionary + offset;
10     hash = hash % (page->presentWordArrayLength);
11     //hash=hash%wordArrayLength;
12     wordstrPoint = (page->wordArray + hash)->wordStr;
13     while (!isEmpty(wordstrPoint) && isDifferent(page->wordArray + hash, tempWordArray, lastAlphaPosition))
14     {
15         i++;
16         if (i > (page->presentWordArrayLength))
17         {
18             enlargeWordArrayLength(page);
19             i = 0;
20         }
21         hash += i * i;
22         hash = hash % (page->presentWordArrayLength);
23         wordstrPoint = (page->wordArray + hash)->wordStr;
24     }
25     /*while ((int)strlen(tempWordArray) >= (page->wordArray + hash)->strlength)
26         enlargeStrLength(page, hash);*/
27     if ((int)strlen(tempWordArray) >= (page->wordArray + hash)->strlength)
28         *(tempWordArray + (page->wordArray + hash)->strlength - 1) = '\0';
29     wordstrPoint = (page->wordArray + hash)->wordStr;
30     if (isEmpty(wordstrPoint))
31     {
32         strcpy_s(wordstrPoint, strlen(tempWordArray)+1, tempWordArray);
33         (page->wordArray + hash)->wordLength = lastAlphaPosition;
34     }
35     else
36     {
37         if (strcmp(wordstrPoint, tempWordArray) > 0)
38             strcpy_s(wordstrPoint, strlen(tempWordArray)+1, tempWordArray);
39     }
40     (page->wordArray + hash)->frequency++;
41     wordNum++;
42     return hash;
43 }
View Code

(5)存儲詞組信息

 1 void storePhaseInfo(struct wordInfo* lastWordInfo, struct wordInfo* presentWordInfo)
 2 {
 3     int i = 0, k = 0;
 4     bool stored = 0;
 5     for (i = 0; i < (lastWordInfo->presentNextWordNum);)
 6     {
 7         if ((*(lastWordInfo->nextWordFrequency + i)) != 0)
 8         {
 9             if ((*(lastWordInfo->nextWordPoint + i)) == presentWordInfo->wordStr && !stored)
10             {
11                 (*(lastWordInfo->nextWordFrequency + i))++;
12                 stored = 1;
13             }
14             else
15                 i++;
16         }
17         else
18             break;
19     }
20     if (i == (lastWordInfo->presentNextWordNum))
21     {
22         lastWordInfo->nextWordPoint = (char**)realloc(lastWordInfo->nextWordPoint, sizeof(char*)*(lastWordInfo->presentNextWordNum) * 2);
23         if (lastWordInfo->nextWordPoint == NULL) exit(-1);
24         lastWordInfo->nextWordFrequency = (int*)realloc(lastWordInfo->nextWordFrequency, sizeof(int)*(lastWordInfo->presentNextWordNum) * 2);
25         if (lastWordInfo->nextWordFrequency == NULL) exit(-1);
26         for (k = (lastWordInfo->presentNextWordNum); k < (lastWordInfo->presentNextWordNum)*2; k++)
27         {
28             *(lastWordInfo->nextWordPoint + k) = NULL;
29             *(lastWordInfo->nextWordFrequency + k) = 0;
30         }
31         (lastWordInfo->presentNextWordNum) *= 2;
32     }
33     if (!stored)
34     {
35         *(lastWordInfo->nextWordPoint + i) = presentWordInfo->wordStr;
36         (*(lastWordInfo->nextWordFrequency + i))++;
37         stored = 1;
38     }
39 }
View Code

(6)ELF哈希函數

 1 unsigned long ELFHash(char* tempWordArray, int lastAlphaPosition)
 2 {
 3     unsigned long hash = 0, i = 0, x = 0;
 4     char *hashStr;
 5     hashStr = (char*)malloc(sizeof(char)*(lastAlphaPosition + 1));
 6     if (hashStr == NULL) exit(-1);
 7     for (i = 0; i <= (unsigned long)lastAlphaPosition; i++)
 8     {
 9         if (tempWordArray[i] >= 'a'&&tempWordArray[i] <= 'z'|| tempWordArray[i]>='0'&&tempWordArray[i]<='9')
10             *(hashStr + i) = tempWordArray[i];
11         else 
12             *(hashStr + i) = tempWordArray[i] - 'A' + 'a';
13     }
14     for (i = 0; i <= (unsigned long)lastAlphaPosition; i++)
15     {
16         hash = (hash << 4) + *(hashStr + i);
17         if ((x = hash & 0xf0000000) != 0)
18         {
19             hash ^= (x >> 24);
20             hash &= ~x;
21         }
22     }
23     hash &= 0x7fffffff;
24     free(hashStr);
25     return hash;
26 }
View Code

(7)頻率前十單詞詞組統計

  1 void topFrequencyWordStatistics(struct alphaArray* dictionary, struct wordStatisticsResult* topFrequencyWord)
  2 {
  3     int i = 0, j = 0;
  4     int minWordFrequency = 0;
  5     for (i = 0; i < topFrequencyWordNum; i++)
  6     {
  7         (topFrequencyWord + i)->wordStr = NULL;
  8         (topFrequencyWord + i)->wordFrequency = 0;
  9     }
 10     for (i = 0; i < alphabet; i++)
 11     {
 12         for (j = 0; j < (dictionary + i)->presentWordArrayLength; j++)
 13         {
 14             if (((dictionary + i)->wordArray + j)->frequency > minWordFrequency)
 15                 updateTopFrequencyWord(topFrequencyWord, ((dictionary + i)->wordArray + j), minWordFrequency);
 16         }
 17     }
 18     sortTopFrequencyWord(topFrequencyWord);
 19     puts("Top 10 word:");
 20     for (i = 0; i < topFrequencyWordNum; i++)
 21         printf("%s\t%d\n", (topFrequencyWord + i)->wordStr, (topFrequencyWord + i)->wordFrequency);
 22     printf("\n");
 23 }
 24 
 25 void updateTopFrequencyWord(struct wordStatisticsResult* topFrequencyWord, struct wordInfo* dictionary_i_j, int &minWordFrequency)
 26 {
 27     int i = 0;
 28     for (i = 0; i < topFrequencyWordNum; i++)
 29     {
 30         if ((topFrequencyWord + i)->wordFrequency == minWordFrequency)
 31         {
 32             (topFrequencyWord + i)->wordStr = dictionary_i_j->wordStr;
 33             (topFrequencyWord + i)->wordFrequency = dictionary_i_j->frequency;
 34             minWordFrequency = dictionary_i_j->frequency;
 35         }
 36     }
 37     for (i = 0; i < topFrequencyWordNum; i++)
 38     {
 39         if ((topFrequencyWord + i)->wordFrequency < minWordFrequency)
 40             minWordFrequency = (topFrequencyWord + i)->wordFrequency;
 41     }
 42 }
 43 
 44 void sortTopFrequencyWord(struct wordStatisticsResult* topFrequencyWord)
 45 {
 46     int i = 0, j = 0;
 47     int minWordFrequency;
 48     int minWordFrequencyPosition; 
 49     struct wordStatisticsResult tempWord;
 50     for (i = 0; i < topFrequencyWordNum - 1; i++)
 51     {
 52         minWordFrequency = topFrequencyWord->wordFrequency;
 53         minWordFrequencyPosition = 0;
 54         for (j = 0; j < topFrequencyWordNum - i; j++)
 55         {
 56             if ((topFrequencyWord + j)->wordFrequency < minWordFrequency)
 57             {
 58                 minWordFrequency = (topFrequencyWord + j)->wordFrequency;
 59                 minWordFrequencyPosition = j;
 60             }
 61         }
 62         tempWord.wordStr = (topFrequencyWord + minWordFrequencyPosition)->wordStr;
 63         tempWord.wordFrequency = minWordFrequency;
 64         (topFrequencyWord + minWordFrequencyPosition)->wordStr = (topFrequencyWord + topFrequencyWordNum - i - 1)->wordStr;
 65         (topFrequencyWord + minWordFrequencyPosition)->wordFrequency = (topFrequencyWord + topFrequencyWordNum - i - 1)->wordFrequency;
 66         (topFrequencyWord + topFrequencyWordNum - i - 1)->wordStr = tempWord.wordStr;
 67         (topFrequencyWord + topFrequencyWordNum - i - 1)->wordFrequency = tempWord.wordFrequency;
 68     }
 69 }
 70 
 71 void topFrequencyPhaseStatistics(struct alphaArray* dictionary, struct phaseStatisticsResult* topFrequencyPhase)
 72 {
 73     int i = 0, j = 0, k = 0;
 74     int minPhaseFrequency = 0;
 75     for (i = 0; i < topFrequencyPhaseNum; i++)
 76     {
 77         (topFrequencyPhase + i)->firstStr = NULL;
 78         (topFrequencyPhase + i)->secondStr = NULL;
 79         (topFrequencyPhase + i)->phaseFrequency = 0;
 80     }
 81     for (i = 0; i < alphabet; i++)
 82     {
 83         for (j = 0; j < (dictionary + i)->presentWordArrayLength; j++)
 84         {
 85             for (k = 0; k < ((dictionary + i)->wordArray + j)->presentNextWordNum; k++)
 86             {
 87                 if (*(((dictionary + i)->wordArray + j)->nextWordFrequency + k) > minPhaseFrequency)
 88                     updateTopFrequencyPhase(topFrequencyPhase, ((dictionary + i)->wordArray + j), k, minPhaseFrequency);
 89             }
 90         }
 91     }
 92     sortTopFrequencyPhase(topFrequencyPhase);
 93     puts("Top 10 phase:");
 94     for (i = 0; i < topFrequencyPhaseNum; i++)
 95         printf("%s %s\t%d\n", (topFrequencyPhase + i)->firstStr, (topFrequencyPhase + i) ->secondStr, (topFrequencyPhase + i)->phaseFrequency);
 96     printf("\n");
 97 }
 98 
 99 void updateTopFrequencyPhase(struct phaseStatisticsResult* topFrequencyPhase,wordInfo* dictionary_i_j,int offset,int &minPhaseFrequency)
100 {
101     int i = 0;
102     for (i = 0; i < topFrequencyPhaseNum; i++)
103     {
104         if ((topFrequencyPhase + i)->phaseFrequency == minPhaseFrequency)
105         {
106             (topFrequencyPhase + i)->firstStr = dictionary_i_j->wordStr;
107             (topFrequencyPhase + i)->secondStr = *(dictionary_i_j->nextWordPoint + offset);
108             (topFrequencyPhase + i)->phaseFrequency = *(dictionary_i_j->nextWordFrequency + offset);
109             minPhaseFrequency = (topFrequencyPhase + i)->phaseFrequency;
110         }
111     }
112     for (i = 0; i < topFrequencyPhaseNum; i++)
113     {
114         if ((topFrequencyPhase + i)->phaseFrequency < minPhaseFrequency)
115             minPhaseFrequency = (topFrequencyPhase + i)->phaseFrequency;
116     }
117 }
118 
119 void sortTopFrequencyPhase(struct phaseStatisticsResult* topFrequencyPhase)
120 {
121     int i = 0, j = 0;
122     int minPhaseFrequency;
123     int minPhaseFrequencyPosition;
124     struct phaseStatisticsResult tempPhase;
125     for (i = 0; i < topFrequencyPhaseNum - 1; i++)
126     {
127         minPhaseFrequency = topFrequencyPhase->phaseFrequency;
128         minPhaseFrequencyPosition = 0;
129         for (j = 0; j < topFrequencyPhaseNum - i; j++)
130         {
131             if ((topFrequencyPhase + j)->phaseFrequency < minPhaseFrequency)
132             {
133                 minPhaseFrequency = (topFrequencyPhase + j)->phaseFrequency;
134                 minPhaseFrequencyPosition = j;
135             }
136         }
137         tempPhase.firstStr = (topFrequencyPhase + minPhaseFrequencyPosition)->firstStr;
138         tempPhase.secondStr = (topFrequencyPhase + minPhaseFrequencyPosition)->secondStr;
139         tempPhase.phaseFrequency = minPhaseFrequency;
140         (topFrequencyPhase + minPhaseFrequencyPosition)->firstStr = (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->firstStr;
141         (topFrequencyPhase + minPhaseFrequencyPosition)->secondStr = (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->secondStr;
142         (topFrequencyPhase + minPhaseFrequencyPosition)->phaseFrequency = (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->phaseFrequency;
143         (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->firstStr = tempPhase.firstStr;
144         (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->secondStr = tempPhase.secondStr;
145         (topFrequencyPhase + topFrequencyPhaseNum - i - 1)->phaseFrequency = tempPhase.phaseFrequency;
146     }
147 }
View Code

(8)輸出

 1 void outputResult(struct alphaArray* dictionary)
 2 {
 3     int i = 0, j = 0, k = 0;
 4     puts("Statistics result:");
 5     printf("characterNum:%lu\n", characterNum);
 6     printf("wordNum:%lu\n", wordNum);
 7     printf("lineNum:%lu\n\n", lineNum);
 8 }
 9 
10 void outputToFile(struct wordStatisticsResult* topFrequencyWord,struct phaseStatisticsResult* topFrequencyPhase)
11 {
12     int i = 0;
13     FILE* fp;
14     fopen_s(&fp,"D:\\RGhw\\result.txt","wb");
15     if (fp == NULL) exit(-1);
16     fputs("characterNum:", fp);
17     fprintf(fp, "%lu\r\n", characterNum);
18     fputs("wordNum:", fp);
19     fprintf(fp, "%lu\r\n", wordNum);
20     fputs("lineNum:", fp);
21     fprintf(fp, "%lu\r\n\r\n", lineNum);
22     fputs("Top 10 frequency words:\r\n", fp);
23     for (i = 0; i < topFrequencyWordNum; i++)
24         fprintf(fp,"%s:  %d\r\n", (topFrequencyWord + i)->wordStr, (topFrequencyWord + i)->wordFrequency);
25     fputs("\r\n",fp); 
26     fputs("Top 10 frequency phases:\r\n", fp);
27     for (i = 0; i < topFrequencyPhaseNum; i++)
28         fprintf(fp,"%s %s:  %d\r\n", (topFrequencyPhase + i)->firstStr, (topFrequencyPhase + i)->secondStr, (topFrequencyPhase + i)->phaseFrequency);
29     fputs("\r\n", fp);
30     fclose(fp);
31 }
View Code

(9)釋放空間

 1 void dictionaryDestroy(struct alphaArray* dictionary)
 2 {
 3     int i, j;
 4     for (i = 0; i < alphabet; i++)
 5     {
 6         for (j = 0; j < ((dictionary + i)->presentWordArrayLength); j++)
 7         {
 8             free(((dictionary + i)->wordArray + j)->wordStr);
 9             free(((dictionary + i)->wordArray + j)->nextWordPoint);
10             free(((dictionary + i)->wordArray + j)->nextWordFrequency);
11         }
12         free((dictionary + i)->wordArray);
13     }
14 }
View Code

 5.代碼性能分析

(1)CPU和GPA使用狀況

(2)各函數CPU佔用細節

(3)main函數CPU佔用細節

(4)遍歷文件並進行統計函數(traverseFileandCount)CPU佔用細節

(5)統計函數(count)CPU佔用細節

分析:

  • 從性能分析來看,程序運行幾乎全部的時間花費在統計函數上。
  • 統計函數內部耗費時間最多的是fgetc()函數,說明每次對文件讀取一個字節效率很低。以後有考慮過使用fread()函數(一次性將文件內容讀入數組)來提升效率,不過因爲時間關係並無優化。
  • 除了fgetc()以外,存儲單詞的函數也花費了較多的時間,緣由多是採用了動態內存,每次都要判斷空間是否夠用,並在不夠用的狀況下申請更大的空間。考慮到程序的健壯性,這一部分時間我以爲是必不可少的。

6.測試樣例與分析

(1)助教提供的測試集

    運行時間32秒(release模式下),運行結果以下(左側爲個人程序結果,右側是助教的,後面都是這樣,注:行數和單詞數輸出順序和助教不同):

    前三項偏差均在100左右,這可能和統計方法有關

    單詞和詞組統計結果和助教同樣

(2)空文件夾

(3)空文件

(4)只含一個詞的文件

(5)同一類單詞按照詞典順序輸出

    文件內容:

    運行結果:

(6)詞組按詞典順序輸出

    文件內容:

    運行結果:

(7)不一樣類型的文件

    文件夾:

    運行結果:

(8)錯誤的路徑

    個人程序直接退出(exit(-1)),沒有輸出錯誤信息。

(9)初版測試集

(10)圖片文件

7.程序存在的問題

程序第一次成功運行後,我對測試集進行了統計,發現THAT這個單詞輸出了兩個。也就是說同一個單詞存放在兩個不一樣的位置。一開始感受很奇怪,百思不得其解。後來發現,問題出在動態內存上。爲了保證程序的健壯性,我使用了動態內存。當詞表存放不下單詞的時候,程序會申請兩倍的空間。可是我忽略了當詞表容量發生變化的時候,根據哈希值肯定的單詞的存儲位置也會發生變化。這形成了一樣的單詞,存放在了不一樣的地方。我想出的解決方案是,依次在一倍初始空間,兩倍初始空間……進行查找,這樣的話能夠保證每個單詞只有一個肯定的位置。不過發現這個問題的時候已經離DDL沒多久了,因此我只是簡單的擴大了初始空間去解決這個問題。

8.總結反思

整體過程上,因爲最開始進行了大體規劃,整個過程比較順利。出現了兩次卡殼:動態內存代碼、虛擬機的使用。詞表採用了動態內存,須要判斷內存是否夠用,不夠用時要從新申請。寫這部分代碼的時候因爲思路不夠清楚,花費了較多時間。程序運行成功後就開始進行移植性的修改。爲了進行測試,安裝了ubuntu虛擬機。成功測試以後忽然虛擬機掛掉了,從新安裝了三次,仍然失敗(心好累)。。因此最後輸出文件的函數沒辦法驗證。

代碼規範上,相比之前稍有進步。此次代碼編寫時,我着重注意了變量命名和函數命名,以加強代碼可讀性。另外,我儘量的將長函數拆分紅若干個小函數,儘管這樣仍然有四五十行的代碼。

時間安排上,我只能說,我是先寫軟工做業而後寫其餘課程做業。

不足之處,虛擬機使用不熟練,出現問題不能儘快解決;代碼性能分析不夠詳細;代碼繁瑣難讀。

之後編程過程當中會不斷鍛鍊、改進。

 

附:代碼github地址 

相關文章
相關標籤/搜索