結對第二次—文獻摘要熱詞統計及進階需求

時間 2019-11-09

原文原文鏈接

課程名稱：軟件工程實踐
做業要求：結對第二次—文獻摘要熱詞統計及進階需求
結對學號：221600428 | 221600438
Github項目地址：基本需求 | 進階需求
Github的代碼簽入記錄：

java

分工以下：git

221600438 鄭厚楚
- 主要代碼編寫
- 需求分析
- 單元測試
221600428 林煜
- 博客撰寫
- 需求分析
- 單元測試

PSPgithub

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃
• Estimate	• 估計這個任務須要多少時間	1300	1540
Development	開發	1140	1420
• Analysis	• 需求分析 (包括學習新技術)	100	120
• Design Spec	• 生成設計文檔	60	60
• Design Review	• 設計複審	60	60
• Coding Standard	• 代碼規範 (爲目前的開發制定合適的規範)	60	60
• Design	• 具體設計	200	240
• Coding	• 具體編碼	500	660
• Code Review	• 代碼複審	100	120
• Test	• 測試（自我測試，修改代碼，提交修改)	60	100
Reporting	報告	160	120
• Test Report	• 測試報告	30	30
• Size Measurement	• 計算工做量	30	30
• Postmortem & Process Improvement Plan	• 過後總結, 並提出過程改進計劃	100	60
合計		1300	1540

解題思路描述
拿到題目後，咱們花了大量的時間對任務進行需求分析，而後針對基本需求分別構建對字符數、單詞數、有效行數以及詞頻的統計函數用以實現各項功能，而後根據進階需求完善程序，並對各項功能進行優化。正則表達式

基本需求
- 輸入文件名以命令行參數傳入。
- 統計文件的字符數
- 統計文件的單詞總數
- 統計文件的有效行數
- 統計文件中各單詞的出現次數，最終只輸出頻率最高的10個
- 按照字典序輸出到文件result.txt
進階需求
- 使用工具爬取論文信息
- 自定義輸入輸出文件
- 加入權重的詞頻統計
- 新增詞組詞頻統計功能，統計文件夾中指定長度的詞組的詞頻
- 自定義詞頻統計輸出
- 多參數的混合使用

實現過程
設計過程編程

基本需求
- 項目需求
  221600428&221600438
- src
  - Main.java
  - lib.java
    函數與流程：
    主要有兩個類 Main.java 和 Lib.java 爲了方便維護和修改函數拆成四個主要函數countChar，countWord，countLine，countMostWord，分別爲計算字符數，單詞數，行數，和輸出詞頻，經過主函數依次調用countChar，countWord，countLine，countMostWord去進行計算並輸出.
    類圖：
    
    流程圖：
進階需求
在基本需求的基礎上，對詞頻統計函數進行完善。
類圖：

流程圖：
數組
爬蟲：
- 對http://openaccess.thecvf.com/CVPR2018.py的網頁信息導入輸入InputStream中，
- 對緩衝流中的數據進行讀行處理，用正則表達式篩選有ptitle的有效行
- 對有效行進行split字符分割處理，獲得title和論文Abstract的連接
- 爬取Abstract連接的網頁信息
- 對信息作讀行處理（含有Abstract的行爲有效行）再讀一行作split處理獲得Abstract的內容

改進思路ide

基礎：
- 對於字符統計的部分，一開始的思路是用read()讀出的字符先存在一個字符串裏面，而後進行\r\n的處理，再行統計字符數，結果發現這種方法的性能極差，沒辦法統計出1M以上的文件級的字符數。改進方法：使用 readLine() 讀文件，用一個int （記爲len）變量標記 readLine() 的結果長度，將讀出的字符串字符逐個分析（非ASCII碼字符不作統計），累加。最後如果len不等於0，則chatCount作減1處理（即作readLine()尾行讀處理））。。
- 對於單詞統計部分，一開始是用自定義的函數作判斷是不是單詞的處理。改進方法：採用正則表達式 .matches("[a-z]{4}[a-z0-9]*") 作判斷是否爲單詞。
進階：
- 讀字符處理過程跟基礎需求的處理過程差很少，但多了一個識別Title:行和Abstract:行的處理。另外的功能處理也只是添加的功能處理部分的代碼，並沒有太大的改變。
- 另外詞組的統計沒有包含字符，由於看不出來這個需求，助教的博客沒有及時去更新看，因此時間已經來不及了。

代碼說明
基礎部分函數

字符統計
用readLine()進行讀行處理，用一個整型變量len標記字符串的長度，而後對字符串作逐個的字符分析處理（非ASCII字符不作統計）。每讀完一行加上一個換行符。最後若(len != 0)，字符數量減1。工具

public void CountChar(String path)
    {       
        try
        {
            
            InputStreamReader r = new InputStreamReader(new FileInputStream(path));
            BufferedReader br = new BufferedReader(r);
            int a=0;
            int len=0;
            String str=null;
            while((str=br.readLine()) != null)
            {
                //str = str +(char)a;
                a=str.length()-1;
                len=a+1;
                while(a>=0)
                {
                    char o=str.charAt(a);
                    if(o>=0 && o<=127)
                    {
                        charCount++;
                    }
                    a--;
                }
                charCount++;
            }
            if(len != 0) charCount--;
            //str=str.replaceAll("\\r\\n+", "a");
            //charCount=str.length();
            r.close();
        }
        catch(IOException e)
        {
            System.out.println("文件讀取出錯！");
        }   
    }

有效行數統計
有效行的分析處理後統計行數。性能

public void CountLine(String path)
    {
        try
        {
            InputStreamReader r = new InputStreamReader(new FileInputStream(path));
            BufferedReader br = new BufferedReader(r);
            String s= br.readLine();
            while(s != null)
            {
                //charCount+=s.length();
                //System.out.println(s);
                if(!s.trim().equals(""))
                {
                    lineCount++;
                }
                s=br.readLine();
            }   
            r.close();      
        }
        catch(IOException e)
        {
            System.out.println("文件讀取出錯！");
        }       
        
    }

單詞統計
readLine()讀行，字符串作字符分割處理，而後對分割後的字符串數組作是否爲單詞的判斷，如果單詞則統計單詞量並list.add；若不是則不做處理。

public void CountWord(String path)
    {   
        try
        {
            InputStreamReader r = new InputStreamReader(new FileInputStream(path));
            BufferedReader br = new BufferedReader(r);
            String s= br.readLine();
            while(s != null)
            {
                s=s.toLowerCase();          
                String wordstr[] = s.split("[^a-zA-Z0-9]");
                int x = wordstr.length;
                for(int i=0; i<x; i++)
                {                   
                    /*if(wordstr[i].length() >= 4)
                    {   
                        //System.out.println(wordstr[i]+"  "+wordstr[i].length()+"  ");
                        String st = wordstr[i].substring(0, 4);
                        //System.out.print(st.matches("[a-zA-Z]+"));
                        if(st.matches("[a-zA-Z]+"))
                        {
                            ++wordCount;
                            list.add(wordstr[i]);
                        }
                    }*/
                    if(wordstr[i].matches("[a-z]{4}[a-z0-9]*"))
                    {
                        ++wordCount;
                        list.add(wordstr[i]);
                    }
                }
                s=br.readLine();
            }   
            r.close();      
            
        }
        catch(IOException e)
        {
            System.out.println("文件讀取出錯！");
        }   
    }

詞頻統計
用一個Map作list的單詞詞頻統計並作降序處理，輸出前10詞頻的單詞，並將結果寫入result.txt文件裏面。

public void CountFre()
    {
        for(String li: list)
        {
            if(words.get(li) != null)
            {
                words.put(li, words.get(li)+1);
            }
            else words.put(li, 1);
        }
        maplist = new ArrayList<Map.Entry<String, Integer>>(words.entrySet());
        Collections.sort(maplist, new Comparator<Map.Entry<String, Integer>>(){
            @Override
            public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
                // TODO Auto-generated method stub
                return o2.getValue() - o1.getValue(); //降序
            }
        });
        
        try
        {
            File file = new File("result.txt");
            BufferedWriter br = new BufferedWriter(new FileWriter(file));
            br.write("characters: "+charCount+"\r\n");
            br.write("words: "+wordCount+"\r\n");
            br.write("lines: "+lineCount+"\r\n");
            for(int i=0; i<maplist.size(); i++)
            {
                if(i>=10) break;
                br.write("<"+maplist.get(i).getKey()+">: "+maplist.get(i).getValue()+"\r\n");
            }   
            br.close();
        }
        catch(Exception e)
        {
            System.out.println("文件讀取出錯！");
        }
        
        for(int i=0; i<maplist.size(); i++)
        {
            if(i>=10) break;
            System.out.println("<"+maplist.get(i).getKey()+">: "+maplist.get(i).getValue());
        }       
    }

進階部分

字符統計
用readLine()進行讀行處理，作正則表達式匹配（「Title: 「「Abstract: 「），判斷字符串是屬於Title仍是Abstract；而後對字符串作.length長度統計，用一個整型變量len標記字符串的長度，對字符串每一個字符作分析（非ASCII字符不作統計）。每讀完一行加上一個換行符，若屬於Title行則減7，若屬於Abstract行則減10。

public void CountChar(String inpath)
    {       
        try
        {
            InputStreamReader r = new InputStreamReader(new FileInputStream(inpath));
            BufferedReader br = new BufferedReader(r);
            //int a=0;
            //int cnt=0;
            //int csnt=0;
            String str=null;
            Pattern psa = Pattern.compile("Title");
            Pattern psb = Pattern.compile("Abstract");
            
            //int lines=0;
            while((str=br.readLine()) != null)
            {           
                Matcher msa = psa.matcher(str);
                Matcher msb = psb.matcher(str);
                if(msa.find())  
                {
                    int index=str.length()-1;
                    //System.out.println(str.charAt(10));
                    while(index >= 0)
                    {
                        char o = str.charAt(index);
                        if(o>=0 || o <=127 )
                        {
                            charCount++;
                        }
                        index--;
                    }
                    charCount-=7; //減去「Title: 」
                    charCount++;  //加上換行符
                    //charCount += str.length()-7;
                }
                else if(msb.find()) 
                {
                    //charCount+=str.length()-10;           
                    int index=str.length()-1;
                    while(index >= 0)
                    {
                        char o = str.charAt(index);
                        if(o>=0 || o <=127 )
                        {
                            charCount++;
                            //System.out.println(o);
                        }
                        index--;
                    }
                    charCount-=10; //減去「Abstract: 」
                    charCount++;
                }
                
            }
            //if(a != 0) charCount--;
            //charCount = charCount - (csnt/2)*17;//有效行減去"\r\n", "Title: ", "Abstract: "
            //System.out.println(csnt); 
            r.close();
            System.out.println("characters: "+charCount);
        }
        catch(IOException e)
        {
            System.out.println("文件讀取出錯！");
        }
    }

有效行數統計
作有效行處理（屬於「Title」和屬於「Abstract」的行）

public void CountLine(String inpath)
    {
        try
        {
            InputStreamReader r = new InputStreamReader(new FileInputStream(inpath));
            BufferedReader br = new BufferedReader(r);
            String s= br.readLine();
            int cnt=0;
            while(s != null)
            {
                //charCount+=s.length();
                //System.out.println(s);
                if(cnt%5 != 0 && !s.trim().equals(""))
                {
                    lineCount++;
                }
                cnt++;
                s=br.readLine();
            }   
            r.close();  
            
        }
        catch(IOException e)
        {
            System.out.println("文件讀取出錯！");
        }       
        
    }

單詞與詞頻統計
xx表示-w的參數，yy表示-m的參數，zz表示-n的參數，inpath表示-i的參數，outpath表示-o的參數。
對文件作讀行獲得字符串s；對s作字符分割成字符串數組wordstr，x記錄wordstr的長度，
先用一個循環作-m的詞組統計,這裏我只考率了單詞詞組，沒有考慮特殊字符的處理。
在用兩個List 作-w的統計，listt作title的統計，listta作abstract的統計。
再用一個循環統計-m沒有統計到的單詞，即wordstr[(x-xx+1)~(x)] 的字符串處理。
用一個Map作詞頻處理
在輸出前yy個詞頻的詞組（一個單詞或是單個單詞均可以）

public void CountWord(int xx, int yy, int zz, String inpath, String outpath)
    {
        try
        {
            InputStreamReader r = new InputStreamReader(new FileInputStream(inpath));
            BufferedReader br = new BufferedReader(r);
            String s= br.readLine();
            while(s != null)
            {
                s=s.toLowerCase();          
                String wordstr[] = s.split("[^a-zA-Z0-9]");
                int x = wordstr.length;
                if(x>=yy)
                {
                    for(int i=1; i<(x-yy+1); i++)
                    {                   
                        if(wordstr[i].matches("[a-z]{4}[a-z0-9]*"))
                        {
                            ++wordCount;
                            //System.out.println(wordstr[0]);
                            String phrase = "";
                            if(wordstr[0].equals("title"))  
                            {
                                //System.out.println("Title--");
                                int j=0;
                                for(j=0; j<yy; j++)
                                {
                                    if(!wordstr[i+j].matches("[a-z]{4}[a-z0-9]*")) break;
                                    else{
                                        phrase += (wordstr[i+j]+" ");
                                    }
                                }
                                phrase = phrase.substring(0, phrase.length()-1);
                                if(j == yy) listt.add(phrase);                              
                                
                            }
                            else  
                            {
                                //System.out.println("Abstract--");
                                int j=0;
                                for(j=0; j<yy; j++)
                                {
                                    if(!wordstr[i+j].matches("[a-z]{4}[a-z0-9]*")) break;
                                    else{
                                        phrase += (wordstr[i+j]+" ");
                                    }
                                }
                                phrase = phrase.substring(0, phrase.length()-1);
                                if(j == yy) lista.add(phrase);                              
                            }
                        }
                    }               
                    for(int k=(x-yy+1); k<x; k++)
                    {
                        if(wordstr[k].matches("[a-z]{4}[a-z0-9]*"))
                        {   
                            
                            ++wordCount;
                        }
                    }
                                
                }
                else
                {
                    for(int k=1; k<x; k++)
                    {
                        if(wordstr[k].matches("[a-z]{4}[a-z0-9]*"))
                        {   
                            ++wordCount;
                        }
                    }                   
                }
                
                s=br.readLine();
            }   
            r.close();      
            System.out.println("words: "+wordCount);
            System.out.println("lines: "+lineCount);
        }
        catch(IOException e)
        {
            System.out.println("文件讀取出錯！");
        }   
        
        
        
        if(xx == 0)
        {
            for(String li: listt)
            {
                if(words.get(li) != null)
                {
                    words.put(li, words.get(li)+1);
                }
                else words.put(li, 1);
            }       
        }
        else
        {
            for(String li: listt)
            {
                if(words.get(li) != null)
                {
                    words.put(li, words.get(li)+10);
                }
                else words.put(li, 10);
            }               
        }

        for(String li: lista)
        {
            if(words.get(li) != null)
            {
                words.put(li, words.get(li)+1);
            }
            else words.put(li, 1);
        }
        maplist = new ArrayList<Map.Entry<String, Integer>>(words.entrySet());
        Collections.sort(maplist, new Comparator<Map.Entry<String, Integer>>(){
            @Override
            public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
                // TODO Auto-generated method stub
                return o2.getValue() - o1.getValue(); //降序
            }
        });
        
        //打印前十結果
        for(int i=0; i<maplist.size(); i++)
        {
            if(i>=zz) break;
            System.out.println("<"+maplist.get(i).getKey()+">: "+maplist.get(i).getValue());
        }           
        //將詞頻統計輸入result.txt文件中
        try
        {
            File file = new File(outpath);
            BufferedWriter br = new BufferedWriter(new FileWriter(file));
            br.write("characters: "+charCount+"\r\n");
            br.write("words: "+wordCount+"\r\n");
            br.write("lines: "+lineCount+"\r\n");
            for(int i=0; i<maplist.size(); i++)
            {
                if(i>=10) break;
                br.write("<"+maplist.get(i).getKey()+">: "+maplist.get(i).getValue()+"\r\n");
            }   
            br.close();
        }
        catch(Exception e)
        {
            System.out.println("文件讀取出錯！");
        }
        
            
    }

main
xx表示-w的參數，yy表示-m的參數，zz表示-n的參數，inpath表示-i的參數，outpath表示-o的參數。

public class Main {
    public static void main(String[] args)
    {
        //long startTime = System.currentTimeMillis();
        
        
        //爬蟲部分 去掉註釋可用
        /*
        lib count = new lib();
        count.GetInfo();
        */
        WordCount words = new WordCount();
        //CountPhrase cp = new CountPhrase();
        String inpath = null; //-i
        String outpath = null;  //-o
        int xx = 0;  //-w
        int yy = 1;  //-m
        int zz = 10;  //-n
        for(int i=0; i<args.length;i+=2)
        {
            //if(args[i].equals("-i")) ;
            //else if(args[i].equals("-o")) ;
            if(args[i].equals("-i"))
            {   
                inpath = args[i+1];
            }
            if(args[i].equals("-o"))
            {
                outpath = args[i+1];
            }
            if(args[i].equals("-w")){
                xx = Integer.parseInt(args[i+1]);
            }
            if(args[i].equals("-m"))
            {
                yy = Integer.parseInt(args[i+1]);
            }
            if(args[i].equals("-n"))
            {
                zz = Integer.parseInt(args[i+1]);
            }
        }
        
        words.CountChar(inpath);
        words.CountLine(inpath);
        words.CountWord(xx, yy, zz, inpath, outpath);
        
        
        //long endTime = System.currentTimeMillis();    
        //System.out.println("程序運行時間："+(endTime-startTime)+"ms");
        
    
    }
}

部分測試代碼
基礎需求單元測試以下：分別對四個函數(CountChar()，CountLine(), CountWord(), CountFre() )進行大小爲110M左右的文件測試(由測試結果可知：CountWord()所耗時間最多)：

進階需求單元測試以下：分別對三個函數(CountChar()，CountLine(), CountWord())進行大小爲110M左右的文件測試(由測試結果可知：CountWord()所耗時間最多)

部分測試結果以下：

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
    
\t\n

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
    軟件工程
軟件工程\t\n軟件工程

abcdefghijklmnopqrstuvwxyz
1234567890
,./;'[]\<>?:"{}|`-=~!@#$%^&*()_+

e
a
d
軟工  
dhoisahdio

e
a 
  d 
 s 
dhoisahdio

weqweq
eqweee
wwwwww
wwwwww
wwwwww
wwwwww
wwwwww
wwwwww
wwwwww
wwwwww
wwwwww
wwwwww

困難與解決
對爬蟲不夠熟悉，編程能力較弱。最後經過上網查資料閱讀有關知識才得以完成任務。

隊友評價
- 221600428被評價：隊友思惟活躍，積極參與，兩人分工明確，合做較爲愉快。
- 221600438被評價：隊友認真細心，富有耐心，動手能力強，代碼能力強，是我應該學習的對象。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。