結對第二次—文獻摘要熱詞統計及進階需求

時間 2019-11-09

原文原文鏈接

課程名稱：軟件工程1916|Wgit

做業連接：結對第二次—文獻摘要熱詞統計及進階需求正則表達式

結對學號：221600421-孔偉民 | 221600422-李東權數組

做業正文

效能分析與 PSP

PSP 2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃
• Estimate	• 估計這個任務須要多少時間
Development	開發
• Analysis	• 需求分析 (包括學習新技術)	200	240
• Design Spec	• 生成設計文檔	60	60
• Design Review	• 設計複審	30	40
• Coding Standard	• 代碼規範 (爲目前的開發制定合適的規範)
• Design	• 具體設計	60	60
• Coding	• 具體編碼	400	600
• Code Review	• 代碼複審	100	200
• Test	• 測試（自我測試，修改代碼，提交修改）	60	300
Reporting	報告
• Test Report	• 測試報告	60	100
• Size Measurement	• 計算工做量	30	30
• Postmortem & Process Improvement Plan	• 過後總結, 並提出過程改進計劃	20	20
	總計	1020	1650

分工

221600422 李東權多線程
- 主要代碼實現app
- 需求分析討論
- 輔助博客撰寫
- 單元測試ide
221600421 孔偉民函數
- 博客撰寫
- 爬蟲程序編寫
- 需求分析討論
- 單元測試

基本需求

咱們看到題目以後首先思考的時單詞、字符、分隔符的定義分別是什麼，通過了羣裏你們的多天討論後仍是沒有得出特別準確的結論，就開始編寫第一版程序，具體思路是每一行讀取後根據正則表達式匹配而且分割成數組，對數組進行遍歷而後看具體的單詞狀況，行數和字符的實現只須要讀取一遍數據就能夠得出了。工具

其中主要的功能全都放在 WordCount 類中，而 Main 中則是對命令行參數的一些處理，類圖以下：性能

WordCount 中有三個主要的方法，即字符、單詞、行數統計，對於關鍵函數 WordCout 單詞統計的實現過程是這樣的：

爬蟲部分

爬蟲的部分使用了「Jsoup」庫，首先到 CVPR2018 的網站獲取到論文的列表，能夠看到列表的 HTML 的結構以下：

具體的論文連接都在類爲 ptitle 的 dt 標籤下，經過 Elements elements = doc.select(".ptitle a") 就能夠選擇到全部的連接。注意：這裏的連接都是相對路徑，不包含主機名，全部咱們在抓具體的論文時要加上前綴http://openaccess.thecvf.com/ 。

具體的論文詳情頁：

HTML 結構依然很簡單，其中框出來的地方就是咱們須要的內容，分別是標題、做者、摘要，經過 doc.select("#papertitle").text() 等就能夠獲取到具體的信息。

在抓取的過程當中發現一個個順序爬太慢了，因而就使用了多線程加快爬取的速度，ExecutorService pool = Executors.newScheduledThreadPool(8); 建立一個線程池，遍歷時就新建一個對應的線程加入到線程池中，能夠加快爬取的速度。

進階需求

咱們在主程序中先獲取到了輸入的各個參數，而後把參數傳入 CountArchieve 類構造對象，CountArchieve 類實現了行數統計、字符數統計、單詞數統計以及進階需求中的詞組統計和權重要求，具體實現流程圖以下：

在基礎的功能上加入了詞組統計以及權重的計算。其中在排序這個方面，即要求按照字典序輸出，咱們使用了TreeMap ，它具備按照字典序自動排序的功能。

具體代碼分析

charCount 和 lineCount 的實現比較簡單，從文件流的開頭開始遍歷，一邊遍歷一邊數就能夠了

public int LineCount() throws IOException {
        int count=0;
        bufferedReader.reset();
        String line;
        while ((line=bufferedReader.readLine())!=null){
            if(!line.isEmpty())
                count++;
        }
        return count;
    }

public int CharCount() throws IOException {   //不能區分回車和/r/n
        int count=0;
        bufferedReader.reset();
        int temp;
        while ((temp=bufferedReader.read())!=-1){
            count++;
            if(temp==13)
                bufferedReader.read();
        }
        return count;
    }

最爲核心的函數就是 WordCount

@Override
    public int WordCount() throws IOException {
        int count=0;
        String line;
        bufferedReader.reset();
        StringBuffer stringBuffer=new StringBuffer();
        while ((line=bufferedReader.readLine())!=null)
            stringBuffer.append(line+"\n");
        String content=stringBuffer.toString();

        //分割文本,分別以分隔符劃分和字母數字劃分，獲得分隔符數組和字母數字數組
        String [] words=content.split("([^a-zA-Z0-9]|\n)+");//1
        String [] division=content.split("[a-zA-Z0-9]+");//2

        //判斷文本是數字字母先出現仍是分隔符先出現，用於M的詞組統計中的分隔符位置
        int whofirst=1;
        if(words.length>0&&division.length>0){
            if(content.indexOf(words[0])<content.indexOf(division[0]))
                whofirst=1;
            else
                whofirst=2;
        }
        else if(words.length>0)
            whofirst=1;
        else if(division.length>0)
            whofirst=2;


        //用於存放長度爲M的詞組
        List<String> wordgroup=new ArrayList<>();

        //單詞的正則表達式
        Pattern pattern=Pattern.compile("^[a-zA-Z]{4,}[0-9]*[a-zA-Z]*");
        Integer value=0;
        String temp="";
        int weight=1; //權重

//        for(int i=0;i<words.length;i++)
//            System.out.println(words[i].toLowerCase());

        for(int i=0,record=0;i<words.length;i++){
            //System.out.println(words[i].toLowerCase());
            /*
            關於權重的判斷,由於Title和Abstract至關於兩部分須要清空積累量
            變量解釋:
                temp 用於存放獲取到的單詞組 當M=2時，可能存放爲 [A+]B ，即單詞分隔符單詞
                wordgroup與temp相似，惟一的區別是[A+][B+]，即B後面還要存放緊跟的換行符
                record記錄當前有多少個單詞知足了
             */

            if(weightjudge){
                if(words[i].equals("Title")){
                    weight=10;
                    temp="";
                    record=0;
                    wordgroup.clear();
                    continue;
                }
                else if(words[i].equals("Abstract")){
                    weight=1;
                    temp="";
                    record=0;
                    wordgroup.clear();
                    continue;
                }
            }
            else{
                if(words[i].equals("Title")||words[i].equals("Abstract")){
                    temp="";
                    record=0;
                    wordgroup.clear();
                    continue;
                }
            }

            //匹配單詞
            if(pattern.matcher(words[i]).matches()){
                count++;   //單詞數+1
                words[i]=words[i].toLowerCase(); //轉化爲小寫
                temp+=words[i];
                record++;    //知足詞組長度+1

                //這個判斷是用來判斷詞組問題即，temp=單詞+換行符，中換行符的位置，是否須要換行符，1爲單詞先，2爲單詞後
                if(this.wordlength>1&&record<this.wordlength){  //輸出第m個字符後temp不須要分隔符
                    if((whofirst==1)&&(i<division.length)){
                        temp+=division[i];
                        wordgroup.add((words[i]+division[i]));
                    }
                    else if((whofirst==2)&&((i+1)<division.length)){
                        temp+=division[i+1];
                        wordgroup.add((words[i]+division[i+1]));
                    }
                    else
                        wordgroup.add((words[i]));
                }
                else if (this.wordlength>1&&record==this.wordlength){ //wordgroup後須要temp+分隔符
                    if((whofirst==1)&&(i<division.length))
                        wordgroup.add((words[i]+division[i]));
                    else if((whofirst==2)&&((i+1)<division.length))
                        wordgroup.add((words[i]+division[i+1]));
                    else
                        wordgroup.add((words[i]));
                }
                if(record==this.wordlength){   //知足詞組長度
                    if(treeMap.containsKey(temp)) {
                        value = treeMap.get(temp) + weight;   //查找是否存在
                        treeMap.put(temp, value);
                    }
                    else{
                        treeMap.put(temp,weight);
                    }
                    temp="";
                    if(this.wordlength>1){
                        //因爲 a b c d，當M=3時有 <abc> <bcd>兩個詞組，這時候就要依靠
                        //wordgroup保存bc兩個單詞，此時wordgroup彈出a，留下bc,
                        //temp修改成b+c+，這就是前面group比temp多保存一個換行符的緣由
                        for(int x=1;x<wordgroup.size();x++)
                            temp+=wordgroup.get(x);
                        wordgroup.remove(0);
                    }
                    record--;
                }
                else;
            }
            else{
                temp="";
                record=0;
                wordgroup.clear();
            }
        }
        return count;
    }

性能分析

使用了 JProfiler 性能測試工具，能夠看到程序的主要時間花費都用在了字符串的分割和正則的匹配，即 split 和 match 函數上，wordcount 函數是程序中主要的函數，運行時間佔到了 15%。

單元測試

咱們構造了若干組測試數據，利用 idea 已有的 junit 進行單元測試，主要是測試 charCount、wordCount、lineCount 這三個函數的輸出符不符合咱們的預期輸出，其中單元測試類以下：

package Test;

import demo.CountAchieve;
import org.junit.Assert;
import org.junit.Test;

import static org.junit.Assert.*;

public class CountAchieveTest {

    private String load = "D:\\program\\IntellijIdeaProjects\\WordCount\\src\\Test\\";

    // 測試文件列表
    private String[] files = {
            "test1.txt", "test2.txt", "test3.txt", "test4.txt", "test5.txt",
            "test6.txt", "test7.txt", "test8.txt", "test9.txt", "test10.txt"
    };
    // 如下是預期輸出
    private int[] chars = {
            33, 34,0, 10, 67,
            40, 32, 56,55, 19
    };
    private int[] lines = {
            2, 2,0, 0, 3,
            3, 5, 3,2, 5
    };
    private int[] words = {
            1, 1,0, 0, 5,
            4, 3,6,6, 0
    };


    @Test
    public void charCount() throws Exception {
        CountAchieve t;
        for (int i = 0; i < files.length; i++) {
            t = new CountAchieve(load + files[i], "1.txt", 1, 10, false);
            Assert.assertEquals("字符統計錯誤"+files[i],chars[i],t.CharCount());
            t.CloseFile();
        }
    }

    @Test
    public void wordCount() throws Exception {
        CountAchieve t;
        for (int i = 0; i < files.length; i++) {
            t = new CountAchieve(load + files[i], "1.txt", 1, 10, false);
            Assert.assertEquals("單詞統計錯誤"+files[i],words[i],t.WordCount());
            t.CloseFile();
        }
    }

    @Test
    public void lineCount() throws Exception {
        CountAchieve t;
        for (int i = 0; i < files.length; i++) {
            t = new CountAchieve(load + files[i], "1.txt", 1, 10, false);
            Assert.assertEquals("行數統計錯誤"+files[i],lines[i],t.LineCount());
            t.CloseFile();
        }
    }
}

咱們經過一個測試文件集的列表輸入測試文件，而後在每個測試方法中循環統計這些測試文件的行數、符號、單詞數等。

部分測試文件實例以下：