結對第二次—文獻摘要熱詞統計及進階需求

時間 2019-12-07

標籤第二次文獻摘要統計進階需求欄目快樂工作简体版

原文原文鏈接

格式描述

這個做業屬於哪一個課程：軟件工程實踐
這個做業要求在哪裏：做業要求
結對學號： 221600434吳何 221500318陳一聰
github: 吳何
這個做業的目標：
1、基本需求：實現一個可以對文本文件中的單詞的詞頻進行統計的控制檯程序。
2、進階需求：在基本需求實現的基礎上，編碼實現頂會熱詞統計器。

1、WordCount基本需求

1.思路分析

讀取字符采用逐個讀取文件流，判斷是否符合要求（根據題目要求選擇不讀CR（ASCII碼值爲13））再計數。或者按行讀取後再轉化爲chararray數組後逐個判斷？（這種方式會省略換行）
由於要讀取單詞，因此使用正則表達式來匹配單詞，並過濾掉非法字符，以單詞-頻率做爲鍵值對保存在TreeMap中，TreeMap會根據鍵的字典序自動升序排列，因此後面只要再實現按照頻率優先的排序算法便可。
行數統計直接按行讀取後使用trim()去掉空白字符後，再進行判斷計數便可。

2.設計過程

分爲多個函數塊，各自實現一部分功能，對於統計單詞總數和統計單詞頻率能夠合併在一個函數裏。
計劃分割爲：CharactersNum(file)實現字符統計，linesNum(file)實現行數統計，wordsNum(file)實現單詞和頻率統計，writeResult(infile,outfile)實現輸出。
對於每一個函數模塊，首先實現基本的功能，再考慮邊緣條件，逐步優化函數。
在對函數進行接口封裝時，發現若是統計單詞總數和統計單詞頻率放在一塊兒，則粒度太大，很差按格式輸出，也沒法使用其中之一功能，因此後面拆爲wordsNum(file)統計單詞總數，wordMap(file)保存每一個單詞的頻率，再用writeMostWords(file)按照<word>: num格式輸出結果。雖然增長了一次文件讀入，但下降了耦合度，是能夠接受的。html

2.關鍵代碼

統計文件的字符數

public static int charactersNum(String filename) throws IOException  {
        int num = 0;        
        BufferedReader br = new BufferedReader(new FileReader(filename));
        int value = -1;
        while ((value = br.read()) != -1) {
            if (value > 0 && value < 128 && value != 13) {
                num ++;
            }           
        }
        br.close();     
        return num;
    }

統計文件的有效行數

public static int linesNum(String filename) throws IOException  {
        int num = 0;            
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String line = null;
        while ((line  = br.readLine()) != null) {
            if (line.trim().length() != 0) {
                num ++;                 
            }
        }
        br.close();     
        return num;
    }

統計文件的單詞個數

public static int wordsNum(String filename) throws IOException  {
        int num = 0;            
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String separator = "[^A-Za-z0-9]";//分隔符
        String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞
        Pattern p = Pattern.compile(regex);
        Matcher m = null;
        String line = null;
        String[] array = null;
        while ((line  = br.readLine()) != null) {
            line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 過濾漢字
            line = line.replaceAll(separator, " "); // 用空格替換分隔符
            array = line.split("\\s+"); // 按空格分割                     
            for (int i = 0;i<array.length;i++) { 
                m = p.matcher(array[i]);
                if (m.matches()) {
                    num++;
                }
            }
        }                       
        br.close();     
        return num;
    }

統計文件的熱詞數

public static TreeMap<String, Integer> wordMap(String filename) throws IOException {    
        // 第一種方法遍歷 
        // 使用entrySet()方法生成一個由Map.entry對象組成的Set,  
        // 而Map.entry對象包括了每一個元素的"鍵"和"值".這樣就能夠用iterator了  
//      Iterator<Entry<String, Integer>> it = tm.entrySet().iterator();  
//      while (it.hasNext()) {  
//          Map.Entry<String, Integer> entry =(Map.Entry<String, Integer>) it.next();  
//          String key = entry.getKey();  
//          Integer value=entry.getValue();  
//
//          System.out.println("<" + key + ">:" + value); 
//      }
        
        TreeMap<String, Integer> tm = new TreeMap<String, Integer>();               
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String separator = "[^A-Za-z0-9]";//分隔符
        String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞
        Pattern p = Pattern.compile(regex);
        String str = null;
        Matcher m = null;
        String line = null;
        String[] array = null;
        while ((line  = br.readLine()) != null) {
            line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 過濾漢字
            line = line.replaceAll(separator, " "); // 用空格替換分隔符
            array = line.split("\\s+"); // 按空格分割                     
            for (int i = 0;i<array.length;i++) { 
                m = p.matcher(array[i]);
                if (m.matches()) {
                    str = array[i].toLowerCase();                
                    if (!tm.containsKey(str)) {
                        tm.put(str, 1);
                    } else {
                        int count = tm.get(str) + 1;
                        tm.put(str, count);
                    }
                }
            }
        }                       
        br.close();     
        return tm;
    }

單詞排序輸出

public static void writeMostWords(String infilename,String outfilename) throws IOException {
        String outpath = new File(outfilename).getAbsolutePath();
        FileWriter fw = new FileWriter(outpath, true);
        TreeMap<String, Integer> tm = wordMap(infilename);
        if(tm != null && tm.size()>=1)
        {         
            List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(tm.entrySet());
            // 經過比較器來實現排序
            Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
                @Override
                public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
                    //treemap默認按照鍵的字典序升序排列的，因此list也是排過序的，在值相同的狀況下不用再給鍵升序排列      
                    // 按照值降序排序 
                    return o2.getValue().compareTo(o1.getValue());
                }
            });   
            int i = 1;
            String key = null;
            Integer value = null;
            for (Map.Entry<String, Integer> mapping : list) {
                key = mapping.getKey();
                value = mapping.getValue();
                System.out.print("<" + key + ">: " + value + '\n');
                fw.write("<" + key + ">: " + value + '\n');
                //只輸出前10個
                if (i == 10) {
                    break;
                }
                i++;
            }
        }
        fw.close();
    }

3.測試分析

4.困難與解決

統計文件字符數時，本來是打算按行讀取後直接用line.Length來計算字符數，後來發現比測試樣例要少，緣由是readLine()不讀取行末換行，採用read()逐個讀取後問題解決。
單詞排序時，對於TreeMap<String,Integer>類型沒有找到對應的排序接口Comparator<Entry<String, Integer>>()，後面將TreeMap保存到list中使用Collections接口調用排序得以解決。
使用命令行編譯時，因爲一開始的項目在SRC目錄下有包，.java文件在包裏，因此javac 命令要進入到.java文件目錄也就是最裏層才能夠，生成的.class文件和.java文件在同一目錄。而使用java命令時，則要進到SRC目錄下，包名做爲前綴名來使用，而且名稱不帶後綴，格式如：java 包名.java文件名才能夠運行。後面項目使用默認包名不須要包名前綴。
若是項目使用了別的.jar包，則要BuildPath->Add to buildPath到Referenced Libraries文件夾，才能夠在IDE下編譯運行，在命令行下按照上述方式會出現錯誤: 程序包xxx不存在，查資料後說要使用-classpath 選項添加.jar文件路徑或者-cp選項添加添加.jar文件路徑才能解決，嘗試了屢次都不行，後面只好用IDE自帶的命令行參數輸入工具來模擬輸入。

2、WordCount進階需求

1.思路分析

使用jsoup爬取CVPR2018數據保存下來，而後再進行統計輸出。
使用commons-cli實現任意命令行參數的組合輸入以及任意順序參數輸入。對基本需求中的函數進行改進，增長布爾參數值w用以判斷是否開啓單詞權重功能，而後再進行權重統計。開啓w後Title中的單詞詞頻以10倍計。統計字符時，用replaceAll("Title: |Abstract: ", "");過濾掉標籤後再統計（使用readline()先讀取行再過濾時會漏掉行末的換行符，因此每讀一行，字符數要自增1）。統計單詞時則是先使用正則表達式Title: .* Abstract: .匹配到標籤後就從字符串中裁剪標籤，再進行相應統計。統計行數時根據要求，匹配到正則表達式[0-9]以及空行時不計算行數。java

2.關鍵代碼

統計文件的字符數

public static int charactersNum(String filename) throws IOException  {      
        int num = 0;            
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String regex = "[0-9]*"; // 匹配數字編號行
        String separator = "Title: |Abstract: ";//過濾Title: 和 Abstract:
        Pattern p = Pattern.compile(regex);
        Matcher m = null;
        String line = null;
        char[] charArray = null;
        int value = -1;
        while ((line  = br.readLine()) != null) {
            num++;//readLine()漏掉的換行符也統計在內
            line = line.replaceAll(separator, ""); // 過濾Title: 和 Abstract:
            m = p.matcher(line);
            if (line.trim().length() != 0 && !m.matches()) {
                charArray = line.toCharArray();
                for (int i = 0;i < line.length();i++) {
                    value = (int)charArray[i];
                    if (value > 0 && value < 128 && value != 13) {
                        num ++;
                    }           
                }               
            }
        }
        br.close();     
        return num;     
    }

統計文件的有效行數

public static int linesNum(String filename) throws IOException  {
        int num = 0;            
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String regex = "[0-9]*"; // 匹配數字編號行
        Pattern p = Pattern.compile(regex);
        Matcher m = null;
        String line = null;
        while ((line  = br.readLine()) != null) {
            m = p.matcher(line);
            if (line.trim().length() != 0 && !m.matches()) {
                num ++;                 
            }
        }
        br.close();     
        return num;
    }

統計文件的單詞個數

public static int wordsNum(String filename,boolean w) throws IOException  {
        int num = 0;            
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String separator = "[^A-Za-z0-9]";//分隔符
        String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞
        String titleRegex = "Title: .*";
        String abstractRegex = "Abstract: .*";
        Pattern p = Pattern.compile(regex);
        Pattern tp = Pattern.compile(titleRegex);
        Pattern ap = Pattern.compile(abstractRegex);
        Matcher m = null;
        Matcher titleMacher = null;
        Matcher abstractMacher = null;
        String line = null;
        String[] array = null;
        boolean intitle = false;
        while ((line  = br.readLine()) != null) {
            titleMacher = tp.matcher(line);
            abstractMacher = ap.matcher(line);
            if (titleMacher.matches()) {
                line = deleteSubString(line,"Title: ");
                intitle = true;
            }
            if (abstractMacher.matches()) {         
                line = deleteSubString(line,"Abstract: ");
            }
            line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 過濾漢字
            line = line.replaceAll(separator, " "); // 用空格替換分隔符
            array = line.split("\\s+"); // 按空格分割                     
            for (int i = 0;i<array.length;i++) { 
                m = p.matcher(array[i]);
                if (m.matches()) {
                    num = (w && intitle)?(num+10):(num+1);
                }
            }
            intitle = false;
        }                       
        br.close();     
        return num;
    }

統計文件的熱詞數

public static TreeMap<String, Integer> wordMap(String filename,boolean w) throws IOException {       
        TreeMap<String, Integer> tm = new TreeMap<String, Integer>();               
        BufferedReader br = new BufferedReader(new FileReader(filename));
        String separator = "[^A-Za-z0-9]";//分隔符
        String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞
        String titleRegex = "Title: .*";
        String abstractRegex = "Abstract: .*";
        Pattern p = Pattern.compile(regex);
        Pattern tp = Pattern.compile(titleRegex);
        Pattern ap = Pattern.compile(abstractRegex);
        Matcher m = null;
        Matcher titleMacher = null;
        Matcher abstractMacher = null;
        String str = null;
        String line = null;
        String[] array = null;
        boolean intitle = false;        
        while ((line  = br.readLine()) != null) {
            titleMacher = tp.matcher(line);
            abstractMacher = ap.matcher(line);
            if (titleMacher.matches()) {
                line = deleteSubString(line,"Title: ");
                intitle = true;
            }
            if (abstractMacher.matches()) {         
                line = deleteSubString(line,"Abstract: ");
            }
            line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 用空格替換漢字
            line = line.replaceAll(separator, " "); // 用空格替換分隔符
            array = line.split("\\s+"); // 按空格分割                     
            for (int i = 0;i<array.length;i++) { 
                m = p.matcher(array[i]);
                if (m.matches()) {
                    str = array[i].toLowerCase();                
                    if (!tm.containsKey(str)) {
                        tm.put(str, w&&intitle?10:1);
                    } else {
                        int count = tm.get(str) + (w&&intitle?10:1);
                        tm.put(str, count);
                    }
                }
            }
            intitle = false;
        }                       
        br.close();     
        return tm;
    }

單詞排序輸出

public static void writeMostWords(String infilename,String outfilename,boolean w,int n) throws IOException {
        String outpath = new File(outfilename).getAbsolutePath();
        FileWriter fw = new FileWriter(outpath, true);
        TreeMap<String, Integer> tm = wordMap(infilename,w);
        if(tm != null && tm.size()>=1)
        {         
            List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(tm.entrySet());
            // 經過比較器來實現排序
            Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
                @Override
                public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
                    //treemap默認按照鍵的字典序升序排列的，因此list也是排過序的，在值相同的狀況下不用再給鍵升序排列      
                    // 按照值降序排序 
                    return o2.getValue().compareTo(o1.getValue());
                }
            });   
            int i = 1;
            String key = null;
            Integer value = null;
            for (Map.Entry<String, Integer> mapping : list) {
                if (n == 0) {
                    break;              
                }
                key = mapping.getKey();
                value = mapping.getValue();
                System.out.print("<" + key + ">: " + value + '\n');
                fw.write("<" + key + ">: " + value + '\n');
                //只輸出前n個 
                if (i == n) {
                    break;
                }
                i++;
            }
        }
        fw.close();
    }

裁剪字符串

public static String deleteSubString(String str1,String str2) {
        StringBuffer sb = new StringBuffer(str1);
        while (true) {
            int index = sb.indexOf(str2);
            if(index == -1) {
                break;
            }
            sb.delete(index, index+str2.length());      
        }       
        return sb.toString();
    }

實現命令行參數任意順序組合輸入

public static void main(String[] args) throws Exception {       
            CommandLineParser parser = new GnuParser();
            Options options = new Options();
            options.addOption("i",true,"讀入文件名");
            options.addOption("o",true,"輸出文件名");
            options.addOption("w",true,"單詞權重");
            options.addOption("m",true,"詞組詞頻統計");
            options.addOption("n",true,"頻率最高的n行單詞或詞組");
            CommandLine commandLine = parser.parse(options, args);
            
            if (commandLine.hasOption("i") && commandLine.hasOption("o") && commandLine.hasOption("w")) {
                String infilename = commandLine.getOptionValue("i");
                String outfilename = commandLine.getOptionValue("o");
                String w = commandLine.getOptionValue("w");
                if (commandLine.hasOption("n")) {
                    String n = commandLine.getOptionValue("n");
                    if (isNumeric(n)) {                     
                        if (w.equals("1")) {
                            writeResult(infilename,outfilename,true,Integer.valueOf(n));
                        }
                        else  {
                            writeResult(infilename,outfilename,false,Integer.valueOf(n));
                        }
                    }
                    else {
                        System.out.println("-n [0<=number<=100]");
                    }
                }
                else {
                    if (w.equals("1")) {
                        writeResult(infilename,outfilename,true,10);
                    }
                    else  {
                        writeResult(infilename,outfilename,false,10);
                    }
                }
            }
            else {
                System.out.print("必須有-i -o -w選項和參數");
            }
    }

其餘函數

public static boolean isNumeric(String str){ 
        Pattern pattern = Pattern.compile("[0-9]*"); 
        return pattern.matcher(str).matches(); 
    }
    
    public static void initTxt(String string) throws IOException {
        String path = new File(string).getAbsolutePath();
        FileWriter fw = new FileWriter(path, false);
        fw.write("");
        fw.flush();
        fw.close();
    }
      public static void writeResult(String infilename,String outfilename,boolean w,int n) throws IOException {
            File file = new File(infilename);
            if (file.exists()) {    
                initTxt(outfilename);
                String outpath = new File(outfilename).getAbsolutePath();
                FileWriter fw = new FileWriter(outpath, true);
                int charactersNum = charactersNum(infilename);
                int wordsNum = wordsNum(infilename,w);
                int linesNum = linesNum(infilename);
                System.out.print("characters: " + charactersNum + '\n');
                System.out.print("words: " + wordsNum + '\n');
                System.out.print("lines: " + linesNum + '\n');
                fw.write("characters: " + charactersNum + '\n');
                fw.write("words: " + wordsNum + '\n');
                fw.write("lines: " + linesNum + '\n');
                fw.flush();
                writeMostWords(infilename,outfilename,w,n);
                if (fw != null) {
                    fw.close();
                }
            }
            else {
                System.out.println(infilename + "文件不存在!");
            }
        }

3.測試分析

4.困難與解決

爬取的論文數據有非ascii碼字符致使顯示和統計不正確，使用正則表達式[^\\x00-\\xff]過濾後解決。使用單線程爬取速度太慢，多線程爬取這位同窗已解決。
git

3、心得總結和評價

221600434吳何

雖然要求實現的東西也簡單，可是也花了很多時間，有時候被一些小細節問題打擾，爲了解決問題，查了很多資料，從而影響到整個的編碼的流暢度，特別是花了很多時間而問題又沒有解決時，簡直是一種折磨。不過還好，想法最終都得以順利實現。也學到了額外的知識，好比爬蟲工具jsoup的使用，github的代碼管理以及單元測試等。
不得不說的是，感受自身的理解能力還不太行，花了比較多時間才大體明白了要實現的功能。
評價隊友：有比較強的學習動力，也樂於交流，求知慾強。github

221500318陳一聰

當最終全部問題順利解決時，看到本身提交完成的做業也特別有成就感，遇到不懂的問題最後也一一解決，學到了不少知識。
花了不少時間才最後完成，以爲本身還有不少進步的空間，也比較慶幸有一個隊友，幫助我理解和編程。
評價隊友：有比較強的編程能力，有耐心，有進取心。正則表達式

PSP

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃
• Estimate	• 估計這個任務須要多少時間	610	630
Development	開發
• Analysis	• 需求分析 (包括學習新技術)	30	90
• Design Spec	• 生成設計文檔	20	10
• Design Review	• 設計複審	20	10
• Coding Standard	• 代碼規範 (爲目前的開發制定合適的規範)	20	20
• Design	• 具體設計	120	250
• Coding	• 具體編碼	640	720
• Code Review	• 代碼複審	30	30
• Test	• 測試（自我測試，修改代碼，提交修改）	40	60
Reporting	報告
• Test Report	• 測試報告	10	15
• Size Measurement	• 計算工做量	15	10
• Postmortem & Process Improvement Plan	• 過後總結, 並提出過程改進計劃	30	25
	合計	975	1240