分爲多個函數塊,各自實現一部分功能,對於統計單詞總數和統計單詞頻率能夠合併在一個函數裏。
計劃分割爲:CharactersNum(file)實現字符統計,linesNum(file)實現行數統計,wordsNum(file)實現單詞和頻率統計,writeResult(infile,outfile)實現輸出。
對於每一個函數模塊,首先實現基本的功能,再考慮邊緣條件,逐步優化函數。
在對函數進行接口封裝時,發現若是統計單詞總數和統計單詞頻率放在一塊兒,則粒度太大,很差按格式輸出,也沒法使用其中之一功能,因此後面拆爲wordsNum(file)統計單詞總數,wordMap(file)保存每一個單詞的頻率,再用writeMostWords(file)按照<word>: num格式輸出結果。雖然增長了一次文件讀入,但下降了耦合度,是能夠接受的。html
public static int charactersNum(String filename) throws IOException { int num = 0; BufferedReader br = new BufferedReader(new FileReader(filename)); int value = -1; while ((value = br.read()) != -1) { if (value > 0 && value < 128 && value != 13) { num ++; } } br.close(); return num; }
public static int linesNum(String filename) throws IOException { int num = 0; BufferedReader br = new BufferedReader(new FileReader(filename)); String line = null; while ((line = br.readLine()) != null) { if (line.trim().length() != 0) { num ++; } } br.close(); return num; }
public static int wordsNum(String filename) throws IOException { int num = 0; BufferedReader br = new BufferedReader(new FileReader(filename)); String separator = "[^A-Za-z0-9]";//分隔符 String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞 Pattern p = Pattern.compile(regex); Matcher m = null; String line = null; String[] array = null; while ((line = br.readLine()) != null) { line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 過濾漢字 line = line.replaceAll(separator, " "); // 用空格替換分隔符 array = line.split("\\s+"); // 按空格分割 for (int i = 0;i<array.length;i++) { m = p.matcher(array[i]); if (m.matches()) { num++; } } } br.close(); return num; }
public static TreeMap<String, Integer> wordMap(String filename) throws IOException { // 第一種方法遍歷 // 使用entrySet()方法生成一個由Map.entry對象組成的Set, // 而Map.entry對象包括了每一個元素的"鍵"和"值".這樣就能夠用iterator了 // Iterator<Entry<String, Integer>> it = tm.entrySet().iterator(); // while (it.hasNext()) { // Map.Entry<String, Integer> entry =(Map.Entry<String, Integer>) it.next(); // String key = entry.getKey(); // Integer value=entry.getValue(); // // System.out.println("<" + key + ">:" + value); // } TreeMap<String, Integer> tm = new TreeMap<String, Integer>(); BufferedReader br = new BufferedReader(new FileReader(filename)); String separator = "[^A-Za-z0-9]";//分隔符 String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞 Pattern p = Pattern.compile(regex); String str = null; Matcher m = null; String line = null; String[] array = null; while ((line = br.readLine()) != null) { line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 過濾漢字 line = line.replaceAll(separator, " "); // 用空格替換分隔符 array = line.split("\\s+"); // 按空格分割 for (int i = 0;i<array.length;i++) { m = p.matcher(array[i]); if (m.matches()) { str = array[i].toLowerCase(); if (!tm.containsKey(str)) { tm.put(str, 1); } else { int count = tm.get(str) + 1; tm.put(str, count); } } } } br.close(); return tm; }
public static void writeMostWords(String infilename,String outfilename) throws IOException { String outpath = new File(outfilename).getAbsolutePath(); FileWriter fw = new FileWriter(outpath, true); TreeMap<String, Integer> tm = wordMap(infilename); if(tm != null && tm.size()>=1) { List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(tm.entrySet()); // 經過比較器來實現排序 Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() { @Override public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) { //treemap默認按照鍵的字典序升序排列的,因此list也是排過序的,在值相同的狀況下不用再給鍵升序排列 // 按照值降序排序 return o2.getValue().compareTo(o1.getValue()); } }); int i = 1; String key = null; Integer value = null; for (Map.Entry<String, Integer> mapping : list) { key = mapping.getKey(); value = mapping.getValue(); System.out.print("<" + key + ">: " + value + '\n'); fw.write("<" + key + ">: " + value + '\n'); //只輸出前10個 if (i == 10) { break; } i++; } } fw.close(); }
使用jsoup爬取CVPR2018數據保存下來,而後再進行統計輸出。
使用commons-cli實現任意命令行參數的組合輸入以及任意順序參數輸入。對基本需求中的函數進行改進,增長布爾參數值w用以判斷是否開啓單詞權重功能,而後再進行權重統計。開啓w後Title中的單詞詞頻以10倍計。統計字符時,用replaceAll("Title: |Abstract: ", "");過濾掉標籤後再統計(使用readline()先讀取行再過濾時會漏掉行末的換行符,因此每讀一行,字符數要自增1)。統計單詞時則是先使用正則表達式Title: .* Abstract: .匹配到標籤後就從字符串中裁剪標籤,再進行相應統計。統計行數時根據要求,匹配到正則表達式[0-9]以及空行時不計算行數。java
public static int charactersNum(String filename) throws IOException { int num = 0; BufferedReader br = new BufferedReader(new FileReader(filename)); String regex = "[0-9]*"; // 匹配數字編號行 String separator = "Title: |Abstract: ";//過濾Title: 和 Abstract: Pattern p = Pattern.compile(regex); Matcher m = null; String line = null; char[] charArray = null; int value = -1; while ((line = br.readLine()) != null) { num++;//readLine()漏掉的換行符也統計在內 line = line.replaceAll(separator, ""); // 過濾Title: 和 Abstract: m = p.matcher(line); if (line.trim().length() != 0 && !m.matches()) { charArray = line.toCharArray(); for (int i = 0;i < line.length();i++) { value = (int)charArray[i]; if (value > 0 && value < 128 && value != 13) { num ++; } } } } br.close(); return num; }
public static int linesNum(String filename) throws IOException { int num = 0; BufferedReader br = new BufferedReader(new FileReader(filename)); String regex = "[0-9]*"; // 匹配數字編號行 Pattern p = Pattern.compile(regex); Matcher m = null; String line = null; while ((line = br.readLine()) != null) { m = p.matcher(line); if (line.trim().length() != 0 && !m.matches()) { num ++; } } br.close(); return num; }
public static int wordsNum(String filename,boolean w) throws IOException { int num = 0; BufferedReader br = new BufferedReader(new FileReader(filename)); String separator = "[^A-Za-z0-9]";//分隔符 String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞 String titleRegex = "Title: .*"; String abstractRegex = "Abstract: .*"; Pattern p = Pattern.compile(regex); Pattern tp = Pattern.compile(titleRegex); Pattern ap = Pattern.compile(abstractRegex); Matcher m = null; Matcher titleMacher = null; Matcher abstractMacher = null; String line = null; String[] array = null; boolean intitle = false; while ((line = br.readLine()) != null) { titleMacher = tp.matcher(line); abstractMacher = ap.matcher(line); if (titleMacher.matches()) { line = deleteSubString(line,"Title: "); intitle = true; } if (abstractMacher.matches()) { line = deleteSubString(line,"Abstract: "); } line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 過濾漢字 line = line.replaceAll(separator, " "); // 用空格替換分隔符 array = line.split("\\s+"); // 按空格分割 for (int i = 0;i<array.length;i++) { m = p.matcher(array[i]); if (m.matches()) { num = (w && intitle)?(num+10):(num+1); } } intitle = false; } br.close(); return num; }
public static TreeMap<String, Integer> wordMap(String filename,boolean w) throws IOException { TreeMap<String, Integer> tm = new TreeMap<String, Integer>(); BufferedReader br = new BufferedReader(new FileReader(filename)); String separator = "[^A-Za-z0-9]";//分隔符 String regex = "^[A-Za-z]{4,}[0-9]*$"; // 正則判斷每一個數組中是否存在有效單詞 String titleRegex = "Title: .*"; String abstractRegex = "Abstract: .*"; Pattern p = Pattern.compile(regex); Pattern tp = Pattern.compile(titleRegex); Pattern ap = Pattern.compile(abstractRegex); Matcher m = null; Matcher titleMacher = null; Matcher abstractMacher = null; String str = null; String line = null; String[] array = null; boolean intitle = false; while ((line = br.readLine()) != null) { titleMacher = tp.matcher(line); abstractMacher = ap.matcher(line); if (titleMacher.matches()) { line = deleteSubString(line,"Title: "); intitle = true; } if (abstractMacher.matches()) { line = deleteSubString(line,"Abstract: "); } line = line.replaceAll("[(\\u4e00-\\u9fa5)]", "");// 用空格替換漢字 line = line.replaceAll(separator, " "); // 用空格替換分隔符 array = line.split("\\s+"); // 按空格分割 for (int i = 0;i<array.length;i++) { m = p.matcher(array[i]); if (m.matches()) { str = array[i].toLowerCase(); if (!tm.containsKey(str)) { tm.put(str, w&&intitle?10:1); } else { int count = tm.get(str) + (w&&intitle?10:1); tm.put(str, count); } } } intitle = false; } br.close(); return tm; }
public static void writeMostWords(String infilename,String outfilename,boolean w,int n) throws IOException { String outpath = new File(outfilename).getAbsolutePath(); FileWriter fw = new FileWriter(outpath, true); TreeMap<String, Integer> tm = wordMap(infilename,w); if(tm != null && tm.size()>=1) { List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(tm.entrySet()); // 經過比較器來實現排序 Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() { @Override public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) { //treemap默認按照鍵的字典序升序排列的,因此list也是排過序的,在值相同的狀況下不用再給鍵升序排列 // 按照值降序排序 return o2.getValue().compareTo(o1.getValue()); } }); int i = 1; String key = null; Integer value = null; for (Map.Entry<String, Integer> mapping : list) { if (n == 0) { break; } key = mapping.getKey(); value = mapping.getValue(); System.out.print("<" + key + ">: " + value + '\n'); fw.write("<" + key + ">: " + value + '\n'); //只輸出前n個 if (i == n) { break; } i++; } } fw.close(); }
public static String deleteSubString(String str1,String str2) { StringBuffer sb = new StringBuffer(str1); while (true) { int index = sb.indexOf(str2); if(index == -1) { break; } sb.delete(index, index+str2.length()); } return sb.toString(); }
public static void main(String[] args) throws Exception { CommandLineParser parser = new GnuParser(); Options options = new Options(); options.addOption("i",true,"讀入文件名"); options.addOption("o",true,"輸出文件名"); options.addOption("w",true,"單詞權重"); options.addOption("m",true,"詞組詞頻統計"); options.addOption("n",true,"頻率最高的n行單詞或詞組"); CommandLine commandLine = parser.parse(options, args); if (commandLine.hasOption("i") && commandLine.hasOption("o") && commandLine.hasOption("w")) { String infilename = commandLine.getOptionValue("i"); String outfilename = commandLine.getOptionValue("o"); String w = commandLine.getOptionValue("w"); if (commandLine.hasOption("n")) { String n = commandLine.getOptionValue("n"); if (isNumeric(n)) { if (w.equals("1")) { writeResult(infilename,outfilename,true,Integer.valueOf(n)); } else { writeResult(infilename,outfilename,false,Integer.valueOf(n)); } } else { System.out.println("-n [0<=number<=100]"); } } else { if (w.equals("1")) { writeResult(infilename,outfilename,true,10); } else { writeResult(infilename,outfilename,false,10); } } } else { System.out.print("必須有-i -o -w選項和參數"); } }
public static boolean isNumeric(String str){ Pattern pattern = Pattern.compile("[0-9]*"); return pattern.matcher(str).matches(); } public static void initTxt(String string) throws IOException { String path = new File(string).getAbsolutePath(); FileWriter fw = new FileWriter(path, false); fw.write(""); fw.flush(); fw.close(); } public static void writeResult(String infilename,String outfilename,boolean w,int n) throws IOException { File file = new File(infilename); if (file.exists()) { initTxt(outfilename); String outpath = new File(outfilename).getAbsolutePath(); FileWriter fw = new FileWriter(outpath, true); int charactersNum = charactersNum(infilename); int wordsNum = wordsNum(infilename,w); int linesNum = linesNum(infilename); System.out.print("characters: " + charactersNum + '\n'); System.out.print("words: " + wordsNum + '\n'); System.out.print("lines: " + linesNum + '\n'); fw.write("characters: " + charactersNum + '\n'); fw.write("words: " + wordsNum + '\n'); fw.write("lines: " + linesNum + '\n'); fw.flush(); writeMostWords(infilename,outfilename,w,n); if (fw != null) { fw.close(); } } else { System.out.println(infilename + "文件不存在!"); } }
爬取的論文數據有非ascii碼字符致使顯示和統計不正確,使用正則表達式[^\\x00-\\xff]過濾後解決。使用單線程爬取速度太慢,多線程爬取這位同窗已解決。
git
雖然要求實現的東西也簡單,可是也花了很多時間,有時候被一些小細節問題打擾,爲了解決問題,查了很多資料,從而影響到整個的編碼的流暢度,特別是花了很多時間而問題又沒有解決時,簡直是一種折磨。不過還好,想法最終都得以順利實現。也學到了額外的知識,好比爬蟲工具jsoup的使用,github的代碼管理以及單元測試等。
不得不說的是,感受自身的理解能力還不太行,花了比較多時間才大體明白了要實現的功能。
評價隊友:有比較強的學習動力,也樂於交流,求知慾強。github
當最終全部問題順利解決時,看到本身提交完成的做業也特別有成就感,遇到不懂的問題最後也一一解決,學到了不少知識。
花了不少時間才最後完成,以爲本身還有不少進步的空間,也比較慶幸有一個隊友,幫助我理解和編程。
評價隊友:有比較強的編程能力,有耐心,有進取心。正則表達式
PSP2.1 | Personal Software Process Stages | 預估耗時(分鐘) | 實際耗時(分鐘) |
---|---|---|---|
Planning | 計劃 | ||
• Estimate | • 估計這個任務須要多少時間 | 610 | 630 |
Development | 開發 | ||
• Analysis | • 需求分析 (包括學習新技術) | 30 | 90 |
• Design Spec | • 生成設計文檔 | 20 | 10 |
• Design Review | • 設計複審 | 20 | 10 |
• Coding Standard | • 代碼規範 (爲目前的開發制定合適的規範) | 20 | 20 |
• Design | • 具體設計 | 120 | 250 |
• Coding | • 具體編碼 | 640 | 720 |
• Code Review | • 代碼複審 | 30 | 30 |
• Test | • 測試(自我測試,修改代碼,提交修改) | 40 | 60 |
Reporting | 報告 | ||
• Test Report | • 測試報告 | 10 | 15 |
• Size Measurement | • 計算工做量 | 15 | 10 |
• Postmortem & Process Improvement Plan | • 過後總結, 並提出過程改進計劃 | 30 | 25 |
合計 | 975 | 1240 |