結對第二次—文獻摘要熱詞統計及進階需求

時間 2019-11-09

原文原文鏈接

做業連接結對第二次—文獻摘要熱詞統計及進階需求html
隊員
- 221500201_孫文慈代碼測試上傳程序到github倉庫文檔編寫查閱資料進度規劃
- 226100125_劉傑 WordCount基礎需求和進階需求主要程序的編寫描述解題思路設計實現過程進度規劃

-Github
-基礎需求(https://github.com/swc221500201/PairProject1-C/)
-進階需求(https://github.com/swc221500201/PairProject2-C/)java

PSP表格

PSP2.1	Personal Software Process Stages	預估耗時（分鐘）	實際耗時（分鐘）
Planning	計劃
Estimate	估計這個任務須要多少時間	1420
Development	開發
Analysis	需求分析 (包括學習新技術)	60	50
Design Spec	生成設計文檔	100	200
Design Review	設計複審	60	80
Coding Standard	代碼規範 (爲目前的開發制定合適的規範)	120	180
Design	具體設計	120	170
Coding	具體編碼	660	720
Code Review	代碼複審	60	120
Test	測試（自我測試，修改代碼，提交修改）	90	100
Reporting	報告
Test Repor	測試報告	60	90
Size Measurement	計算工做量	60	70
Postmortem & Process Improvement Plan	過後總結, 並提出過程改進計劃	30	60
	合計	1420	1840

解題思路

剛剛接觸題目以後第一反應就是用c++的文件流操做來挨個讀取和分隔單詞，而且很快的寫出了能夠運行的源碼，後續單詞統計排序功能，本身手動寫了一些程序，雖然實現了排序功能，不過在文本大小超過1M以上，速度就明顯更不上了，在CSDN和博客園上學習了幾篇相關的博客，發現用哈希表是個不錯的選擇，而C++的map容器又正好適合這個功能，內部的紅黑樹實現使得插入和排序速度快了很多，通過改進以後速度上有了很大提高。node

實現過程

在考慮好實現基礎功能和進階需求以後：python

針對基礎需求：考慮到要儘可能作到功能獨當即"Core"模塊化，因而按要求實現了三個主要函數以及一些輔助的bool判斷函數來知足題目需求。由於功能比較簡單因此能夠在一個源文件中實現。

int countCharacter(string f);//f爲要統計的文件路徑 返回int類型的字符個數
int countLine(string f);//f爲要統計的文件路徑 返回int類型的行數
int countWords(string f);//f爲要統計的文件路徑 返回int類型的單詞個數
void sortwordCount(string f, string resultTxt);//排序文件f單詞並輸出到文件 resultTxt

針對進階需求：相比於基礎功能，進階功能要求多了命令行運行，以及詞頻統計，權重統計等進一步要求，因此整體上是在基礎功能的函數之上重寫了一些功能來知足。

void countWordsWithWeight(string f, string resultTxt, int w);//權重統計,f爲要統計的文件路徑
//w爲權重選擇 並輸出到文件 resultTxt
void countGroupWordsWithLength(string f, string resultTxt, int n);//f爲要統計的文件路徑 並輸出到文件 resultTxt ，n爲用戶自定義詞組長度
.....

如下是使用命令行運行時的主要寫法，即把argv[i]挨個判斷並執行相關操做ios

for (int i = 0; i < argc; ++i) {
    if (string(argv[i]) == "-i"){        
    }
    else if(){      
    }
    ......
}

針對爬蟲：爬蟲原計劃使用c++的類庫實現，不過代碼量實在巨大，後續改用java的jsoup爬取整個html文檔，並生成能夠修改的DOM文檔，在DOM文檔中使用正則匹配找到了每一篇論文的href，以後對這個href連接挨個發送請求，從返回的文本中解析出.ptitle等咱們須要的數據內容。總的來講不是很難，不過本身分析DOM文檔結構的時候比較難找到規律，好在這個網站沒有設置反爬蟲的陷阱。並且不須要翻頁等複雜操做。

改進思路

爬蟲javac++

爬蟲的爬取速度太慢，爬取所有900多條論文消息須要花費近一分鐘，若是可以啓用線程應該能大幅度提升速度，不過限制於時間沒有實現。git
基礎功能github

其實我以爲統計行數，統計單詞，統計字符的三個功能能夠集中於一個函數，這樣能夠在分析統計的時候節約兩遍的讀取時間。不過助教說了這是爲了使功能獨立。api
進階功能緩存

進階功能基於基礎功能，因此仍是以爲要改進的是基礎功能。

代碼說明

基礎需求

//f:要進行字符統計的文件路徑 
//返回值 字符數
//c >= 0 && c <= 127 由於爬取的文件包含法語字符因此用c++的 isascii(c)判斷會報錯
int countCharacter(string f) {
    int ascii = 0;
    ifstream read;
    read.open(f, ios::in);
    char c;
    while (!read.eof()) {
        read >> c;
        if (c >= 0 && c <= 127)
            ascii++;
    }
    read.close();
    return ascii;
}

//f:要進行統計的文件路徑 
//返回值 行數
//eachline.empty()劃去空行
int countLine(string f) {
    ifstream input(f, ios::in);
    string eachline;
    int line = 0;
    while (getline(input, eachline))
    {
        if (!eachline.empty())
            line++;
    }
    input.close();
    return line;
}

//f:要進行統計的文件路徑 
//返回值 單詞數
//裏面的isword()爲自定義的單詞判斷函數 自動過濾不符合題意單詞
int countWords(string f) {
    int wordNum = 0;
    ifstream input;
    input.open(f, ios::in);
    string aline;
    string content;
    string::size_type start = 0;
    string::size_type end = aline.find_first_of(" ");//空格做爲單詞分隔符

    while (getline(input, aline))
    {
        //爲了不溢出，保存一個string對象size的最安全的方法就是使用標準庫類型string：：size_type
        string::size_type start = 0;
        string::size_type end = aline.find_first_of(" ");//空格做爲單詞分隔符
        while (end != string::npos) //npos就是這一行到頭啦；
        {
            string content = aline.substr(start, end - start);
            if (isword(content))//這個單詞從未出現
                wordNum++;
            start = end + 1;
            end = aline.find_first_of(" ", start);//空格做爲單詞分隔符
        }

    }
    input.close();
    return wordNum;
}


//f:要進行統計的文件路徑 
//返回值 無
//利用map容器來存儲統計單詞詞頻 multimap來實現單詞字典順序輸
void sortwordCount(string f) {

    ofstream out(resultTxt,ios::app);
    ifstream input;

    input.open(f, ios::in);
    string eachline;
    map<string, int> mapA; //第一個存單詞,第二個存單詞出現的次數;

    while (getline(input, eachline))
    {
        //爲了不溢出，保存一個string對象size的最安全的方法就是使用標準庫類型string：：size_type
        string::size_type start = 0;
        string::size_type end = eachline.find_first_of(" ");//空格做爲單詞分隔符
        while (end != string::npos) //npos就是這一行到頭啦；
        {
            string content = eachline.substr(start, end - start);
            if (isword(content)) {
                tolowerString(content);//把content內容轉換爲小寫 便於輸出和統計
                
                //if (!isLetter(content[end])&&!isdigit(content[end]))
                //  content.erase(content.end());
                map<string, int>::iterator it = mapA.find(content);
                if (it == mapA.end())//這個單詞從未出現
                    mapA.insert(pair<string, int>(content, 1));//賦值的時候只接受pair類型；
                else
                    ++it->second;//單詞存在

            }
            start = end + 1;
            end = eachline.find_first_of(" ", start);//空格做爲單詞分隔符
        }

    }

    multimap<int, string, greater<int> > mapB;//按int排序的multimap

//轉移mapA
    for (map<string, int>::iterator it1 = mapA.begin(); it1 != mapA.end(); ++it1)
    {
        mapB.insert(pair<int, string>(it1->second, it1->first));
    }


    //界面輸出前十
    int i = 0;
    for (map<int, string>::iterator it2 = mapB.begin(); i < 10&&it2!=mapB.end(); ++it2, ++i)
        cout <<"<"<<it2->second <<">:"<< it2->first << endl;
    //輸出排序好的map
    
    for (map<int, string>::iterator it2 = mapB.begin(); it2 != mapB.end(); ++it2)
    {
        //      if ((it2->first) > 1)
        out << "<" << it2->second << ">:" << it2->first << endl;
    }

    out.close();
    input.close();
}

進階需求

詞頻統計

//f:要進行統計的文件路徑 
//返回值 無
//實現方法與基礎功能裏差很少，區別在於過濾了title： 和abstract：
//爲了簡潔如下只顯示關鍵段代碼
void sortwordCount(string f) {
    if (isAbstract(eachline.substr(start, end - start)) || isTitle(eachline.substr(start, end - start))) {
    start = end + 1;
    end = eachline.find_first_of(" ", start);//空格做爲單詞分隔符
    }
}


//f:要進行統計的文件路徑 w 是否開啓權重統計 0關閉 1開啓
//返回值 無
//實現方法與基礎功能裏差很少，區別在於增長了權重統計功能
void countWordsWithWeight(string f, int w) {

        flag = isTitle(eachline.substr(start, end - start));
        ///不統計title和abstract
        if (isAbstract(eachline.substr(start, end - start))||isTitle(eachline.substr(start, end - start))) {
            start = end + 1;
            end = eachline.find_first_of(" ", start);//空格做爲單詞分隔符
        }
        while (end != string::npos) //npos就是這一行到頭啦；
        {
            string content = eachline.substr(start, end - start);

            if (isword(content)) {
                tolowerString(content);//把content內容轉換爲小寫 便於輸出和統計
                map<string, int>::iterator it = mapA.find(content);
                if (it == mapA.end())//這個單詞從未出現
                    mapA.insert(pair<string, int>(content, 1));//賦值的時候只接受pair類型 
                else {
                    if (w == 0 || flag == false)
                        ++it->second;//單詞存在
                    else if (w == 1 && flag == true)
                    {
                        it->second+=10;//單詞存在+= 10;
                    }       
                }
            }
            start = end + 1;
            end = eachline.find_first_of(" ", start);//空格做爲單詞分隔符
        }

    }

}

//f:要進行統計的文件路徑 n 用戶自定義的詞組長度
//返回值 無
//實現方法：從流裏讀取到第n個分隔符後截斷
void countGroupWordsWithLength(string f,int n) {

    while (getline(input, eachline))
    {
        //爲了不溢出，保存一個string對象size的最安全的方法就是使用標準庫類型string：：size_type
        string::size_type start = 0;
        string::size_type end = eachline.find_first_of(" ");//空格做爲單詞分隔符
                        ///不統計title和abstract
        if (isAbstract(eachline.substr(start, end - start)) || isTitle(eachline.substr(start, end - start))) {
            start = end + 1;
            end = eachline.find_first_of(" ", start);//空格做爲單詞分隔符
        }

        content = eachline.substr(start, end-start);
        for (i = 0; i < content.size() && cntNum < n; ++i) {
            if (content[i]==' ')
                cntNum++;
        }
        end = end + i;
}

爬取論文信息

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;

public class  Spiter {
    // TODO Auto-generated method stub
    public static void main(String[] args) throws IOException {
        Document doc=Jsoup.connect("http://openaccess.thecvf.com/CVPR2018.py").maxBodySize(0).get();
        Elements listClass = doc.getElementsByAttributeValue("class", "ptitle");
        Document paper;
        int num=0;
        File file=new File("spider.txt");
        Writer out=new FileWriter(file);
            try {           
                System.out.print("爬取開始\n");
                for(Element element:listClass) {
                    
                    String link = element.getElementsByTag("a").attr("href");
                    link="http://openaccess.thecvf.com/"+link;
                    paper=Jsoup.connect(link).get();
                    
                    Element Etitle=paper.getElementById("papertitle");                  
                    Element Eabstr=paper.getElementById("abstract");
                    String abstr=Eabstr.text();
                    String title=Etitle.text();
                    out.write(num+"\r\n");
                    out.write("Title: "+title+"\r\n");
                    out.write("Abstract: "+abstr+"\r\n"); // \r\n即爲換行
                    
                    out.write("\r\n");
                    out.write("\r\n");
                    num++;
                    out.flush(); // 把緩存區內容壓入文件

                }
                System.out.print("爬取結束");
                
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }       
            out.close();
        }
}

測試代碼

File *input = fopen("input.txt","r");
File *ans = fopen("ans.txt","r");
String getAns,getInput;
while(!getline(getInput,input)){
    getline(getAns,input);
    if(!getAns.equal(getInput))
        showMessage();
}
cout<<"success"<<endl;

困難

好久沒有接觸到文件的操做，對於c++的api比較生疏，從新熟悉的過程花了很多時間，另外在爬蟲上也花了很多時間。例如Jsoup.connect（）函數會限定一個默認的1M大小，使得我爬取的數據只有500多條，然而其餘隊伍用python作出來的卻有900多條，正當準備重寫的時候，隊友發現了這個問題，爲咱們節省了很多時間。以及後期需求不斷變動，每次都要從新考慮。不過好在此次的配合默契了很多。

總結

經過此次做業，咱們能夠說從新學習了一遍c++，對相關知識有了更深刻的理解和掌握。完成基礎需求後發現時間還比較充裕，便去嘗試寫了一下進階需求，在這個過程當中接觸了爬蟲，開始時遇到了一些問題，好在隊友間相互配合探討，成功發現並解決了問題，使咱們進一步體會到了合做的優點。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。