流水帳記錄一個網頁信息提取任務的解決過程

時間 2019-11-20

原文原文鏈接

按照那個懸賞任務的做者推薦，下載了那幾個html parser，以及那個cJSON。html

先寫了一個workflow.h文件，裏面填了幾個函數聲明：node

// true(1) or false(0)
int validate_args(int argc, char** argv);
int validate_file(FILE* file);

/* parse fileContent to page */
int parse_filecontent(FileContents* fileContent, PageTree* page);
void clean_pagetree(PageTree* page);

/* return true(1) or false(0) */
int check_page_type(PageTree* page);

/* return true(1) or false(0).
   If return true, the json fragment is stored in argument (*json).
*/
int output_json(PageTree* page, jsonPtr* json);

這個已經很直觀了，在main函數裏依次調用這些函數，最終輸出的json就是做者的任務要求，就是這麼直接，main函數以下：git

int main(int argc, char** argv)
{
    int retcode;
    /*retcode = validate_args(argc, argv);
    if (!retcode) {
        report_msg(0, "Invalid arguments.\n");
        return 1;
    }*/

    const char* filename = "F:/迅雷下載/懸賞/html解析工具參考附件/html_samples/Barack Obama.html";
    FILE* file = 0;
    fopen_s(&file, filename, "rb");
    if (!file) {
        report_msg(0, "Open file failed.\n");
        return 1;
    }

    retcode = validate_file(file);
    if (!retcode) {
        return 1;
    }

    FileContents fc;
    if (!load_file_to_memory(file, &fc)) {
        return 1;
    }

    PageTree page;
    if (!parse_filecontent(&fc, &page)) {
        return 1;
    }

    jsonPtr json;
    if (!output_json(&page, &json)) {
        return 1;
    }

    char* json_print_content = cJSON_Print(json);
    report_msg_with_newline(1, json_print_content);
    free(json_print_content);

    clean_pagetree(&page);
    unload_file_from_memory(&fc);
    fclose(file);
    return 0;
}

裏面的load_file_to_memory是由於gumbo那個解析庫的parse函數只接收char* buffer，即只接收html格式的文本，不支持指定文件路徑來parse的方式。並且那個庫只支持utf8編碼的文件。所以文件須要都是以utf8格式保存的才行。github

用那個奧巴馬的facebook頁面測試了一下，最後控制檯輸出了正確的json串，算是跑通了流程。web

剩下的就是如何判斷頁面類型，以及對不一樣類型頁面運用不一樣函數來提取信息。chrome

在chrome裏查看頁面源代碼太雜亂，拷貝到visualstudio裏把電腦卡死，看網上的推薦下載了一個webstorm，還挺好使。將html代碼拷貝到webstorm裏按Alt+F8格式化後，查找判斷頁面類型的蛛絲馬跡。鼠標停留在html標籤上時，編輯器頭部會出現一個層次信息條，右擊這個層次信息條的節點能在編輯器裏選中包括該節點在內的節點內容。shell

針對facebook頁面，發現html#id="facebook"，據此能夠肯定一個頁面是否facebook頁面。同時發現head->noscript->meta#content是判斷該facebook頁面是列表頁仍是評論頁的關鍵。我去你大爺的，gumbo解析noscript節點竟然有個這麼大的坑，noscript節點裏的東西都被當作raw text了。先用下面的代碼將就一下：json

if (noscript = get_child_node(head, GUMBO_TAG_NOSCRIPT, 0, 0)) {
  const char* content = strstr(get_elem(noscript)->original_tag.data, "URL=");
  /*if (meta = get_child_node(noscript, GUMBO_TAG_META, 0, 0)) {
    GumboAttribPtr content = gumbo_get_attribute(&get_elem(meta)->attributes, "content");
    if (content) {
    const char* p, *q;
    p = strchr(content->value, '/');
    if (p && (q = strchr(content->value + 1, '/')) && (q > p + 1)) {
    if (*(q + 1) == '?') {
    page->page_type = eFacebookListPage;
    }
    else {
    page->page_type = eFacebookCommentPage;
    }
    int username_length = q - p - 1;
    page->user_name = malloc(username_length + 1);
    memcpy(page->user_name, p + 1, username_length);
    page->user_name[username_length] = 0;

    return 1;
    }
    }
    }*/
  if (content) {
    const char* p, *q;
    p = strchr(content, '/');
    if (p && (q = strchr(p+1, '?')) && (q > p + 1)) {
      if (isalpha(*(q - 1))) {
	page->page_type = eFacebookListPage;
      }
      else {
	page->page_type = eFacebookCommentPage;
      }
      int username_length = q - p - 1;
      page->user_name = malloc(username_length + 1);
      memcpy(page->user_name, p + 1, username_length);
      page->user_name[username_length] = 0;

      return 1;
    }
  }
}

webstorm的導航區裏選中文件，按shift+f6是重命名功能。webstorm

twitter的列表頁與評論頁的區分與facebook頁面一致，而且這裏也是判斷該頁面是否twitter頁面的一個點。編輯器

youtube頁面在head->link的rel="canonical"時，其href裏有www.youtube.com字樣，從這裏也能夠區分是列表頁仍是評論頁。回去看了也twitter代碼，發現twitter也如此。經過這個字段判斷更容易。

區分開頁面的具體類型後，剩下的最後一件事情就是從頁面提取信息。

代碼的下載編譯：

git clone https://github.com/google/gumbo-parser.git

安裝gumbo的過程：

$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

這過程當中有可能提示g++和automake或autoconf沒安裝等等，有可能還須要安裝pkg-config等，這些都是gumbo庫須要的。

git clone https://git.oschina.net/zbjxb/jinmudaozhang.git

這是parser代碼。

編譯指令：

gcc -o parser parse_htmls.c cJSON.c zbj_utils.c zbj_workflow.c -I/usr/local/include /usr/local/lib/libgumbo.a -lm

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。