用Golang寫爬蟲(五) - 使用XPath

時間 2019-11-16

標籤 golang 爬蟲使用 xpath 欄目 Go 简体版

原文原文鏈接

在這個系列文章裏面已經介紹了BeautifulSoup的替代庫soup和Pyquery的替代庫goquery，但其實我寫Python爬蟲最願意用的頁面解析組合是lxml+XPath。爲何呢？先分別說一下lxml和XPath的優點吧html

lxml

lxml是HTML/XML的解析器，它用 C 語言實現的 libxml2 和l ibxslt 的P ython 綁定。除了效率高，還有一個特色是文檔容錯能力強。node

XPath

XPath全稱XML Path Language，也就是XML路徑語言，是一門在XML文檔中查找信息的語言，最初是用來搜尋XML文檔的，可是它一樣適用於HTML文檔的搜索。經過編寫對應的路徑表達式或者使用內置的標準函數，能夠方便的直接獲取到想要的任何內容，不用像soup和goquery那樣要用Find方法鏈式的找節點再用Text之類的方法或者對應的值（也就是一句代碼就拿到結果了），這就是它的特色和優點，而lxml正好支持XPath，因此lxml+XPath一直是我寫爬蟲的首選。git

XPath與BeautifulSoup(soup)、Pyquery(goquery)相比，學習曲線要高一些，可是學會它是很是有價值的，你會愛上它。你看我如今，原來用Python寫爬蟲學會了XPath，如今能夠直接找支持XPath的庫直接用了。github

另外說一點，若是你很是喜歡BeautifulSoup，必定要選擇BeautifulSoup+lxml這個組合，由於BeautifulSoup默認的HTML解析器用的是Python標準庫中的html.parser，雖然文檔容錯能力也很強，可是效率會差不少。golang

我學習XPath是經過w3school，能夠從延伸閱讀找到連接web

Golang中的Xpath庫

用Golang寫的Xpath庫是不少的，因爲我尚未什麼實際開發經驗，因此能搜到的幾個庫都試用一下，而後再出結論吧。bash

首先把豆瓣Top250的部分HTML代碼貼出來函數

<ol class="grid_view">
  <li>
    <div class="item">
      <div class="info">
        <div class="hd">
          <a href="https://movie.douban.com/subject/1292052/" class="">
            <span class="title">肖申克的救贖</span>
            <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
            <span class="other">&nbsp;/&nbsp;月黑高飛(港)  /  刺激 1995(臺)</span>
          </a>
          <span class="playable">[可播放]</span>
        </div>
      </div>
    </div>
  </li>
  ....
</ol>
複製代碼

仍是原來的需求：得到條目 ID 和標題post

github.com/lestrrat-go/libxml2

lestrrat-go/libxml2是一個libxml2的Golang綁定庫，性能

首先安裝它：

❯ go get github.com/lestrrat-go/libxml2
複製代碼

接着改代碼

import (
        "log"
        "time"
        "strings"
        "strconv"
        "net/http"

        "github.com/lestrrat-go/libxml2"
        "github.com/lestrrat-go/libxml2/types"
        "github.com/lestrrat-go/libxml2/xpath"
)

func fetch(url string) types.Document {
        log.Println("Fetch Url", url)
        client := &http.Client{}
        req, _ := http.NewRequest("GET", url, nil)
        req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
        resp, err := client.Do(req)
        if err != nil {
                log.Fatal("Http get err:", err)
        }
        if resp.StatusCode != 200 {
                log.Fatal("Http status code:", resp.StatusCode)
        }
        defer resp.Body.Close()
        doc, err := libxml2.ParseHTMLReader(resp.Body)
        if err != nil {
                log.Fatal(err)
        }
        return doc
}
複製代碼

fetch 函數和以前的總體一致，doc 是用libxml2.ParseHTMLReader(resp.Body)得到的。parseUrls的改動比較大：

func parseUrls(url string, ch chan bool) {
        doc := fetch(url)
        defer doc.Free()
        nodes := xpath.NodeList(doc.Find(`//ol[@class="grid_view"]/li//div[@class="hd"]`))
        for _, node := range nodes {
                urls, _ := node.Find("./a/@href")
                titles, _ := node.Find(`.//span[@class="title"]/text()`)
                log.Println(strings.Split(urls.NodeList()[0].TextContent(), "/")[4],
                        titles.NodeList()[0].TextContent())
        }
        time.Sleep(2 * time.Second)
        ch <- true
}
複製代碼

我的以爲libxml2設計的接口用起來體驗很差，每次都要用NodeList()[index].TextContent這麼麻煩的寫法得到匹配值。

另外文檔寫的很是簡陋，看項目源碼，還有能用xpath.NewContext建立一個上下文，而後用xpath.String(ctx.Find("/foo/bar"))的方式得到對應XPath語句的結果，但依然很麻煩！

github.com/antchfx/htmlquery

htmlquery如其名，是一個對HTML文檔作XPath查詢的包。它的核心是antchfx/xpath，項目更新頻繁，文檔也比較完整。

首先安裝它：

❯ go get github.com/antchfx/htmlquery
複製代碼

接着按需求修改：

import (
    "log"
    "time"
    "strings"
    "strconv"
    "net/http"

    "golang.org/x/net/html"
    "github.com/antchfx/htmlquery"
)

func fetch(url string) *html.Node {
    log.Println("Fetch Url", url)
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
    resp, err := client.Do(req)
    if err != nil {
        log.Fatal("Http get err:", err)
    }
    if resp.StatusCode != 200 {
        log.Fatal("Http status code:", resp.StatusCode)
    }
    defer resp.Body.Close()
    doc, err := htmlquery.Parse(resp.Body)
    if err != nil {
        log.Fatal(err)
    }
    return doc
}
複製代碼

fetch函數主要就是修改htmlquery.Parse(resp.Body)和函數返回值類型*html.Node。再看看parseUrls：

func parseUrls(url string, ch chan bool) {
    doc := fetch(url)
    nodes := htmlquery.Find(doc, `//ol[@class="grid_view"]/li//div[@class="hd"]`)
    for _, node := range nodes {
        url := htmlquery.FindOne(node, "./a/@href")
        title := htmlquery.FindOne(node, `.//span[@class="title"]/text()`)
        log.Println(strings.Split(htmlquery.InnerText(url), "/")[4],
            htmlquery.InnerText(title))
    }
    time.Sleep(2 * time.Second)
    ch <- true
}
複製代碼