寫爬蟲還在用 python？快來試試 go 語言的爬蟲框架吧

時間 2019-11-10

原文原文鏈接

今天爲你們介紹的是一款 go 語言爬蟲框架 -- colly。html

開始

首先，你可使用一下命令安裝 colly。jquery

go get -u github.com/gocolly/colly/...

其次，構建 Collector，添加事件，而後訪問：git

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    // 初始化 colly
    c := colly.NewCollector(
        // 只採集規定的域名下的內容
        colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
    )

    // 任何具備 href 屬性的標籤都會觸發回調函數
    // 第一個參數其實就是 jquery 風格的選擇器
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        fmt.Printf("Link found: %q -> %s\n", e.Text, link)
        // 訪問該網站
        c.Visit(e.Request.AbsoluteURL(link))
    })

    // 在請求發起以前輸出 url
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    // 從如下地址開始抓起
    c.Visit("https://hackerspaces.org/")
}

運行以上代碼，會從最開始的地址抓起，一直把規定的兩個域名下的頁面遞歸採集完。看，是否是很簡單很方便！github

登陸鑑權

某些網站的某些頁面可能須要登陸狀態才能訪問。Colly 提供 Post 方法用於登陸請求（colly 自己會維護 cookie）。正則表達式

// authenticate
err := c.Post("http://example.com/login", map[string]string{"username": "admin", "password": "admin"})
if err != nil {
    log.Fatal(err)
}

不少網站可能會有驗證碼、csrf_token 之類的仿網絡攻擊策略。對於 csrf_token，通常都會在頁面的某個位置，好比表單，或者 mate 標籤裏，這些都是很容易獲取到的。對於驗證碼，能夠嘗試在控制檯輸入結果或者採用圖片識別的方式。redis

速率控制

不少內容網站會有防採集策略，因此過快的請求速率極可能致使被封 ip。這裏可使用 LimitRule 限制採集速度。後端

// 對於任何域名，同時只有兩個併發請求在請求該域名
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})

上面是一個簡單的例子。除了能夠限制域名併發量外，還能夠限制間隔時間等。咱們看一下 LimitRule 的結構：cookie

type LimitRule struct {
   // 匹配域名的正則表達式
   DomainRegexp string
   // glob 匹配模式
   DomainGlob string
   // 在發起一個新請求時的等待時間
   Delay time.Duration
   // 在發起一個新請求時的隨機等待時間
   RandomDelay time.Duration
   // 匹配到的域名的併發請求數
   Parallelism    int
   waitChan       chan bool
   compiledRegexp *regexp.Regexp
   compiledGlob   glob.Glob
}

隊列與`redis`存儲支持

某些狀況下，咱們的爬蟲可能會主動或被動地掛掉，因此一個合理的進度保存有助於咱們排除已經爬過的內容。這時候咱們就須要用到隊列以及存儲支持。網絡

Colly 自己有文件存儲模式，默認是未開啓狀態。推薦使用 redis 進行存儲。併發

urls := []string{
   "http://httpbin.org/",
   "http://httpbin.org/ip",
   "http://httpbin.org/cookies/set?a=b&c=d",
   "http://httpbin.org/cookies",
}

c := colly.NewCollector()

// 建立 redis storage
storage := &redisstorage.Storage{
   Address:  "127.0.0.1:6379",
   Password: "",
   DB:       0,
   Prefix:   "httpbin_test",
}

// 把 storage 設置到 collector 上
err := c.SetStorage(storage)
if err != nil {
   panic(err)
}

// 刪除以前的數據（若是須要）
if err := storage.Clear(); err != nil {
   log.Fatal(err)
}

// 結束後關閉 redis 鏈接
defer storage.Client.Close()

// 使用 redis 做爲存儲後端，建立請求隊列
// 消費者數量設定爲 2
q, _ := queue.New(2, storage)

c.OnResponse(func(r *colly.Response) {
   log.Println("Cookies:", c.Cookies(r.Request.URL.String()))
})

// 把 url 加入到隊列
for _, u := range urls {
   q.AddURL(u)
}
// 開始採集
q.Run(c)

使用隊列時，在解析到頁面的連接後，能夠繼續把連接的 url 添加到隊列中。

內容解析

內容抓取到了，如何解析並獲取咱們想要的內容呢？
以 html 爲例（colly 也有 xml 等內容解析）：

// refentry 內容
c.OnHTML(".refentry", func(element *colly.HTMLElement) {
   // ...
})

OnHtml 第一個參數是 jquery風格的選擇器，第二個參數是 callback，callback 會傳入 HTMLElement 對象。HTMLElement 結構體：
type HTMLElement struct {
   // 標籤的名稱
   Name       string
   Text       string
   attributes []html.Attribute
   // 當前的 request
   Request *Request
   // 當前的 response
   Response *Response
   // 當前節點的 DOM 元素
   DOM *goquery.Selection
   // 在該 callback 回調中，此 element 的索引
   Index int
}

其中，能夠經過 DOM 字段操做（增刪節點）、遍歷、獲取節點內容。
DOM 字段是 Selection 類型，該類型提供了大量的方法。若是你用過 jQuery，你必定會以爲熟悉。

舉個栗子，咱們想要刪除 h1.refname 標籤，並返回父元素的 html 內容：

c.OnHTML(".refentry", func(element *colly.HTMLElement) {
   titleDom := element.DOM.Find("h1.refname")
   title := titleDom.Text()
   titleDom.Remove()
   
   content, _ := element.DOM.Html()
   // ...
})