Go 每日一庫之 colly

時間 2021-07-03

標籤 javascript java git github golang chrome json api 數組瀏覽器欄目 JavaScript 简体版

原文原文鏈接

簡介

colly是用 Go 語言編寫的功能強大的爬蟲框架。它提供簡潔的 API，擁有強勁的性能，能夠自動處理 cookie&session，還有提供靈活的擴展機制。javascript

首先，咱們介紹colly的基本概念。而後經過幾個案例來介紹colly的用法和特性：拉取 GitHub Treading，拉取百度小說熱榜，下載 Unsplash 網站上的圖片。java

快速使用

本文代碼使用 Go Modules。git

建立目錄並初始化：github

$ mkdir colly && cd colly
$ go mod init github.com/darjun/go-daily-lib/colly

安裝colly庫：golang

$ go get -u github.com/gocolly/colly/v2

使用：chrome

package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  c := colly.NewCollector(
    colly.AllowedDomains("www.baidu.com" ),
  )

  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    c.Visit(e.Request.AbsoluteURL(link))
  })

  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))
  })

  c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error %s: %v\n", r.Request.URL, err)
  })

  c.Visit("http://www.baidu.com/")
}

colly的使用比較簡單：json

首先，調用colly.NewCollector()建立一個類型爲*colly.Collector的爬蟲對象。因爲每一個網頁都有不少指向其餘網頁的連接。若是不加限制的話，運行可能永遠不會中止。因此上面經過傳入一個選項colly.AllowedDomains("www.baidu.com")限制只爬取域名爲www.baidu.com的網頁。api

而後咱們調用c.OnHTML方法註冊HTML回調，對每一個有href屬性的a元素執行回調函數。這裏繼續訪問href指向的 URL。也就是說解析爬取到的網頁，而後繼續訪問網頁中指向其餘頁面的連接。數組

調用c.OnRequest()方法註冊請求回調，每次發送請求時執行該回調，這裏只是簡單打印請求的 URL。瀏覽器

調用c.OnResponse()方法註冊響應回調，每次收到響應時執行該回調，這裏也只是簡單的打印 URL 和響應大小。

調用c.OnError()方法註冊錯誤回調，執行請求發生錯誤時執行該回調，這裏簡單打印 URL 和錯誤信息。

最後咱們調用c.Visit()開始訪問第一個頁面。

運行：

$ go run main.go
Visiting http://www.baidu.com/
Response http://www.baidu.com/: 303317 bytes
Link found: "百度首頁" -> /
Link found: "設置" -> javascript:;
Link found: "登陸" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
Link found: "新聞" -> http://news.baidu.com
Link found: "hao123" -> https://www.hao123.com
Link found: "地圖" -> http://map.baidu.com
Link found: "直播" -> https://live.baidu.com/
Link found: "視頻" -> https://haokan.baidu.com/?sfrom=baidu-top
Link found: "貼吧" -> http://tieba.baidu.com
...

colly爬取到頁面以後，會使用goquery解析這個頁面。而後查找註冊的 HTML 回調對應元素選擇器（element-selector），將goquery.Selection封裝成一個colly.HTMLElement執行回調。

colly.HTMLElement其實就是對goquery.Selection的簡單封裝：

type HTMLElement struct {
  Name string
  Text string
  Request *Request
  Response *Response
  DOM *goquery.Selection
  Index int
}

並提供了簡單易用的方法：

Attr(k string)：返回當前元素的屬性，上面示例中咱們使用e.Attr("href")獲取了href屬性；
ChildAttr(goquerySelector, attrName string)：返回goquerySelector選擇的第一個子元素的attrName屬性；
ChildAttrs(goquerySelector, attrName string)：返回goquerySelector選擇的全部子元素的attrName屬性，以[]string返回；
ChildText(goquerySelector string)：拼接goquerySelector選擇的子元素的文本內容並返回；
ChildTexts(goquerySelector string)：返回goquerySelector選擇的子元素的文本內容組成的切片，以[]string返回。
ForEach(goquerySelector string, callback func(int, *HTMLElement))：對每一個goquerySelector選擇的子元素執行回調callback；
Unmarshal(v interface{})：經過給結構體字段指定 goquerySelector 格式的 tag，能夠將一個 HTMLElement 對象 Unmarshal 到一個結構體實例中。

這些方法會被頻繁地用到。下面咱們就經過一些示例來介紹colly的特性和用法。

GitHub Treading

我以前寫過一個拉取GitHub Treading 的 API，用colly更方便：

type Repository struct {
  Author  string
  Name    string
  Link    string
  Desc    string
  Lang    string
  Stars   int
  Forks   int
  Add     int
  BuiltBy []string
}

func main() {
  c := colly.NewCollector(
    colly.MaxDepth(1),
  )


  repos := make([]*Repository, 0, 15)
  c.OnHTML(".Box .Box-row", func (e *colly.HTMLElement) {
    repo := &Repository{}

    // author & repository name
    authorRepoName := e.ChildText("h1.h3 > a")
    parts := strings.Split(authorRepoName, "/")
    repo.Author = strings.TrimSpace(parts[0])
    repo.Name = strings.TrimSpace(parts[1])

    // link
    repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a", "href"))

    // description
    repo.Desc = e.ChildText("p.pr-4")

    // language
    repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))

    // star & fork
    starForkStr := e.ChildText("div.mt-2 > a.mr-3")
    starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ",", "", -1)
    parts = strings.Split(starForkStr, "\n")
    repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))
    repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)-1]))

    // add
    addStr := e.ChildText("div.mt-2 > span.float-sm-right")
    parts = strings.Split(addStr, " ")
    repo.Add, _ = strconv.Atoi(parts[0])

    // built by
    e.ForEach("div.mt-2 > span.mr-3  img[src]", func (index int, img *colly.HTMLElement) {
      repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))
    })

    repos = append(repos, repo)
  })

  c.Visit("https://github.com/trending")
  
  fmt.Printf("%d repositories\n", len(repos))
  fmt.Println("first repository:")
  for _, repo := range repos {
      fmt.Println("Author:", repo.Author)
      fmt.Println("Name:", repo.Name)
      break
  }
}

咱們用ChildText獲取做者、倉庫名、語言、星數和 fork 數、今日新增等信息，用ChildAttr獲取倉庫連接，這個連接是一個相對路徑，經過調用e.Request.AbsoluteURL()方法將它轉爲一個絕對路徑。

運行：

$ go run main.go
25 repositories
first repository:
Author: Shopify
Name: dawn

百度小說熱榜

網頁結構以下：

各部分結構以下：

每條熱榜各自在一個div.category-wrap_iQLoo中；
a元素下div.index_1Ew5p是排名；
內容在div.content_1YWBm中；
內容中a.title_dIF3B是標題；
內容中兩個div.intro_1l0wp，前一個是做者，後一個是類型；
內容中div.desc_3CTjT是描述。

由此咱們定義結構：

type Hot struct {
  Rank   string `selector:"a > div.index_1Ew5p"`
  Name   string `selector:"div.content_1YWBm > a.title_dIF3B"`
  Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`
  Type   string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`
  Desc   string `selector:"div.desc_3CTjT"`
}

tag 中是 CSS 選擇器語法，添加這個是爲了能夠直接調用HTMLElement.Unmarshal()方法填充Hot對象。

而後建立Collector對象：

c := colly.NewCollector()

註冊回調：

c.OnHTML("div.category-wrap_iQLoo", func(e *colly.HTMLElement) {
  hot := &Hot{}

  err := e.Unmarshal(hot)
  if err != nil {
    fmt.Println("error:", err)
    return
  }

  hots = append(hots, hot)
})

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Requesting:", r.URL)
})

c.OnResponse(func(r *colly.Response) {
  fmt.Println("Response:", len(r.Body))
})

OnHTML對每一個條目執行Unmarshal生成Hot對象。

OnRequest/OnResponse只是簡單輸出調試信息。

而後，調用c.Visit()訪問網址：

err := c.Visit("https://top.baidu.com/board?tab=novel")
if err != nil {
  fmt.Println("Visit error:", err)
  return
}

最後添加一些調試打印：

fmt.Printf("%d hots\n", len(hots))
for _, hot := range hots {
  fmt.Println("first hot:")
  fmt.Println("Rank:", hot.Rank)
  fmt.Println("Name:", hot.Name)
  fmt.Println("Author:", hot.Author)
  fmt.Println("Type:", hot.Type)
  fmt.Println("Desc:", hot.Desc)
  break
}

運行輸出：

Requesting: https://top.baidu.com/board?tab=novel
Response: 118083
30 hots
first hot:
Rank: 1
Name: 逆天邪神
Author: 做者：火星引力
Type: 類型：玄幻
Desc: 掌天毒之珠，承邪神之血，修逆天之力，一代邪神，君臨天下！  查看更多>

Unsplash

我寫公衆號文章，背景圖片基本都是從 unsplash 這個網站獲取。unsplash 提供了大量的、豐富的、免費的圖片。這個網站有個問題，就是訪問速度比較慢。既然學習爬蟲，恰好利用程序自動下載圖片。

unsplash 首頁以下圖所示：

網頁結構以下：

可是首頁上顯示的都是尺寸較小的圖片，咱們點開某張圖片的連接：

網頁結構以下：

因爲涉及三層網頁結構（img最後還須要訪問一次），使用一個colly.Collector對象，OnHTML回調設置須要格外當心，給編碼帶來比較大的心智負擔。colly支持多個Collector，咱們採用這種方式來編碼：

func main() {
  c1 := colly.NewCollector()
  c2 := c1.Clone()
  c3 := c1.Clone()

  c1.OnHTML("figure[itemProp] a[itemProp]", func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "" {
      return
    }

    c2.Visit(e.Request.AbsoluteURL(href))
  })

  c2.OnHTML("div._1g5Lu > img[src]", func(e *colly.HTMLElement) {
    src := e.Attr("src")
    if src == "" {
      return
    }

    c3.Visit(src)
  })

  c1.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
  })

  c1.OnError(func(r *colly.Response, err error) {
    fmt.Println("Visiting", r.Request.URL, "failed:", err)
  })
}

咱們使用 3 個Collector對象，第一個Collector用於收集首頁上對應的圖片連接，而後使用第二個Collector去訪問這些圖片連接，最後讓第三個Collector去下載圖片。上面咱們還爲第一個Collector註冊了請求和錯誤回調。

第三個Collector下載到具體的圖片內容後，保存到本地：

func main() {
  // ... 省略
  var count uint32
  c3.OnResponse(func(r *colly.Response) {
    fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))
    err := r.Save(fileName)
    if err != nil {
      fmt.Printf("saving %s failed:%v\n", fileName, err)
    } else {
      fmt.Printf("saving %s success\n", fileName)
    }
  })

  c3.OnRequest(func(r *colly.Request) {
    fmt.Println("visiting", r.URL)
  })
}

上面使用atomic.AddUint32()爲圖片生成序號。

運行程序，爬取結果：

異步

默認狀況下，colly爬取網頁是同步的，即爬完一個接着爬另外一個，上面的 unplash 程序就是如此。這樣須要很長時間，colly提供了異步爬取的特性，咱們只須要在構造Collector對象時傳入選項colly.Async(true)便可開啓異步：

c1 := colly.NewCollector(
  colly.Async(true),
)

可是，因爲是異步爬取，因此程序最後須要等待Collector處理完成，不然早早地退出main，程序會退出：

c1.Wait()
c2.Wait()
c3.Wait()

再次運行，速度快了不少😀。

第二版

向下滑動 unsplash 的網頁，咱們發現後面的圖片是異步加載的。滾動頁面，經過 chrome 瀏覽器的 network 頁籤查看請求：

請求路徑/photos，設置per_page和page參數，返回的是一個 JSON 數組。因此有了另外一種方式：

定義每一項的結構體，咱們只保留必要的字段：

type Item struct {
  Id     string
  Width  int
  Height int
  Links  Links
}

type Links struct {
  Download string
}

而後在OnResponse回調中解析 JSON，對每一項的Download連接調用負責下載圖像的Collector的Visit()方法：

c.OnResponse(func(r *colly.Response) {
  var items []*Item
  json.Unmarshal(r.Body, &items)
  for _, item := range items {
    d.Visit(item.Links.Download)
  }
})

初始化訪問，咱們設置拉取 3 頁，每頁 12 個（和頁面請求的個數一致）：

for page := 1; page <= 3; page++ {
  c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))
}

運行，查看下載的圖片：

限速

有時候併發請求太多，網站會限制訪問。這時就須要使用LimitRule了。說白了，LimitRule就是限制訪問速度和併發量的：

type LimitRule struct {
  DomainRegexp string
  DomainGlob string
  Delay time.Duration
  RandomDelay time.Duration
  Parallelism    int
}

經常使用的就Delay/RandomDelay/Parallism這幾個，分別表示請求與請求之間的延遲，隨機延遲，和併發數。另外必須指定對哪些域名施行限制，經過DomainRegexp或DomainGlob設置，若是這兩個字段都未設置Limit()方法會返回錯誤。用在上面的例子中：

err := c.Limit(&colly.LimitRule{
  DomainRegexp: `unsplash\.com`,
  RandomDelay:  500 * time.Millisecond,
  Parallelism:  12,
})
if err != nil {
  log.Fatal(err)
}

咱們設置針對unsplash.com這個域名，請求與請求之間的隨機最大延遲 500ms，最多同時併發 12 個請求。

設置超時

有時候網速較慢，colly中使用的http.Client有默認超時機制，咱們能夠經過colly.WithTransport()選項改寫：

c.WithTransport(&http.Transport{
  Proxy: http.ProxyFromEnvironment,
  DialContext: (&net.Dialer{
    Timeout:   30 * time.Second,
    KeepAlive: 30 * time.Second,
  }).DialContext,
  MaxIdleConns:          100,
  IdleConnTimeout:       90 * time.Second,
  TLSHandshakeTimeout:   10 * time.Second,
  ExpectContinueTimeout: 1 * time.Second,
})

擴展

colly在子包extension中提供了一些擴展特性，最最經常使用的就是隨機 User-Agent 了。一般網站會經過 User-Agent 識別請求是不是瀏覽器發出的，爬蟲通常會設置這個 Header 把本身假裝成瀏覽器。使用也比較簡單：

import "github.com/gocolly/colly/v2/extensions"

func main() {
  c := colly.NewCollector()
  extensions.RandomUserAgent(c)
}

隨機 User-Agent 實現也很簡單，就是從一些預先定義好的 User-Agent 數組中隨機一個設置到 Header 中：

func RandomUserAgent(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
  })
}

實現本身的擴展也不難，例如咱們每次請求時須要設置一個特定的 Header，擴展能夠這麼寫：

func MyHeader(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("My-Header", "dj")
  })
}

用Collector對象調用MyHeader()函數便可：

MyHeader(c)

總結

colly是 Go 語言中最流行的爬蟲框架，支持豐富的特性。本文對一些經常使用特性作了介紹，並輔之以實例。限於篇幅，一些高級特性未能涉及，例如隊列，存儲等。對爬蟲感興趣的可去深刻了解。

你們若是發現好玩、好用的 Go 語言庫，歡迎到 Go 每日一庫 GitHub 上提交 issue😄

參考

Go 每日一庫 GitHub：https://github.com/darjun/go-daily-lib
Go 每日一庫之 goquery：https://darjun.github.io/2020/10/11/godailylib/goquery/
用 Go 實現一個 GitHub Trending API：https://darjun.github.io/2021/06/16/github-trending-api/
colly GitHub：https://github.com/gocolly/colly

我

個人博客：https://darjun.github.io

歡迎關注個人微信公衆號【GoUpUp】，共同窗習，一塊兒進步~

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。