go-colly爬蟲上手(爬取英文名人名言)

最近發現了一個quote-lib網站:https://www.goodreads.com
因而瞭解到golang有個在github上star數超過6K的爬蟲框架:Colly.linux

項目目的

我想首先將這個goodreads的quotes全都爬下來,而後保存到一個文件裏。 最後解析爬下來的quotes,爲了優美的markdown效果而格式化每一個quote,使得在網頁中這樣展現出來:git

每條quote有三個元素:quote的類型, quote文本體,做者或出處github

「We are what we pretend to be, so we must be careful about what we pretend to be.」 ‎.
Kurt Vonnegut, Mother Night

「Sometimes you wake up. Sometimes the fall kills you. And sometimes, when you fall, you fly.」
Neil Gaiman, Fables & Reflectionsgolang

準備工做

簡要介紹

Lightning Fast and Elegant Scraping Framework for Gophers.web

Colly provides a clean interface to write any kind of crawler/scraper/spider.api

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.bash

go-colly git-repo url

gocolly/colly : https://github.com/gocolly/collymarkdown

安裝

$ go get -u github.com/gocolly/colly/...

go環境

$ go version                
go version go1.12.8 linux/amd64

you can export GO111MODULE=on optionalyapp

快速上手

draft:框架

package main

import (
    "fmt"
    "os"
    "regexp"
    "strings"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/extensions"
)

func main() {

    fileName := "quote.md"
    file, errFile := os.Create(fileName)
    if errFile != nil {
        println("operating system create file error :%s", errFile.Error())
        panic(errFile)
    }
    defer func() {
        err := file.Close()
        if err != nil {
            println("file close error")
        }
    }()

    c := colly.NewCollector()
    errProxy := c.SetProxy("http://127.0.0.1:1080/")
    if errProxy != nil {
        println("colly set proxy error :%s", errProxy.Error())
        panic(errProxy)
    }
    // c.AllowedDomains  = []string{"https://www.goodreads.com"}
    c.AllowURLRevisit = true
    extensions.RandomUserAgent(c)

    c.OnHTML(".quoteText ",
        func(e *colly.HTMLElement) {
            text := strings.TrimSpace(strings.Split(e.Text, "―")[0])
            author := TrimSpaceNewlineInString(strings.TrimSpace(e.ChildText(".authorOrTitle")))

            fileWriteForMarkdown(file, text, author)
        })

    c.OnHTML(".next_page", func(e *colly.HTMLElement) {
        println("visit: ", e.Request.AbsoluteURL(e.Attr("href")))
        errHrefVisit := c.Visit(e.Request.AbsoluteURL(e.Attr("href")))
        if errHrefVisit != nil {
            panic(errHrefVisit)
        }

    })

    errVisit := c.Visit("https://www.goodreads.com/quotes/tag/philosophy")
    if errVisit != nil {
        panic(errVisit)
    }

}

func TrimSpaceNewlineInString(s string) string {
    re := regexp.MustCompile(`\n`)
    return re.ReplaceAllString(s, " ")
}

func fileWriteForMarkdown(file *os.File, lines ...string) {
    var admotionBot = `
\{\{% /admonition %\}\}
`
    head := fmt.Sprintf(`
\{\{%% admonition quote "%s" %%\}\}
`, lines[1])
    _, err := (*file).Write([]byte(head))
    if err != nil {
        println("file write error ", err.Error())
    }
    _, err = (*file).Write([]byte(lines[0]))
    if err != nil {
        println("file write error ", err.Error())
    }
    _, err = (*file).Write([]byte(admotionBot))
    if err != nil {
        println("file write error ", err.Error())
    }
}

func fileWriteDirect(file *os.File,lines ...string){

    _, err := (*file).Write([]byte(lines[0]))
    if err != nil {
        println("file write error ", err.Error())
    }
    _, err = (*file).Write([]byte(lines[1]))
    if err != nil {
        println("file write error ", err.Error())
    }
}
相關文章
相關標籤/搜索