最近發現了一個quote-lib網站:https://www.goodreads.com
因而瞭解到golang有個在github上star數超過6K的爬蟲框架:Colly.linux
我想首先將這個goodreads的quotes全都爬下來,而後保存到一個文件裏。 最後解析爬下來的quotes,爲了優美的markdown效果而格式化每一個quote,使得在網頁中這樣展現出來:git
每條quote有三個元素:quote的類型, quote文本體,做者或出處github
「We are what we pretend to be, so we must be careful about what we pretend to be.」 .
Kurt Vonnegut, Mother Night「Sometimes you wake up. Sometimes the fall kills you. And sometimes, when you fall, you fly.」
Neil Gaiman, Fables & Reflectionsgolang
Lightning Fast and Elegant Scraping Framework for Gophers.web
Colly provides a clean interface to write any kind of crawler/scraper/spider.api
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.bash
gocolly/colly : https://github.com/gocolly/collymarkdown
$ go get -u github.com/gocolly/colly/...
$ go version go version go1.12.8 linux/amd64
you can export GO111MODULE=on optionalyapp
draft:框架
package main import ( "fmt" "os" "regexp" "strings" "github.com/gocolly/colly" "github.com/gocolly/colly/extensions" ) func main() { fileName := "quote.md" file, errFile := os.Create(fileName) if errFile != nil { println("operating system create file error :%s", errFile.Error()) panic(errFile) } defer func() { err := file.Close() if err != nil { println("file close error") } }() c := colly.NewCollector() errProxy := c.SetProxy("http://127.0.0.1:1080/") if errProxy != nil { println("colly set proxy error :%s", errProxy.Error()) panic(errProxy) } // c.AllowedDomains = []string{"https://www.goodreads.com"} c.AllowURLRevisit = true extensions.RandomUserAgent(c) c.OnHTML(".quoteText ", func(e *colly.HTMLElement) { text := strings.TrimSpace(strings.Split(e.Text, "―")[0]) author := TrimSpaceNewlineInString(strings.TrimSpace(e.ChildText(".authorOrTitle"))) fileWriteForMarkdown(file, text, author) }) c.OnHTML(".next_page", func(e *colly.HTMLElement) { println("visit: ", e.Request.AbsoluteURL(e.Attr("href"))) errHrefVisit := c.Visit(e.Request.AbsoluteURL(e.Attr("href"))) if errHrefVisit != nil { panic(errHrefVisit) } }) errVisit := c.Visit("https://www.goodreads.com/quotes/tag/philosophy") if errVisit != nil { panic(errVisit) } } func TrimSpaceNewlineInString(s string) string { re := regexp.MustCompile(`\n`) return re.ReplaceAllString(s, " ") } func fileWriteForMarkdown(file *os.File, lines ...string) { var admotionBot = ` \{\{% /admonition %\}\} ` head := fmt.Sprintf(` \{\{%% admonition quote "%s" %%\}\} `, lines[1]) _, err := (*file).Write([]byte(head)) if err != nil { println("file write error ", err.Error()) } _, err = (*file).Write([]byte(lines[0])) if err != nil { println("file write error ", err.Error()) } _, err = (*file).Write([]byte(admotionBot)) if err != nil { println("file write error ", err.Error()) } } func fileWriteDirect(file *os.File,lines ...string){ _, err := (*file).Write([]byte(lines[0])) if err != nil { println("file write error ", err.Error()) } _, err = (*file).Write([]byte(lines[1])) if err != nil { println("file write error ", err.Error()) } }