編寫爬蟲程序的神器 - Groovy + Jsoup + Sublime

時間 2019-11-29

標籤編寫爬蟲程序神器 groovy jsoup sublime 欄目網絡爬蟲简体版

原文原文鏈接

寫過不少個爬蟲小程序了，以前幾回主要用C# + Html Agility Pack來完成工做。因爲.NET BCL只提供了"底層"的HttpWebRequest和"中層"的WebClient，故對HTTP操做仍是須要編寫不少代碼的。加上編寫C#須要使用Visual Studio這個很"重"的工具，開發效率長期以來處於一種低下的狀態。html

最近項目裏面接觸到了一種神奇的語言Groovy -- 一種全面兼容Java語言且提供了大量額外語法功能的動態語言。加上網絡上有開源的Jsoup項目 -- 一個輕量級的使用CSS選擇器來解析HTML內容的類庫，這樣的組合編寫爬蟲簡直如沐春風。前端

抓cnblogs首頁新聞標題的腳本json

Jsoup.connect("http://cnblogs.com").get().select("#post_list > div > div.post_item_body > h3 > a").each {
    println it.text()   
}

output小程序

抓cnblogs首頁新聞詳細信息瀏覽器

Jsoup.connect("http://cnblogs.com").get().select("#post_list > div").take(5).each {
    def url = it.select("> div.post_item_body > h3 > a").attr("href")
    def title = it.select("> div.post_item_body > h3 > a").text()
    def description = it.select("> div.post_item_body > p").text()
    def author = it.select("> div.post_item_body > div > a").text()
    def comments = it.select("> div.post_item_body > div > span.article_comment > a").text()
    def view = it.select("> div.post_item_body > div > span.article_view > a").text()網絡

    println ""
    println "新聞: $title"
    println "連接: $url"
    println "描述: $description"
    println "做者: $author, 評論: $comments, 閱讀: $view"
}編輯器

output工具

怎麼樣，很方即是吧。是否是找到一種編寫前端JavaScript和jQuery代碼的感受，那就對了！post

這裏說一個竅門，編寫CSS選擇器的時候能夠藉助Google Chrome瀏覽器的開發工具，如圖：開發工具

再來看看Groovy是如何快速處理JSON和XML的。一句話：方便到家。

抓cnblogs的feeds

new XmlSlurper().parse("http://feed.cnblogs.com/blog/sitehome/rss").with { xml ->
    def title = xml.title.text()
    def subtitle = xml.subtitle.text()
    def updated = xml.updated.text()

    println "feeds"
    println "title -> $title"
    println "subtitle -> $subtitle"
    println "updated -> $updated"

    def entryList = xml.entry.take(3).collect {
        def id = it.id.text()
        def subject = it.title.text()
        def summary = it.summary.text()
        def author = it.author.name.text()
        def published = it.published.text()
        [id, subject, summary, author, published]
    }.each {
        println ""
        println "article -> ${it[1]}"
        println it[0]
        println "author -> ${it[3]}"
    }
}

output

抓msdn訂閱的產品分類信息

new JsonSlurper().parse(new URL("http://msdn.microsoft.com/en-us/subscriptions/json/GetProductCategories?brand=MSDN&localeCode=en-us")).with { rs ->
println rs.collect{ it.Name }
}

output

再說一下代碼編輯器。本方案因爲使用Groovy這門動態語言，故能夠選擇一種輕量級的文本編輯器，這裏要推薦Sublime。其中文翻譯是「高大尚」的意思。從這個小小的文本編輯器所表現出來的豐富功能和極佳的用戶體驗來看，也確實對得起這個名字了。

優勢：