電子書網站爬蟲實踐

時間 2019-11-13

原文原文鏈接

近期搜電子是的時候發現一個有趣的網站，不少精校版的電子書，因爲好奇，就想作一個爬蟲把名稱彙總一下。（具體緣由在於canvas的頁面背景效果在Chrome瀏覽器裏面特別消耗資源）本身去搜索書名，而後找下載地址。十幾分鍾，腳本基本寫完，一夜時間也差很少可以跑完了。html

分享代碼，僅供參考（比較粗糙）。java

package com.fun

import com.fun.db.mysql.MySqlTest
import com.fun.frame.httpclient.FanLibrary
import com.fun.utils.Regex
import org.slf4j.Logger
import org.slf4j.LoggerFactory

class T extends FanLibrary {

    static Logger logger = LoggerFactory.getLogger(T.class)


    public static void main(String[] args) {
//        test(322)

        def list = 1..1000 as List

        list.each { x ->  
            try {
                test(x)
            } catch (Exception e) {∫
                logger.error(x.toString())
                output(e)
            }
            logger.warn(x.toString())
            sleep(2000)
        }

        testOver()
    }

    static def test(int id) {
//        def get = getHttpGet("https://****/books/9798.html")
        def get = getHttpGet("https://****/books/" + id + ".html")
        def response = getHttpResponse(get)
        def string = response.getString("content")
        if (string.contains("您需求的文件不存在")|| string.contains("頁面未找到")) return
        output(string)
        def all = Regex.regexAll(string, "class=\"bookpic\"> <img title=\".*?\"").get(0)
        def all2 = Regex.regexAll(string, "content=\"內容簡介.*?\"").get(0)
        def all3 = Regex.regexAll(string, "title=\"做者：.*?\"").get(0)
        def all40 = Regex.regexAll(string, "https://sobooks\\.cc/go\\.html\\?url=https{0,1}://.*?\\.ctfile\\.com/.*?\"")
        def all4 = all40.size() == 0 ? "" : all40.get(0)
        def all50 = Regex.regexAll(string, "https://sobooks\\.cc/go\\.html\\?url=https{0,1}://pan\\.baidu\\.com/.*?\"")
        def all5 = all50.size() == 0 ? "" : all50.get(0)
        output(all)
        output(all2)
        output(all3)
        output(all4)
        output(all5)
        def name = all.substring(all.lastIndexOf("=") + 2, all.length() - 1)
        def author = all3.substring(all3.lastIndexOf("=") + 2, all3.length() - 1)
        def intro = all2.substring(all2.lastIndexOf("=") + 2, all2.length() - 1)
        def url1 = all4 == "" ? "" : all4.substring(all4.lastIndexOf("=") + 1, all4.length() - 1)
        def url2 = all5 == "" ? "" : all5.substring(all5.lastIndexOf("=") + 1, all5.length() - 1)
        output(name, author, intro, url1, url2)
        def sql = String.format("INSERT INTO books (name,author,intro,urlc,urlb,bookid) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d)", name, author, intro, url1, url2, id)
        MySqlTest.sendWork(sql)
    }
}

我的感受仍是比較滿意的。python

公衆號後臺回覆「電子書」可得網站地址和CSV文件下載地址。mysql