近期搜電子是的時候發現一個有趣的網站,不少精校版的電子書,因爲好奇,就想作一個爬蟲把名稱彙總一下。(具體緣由在於canvas的頁面背景效果在Chrome瀏覽器裏面特別消耗資源)本身去搜索書名,而後找下載地址。十幾分鍾,腳本基本寫完,一夜時間也差很少可以跑完了。html
分享代碼,僅供參考(比較粗糙)。java
package com.fun import com.fun.db.mysql.MySqlTest import com.fun.frame.httpclient.FanLibrary import com.fun.utils.Regex import org.slf4j.Logger import org.slf4j.LoggerFactory class T extends FanLibrary { static Logger logger = LoggerFactory.getLogger(T.class) public static void main(String[] args) { // test(322) def list = 1..1000 as List list.each { x -> try { test(x) } catch (Exception e) {∫ logger.error(x.toString()) output(e) } logger.warn(x.toString()) sleep(2000) } testOver() } static def test(int id) { // def get = getHttpGet("https://****/books/9798.html") def get = getHttpGet("https://****/books/" + id + ".html") def response = getHttpResponse(get) def string = response.getString("content") if (string.contains("您需求的文件不存在")|| string.contains("頁面未找到")) return output(string) def all = Regex.regexAll(string, "class=\"bookpic\"> <img title=\".*?\"").get(0) def all2 = Regex.regexAll(string, "content=\"內容簡介.*?\"").get(0) def all3 = Regex.regexAll(string, "title=\"做者:.*?\"").get(0) def all40 = Regex.regexAll(string, "https://sobooks\\.cc/go\\.html\\?url=https{0,1}://.*?\\.ctfile\\.com/.*?\"") def all4 = all40.size() == 0 ? "" : all40.get(0) def all50 = Regex.regexAll(string, "https://sobooks\\.cc/go\\.html\\?url=https{0,1}://pan\\.baidu\\.com/.*?\"") def all5 = all50.size() == 0 ? "" : all50.get(0) output(all) output(all2) output(all3) output(all4) output(all5) def name = all.substring(all.lastIndexOf("=") + 2, all.length() - 1) def author = all3.substring(all3.lastIndexOf("=") + 2, all3.length() - 1) def intro = all2.substring(all2.lastIndexOf("=") + 2, all2.length() - 1) def url1 = all4 == "" ? "" : all4.substring(all4.lastIndexOf("=") + 1, all4.length() - 1) def url2 = all5 == "" ? "" : all5.substring(all5.lastIndexOf("=") + 1, all5.length() - 1) output(name, author, intro, url1, url2) def sql = String.format("INSERT INTO books (name,author,intro,urlc,urlb,bookid) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d)", name, author, intro, url1, url2, id) MySqlTest.sendWork(sql) } }
我的感受仍是比較滿意的。python
公衆號後臺回覆「電子書」可得網站地址和CSV文件下載地址。mysql