咱們以前的爬蟲都是模擬成瀏覽器後直接爬取,並無動態設置IP代理以及UserAgent標識,這樣很容易被服務器封IP,所以須要設置IP代理,但又不想花錢買,網上有免費IP代理,但大多都數都是不可用,並且不穩定,因此須要自行抓取、校驗php
本文記錄免費IP代理池定時維護,封裝通用爬蟲工具類每次隨機更新IP代理池跟UserAgent池,並製做簡易流量爬蟲驗證咱們的IP代理池、UserAgent池css
主要用到的知識:爬蟲相關、SpringBoot相關,項目整合了多個知識點:html
httpclient+jsoup實現小說線上採集閱讀java
SpringBoot系列——@Async優雅的異步調用程序員
SpringBoot系列——Spring-Data-JPAes6
SpringBoot系列——WebSocketgithub
SpringBoot系列——Logback日誌,輸出到文件以及實時輸出到web頁面
項目結構
pom引入父類,同時引入基礎爬蟲所需的依賴,以及mysql、jpa依賴
<!-- 小蜘蛛 --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.4</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpcore</artifactId> <version>4.4.9</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency> <dependency> <groupId>net.sf.json-lib</groupId> <artifactId>json-lib</artifactId> <version>2.4</version> <classifier>jdk15</classifier> </dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.32</version> </dependency> <!--添加springdata-jpa依賴 --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <!--添加MySQL驅動依賴 --> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> </dependency>
PS:具體的數據庫鏈接配置須要在具體的爬蟲項目進行配置
而後就能夠做爲一個通用功能項目,具體的爬蟲項目經過pom引入
HttpClient請求的響應對象跟WebClient的不一致,爲了更加規範,咱們定義統一的響應對象
/** * 統一響應對象 */ @Data public class ResultVo<E> { private ResultVo(Integer statusCode, String statusMessage, E page) { this.statusCode = statusCode; this.statusMessage = statusMessage; this.page = page; } //響應狀態 private Integer statusCode; //響應消息 private String statusMessage; //響應對象 private E page; /** * 經過靜態方法獲取實例 */ public static <E> ResultVo<E> of(Integer statusCode,String statusMessage,E page) { return new ResultVo<>(statusCode, statusMessage, page); } }
免費的IP代理仍是有挺多的,不過大多數都不穩定,須要本身抓取、校驗,本文主要抓取的是89ip(http://www.89ip.cn/index_1.html)的免費代理,抓取前十頁,150個,校驗後大概有50個可用,兩個定時異步任務:定時更新IP代理池,目前設置一個小時觸發一次、定時檢查IP代理池,目前設置半個小時觸發一次(西刺的免費IP代理可用的太少了,先註釋起來)
更新下來的IP代理須要存庫,IP地址就是主鍵,因此若是已經存在就會替換掉,不存在則會加入數據庫,檢查IP代理是否可用是用這個IP代理去訪問查詢外網地址的網站(
http://pv.sohu.com/cityjson
),能請求成功,且返回的外網ip是同樣說明代理成功,代理失敗的將會從數據庫池移除,檢查完成後更新IP代理池
IP代理表結構SQL
/* Navicat Premium Data Transfer Source Server : localhost Source Server Type : MySQL Source Server Version : 50528 Source Host : localhost:3306 Source Schema : test Target Server Type : MySQL Target Server Version : 50528 File Encoding : 65001 Date: 13/08/2019 15:55:59 */ SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for spider_ip_proxy -- ---------------------------- DROP TABLE IF EXISTS `spider_ip_proxy`; CREATE TABLE `spider_ip_proxy` ( `ip` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT 'ip地址', `port` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '端口', `city` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '城市', `operator` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '運營商', PRIMARY KEY (`ip`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact; SET FOREIGN_KEY_CHECKS = 1;
jpa實體映射
/** * 爬蟲IP代理池實體對象 */ @Data @Entity(name = "spider_ip_proxy") public class IpProxy { @Id //ip地址 private String ip; //端口 private String port; //城市 private String city; //運營商 private String operator; }
我並無在網上找到提供UserAgent池的網站,因此我收集一堆UserAgent標識並存到數據庫中當作UserAgent池,我的感受那麼多應該夠用了,因此就沒有定時任務去更新
UserAgent標識表結構、數據SQL
/* Navicat Premium Data Transfer Source Server : localhost Source Server Type : MySQL Source Server Version : 50528 Source Host : localhost:3306 Source Schema : test Target Server Type : MySQL Target Server Version : 50528 File Encoding : 65001 Date: 13/08/2019 15:58:07 */ SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for spider_user_agent -- ---------------------------- DROP TABLE IF EXISTS `spider_user_agent`; CREATE TABLE `spider_user_agent` ( `user_agent` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT 'User Agent', PRIMARY KEY (`user_agent`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact; -- ---------------------------- -- Records of spider_user_agent -- ---------------------------- INSERT INTO `spider_user_agent` VALUES ('Chrome/10.0.648.133 Safari/534.16'); INSERT INTO `spider_user_agent` VALUES ('Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; 360SE) '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.109 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.70 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; LGMS323 Build/KOT49I.MS32310c) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/67.0.3396.87 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.109 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.70 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6)'); INSERT INTO `spider_user_agent` VALUES ('MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'); INSERT INTO `spider_user_agent` VALUES ('NOKIA5700/ UCWEB7.0.2.37/28/999'); INSERT INTO `spider_user_agent` VALUES ('Openwave/ UCWEB7.0.2.37/28/999'); INSERT INTO `spider_user_agent` VALUES ('Opera/8.0 (Windows NT 5.1; U; en)'); INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10'); INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52'); INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11'); INSERT INTO `spider_user_agent` VALUES ('UCWEB7.0.2.37/28/999'); SET FOREIGN_KEY_CHECKS = 1;
jpa實體映射
/** * 爬蟲User-Agent池實體對象 */ @Data @Entity(name = "spider_user_agent") public class UserAgent { @Id //User Agent private String userAgent; }
HttpClient是http包下面的東西,能夠簡單發起請求獲取數據,但不會去解析DOM、執行js、css等,所以須要藉助Jsoup來解析Html文檔,工具類包含了IP代理池、UserAgent池,每次發起請求都會隨機從IP代理池獲取IP代理、從UserAgent池隨機獲取UserAgent標識,IP代理池由定時任務去更新
提供一個靜態方法,獲取一個HttpClient對象,支持繞過SSL校驗
WebClient是htmlunit的東西,可模擬瀏覽器解析DOM、執行js、css等,能夠解析Html文檔,例如像jq操做DOM對象同樣,工具類包含了IP代理池、UserAgent池,每次發起請求都會隨機從IP代理池獲取IP代理、從UserAgent池隨機獲取UserAgent標識,IP代理池由定時任務去更新
提供一個靜態方法獲取WebClient對象,開啓了部分功能
流量爬蟲目前有如下幾個項目:
咱們引入common-spider,開始編寫流量爬蟲,主要就是用WebClient去訪問博客園的博客,換IP代理、換UserAgent標識,設置執行JS,全部的操做都是隨機的、隨機代理IP、隨機UserAgent標識、隨機訪問時間、隨機訪問博客,甚至咱們能夠設置攜帶隨機cookie(須要進行仔細分析,到底發送了那些cookie,cookie的值有什麼規則,建議用火狐瀏覽器進行分析),歷來達到模擬真實用戶訪問,使博客閱讀量增長,俗稱刷閱讀量
爲了方便觀察實時日誌,秀出咱們以前的騷操做(SpringBoot系列——Logback日誌,輸出到文件以及實時輸出到web頁面),開始搭建項目
項目結構
在pom文件中引入父類、同時引入common-spider,以及thymeleaf、websocket
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <artifactId>flow-spider</artifactId> <version>0.0.1</version> <name>flow-spider</name> <description>流量爬蟲</description> <!-- 引入父類 --> <parent> <groupId>cn.huanzi.qch</groupId> <artifactId>parent</artifactId> <version>1.0.0</version> </parent> <dependencies> <dependency> <groupId>cn.huanzi.qch</groupId> <artifactId>common-spider</artifactId> <version>0.0.1</version> </dependency> <!-- springboot websocket --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-websocket</artifactId> </dependency> <!-- thymeleaf模板 --> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-thymeleaf</artifactId> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>
配置文件配置數據庫相關配置
#數據庫相關 spring.datasource.url=jdbc:mysql://localhost:3306/test?serverTimezone=GMT%2B8&characterEncoding=utf-8 spring.datasource.username= root spring.datasource.password=123456 spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
以及實時日誌須要的一些操做就再也不重複了,看以前的博客
博客實體對象
爲了方便,咱們爬取博客集合存儲到數據庫中
數據庫表結構SQL
/* Navicat Premium Data Transfer Source Server : localhost Source Server Type : MySQL Source Server Version : 50528 Source Host : localhost:3306 Source Schema : test Target Server Type : MySQL Target Server Version : 50528 File Encoding : 65001 Date: 13/08/2019 16:48:14 */ SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for spider_blog -- ---------------------------- DROP TABLE IF EXISTS `spider_blog`; CREATE TABLE `spider_blog` ( `blog_url` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT '博客連接', `blog_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '博客標題', PRIMARY KEY (`blog_url`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact; SET FOREIGN_KEY_CHECKS = 1;
jpa映射實體
/** * 博客園博客文章實體對象 */ @Data @Entity(name = "spider_blog") public class Blog { @Id private String blogUrl; private String blogName; }
controller
爲了偷懶,咱們連service層都懶得寫了,業務邏輯直接寫在controller層
啓動類
啓動類也須要進行一些註解配置,SpringBoot默認只能掃描到當前包和子包,全部咱們須要添加註解指定掃描路徑Spring才能識別到註解
@Slf4j//使用lombok的@Slf4j,幫咱們建立Logger對象,效果與下方獲取日誌對象同樣 @SpringBootApplication//默認只能掃描到當前包和子包 @EnableJpaRepositories(basePackages = {"cn.huanzi.qch.commonspider.repository","cn.huanzi.qch.flowspider.cnblogs.repository"})//掃描@Repository註解; @EntityScan(basePackages = {"cn.huanzi.qch.commonspider.pojo","cn.huanzi.qch.flowspider.cnblogs.pojo"})//掃描@Entity註解; @ComponentScan(basePackages = {"cn.huanzi.qch.commonspider.**","cn.huanzi.qch.flowspider.**"})//掃描 帶@Component的註解,如:@Controller、@Service 註解 @EnableScheduling //容許支持定時器了 public class FlowSpiderApplication { //省略部分代碼... }
因爲咱們使用了註解來指定,SpringBoot的默認掃描路徑失效,因此也須要將全部須要掃描的路徑補全
項目已經配置得差很少了,爲了方便操做,咱們在實時日誌頁面新增幾個按鈕來手動調用這些功能
運行效果
頁面效果大概就是這樣
那這個流量爬蟲具體效果怎麼樣的?這是我掛機從下午6點多到次日早上9點多的效果,博客集合就只留一篇,其它的全刪掉,這篇博客的訪問量從34增長到890
成功一千屢次才增長八百?並且還失敗三千屢次??效率是否是過低了一點?
一、免費的IP代理不少,但真正可用的不多,並且還不穩定,說不定前幾分鐘剛校驗成功,當你用的時候又代理失敗,想要穩定的IP代理得花錢買比較靠譜
二、常常出現400 The plain HTTP request was sent to HTTPS port,我目前也不知道怎麼解決
三、小几率同一時間段內屢次隨機到了同一個IP代理,博客園不作訪問統計
四、未知緣由致使閱讀量增長...
PS:
正所謂,程序員何苦爲難程序員...,你們隨機訪問秒數不要太快了,咱們只是爲了學習,不是爲了刷流量,也要考慮博客園運維人員的感覺哇!
(偷偷的說一下,能夠寫個定時任務去更新博客集合,這樣咱們的流量機器人就能夠作到全自動刷流量,按照目前的狀況看,一天能夠貢獻差很少2000的閱讀量,打包部署到雲服務器,全自動24小時不停機【隱藏滑稽臉~~】)
另外,大家檢出代碼後,不要都用個人的博客來試,我怕被封號...
補充:
這就尷尬了...
目前只能刷不須要微信登陸受權的投票,好比下面這個投票例子,具體緣由在後面再跟你們討論
咱們先簡單分析如下這類型的微信投票,作一下前期準備,找個正在進行微信投票的項目的網頁連接(http://www.dzmshd.com/Home/index.php?m=Index&a=content&id=42&fid=8130&subscribe=1),右鍵查看源代碼,找到投票發起的請求連接
PS:微信很雞賊,只能用微信內置瀏覽器打開...
使用微信電腦端打開,對着網頁右鍵,查看源代碼
微信會在這個位置生成一個TXT文件,並幫咱們自動打開,而後咱們按關鍵字搜索,
搜索這個js方法,找到請求連接,拼接上參數後:http://www.dzmshd.com/Home/index.php?m=Index&a=vote&vid=8130&id=42&tp=
連接找到了,咱們開始寫代碼,一樣,寫在controller裏就能夠了,簡單點
注意,UserAgent標識得設置微信的,不能用咱們前面的UserAgent池了,我在網上找了幾個
//微信UserAgent標識 String[] webKitUserAgent = { "Mozilla/5.0 (Linux; Android 7.1.1; MI 6 Build/NMF26X; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/043807 Mobile Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN", "Mozilla/5.0 (Linux; Android 7.1.1; OD103 Build/NMF26F; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/4G Language/zh_CN", "Mozilla/5.0 (Linux; Android 6.0.1; SM919 Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN", "Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN", "Mozilla/5.0 (Linux; Android 5.1; HUAWEI TAG-AL00 Build/HUAWEITAG-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043622 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/4G Language/zh_CN", "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 MicroMessenger/6.6.1 NetType/4G Language/zh_CN", "Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_2 like Mac OS X) AppleWebKit/604.4.7 (KHTML, like Gecko) Mobile/15C202 MicroMessenger/6.6.1 NetType/4G Language/zh_CN", "Mozilla/5.0 (iPhone; CPU iPhone OS 11_1_1 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Mobile/15B150 MicroMessenger/6.6.1 NetType/WIFI Language/zh_CN", "Mozilla/5.0 (iphone x Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN", };
controller
這個連接仍是比較簡單,GET請求,咱們使用HttpClientUtil就能夠了,而後運行起來,訪問:http://localhost:10087/weChatVote/start
效果
咱們同樣隨機秒數去請求,換IP代理,UserAgent要換微信標識的,運行一小段時間後
日誌顯示成功13次,檢查一下,發現已經從2983變成2997,多了一次估計是恰好有人給它投票...
爲了方便驗證,咱們找一個投票數爲零的,試一下,別人都幾千票了,它一張都沒有也是可憐,幫它刷刷人氣(嘿嘿~)
先找出請求連接:http://www.dzmshd.com/Home/index.php?m=Index&a=vote&vid=8679&id=42&tp=
項目運行起來,訪問:http://localhost:10087/weChatVote/start
效果
運行一小段時間後,刷了137票,瞬間排到12名(捂臉)
PS:發現有好屢次失敗都是這個緣由,由於咱們的代理IP太少了,並且前面已經用了部分IP給第二名投票了,因此投票失敗,後面去更新IP代理池,而後檢查校驗繼續刷
須要登陸受權的比較麻煩,先看一下微信網頁受權的大體流程:(微信公衆平臺:https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1421140842)
普通瀏覽器沒法調試查看微信的連接,得須要抓包軟件進行分析,好比fiddler等
若是參數設置錯誤,連受權頁面都訪問不了
強行請求在源碼找到的連接進行訪問,返回這個報錯頁面,由於少了參數,連受權頁面都沒法重定向過去
自動任務更新免費IP代理,發起的請求都是隨機秒數、隨機IP、隨機UserAgent,甚至還能夠隨機cookie,模擬真實用戶使用瀏覽器發起的請求
本文就記錄到這裏,聲明一下,技術僅供學習研究,請你們不要應用在觸發法律的地方,歡迎你們一塊兒討論
原先兩個工具類只支持發起GET請求,如今新增支持發起POST請求,不過有一點要注意,通過我測試,post請求分紅兩種狀況來設置參數,後端才能成功接參
一、服務端有@RequestBody,請求頭須要設置Content-type=application/json; charset=UTF-8,同時請求參數要放在body裏
二、服務端沒有@RequestBody,請求頭須要設置Content-type=application/x-www-form-urlencoded; charset=UTF-8,同時請求參數要放在URL參數裏
目前是兩種都寫在裏面了,我默認先註釋其中一個,你們使用的時候再自行調整、擴展
代碼已經開源、託管到個人GitHub、碼雲: