爲了保持消息靈通,博主沒事會上上校內論壇看看新帖,做爲爬蟲愛好者,博主萌生了寫個爬蟲自動下載的想法。java
嗯,此次就選Java。node
Jsoup是一款比較好的Java版HTML解析器。可直接解析某個URL地址、HTML文本內容。它提供了一套很是省力的API,可經過DOM,CSS以及相似於jQuery的操做方法來取出和操做數據。mysql
mysql-connector-java是java JDBC的MySQL驅動,能夠提供方便統一的接口來操縱MySQL數據庫sql
博主爬取的是浙大的cc98論壇,須要內網才能上,新帖會在其中一個版面內出現,界面大概是這樣:數據庫
分析一下以後,發現100條新帖一共有5頁,內容呈如今一張表格上。cookie
再用Firefox分析一下頁面,發現是一個class="tableborder1"的table下有20行記錄,每一行有4個td,爬蟲只要獲取這四個td數據就能夠了。ide
因爲這個網站是要用用戶名和密碼登陸的,博主一開始都在使用POST方法,後來用Firefox抓包分析以後,才發現可使用帶cookies的GET方法登陸。工具
private void getDoc(String url, String page) { try { //獲取網頁 this.doc = Jsoup.connect(url) .cookie("aspsky", "***") .cookie("BoardList","BoardID=Show") .cookie("owaenabled","True") .cookie("autoplay","True") .cookie("ASPSESSIONIDSSCBSSCR","***") .data(" stype","3") .data("page",page) //Page就是1-5頁 .get(); } catch (IOException e) { e.printStackTrace(); throw new RuntimeException(); } }
這裏使用到的Jsoup很強大,其實還能夠添加header之類的做修飾,但博主發現只要加了cookies以後就能成功訪問了。學習
根據目標分析的結果,咱們能夠開始解析HTML文檔,一樣地,Jsoup容許使用JQuery方法來解析,十分方便網站
private void parse() { //採用JQuery CSS選擇 Elements rows = doc.select(".tableborder1 tr"); //去掉表頭行 rows.remove(0); for(Element row : rows) { String theme = row.select("td:eq(0) a:eq(1)").text().trim(); String url = "http://www.cc98.org/" + row.select("td:eq(0) a:eq(1)").attr("href"); String part = row.select("td:eq(1) a").text().trim(); String author = row.select("td:eq(2) a").text().trim(); if(author.length()==0) { author = "匿名"; } String rawTime = row.select("td:eq(3)").text(). replace("\n","") .replace("\t",""); try { Date publishTime = sdf.parse(rawTime); System.out.println(publishTime+" "+theme); System.out.println("---------------------------------------------------------"); storage.store(theme,publishTime,part,author,url); } catch (ParseException e) { e.printStackTrace(); } } }
爲了方便往後的分析(博主還打算偶爾分析一下各個版面的活躍狀況),咱們要把數據存儲到硬盤上,這裏選用了jdbc鏈接MySQL
public void store(String theme, Date publishTime, String part, String author, String url) { try { String sql = "INSERT INTO news (theme," + "publishTime,part,author,url)VALUES(?,?,?,?,?)"; //使用預處理的方法 PreparedStatement ps = null; ps = conn.prepareStatement(sql); //依次填入參數 ps.setString(1,theme); java.sql.Time time = new java.sql.Time(publishTime.getTime()); //這裏使用數據庫的時間戳 ps.setTimestamp(2,new Timestamp(publishTime.getTime())); ps.setString(3,part); ps.setString(4,author); ps.setString(5,url); ps.executeUpdate(); } catch (SQLException e) { //主要是重複的異常,在MySQL中已經有約束unique // e.printStackTrace(); } }
咱們能夠經過MySQL可視化工具查看結果,因爲MySQLworkbench不太好用,博主使用了DBeaver,結果以下:
結果很是使人滿意。
Spider.java
package com.company; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; public class Spider { private Document doc; //定義時間格式 private SimpleDateFormat sdf = new SimpleDateFormat( " yyyy/MM/dd HH:mm " ); Storage storage = new Storage(); private void getDoc(String url, String page) { try { //獲取網頁 this.doc = Jsoup.connect(url) .cookie("aspsky", "***") .cookie("BoardList","BoardID=Show") .cookie("owaenabled","True") .cookie("autoplay","True") .cookie("ASPSESSIONIDSSCBSSCR","***") .data(" stype","3") .data("page",page) .get(); } catch (IOException e) { e.printStackTrace(); throw new RuntimeException(); } } private void parse() { //採用JQuery CSS選擇 Elements rows = doc.select(".tableborder1 tr"); //去掉表頭行 rows.remove(0); for(Element row : rows) { String theme = row.select("td:eq(0) a:eq(1)").text().trim(); String url = "http://www.cc98.org/" + row.select("td:eq(0) a:eq(1)").attr("href"); String part = row.select("td:eq(1) a").text().trim(); String author = row.select("td:eq(2) a").text().trim(); if(author.length()==0) { author = "匿名"; } String rawTime = row.select("td:eq(3)").text(). replace("\n","") .replace("\t",""); try { Date publishTime = sdf.parse(rawTime); System.out.println(publishTime+" "+theme); System.out.println("---------------------------------------------------------"); storage.store(theme,publishTime,part,author,url); } catch (ParseException e) { e.printStackTrace(); } } } public void run(String url) { for (Integer i = 1; i<=5; i++) { getDoc(url, i.toString()); parse(); } storage.close(); } public static void main(String[] args) { Spider spider = new Spider(); spider.run("http://www.cc98.org/queryresult.asp?stype=3"); } }
Storage.java
package com.company; import java.sql.*; import java.util.Date; public class Storage { //數據庫鏈接字符串,cc98爲數據庫名 private static final String URL="jdbc:mysql://localhost:3306/cc98?characterEncoding=utf8&useSSL=false"; //登陸名 private static final String NAME="***"; //密碼 private static final String PASSWORD="***"; private Connection conn = null; public Storage() { //加載jdbc驅動 try { Class.forName("com.mysql.jdbc.Driver"); } catch (ClassNotFoundException e) { System.out.println("未能成功加載驅動程序,請檢查是否導入驅動程序!"); e.printStackTrace(); } try { conn = DriverManager.getConnection(URL, NAME, PASSWORD); // System.out.println("獲取數據庫鏈接成功!"); } catch (SQLException e) { // System.out.println("獲取數據庫鏈接失敗!"); e.printStackTrace(); } } public void close() { //關閉數據庫 if(conn!=null) { try { conn.close(); } catch (SQLException e) { e.printStackTrace(); } } } public void store(String theme, Date publishTime, String part, String author, String url) { try { String sql = "INSERT INTO news (theme," + "publishTime,part,author,url)VALUES(?,?,?,?,?)"; //使用預處理的方法 PreparedStatement ps = null; ps = conn.prepareStatement(sql); //依次填入參數 ps.setString(1,theme); java.sql.Time time = new java.sql.Time(publishTime.getTime()); //這裏使用數據庫的時間戳 ps.setTimestamp(2,new Timestamp(publishTime.getTime())); ps.setString(3,part); ps.setString(4,author); ps.setString(5,url); ps.executeUpdate(); } catch (SQLException e) { //主要是重複的異常,在MySQL中已經有約束unique // e.printStackTrace(); } } }
這個項目其實涉及的知識點還挺多的,博主也剛學java,不少細節也沒有很好吃透,如JDBC、Jsoup都值得好好學習一下。歡迎你們批評指正。
另,祝你們中秋國慶雙節快樂~