httpclient+jsoup實現小說線上採集閱讀

前言

  用過老版本UC看小說的同窗都知道,當年版權問題比較鬆懈,咱們能夠再UC搜索不一樣來源的小說,而且閱讀,那麼它是怎麼作的呢?下面讓咱們本身實現一個小說線上採集閱讀。(說明:僅用於技術學習、研究)javascript

  看小說時,最煩的就是有各類廣告,這些廣告有些是站長放上去的盈利手段,有些是被人惡意注入。個人上一篇博客中實現了小說採集並保存到本地TXT文件 HttpClients+Jsoup抓取筆趣閣小說,並保存到本地TXT文件,這樣咱們就能夠導入手機用手機閱讀軟件看小說;那麼咱們這裏實現一個能夠在線看小說。php

 

話很少說先看效果

 

  首頁:css

  頁面很純淨,目前有三種來源html

 

   搜索結果頁:前端

  三個不一樣的來源,分頁用的是layui的laypage,邏輯分頁。(筆趣閣的搜索結果界面沒有書本的圖片)vue

 

 

   

  翻頁效果:java

 

 

 

  縱橫網連簡介等都幫咱們分詞,搞得數據量太大,速度太慢:books.size() < 888 jquery

 

 

  書本詳情頁:git

 

  小說閱讀頁:github

 

  上、下一章:

 

代碼與分析

  項目是springboot項目,原理很是簡單,就是用httpclient構造一個請求頭去請求對應的來源連接,用jsoup去解析響應回來的response,

  經過jsoup的選擇器去找到咱們想要的數據,存入實體,放到ModelAndView裏面,前端頁面用thymeleaf去取值、遍歷數據。

  可是有一些書是要會員才能看,這種狀況下咱們須要作模擬登錄才能繼續採集,這裏只是一個簡單的採集,就不作模擬登錄了。

 

  採集過程當中碰到的問題:

  一、起點中文網採集書本集合時,想要的數據不在頁面源碼裏面

   起點中文網很機智,他在html代碼了沒有直接展現page分頁信息的連接

 

  能夠看到,httpClient請求回來的response裏分頁信息標籤裏面是空的,但用瀏覽器去請求裏面有信息

 


  這是由於httpClient去模擬咱們的瀏覽器訪問某個連接,直接響應回這個連接對應的內容,並不會去幫咱們觸發其餘的ajax,而瀏覽器回去解析響應回來的html,當碰到img、script、link等標籤它會幫咱們去ajax請求對應的資源。
  由此推測,page相關的信息,起點中文網是在js代碼裏面去獲取並追加,最後經過network找到它的一些蛛絲馬跡

 

  既然他沒有寫在html裏,那咱們就本身去建立鏈接,能夠看到html上有當前頁跟最大頁數
 
完美

 

 

 

   二、筆趣閣查看書本詳情,圖片防盜鏈

   筆趣閣有一個圖片防盜,咱們在本身的html引入圖片路徑時 ,但當咱們把連接用瀏覽器訪問時是能夠的

 



  
  對比一下兩邊的請求頭

 

  首先咱們要知道什麼事圖片防盜鏈,猛戳這裏  -->: 圖片防盜鏈原理及應對方法  ;咱們直接用大佬的反防盜鏈方法,而且針對咱們的項目改造一下:
<div id="bookImg"></div>
    /**
     * 反防盜鏈
     */
    function showImg(parentObj, url) {
        //來一個隨機數
        var frameid = 'frameimg' + Math.random();
        //放在(父頁面)window裏面   iframe的script標籤裏面綁定了window.onload,做用:設置iframe的高度、寬度 <script>window.onload = function() {  parent.document.getElementById(\'' + frameid + '\').height = document.getElementById(\'img\').height+\'px\'; }<' + '/script>
        window.img = '<img src=\'' + url + '?' + Math.random() + '\'/>';
        //iframe調用parent.img
        $(parentObj).append('<iframe id="' + frameid + '" src="javascript:parent.img;" frameBorder="0" scrolling="no"></iframe>');
    }

    showImg($("#bookImg"), book.img);

 

  效果最終:

 

 

 

 

 

 

  三、採集書本詳情時,起點網的目錄並無在html裏 

   起點網的目錄並無在html裏,也不是在另外一個連接裏

 

  經過瀏覽器頁面Elements的Break on打斷點
 
  
  查看調用棧發現 ,它在js ajax請求數據,進行tab切換,就連總共有多少章,它都是頁面加載出來以後ajax請求回來的

 


  看一下他的請求頭跟參數

 

  只要咱們弄懂_csrfToken參數就能夠構造一個get請求  https://book.qidian.com/ajax/book/category?_csrfToken=LosgUIe29G7LV04gdutbSqzKRb9XxoPyqtWBQ3hU&bookId=1209977 

  經過瀏覽器查看可知,第一章對應的連接: https://read.qidian.com/chapter/2R9G_ziBVg41/MyEcwtk5i8Iex0RJOkJclQ2
  這個就是咱們想要的
  https://read.qidian.com/chapter/  + cU章節連接
  cN章節名稱 

 

 
   _csrfToken是cookie,並且屢次刷新都不變,大膽猜想:起點爲咱們生成cookie而且攜帶請求ajax,攜帶與起點給咱們的cookie不一致的時候返回失敗
  咱們每次調用gather,都是一次新的httpclient對象,每次既然如此,那咱們就先獲取cookie,在用同一個httpclient去請求數據便可 (詳情代碼已經貼出來,在BookHandler_qidian.book_details_qidian裏面)
 
   最終咱們得到了返回值,是一個json

 

 
 
 

  一樣的,大部分邏輯都寫在註釋裏面,相信你們都看得懂:

 

  maven引包:

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.9</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <dependency>
            <groupId>net.sf.json-lib</groupId>
            <artifactId>json-lib</artifactId>
            <version>2.4</version>
            <classifier>jdk15</classifier>
        </dependency>

 

  書實體類:

/**
 * 書對象
 */
@Data
public class Book {

    /**
     * 連接
     */
    private String bookUrl;

    /**
     * 書名
     */
    private String bookName;

    /**
     * 做者
     */
    private String author;

    /**
     * 簡介
     */
    private String synopsis;

    /**
     * 圖片
     */
    private String img;

    /**
     * 章節目錄 chapterName、url
     */
    private List<Map<String,String>> chapters;

    /**
     * 狀態
     */
    private String status;

    /**
     * 類型
     */
    private String type;

    /**
     * 更新時間
     */
    private String updateDate;

    /**
     * 第一章
     */
    private String firstChapter;

    /**
     * 第一章連接
     */
    private String firstChapterUrl;

    /**
     * 上一章節
     */
    private String prevChapter;

    /**
     * 上一章節連接
     */
    private String prevChapterUrl;

    /**
     * 當前章節名稱
     */
    private String nowChapter;

    /**
     * 當前章節內容
     */
    private String nowChapterValue;

    /**
     * 當前章節連接
     */
    private String nowChapterUrl;

    /**
     * 下一章節
     */
    private String nextChapter;

    /**
     * 下一章節連接
     */
    private String nextChapterUrl;

    /**
     * 最新章節
     */
    private String latestChapter;

    /**
     * 最新章節連接
     */
    private String latestChapterUrl;

    /**
     * 大小
     */
    private String magnitude;

    /**
     * 來源
     */
    private Map<String,String> source;
    private String sourceKey;
}

 

  小工具類:

/**
 * 小工具類
 */
public class BookUtil {

    /**
     * 自動注入參數
     * 例如:
     *
     * @param src    http://search.zongheng.com/s?keyword=#1&pageNo=#2&sort=
     * @param params "鬥破蒼穹","1"
     * @return http://search.zongheng.com/s?keyword=鬥破蒼穹&pageNo=1&sort=
     */
    public static String insertParams(String src, String... params) {
        int i = 1;
        for (String param : params) {
            src = src.replaceAll("#" + i, param);
            i++;
        }
        return src;
    }

    /**
     * 採集當前url完整response實體.toString()
     *
     * @param url url
     * @return response實體.toString()
     */
    public static String gather(String url, String refererUrl) {
        String result = null;
        try {
            //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去)
            CloseableHttpClient httpClient = HttpClients.createDefault();
            //建立get方式請求對象
            HttpGet httpGet = new HttpGet(url);
            httpGet.addHeader("Content-type", "application/json");
            //包裝一下
            httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
            httpGet.addHeader("Referer", refererUrl);
            httpGet.addHeader("Connection", "keep-alive");

            //經過請求對象獲取響應對象
            CloseableHttpResponse response = httpClient.execute(httpGet);
            //獲取結果實體
            if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                result = EntityUtils.toString(response.getEntity(), "GBK");
            }

            //釋放連接
            response.close();
        }
        //這裏還能夠捕獲超時異常,從新鏈接抓取
        catch (Exception e) {
            result = null;
            System.err.println("採集操做出錯");
            e.printStackTrace();
        }
        return result;
    }
}

 

   Controller層:

/**
 * Book Controller層
 */
@RestController
@RequestMapping("book")
public class BookContrller {

    /**
     * 來源集合
     */
    private static Map<String, Map<String, String>> source = new HashMap<>();

    static {
        //筆趣閣
        source.put("biquge", BookHandler_biquge.biquge);

        //縱橫中文網
        source.put("zongheng", BookHandler_zongheng.zongheng);

        //起點中文網
        source.put("qidian", BookHandler_qidian.qidian);
    }

    /**
     * 訪問首頁
     */
    @GetMapping("/index")
    public ModelAndView index() {
        return new ModelAndView("book_index.html");
    }

    /**
     * 搜索書名
     */
    @GetMapping("/search")
    public ModelAndView search(Book book) {
        //結果集
        ArrayList<Book> books = new ArrayList<>();
        //關鍵字
        String keyWord = book.getBookName();
        //來源
        String sourceKey = book.getSourceKey();

        //獲取來源詳情
        Map<String, String> src = source.get(sourceKey);

        // 編碼
        try {
            keyWord = URLEncoder.encode(keyWord, src.get("UrlEncode"));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        //searchUrl
        src.put("searchUrl", BookUtil.insertParams(src.get("searchUrl"), keyWord, "1"));

        //調用不一樣的方法
        switch (sourceKey) {
            case "biquge":
                BookHandler_biquge.book_search_biquge(books, src, keyWord);
                break;
            case "zongheng":
                BookHandler_zongheng.book_search_zongheng(books, src, keyWord);
                break;
            case "qidian":
                BookHandler_qidian.book_search_qidian(books, src, keyWord);
                break;
            default:
                //默認全部都查
                BookHandler_biquge.book_search_biquge(books, src, keyWord);
                BookHandler_zongheng.book_search_zongheng(books, src, keyWord);
                BookHandler_qidian.book_search_qidian(books, src, keyWord);
                break;
        }

        System.out.println(books.size());
        ModelAndView modelAndView = new ModelAndView("book_list.html", "books", books);
        try {
            modelAndView.addObject("keyWord", URLDecoder.decode(keyWord, src.get("UrlEncode")));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        modelAndView.addObject("sourceKey", sourceKey);
        return modelAndView;
    }

    /**
     * 訪問書本詳情
     */
    @GetMapping("/details")
    public ModelAndView details(String sourceKey,String bookUrl,String searchUrl) {
        Map<String, String> src = source.get(sourceKey);
        src.put("searchUrl",searchUrl);
        Book book = new Book();
        //調用不一樣的方法
        switch (sourceKey) {
            case "biquge":
                book = BookHandler_biquge.book_details_biquge(src, bookUrl);
                break;
            case "zongheng":
                book = BookHandler_zongheng.book_details_zongheng(src, bookUrl);
                break;
            case "qidian":
                book = BookHandler_qidian.book_details_qidian(src, bookUrl);
                break;
            default:
                break;
        }
        return new ModelAndView("book_details.html", "book", book);
    }

    /**
     * 訪問書本章節
     */
    @GetMapping("/read")
    public ModelAndView read(String sourceKey,String chapterUrl,String refererUrl) {
        Map<String, String> src = source.get(sourceKey);
        Book book = new Book();
        //調用不一樣的方法
        switch (sourceKey) {
            case "biquge":
                book = BookHandler_biquge.book_read_biquge(src, chapterUrl,refererUrl);
                break;
            case "zongheng":
                book = BookHandler_zongheng.book_read_zongheng(src, chapterUrl,refererUrl);
                break;
            case "qidian":
                book = BookHandler_qidian.book_read_qidian(src, chapterUrl,refererUrl);
                break;
            default:
                break;
        }
        return new ModelAndView("book_read.html", "book", book);
    }
}

 

   三個不一樣來源的Handler處理器,每一個來源都有不一樣的採集規則:


 BookHandler_biquge
/**
 * 筆趣閣採集規則
 */
public class BookHandler_biquge {

    /**
     * 來源信息
     */
    public static HashMap<String, String> biquge = new HashMap<>();

    static {
        //筆趣閣
        biquge.put("name", "筆趣閣");
        biquge.put("key", "biquge");
        biquge.put("baseUrl", "http://www.biquge.com.tw");
        biquge.put("baseSearchUrl", "http://www.biquge.com.tw/modules/article/soshu.php");
        biquge.put("UrlEncode", "GB2312");
        biquge.put("searchUrl", "http://www.biquge.com.tw/modules/article/soshu.php?searchkey=+#1&page=#2");
    }

    /**
     * 獲取search list   筆趣閣採集規則
     *
     * @param books   結果集合
     * @param src     源目標
     * @param keyWord 關鍵字
     */
    public static void book_search_biquge(ArrayList<Book> books, Map<String, String> src, String keyWord) {
        //採集術
        String html = BookUtil.gather(src.get("searchUrl"), src.get("baseUrl"));
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //當前頁集合
            Elements resultList = doc.select("table.grid  tr#nr");
            for (Element result : resultList) {
                Book book = new Book();
                //書本連接
                book.setBookUrl(result.child(0).select("a").attr("href"));
                //書名
                book.setBookName(result.child(0).select("a").text());
                //做者
                book.setAuthor(result.child(2).text());
                //更新時間
                book.setUpdateDate(result.child(4).text());
                //最新章節
                book.setLatestChapter(result.child(1).select("a").text());
                book.setLatestChapterUrl(result.child(1).select("a").attr("href"));
                //狀態
                book.setStatus(result.child(5).text());
                //大小
                book.setMagnitude(result.child(3).text());
                //來源
                book.setSource(src);
                books.add(book);
            }

            //下一頁
            Elements searchNext = doc.select("div.pages > a.ngroup");
            String href = searchNext.attr("href");
            if (!StringUtils.isEmpty(href)) {
                src.put("baseUrl", src.get("searchUrl"));
                src.put("searchUrl", href.contains("http") ? href : (src.get("baseSearchUrl") + href));
                book_search_biquge(books, src, keyWord);
            }

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
    }

    /**
     *  獲取書本詳情  筆趣閣採集規則
     * @param src 源目標
     * @param bookUrl 書本連接
     * @return Book對象
     */
    public static Book book_details_biquge(Map<String, String> src, String bookUrl) {
        Book book = new Book();
        //採集術
        String html = BookUtil.gather(bookUrl, src.get("searchUrl"));
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);
            //書本連接
            book.setBookUrl(doc.select("meta[property=og:url]").attr("content"));
            //圖片
            book.setImg(doc.select("meta[property=og:image]").attr("content"));
            //書名
            book.setBookName(doc.select("div#info > h1").text());
            //做者
            book.setAuthor(doc.select("meta[property=og:novel:author]").attr("content"));
            //更新時間
            book.setUpdateDate(doc.select("meta[property=og:novel:update_time]").attr("content"));
            //最新章節
            book.setLatestChapter(doc.select("meta[property=og:novel:latest_chapter_name]").attr("content"));
            book.setLatestChapterUrl(doc.select("meta[property=og:novel:latest_chapter_url]").attr("content"));
            //類型
            book.setType(doc.select("meta[property=og:novel:category]").attr("content"));
            //簡介
            book.setSynopsis(doc.select("meta[property=og:description]").attr("content"));
            //狀態
            book.setStatus(doc.select("meta[property=og:novel:status]").attr("content"));

            //章節目錄
            ArrayList<Map<String, String>> chapters = new ArrayList<>();
            for (Element result : doc.select("div#list dd")) {
                HashMap<String, String> map = new HashMap<>();
                map.put("chapterName", result.select("a").text());
                map.put("url", result.select("a").attr("href"));
                chapters.add(map);
            }
            book.setChapters(chapters);

            //來源
            book.setSource(src);

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
        return book;
    }

    /**
     * 獲得當前章節名以及完整內容跟上、下一章的連接地址 筆趣閣採集規則
     * @param src 源目標
     * @param chapterUrl 當前章節連接
     * @param refererUrl 來源連接
     * @return Book對象
     */
    public static Book book_read_biquge(Map<String, String> src,String chapterUrl,String refererUrl) {
        Book book = new Book();

        //當前章節連接
        book.setNowChapterUrl(chapterUrl.contains("http") ? chapterUrl : (src.get("baseUrl") + chapterUrl));

        //採集術
        String html = BookUtil.gather(book.getNowChapterUrl(), refererUrl);
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //當前章節名稱
            book.setNowChapter(doc.select("div.box_con > div.bookname > h1").text());

            //刪除圖片廣告
            doc.select("div.box_con > div#content img").remove();
            //當前章節內容
            book.setNowChapterValue(doc.select("div.box_con > div#content").outerHtml());

            //上、下一章
            book.setPrevChapter(doc.select("div.bottem2 a:matches((?i)下一章)").text());
            book.setPrevChapterUrl(doc.select("div.bottem2 a:matches((?i)下一章)").attr("href"));
            book.setNextChapter(doc.select("div.bottem2 a:matches((?i)上一章)").text());
            book.setNextChapterUrl(doc.select("div.bottem2 a:matches((?i)上一章)").attr("href"));

            //來源
            book.setSource(src);

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
        return book;
    }
}


BookHandler_zongheng
/**
 * 縱橫中文網採集規則
 */
public class BookHandler_zongheng {

    /**
     * 來源信息
     */
    public static HashMap<String, String> zongheng = new HashMap<>();

    static {
        //縱橫中文網
        zongheng.put("name", "縱橫中文網");
        zongheng.put("key", "zongheng");
        zongheng.put("baseUrl", "http://www.zongheng.com");
        zongheng.put("baseSearchUrl", "http://search.zongheng.com/s");
        zongheng.put("UrlEncode", "UTF-8");
        zongheng.put("searchUrl", "http://search.zongheng.com/s?keyword=#1&pageNo=#2&sort=");
    }

    /**
     * 獲取search list   縱橫中文網採集規則
     *
     * @param books   結果集合
     * @param src     源目標
     * @param keyWord 關鍵字
     */
    public static void book_search_zongheng(ArrayList<Book> books, Map<String, String> src, String keyWord) {
        //採集術
        String html = BookUtil.gather(src.get("searchUrl"), src.get("baseUrl"));
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //當前頁集合
            Elements resultList = doc.select("div.search-tab > div.search-result-list");
            for (Element result : resultList) {
                Book book = new Book();
                //書本連接
                book.setBookUrl(result.select("div.imgbox a").attr("href"));
                //圖片
                book.setImg(result.select("div.imgbox img").attr("src"));
                //書名
                book.setBookName(result.select("h2.tit").text());
                //做者
                book.setAuthor(result.select("div.bookinfo > a").first().text());
                //類型
                book.setType(result.select("div.bookinfo > a").last().text());
                //簡介
                book.setSynopsis(result.select("p").text());
                //狀態
                book.setStatus(result.select("div.bookinfo > span").first().text());
                //大小
                book.setMagnitude(result.select("div.bookinfo > span").last().text());
                //來源
                book.setSource(src);
                books.add(book);
            }

            //下一頁
            Elements searchNext = doc.select("div.search_d_pagesize > a.search_d_next");
            String href = searchNext.attr("href");
            //最多隻要888本,否則太慢了
            if (books.size() < 888 && !StringUtils.isEmpty(href)) {
                src.put("baseUrl", src.get("searchUrl"));
                src.put("searchUrl", href.contains("http") ? href : (src.get("baseSearchUrl") + href));
                book_search_zongheng(books, src, keyWord);
            }

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
    }

    /**
     *  獲取書本詳情  縱橫中文網採集規則
     * @param src 源目標
     * @param bookUrl 書本連接
     * @return Book對象
     */
    public static Book book_details_zongheng(Map<String, String> src, String bookUrl) {
        Book book = new Book();
        //採集術
        String html = BookUtil.gather(bookUrl, src.get("searchUrl"));
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //書本連接
            book.setBookUrl(bookUrl);
            //圖片
            book.setImg(doc.select("div.book-img > img").attr("src"));
            //書名
            book.setBookName(doc.select("div.book-info > div.book-name").text());
            //做者
            book.setAuthor(doc.select("div.book-author div.au-name").text());
            //更新時間
            book.setUpdateDate(doc.select("div.book-new-chapter div.time").text());
            //最新章節
            book.setLatestChapter(doc.select("div.book-new-chapter div.tit a").text());
            book.setLatestChapterUrl(doc.select("div.book-new-chapter div.tit a").attr("href"));
            //類型
            book.setType(doc.select("div.book-label > a").last().text());
            //簡介
            book.setSynopsis(doc.select("div.book-dec > p").text());
            //狀態
            book.setStatus(doc.select("div.book-label > a").first().text());

            //章節目錄
            String chaptersUrl = doc.select("a.all-catalog").attr("href");
            ArrayList<Map<String, String>> chapters = new ArrayList<>();
            //採集術
            for (Element result : Jsoup.parse(BookUtil.gather(chaptersUrl, bookUrl)).select("ul.chapter-list li")) {
                HashMap<String, String> map = new HashMap<>();
                map.put("chapterName", result.select("a").text());
                map.put("url", result.select("a").attr("href"));
                chapters.add(map);
            }
            book.setChapters(chapters);
            //來源
            book.setSource(src);
        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
        return book;
    }

    /**
     * 獲得當前章節名以及完整內容跟上、下一章的連接地址 縱橫中文網採集規則
     * @param src 源目標
     * @param chapterUrl 當前章節連接
     * @param refererUrl 來源連接
     * @return Book對象
     */
    public static Book book_read_zongheng(Map<String, String> src,String chapterUrl,String refererUrl) {
        Book book = new Book();

        //當前章節連接
        book.setNowChapterUrl(chapterUrl.contains("http") ? chapterUrl : (src.get("baseUrl") + chapterUrl));

        //採集術
        String html = BookUtil.gather(book.getNowChapterUrl(), refererUrl);
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //當前章節名稱
            book.setNowChapter(doc.select("div.title_txtbox").text());

            //刪除圖片廣告
            doc.select("div.content img").remove();
            //當前章節內容
            book.setNowChapterValue(doc.select("div.content").outerHtml());

            //上、下一章
            book.setPrevChapter(doc.select("div.chap_btnbox a:matches((?i)下一章)").text());
            book.setPrevChapterUrl(doc.select("div.chap_btnbox a:matches((?i)下一章)").attr("href"));
            book.setNextChapter(doc.select("div.chap_btnbox a:matches((?i)上一章)").text());
            book.setNextChapterUrl(doc.select("div.chap_btnbox a:matches((?i)上一章)").attr("href"));

            //來源
            book.setSource(src);

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
        return book;
    }
}

 

BookHandler_qidian
/**
 *  起點中文網採集規則
 */
public class BookHandler_qidian {

    /**
     * 來源信息
     */
    public static HashMap<String, String> qidian = new HashMap<>();

    static {
        //起點中文網
        qidian.put("name", "起點中文網");
        qidian.put("key", "qidian");
        qidian.put("baseUrl", "http://www.qidian.com");
        qidian.put("baseSearchUrl", "https://www.qidian.com/search");
        qidian.put("UrlEncode", "UTF-8");
        qidian.put("searchUrl", "https://www.qidian.com/search?kw=#1&page=#2");
    }

    /**
     * 獲取search list   起點中文網採集規則
     *
     * @param books   結果集合
     * @param src     源目標
     * @param keyWord 關鍵字
     */
    public static void book_search_qidian(ArrayList<Book> books, Map<String, String> src, String keyWord) {
        //採集術
        String html = BookUtil.gather(src.get("searchUrl"), src.get("baseUrl"));
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //當前頁集合
            Elements resultList = doc.select("li.res-book-item");
            for (Element result : resultList) {
                Book book = new Book();
                /*
                       若是你們打斷點在這裏的話就會發現,起點的連接是這樣的
                       //book.qidian.com/info/1012786368

                       以兩個斜槓開頭,不過無所謂,httpClient照樣能夠請求
                 */
                //書本連接
                book.setBookUrl(result.select("div.book-img-box a").attr("href"));
                //圖片
                book.setImg(result.select("div.book-img-box img").attr("src"));
                //書名
                book.setBookName(result.select("div.book-mid-info > h4").text());
                //做者
                book.setAuthor(result.select("div.book-mid-info > p.author > a").first().text());
                //類型
                book.setType(result.select("div.book-mid-info > p.author > a").last().text());
                //簡介
                book.setSynopsis(result.select("div.book-mid-info > p.intro").text());
                //狀態
                book.setStatus(result.select("div.book-mid-info > p.author > span").first().text());
                //更新時間
                book.setUpdateDate(result.select("div.book-mid-info > p.update > span").text());
                //最新章節
                book.setLatestChapter(result.select("div.book-mid-info > p.update > a").text());
                book.setLatestChapterUrl(result.select("div.book-mid-info > p.update > a").attr("href"));
                //來源
                book.setSource(src);
                books.add(book);
            }

            //當前頁
            String page = doc.select("div#page-container").attr("data-page");

            //最大頁數
            String pageMax = doc.select("div#page-container").attr("data-pageMax");

            //當前頁 < 最大頁數
            if (Integer.valueOf(page) < Integer.valueOf(pageMax)) {
                src.put("baseUrl", src.get("searchUrl"));
                //本身拼接下一頁連接
                src.put("searchUrl", src.get("searchUrl").replaceAll("page=" + Integer.valueOf(page), "page=" + (Integer.valueOf(page) + 1)));
                book_search_qidian(books, src, keyWord);
            }

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
    }

    /**
     *  獲取書本詳情  起點中文網採集規則
     * @param src 源目標
     * @param bookUrl 書本連接
     * @return Book對象
     */
    public static Book book_details_qidian(Map<String, String> src, String bookUrl) {
        Book book = new Book();

        //https
        bookUrl = "https:" + bookUrl;

        //採集術
        String html = BookUtil.gather(bookUrl, src.get("searchUrl"));
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            //書本連接
            book.setBookUrl(bookUrl);
            //圖片
            String img = doc.select("div.book-img > a#bookImg > img").attr("src");
            img = "https:" + img;
            book.setImg(img);
            //書名
            book.setBookName(doc.select("div.book-info > h1 > em").text());
            //做者
            book.setAuthor(doc.select("div.book-info > h1 a.writer").text());
            //更新時間
            book.setUpdateDate(doc.select("li.update em.time").text());
            //最新章節
            book.setLatestChapter(doc.select("li.update a").text());
            book.setLatestChapterUrl(doc.select("li.update a").attr("href"));
            //類型
            book.setType(doc.select("p.tag > span").first().text());
            //簡介
            book.setSynopsis(doc.select("div.book-intro > p").text());
            //狀態
            book.setStatus(doc.select("p.tag > a").first().text());

            //章節目錄

            //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去)
            BasicCookieStore cookieStore = new BasicCookieStore();
            CloseableHttpClient httpClient = HttpClients.custom().setDefaultCookieStore(cookieStore).build();
            //建立get方式請求對象
            HttpGet httpGet = new HttpGet("https://book.qidian.com/");
            httpGet.addHeader("Content-type", "application/json");
            //包裝一下
            httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
            httpGet.addHeader("Connection", "keep-alive");
            //經過請求對象獲取響應對象
            CloseableHttpResponse response = httpClient.execute(httpGet);
            //得到Cookies
            String _csrfToken = "";
            List<Cookie> cookies = cookieStore.getCookies();
            for (int i = 0; i < cookies.size(); i++) {
                if("_csrfToken".equals(cookies.get(i).getName())){
                    _csrfToken = cookies.get(i).getValue();
                }
            }

            //構造post
            String bookId = doc.select("div.book-img a#bookImg").attr("data-bid");
            HttpPost httpPost = new HttpPost(BookUtil.insertParams("https://book.qidian.com/ajax/book/category?_csrfToken=#1&bookId=#2",_csrfToken,bookId));
            httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
            httpPost.addHeader("Connection", "keep-alive");
            //經過請求對象獲取響應對象
            CloseableHttpResponse response1 = httpClient.execute(httpPost);
            //獲取結果實體(json格式字符串)
            String chaptersJson = "";
            if (response1.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                chaptersJson = EntityUtils.toString(response1.getEntity(), "UTF-8");
            }

            //java處理json
            ArrayList<Map<String, String>> chapters = new ArrayList<>();

            JSONObject jsonArray = JSONObject.fromObject(chaptersJson);
            Map<String,Object> objectMap = (Map<String, Object>) jsonArray;

            Map<String, Object> objectMap_data = (Map<String, Object>) objectMap.get("data");
            List<Map<String, Object>> objectMap_data_vs = (List<Map<String, Object>>) objectMap_data.get("vs");
            for(Map<String, Object> vs : objectMap_data_vs){
                List<Map<String, Object>> cs = (List<Map<String, Object>>) vs.get("cs");
                for(Map<String, Object> chapter : cs){
                    Map<String, String> map = new HashMap<>();
                    map.put("chapterName", (String) chapter.get("cN"));
                    map.put("url", "https://read.qidian.com/chapter/"+(String) chapter.get("cU"));
                    chapters.add(map);
                }
            }

            book.setChapters(chapters);


            //來源
            book.setSource(src);

            //釋放連接
            response.close();
        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
        return book;
    }

    /**
     * 獲得當前章節名以及完整內容跟上、下一章的連接地址 起點中文網採集規則
     * @param src 源目標
     * @param chapterUrl 當前章節連接
     * @param refererUrl 來源連接
     * @return Book對象
     */
    public static Book book_read_qidian(Map<String, String> src,String chapterUrl,String refererUrl) {
        Book book = new Book();

        //當前章節連接
        book.setNowChapterUrl(chapterUrl.contains("http") ? chapterUrl : (src.get("baseUrl") + chapterUrl));

        //採集術
        String html = BookUtil.gather(book.getNowChapterUrl(), refererUrl);
        try {
            //解析html格式的字符串成一個Document
            Document doc = Jsoup.parse(html);

            System.out.println(html);

            //當前章節名稱
            book.setNowChapter(doc.select("h3.j_chapterName").text());

            //刪除圖片廣告
            doc.select("div.read-content img").remove();
            //當前章節內容
            book.setNowChapterValue(doc.select("div.read-content").outerHtml());

            //上、下一章
            book.setPrevChapter(doc.select("div.chapter-control a:matches((?i)下一章)").text());
            String prev = doc.select("div.chapter-control a:matches((?i)下一章)").attr("href");
            prev = "https:"+prev;
            book.setPrevChapterUrl(prev);
            book.setNextChapter(doc.select("div.chapter-control a:matches((?i)上一章)").text());
            String next = doc.select("div.chapter-control a:matches((?i)上一章)").attr("href");
            next = "https:"+next;
            book.setNextChapterUrl(next);

            //來源
            book.setSource(src);

        } catch (Exception e) {
            System.err.println("採集數據操做出錯");
            e.printStackTrace();
        }
        return book;
    }
}

 

 

  四個html頁面:

 

  book_index

<!DOCTYPE html>
<!--解決idea thymeleaf 表達式模板報紅波浪線-->
<!--suppress ALL -->
<html xmlns:th="http://www.thymeleaf.org">
<head>
    <meta charset="UTF-8">
    <title>MY BOOK</title>
    <!-- 新 Bootstrap 核心 CSS 文件 -->
    <link rel="stylesheet" href="http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css">
    <style>

        body{
            background-color: antiquewhite;
        }

        .main{
            margin: auto;
            width: 500px;
            margin-top: 150px;
        }

        #bookName{
            width: 300px;
        }

        #title{
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="main">
        <h2 id="title">MY BOOK</h2>
        <form class="form-inline" method="get" th:action="@{/book/search}">
            來源
            <select class="form-control" id="source" name="sourceKey">
                <option value="">全部</option>
                <option value="biquge">筆趣閣</option>
                <option value="zongheng">縱橫網</option>
                <option value="qidian">起點網</option>
            </select>
            <input type="text" id="bookName" name="bookName" class="form-control" placeholder="請輸入..."/>
            <button class="btn btn-info" type="submit">搜索</button>
        </form>
    </div>
</body>
</html>

 

  book_list

<!DOCTYPE html>
<!--解決idea thymeleaf 表達式模板報紅波浪線-->
<!--suppress ALL -->
<html xmlns:th="http://www.thymeleaf.org">
<head>
    <meta charset="UTF-8">
    <title>BOOK LIST</title>
    <!-- 新 Bootstrap 核心 CSS 文件 -->
    <link rel="stylesheet" href="http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css">
    <link rel="stylesheet" href="http://hanlei.online/Onlineaddress/layui/css/layui.css"/>
    <style>
        body {
            background-color: antiquewhite;
        }

        .main {
            margin: auto;
            width: 500px;
            margin-top: 50px;
        }

        .book {
            border-bottom: solid #428bca 1px;
        }

        .click-book-detail, .click-book-read {
            cursor: pointer;
            color: #428bca;
        }

        .click-book-detail:hover {
            color: rgba(150, 149, 162, 0.47);
        }

        .click-book-read:hover {
            color: rgba(150, 149, 162, 0.47);
        }
    </style>
</head>
<body>
<div class="main">
    <form class="form-inline" method="get" th:action="@{/book/search}">
        來源
        <select class="form-control" id="source" name="sourceKey">
            <option value="">全部</option>
            <option value="biquge" th:selected="${sourceKey} == 'biquge'">筆趣閣</option>
            <option value="zongheng" th:selected="${sourceKey} == 'zongheng'">縱橫網</option>
            <option value="qidian" th:selected="${sourceKey} == 'qidian'">起點網</option>
        </select>
        <input type="text" id="bookName" name="bookName" class="form-control" placeholder="請輸入..."
               th:value="${keyWord}"/>
        <button class="btn btn-info" type="submit">搜索</button>
    </form>
    <br/>
    <div id="books"></div>
    <div id="page"></div>
</div>
</body>
<!-- jquery在線版本 -->
<script src="http://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script>
<script src="http://hanlei.online/Onlineaddress/layui/layui.js"></script>
<script th:inline="javascript">
    var ctx = /*[[@{/}]]*/'';
    var books = [[${books}]];//取出後臺數據
    var nums = 10; //每頁出現的數量
    var pages = books.length; //總數

    /**
     * 傳入當前頁,根據nums去計算,從books集合截取對應數據作展現
     */
    var thisDate = function (curr) {
        var str = "",//當前頁須要展現的html
            first = (curr * nums - nums),//展現的第一條數據的下標
            last = curr * nums - 1;//展現的最後一條數據的下標
        last = last >= books.length ? (books.length - 1) : last;
        for (var i = first; i <= last; i++) {
            var book = books[i];
            str += "<div class='book'>" +
                "<img class='click-book-detail' data-bookurl='" + book.bookUrl + "' data-sourcekey='" + book.source.key + "' data-searchurl='" + book.source.searchUrl + "' src='" + book.img + "'></img>" +
                "<p class='click-book-detail' data-bookurl='" + book.bookUrl + "' data-sourcekey='" + book.source.key + "' data-searchurl='" + book.source.searchUrl + "'>書名:" + book.bookName + "</p>" +
                "<p>做者:" + book.author + "</p>" +
                "<p>簡介:" + book.synopsis + "</p>" +
                "<p class='click-book-read' data-chapterurl='" + book.latestChapterUrl + "' data-sourcekey='" + book.source.key + "' data-refererurl='" + book.source.refererurl + "'>最新章節:" + book.latestChapter + "</p>" +
                "<p>更新時間:" + book.updateDate + "</p>" +
                "<p>大小:" + book.magnitude + "</p>" +
                "<p>狀態:" + book.status + "</p>" +
                "<p>類型:" + book.type + "</p>" +
                "<p>來源:" + book.source.name + "</p>" +
                "</div><br/>";
        }
        return str;
    };

    //獲取一個laypage實例
    layui.use('laypage', function () {
        var laypage = layui.laypage;

        //調用laypage 邏輯分頁
        laypage.render({
            elem: 'page',
            count: pages,
            limit: nums,
            jump: function (obj) {
                //obj包含了當前分頁的全部參數,好比:
                // console.log(obj.curr); //獲得當前頁,以便向服務端請求對應頁的數據。
                // console.log(obj.limit); //獲得每頁顯示的條數
                document.getElementById('books').innerHTML = thisDate(obj.curr);
            },
            prev: '<',
            next: '>',
            theme: '#f9c357',
        })
    });

    $("body").on("click", ".click-book-detail", function (even) {
        var bookUrl = $(this).data("bookurl");
        var searchUrl = $(this).data("searchurl");
        var sourceKey = $(this).data("sourcekey");
        window.location.href = ctx + "/book/details?sourceKey=" + sourceKey + "&searchUrl=" + searchUrl + "&bookUrl=" + bookUrl;
    });
    $("body").on("click", ".click-book-read", function (even) {
        var chapterUrl = $(this).data("chapterurl");
        var refererUrl = $(this).data("refererurl");
        var sourceKey = $(this).data("sourcekey");
        window.location.href = ctx + "/book/read?sourceKey=" + sourceKey + "&refererUrl=" + refererUrl + "&chapterUrl=" + chapterUrl;
    });
</script>
</html>

 

book_details

<!DOCTYPE html>
<!--解決idea thymeleaf 表達式模板報紅波浪線-->
<!--suppress ALL -->
<html xmlns:th="http://www.thymeleaf.org">
<head>
    <meta charset="UTF-8">
    <title>BOOK DETAILS</title>
    <!-- 新 Bootstrap 核心 CSS 文件 -->
    <link rel="stylesheet" href="http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css">
    <link rel="stylesheet" href="http://hanlei.online/Onlineaddress/layui/css/layui.css"/>
    <style>
        body {
            background-color: antiquewhite;
        }

        .main {
            margin: auto;
            width: 500px;
            margin-top: 150px;
        }

        .book {
            border-bottom: solid #428bca 1px;
        }

        .click-book-detail, .click-book-read {
            cursor: pointer;
            color: #428bca;
        }

        .click-book-detail:hover {
            color: rgba(150, 149, 162, 0.47);
        }

        .click-book-read:hover {
            color: rgba(150, 149, 162, 0.47);
        }

        a {
            color: #428bca;
        }

    </style>
</head>
<body>
<div class="main">
    <div class='book'>
        <div id="bookImg"></div>
        <p>書名:<span th:text="${book.bookName}"></span></p>
        <p>做者:<span th:text="${book.author}"></span></p>
        <p>簡介:<span th:text="${book.synopsis}"></span></p>
        <p>最新章節:<a th:href="${book.latestChapterUrl}" th:text="${book.latestChapter}"></a></p>
        <p>更新時間:<span th:text="${book.updateDate}"></span></p>
        <p>大小:<span th:text="${book.magnitude}"></span></p>
        <p>狀態:<span th:text="${book.status}"></span></p>
        <p>類型:<span th:text="${book.type}"></span></p>
        <p>來源:<span th:text="${book.source.name}"></span></p>
    </div>
    <br/>
    <div class="chapters" th:each="chapter,iterStat:${book.chapters}">
        <p class="click-book-read" th:attr="data-chapterurl=${chapter.url},data-sourcekey=${book.source.key},data-refererurl=${book.bookUrl}" th:text="${chapter.chapterName}"></p>
    </div>
</div>
</body>
<!-- jquery在線版本 -->
<script src="http://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script>
<script th:inline="javascript">
    var ctx = /*[[@{/}]]*/'';
    var book = [[${book}]];//取出後臺數據

    /**
     * 反防盜鏈
     */
    function showImg(parentObj, url) {
        //來一個隨機數
        var frameid = 'frameimg' + Math.random();
        //放在(父頁面)window裏面   iframe的script標籤裏面綁定了window.onload,做用:設置iframe的高度、寬度 <script>window.onload = function() {  parent.document.getElementById(\'' + frameid + '\').height = document.getElementById(\'img\').height+\'px\'; }<' + '/script>
        window.img = '<img src=\'' + url + '?' + Math.random() + '\'/>';
        //iframe調用parent.img
        $(parentObj).append('<iframe id="' + frameid + '" src="javascript:parent.img;" frameBorder="0" scrolling="no"></iframe>');
    }

    showImg($("#bookImg"), book.img);

    $("body").on("click", ".click-book-read", function (even) {
        var chapterUrl = $(this).data("chapterurl");
        var refererUrl = $(this).data("refererurl");
        var sourceKey = $(this).data("sourcekey");
        window.location.href = ctx + "/book/read?sourceKey=" + sourceKey + "&refererUrl=" + refererUrl + "&chapterUrl=" + chapterUrl;
    });

</script>
</html>

 

  book_read

<!DOCTYPE html>
<!--解決idea thymeleaf 表達式模板報紅波浪線-->
<!--suppress ALL -->
<html xmlns:th="http://www.thymeleaf.org">
<head>
    <meta charset="UTF-8">
    <title>BOOK READ</title>
    <style>
        body {
            background-color: antiquewhite;
        }

        .main {
            padding: 10px 20px;
        }

        .click-book-detail, .click-book-read {
            cursor: pointer;
            color: #428bca;
        }

        .click-book-detail:hover {
            color: rgba(150, 149, 162, 0.47);
        }

        .click-book-read:hover {
            color: rgba(150, 149, 162, 0.47);
        }

        .float-left{
            float: left;
            margin-left: 70px;
        }
    </style>
</head>
<body>
<div class="main">
    <!-- 章節名稱 -->
    <h3 th:text="${book.nowChapter}"></h3>
    <!-- 章節內容 -->
    <p th:utext="${book.nowChapterValue}"></p>
    <!-- 上、下章 -->
    <p class="click-book-read float-left"
       th:attr="data-chapterurl=${book.nextChapterUrl},data-sourcekey=${book.source.key},data-refererurl=${book.nowChapterUrl}"
       th:text="${book.nextChapter}"></p>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <p class="click-book-read float-left"
       th:attr="data-chapterurl=${book.prevChapterUrl},data-sourcekey=${book.source.key},data-refererurl=${book.nowChapterUrl}"
       th:text="${book.prevChapter}"></p>
</div>
</body>
<!-- jquery在線版本 -->
<script src="http://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script>
<script th:inline="javascript">
    var ctx = /*[[@{/}]]*/'';
    $("body").on("click", ".click-book-read", function (even) {
        var chapterUrl = $(this).data("chapterurl");
        var refererUrl = $(this).data("refererurl");
        var sourceKey = $(this).data("sourcekey");
        window.location.href = ctx + "/book/read?sourceKey=" + sourceKey + "&refererUrl=" + refererUrl + "&chapterUrl=" + chapterUrl;
    });
</script>
</html>

 

 

  補充

  2019-07-17補充:咱們以前三個來源網站的baseUrl都是用http,但網站後面都升級成了https,例如筆趣閣:

  致使抓取數據時報錯

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
    at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
    at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
    at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
    at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
    at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
    at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
    at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
    at sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
    at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
    at cn.huanzi.qch.spider.novelreading.util.BookUtil.gather(BookUtil.java:81)
    at cn.huanzi.qch.spider.novelreading.pojo.BookHandler_biquge.book_search_biquge(BookHandler_biquge.java:43)
    at cn.huanzi.qch.spider.novelreading.controller.BookContrller.search(BookContrller.java:78)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:215)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:142)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:998)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:890)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:875)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
    at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
    at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
    at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:770)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415)
    at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:748)

  解決辦法:參考http://www.javashuo.com/article/p-grzpkcic-ng.html,繞過證書驗證

  在BookUtil.java中新增方法

    /**
     * 繞過SSL驗證
     */
    private static SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException {
        SSLContext sc = SSLContext.getInstance("SSLv3");

        // 實現一個X509TrustManager接口,用於繞過驗證,不用修改裏面的方法
        X509TrustManager trustManager = new X509TrustManager() {
            @Override
            public void checkClientTrusted(
                    java.security.cert.X509Certificate[] paramArrayOfX509Certificate,
                    String paramString) throws CertificateException {
            }

            @Override
            public void checkServerTrusted(
                    java.security.cert.X509Certificate[] paramArrayOfX509Certificate,
                    String paramString) throws CertificateException {
            }

            @Override
            public java.security.cert.X509Certificate[] getAcceptedIssuers() {
                return null;
            }
        };

        sc.init(null, new TrustManager[]{trustManager}, null);
        return sc;
    }

  而後在gather方法中改爲這樣獲取httpClient

    /**
     * 採集當前url完整response實體.toString()
     *
     * @param url url
     * @return response實體.toString()
     */
    public static String gather(String url, String refererUrl) {
        String result = null;
        try {
            //採用繞過驗證的方式處理https請求
            SSLContext sslcontext = createIgnoreVerifySSL();

            // 設置協議http和https對應的處理socket連接工廠的對象
            Registry<ConnectionSocketFactory> socketFactoryRegistry = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register("http", PlainConnectionSocketFactory.INSTANCE)
                    .register("https", new SSLConnectionSocketFactory(sslcontext))
                    .build();
            PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(socketFactoryRegistry);
            HttpClients.custom().setConnectionManager(connManager);

            //建立自定義的httpclient對象
            CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connManager).build();


            //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去)
//            CloseableHttpClient httpClient = HttpClients.createDefault();

            //建立get方式請求對象
            HttpGet httpGet = new HttpGet(url);
            httpGet.addHeader("Content-type", "application/json");
            //包裝一下
            httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
            httpGet.addHeader("Referer", refererUrl);
            httpGet.addHeader("Connection", "keep-alive");

            //經過請求對象獲取響應對象
            CloseableHttpResponse response = httpClient.execute(httpGet);
            //獲取結果實體
            if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                result = EntityUtils.toString(response.getEntity(), "GBK");
            }

            //釋放連接
            response.close();
        }
        //這裏還能夠捕獲超時異常,從新鏈接抓取
        catch (Exception e) {
            result = null;
            System.err.println("採集操做出錯");
            e.printStackTrace();
        }
        return result;
    }

  這樣就能夠正常抓取了

 

  咱們以前獲取項目路徑用的是

var ctx = /*[[@{/}]]*/'';

  忽然發現不行了,跳轉的路徑直接是/開頭,如今改爲這樣獲取

    //項目路徑
    var ctx = [[${#request.getContextPath()}]];

 

  2019-08-01補充:你們若是看到有這個報錯,鏈接被重置,不要慌張,有多是網站換域名了好比如今咱們程序請求的是http://www.biquge.com.tw,但這個網址已經不能訪問了,筆趣閣已經改爲https://www.biqudu.net/,咱們改一下代碼就能夠解決問題,要注意檢查各個源路徑是否能正常訪問,同時對方也可能改頁面格式,致使咱們以前的規則沒法匹配獲取數據,這種狀況只能從新編寫爬取規則了

 

  2019-08-02補充:發現了個bug,咱們的BookUtil.insertParams方法原理是替換#字符串

    /**
     * 自動注入參數
     * 例如:
     *
     * @param src    http://search.zongheng.com/s?keyword=#1&pageNo=#2&sort=
     * @param params "鬥破蒼穹","1"
     * @return http://search.zongheng.com/s?keyword=鬥破蒼穹&pageNo=1&sort=
     */
    public static String insertParams(String src, String... params) {
        int i = 1;
        for (String param : params) {
            src = src.replaceAll("#" + i, param);
            i++;
        }
        return src;
    }

  可是咱們在搜索的時候,調用參數自動注入,形參src的值是來自靜態屬性Map,初始化的時候有兩個#字符串,在進行第一次搜索以後,#字符串被替換了,後面再進行搜索注入參數已經沒有#字符串了,所以後面的搜索結果都是第一次的結果...

 

  解決:獲取來源時不是用=賦值,而是複製一份,三個方法都要改

  修改前:

        //獲取來源詳情
        Map<String, String> src = source.get(sourceKey);

  修改後:

        //獲取來源詳情,複製一份
        Map<String, String> src = new HashMap<>();
        src.putAll(source.get(sourceKey));

 

  多端開發

  公司最近打算作手機端,學習了DCloud公司的uni-app,開發工具是HBuilderX,並用咱們的小說爬蟲學習、練手,作了個H5手機端的頁面

  DCloud公司官網:https://www.dcloud.io/

  uni-app官網:https://uniapp.dcloud.io/

  uni-app 是一個使用 Vue.js 開發全部前端應用的框架,開發者編寫一套代碼,可編譯到iOS、Android、H五、以及各類小程序等多個平臺。

  

  效果圖:

   

 

  代碼開源

  代碼已經開源、託管到個人GitHub、碼雲:

  GitHub:https://github.com/huanzi-qch/spider

  碼雲:https://gitee.com/huanzi-qch/spider

相關文章
相關標籤/搜索