用過老版本UC看小說的同窗都知道,當年版權問題比較鬆懈,咱們能夠再UC搜索不一樣來源的小說,而且閱讀,那麼它是怎麼作的呢?下面讓咱們本身實現一個小說線上採集閱讀。(說明:僅用於技術學習、研究)javascript
看小說時,最煩的就是有各類廣告,這些廣告有些是站長放上去的盈利手段,有些是被人惡意注入。在個人上一篇博客中實現了小說採集並保存到本地TXT文件 HttpClients+Jsoup抓取筆趣閣小說,並保存到本地TXT文件,這樣咱們就能夠導入手機用手機閱讀軟件看小說;那麼咱們這裏實現一個能夠在線看小說。php
首頁:css
頁面很純淨,目前有三種來源html
搜索結果頁:前端
三個不一樣的來源,分頁用的是layui的laypage,邏輯分頁。(筆趣閣的搜索結果界面沒有書本的圖片)vue
翻頁效果:java
縱橫網連簡介等都幫咱們分詞,搞得數據量太大,速度太慢:books.size() < 888 jquery
書本詳情頁:git
小說閱讀頁:github
上、下一章:
項目是springboot項目,原理很是簡單,就是用httpclient構造一個請求頭去請求對應的來源連接,用jsoup去解析響應回來的response,
經過jsoup的選擇器去找到咱們想要的數據,存入實體,放到ModelAndView裏面,前端頁面用thymeleaf去取值、遍歷數據。
可是有一些書是要會員才能看,這種狀況下咱們須要作模擬登錄才能繼續採集,這裏只是一個簡單的採集,就不作模擬登錄了。
採集過程當中碰到的問題:
一、起點中文網採集書本集合時,想要的數據不在頁面源碼裏面
二、筆趣閣查看書本詳情,圖片防盜鏈
<div id="bookImg"></div>
/** * 反防盜鏈 */ function showImg(parentObj, url) { //來一個隨機數 var frameid = 'frameimg' + Math.random(); //放在(父頁面)window裏面 iframe的script標籤裏面綁定了window.onload,做用:設置iframe的高度、寬度 <script>window.onload = function() { parent.document.getElementById(\'' + frameid + '\').height = document.getElementById(\'img\').height+\'px\'; }<' + '/script> window.img = '<img src=\'' + url + '?' + Math.random() + '\'/>'; //iframe調用parent.img $(parentObj).append('<iframe id="' + frameid + '" src="javascript:parent.img;" frameBorder="0" scrolling="no"></iframe>'); } showImg($("#bookImg"), book.img);
三、採集書本詳情時,起點網的目錄並無在html裏
只要咱們弄懂_csrfToken參數就能夠構造一個get請求 https://book.qidian.com/ajax/book/category?_csrfToken=LosgUIe29G7LV04gdutbSqzKRb9XxoPyqtWBQ3hU&bookId=1209977
一樣的,大部分邏輯都寫在註釋裏面,相信你們都看得懂:
maven引包:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.4</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpcore</artifactId> <version>4.4.9</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency> <dependency> <groupId>net.sf.json-lib</groupId> <artifactId>json-lib</artifactId> <version>2.4</version> <classifier>jdk15</classifier> </dependency>
書實體類:
/** * 書對象 */ @Data public class Book { /** * 連接 */ private String bookUrl; /** * 書名 */ private String bookName; /** * 做者 */ private String author; /** * 簡介 */ private String synopsis; /** * 圖片 */ private String img; /** * 章節目錄 chapterName、url */ private List<Map<String,String>> chapters; /** * 狀態 */ private String status; /** * 類型 */ private String type; /** * 更新時間 */ private String updateDate; /** * 第一章 */ private String firstChapter; /** * 第一章連接 */ private String firstChapterUrl; /** * 上一章節 */ private String prevChapter; /** * 上一章節連接 */ private String prevChapterUrl; /** * 當前章節名稱 */ private String nowChapter; /** * 當前章節內容 */ private String nowChapterValue; /** * 當前章節連接 */ private String nowChapterUrl; /** * 下一章節 */ private String nextChapter; /** * 下一章節連接 */ private String nextChapterUrl; /** * 最新章節 */ private String latestChapter; /** * 最新章節連接 */ private String latestChapterUrl; /** * 大小 */ private String magnitude; /** * 來源 */ private Map<String,String> source; private String sourceKey; }
小工具類:
/** * 小工具類 */ public class BookUtil { /** * 自動注入參數 * 例如: * * @param src http://search.zongheng.com/s?keyword=#1&pageNo=#2&sort= * @param params "鬥破蒼穹","1" * @return http://search.zongheng.com/s?keyword=鬥破蒼穹&pageNo=1&sort= */ public static String insertParams(String src, String... params) { int i = 1; for (String param : params) { src = src.replaceAll("#" + i, param); i++; } return src; } /** * 採集當前url完整response實體.toString() * * @param url url * @return response實體.toString() */ public static String gather(String url, String refererUrl) { String result = null; try { //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去) CloseableHttpClient httpClient = HttpClients.createDefault(); //建立get方式請求對象 HttpGet httpGet = new HttpGet(url); httpGet.addHeader("Content-type", "application/json"); //包裝一下 httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"); httpGet.addHeader("Referer", refererUrl); httpGet.addHeader("Connection", "keep-alive"); //經過請求對象獲取響應對象 CloseableHttpResponse response = httpClient.execute(httpGet); //獲取結果實體 if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { result = EntityUtils.toString(response.getEntity(), "GBK"); } //釋放連接 response.close(); } //這裏還能夠捕獲超時異常,從新鏈接抓取 catch (Exception e) { result = null; System.err.println("採集操做出錯"); e.printStackTrace(); } return result; } }
Controller層:
/** * Book Controller層 */ @RestController @RequestMapping("book") public class BookContrller { /** * 來源集合 */ private static Map<String, Map<String, String>> source = new HashMap<>(); static { //筆趣閣 source.put("biquge", BookHandler_biquge.biquge); //縱橫中文網 source.put("zongheng", BookHandler_zongheng.zongheng); //起點中文網 source.put("qidian", BookHandler_qidian.qidian); } /** * 訪問首頁 */ @GetMapping("/index") public ModelAndView index() { return new ModelAndView("book_index.html"); } /** * 搜索書名 */ @GetMapping("/search") public ModelAndView search(Book book) { //結果集 ArrayList<Book> books = new ArrayList<>(); //關鍵字 String keyWord = book.getBookName(); //來源 String sourceKey = book.getSourceKey(); //獲取來源詳情 Map<String, String> src = source.get(sourceKey); // 編碼 try { keyWord = URLEncoder.encode(keyWord, src.get("UrlEncode")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } //searchUrl src.put("searchUrl", BookUtil.insertParams(src.get("searchUrl"), keyWord, "1")); //調用不一樣的方法 switch (sourceKey) { case "biquge": BookHandler_biquge.book_search_biquge(books, src, keyWord); break; case "zongheng": BookHandler_zongheng.book_search_zongheng(books, src, keyWord); break; case "qidian": BookHandler_qidian.book_search_qidian(books, src, keyWord); break; default: //默認全部都查 BookHandler_biquge.book_search_biquge(books, src, keyWord); BookHandler_zongheng.book_search_zongheng(books, src, keyWord); BookHandler_qidian.book_search_qidian(books, src, keyWord); break; } System.out.println(books.size()); ModelAndView modelAndView = new ModelAndView("book_list.html", "books", books); try { modelAndView.addObject("keyWord", URLDecoder.decode(keyWord, src.get("UrlEncode"))); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } modelAndView.addObject("sourceKey", sourceKey); return modelAndView; } /** * 訪問書本詳情 */ @GetMapping("/details") public ModelAndView details(String sourceKey,String bookUrl,String searchUrl) { Map<String, String> src = source.get(sourceKey); src.put("searchUrl",searchUrl); Book book = new Book(); //調用不一樣的方法 switch (sourceKey) { case "biquge": book = BookHandler_biquge.book_details_biquge(src, bookUrl); break; case "zongheng": book = BookHandler_zongheng.book_details_zongheng(src, bookUrl); break; case "qidian": book = BookHandler_qidian.book_details_qidian(src, bookUrl); break; default: break; } return new ModelAndView("book_details.html", "book", book); } /** * 訪問書本章節 */ @GetMapping("/read") public ModelAndView read(String sourceKey,String chapterUrl,String refererUrl) { Map<String, String> src = source.get(sourceKey); Book book = new Book(); //調用不一樣的方法 switch (sourceKey) { case "biquge": book = BookHandler_biquge.book_read_biquge(src, chapterUrl,refererUrl); break; case "zongheng": book = BookHandler_zongheng.book_read_zongheng(src, chapterUrl,refererUrl); break; case "qidian": book = BookHandler_qidian.book_read_qidian(src, chapterUrl,refererUrl); break; default: break; } return new ModelAndView("book_read.html", "book", book); } }
三個不一樣來源的Handler處理器,每一個來源都有不一樣的採集規則:
BookHandler_biquge
/** * 筆趣閣採集規則 */ public class BookHandler_biquge { /** * 來源信息 */ public static HashMap<String, String> biquge = new HashMap<>(); static { //筆趣閣 biquge.put("name", "筆趣閣"); biquge.put("key", "biquge"); biquge.put("baseUrl", "http://www.biquge.com.tw"); biquge.put("baseSearchUrl", "http://www.biquge.com.tw/modules/article/soshu.php"); biquge.put("UrlEncode", "GB2312"); biquge.put("searchUrl", "http://www.biquge.com.tw/modules/article/soshu.php?searchkey=+#1&page=#2"); } /** * 獲取search list 筆趣閣採集規則 * * @param books 結果集合 * @param src 源目標 * @param keyWord 關鍵字 */ public static void book_search_biquge(ArrayList<Book> books, Map<String, String> src, String keyWord) { //採集術 String html = BookUtil.gather(src.get("searchUrl"), src.get("baseUrl")); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //當前頁集合 Elements resultList = doc.select("table.grid tr#nr"); for (Element result : resultList) { Book book = new Book(); //書本連接 book.setBookUrl(result.child(0).select("a").attr("href")); //書名 book.setBookName(result.child(0).select("a").text()); //做者 book.setAuthor(result.child(2).text()); //更新時間 book.setUpdateDate(result.child(4).text()); //最新章節 book.setLatestChapter(result.child(1).select("a").text()); book.setLatestChapterUrl(result.child(1).select("a").attr("href")); //狀態 book.setStatus(result.child(5).text()); //大小 book.setMagnitude(result.child(3).text()); //來源 book.setSource(src); books.add(book); } //下一頁 Elements searchNext = doc.select("div.pages > a.ngroup"); String href = searchNext.attr("href"); if (!StringUtils.isEmpty(href)) { src.put("baseUrl", src.get("searchUrl")); src.put("searchUrl", href.contains("http") ? href : (src.get("baseSearchUrl") + href)); book_search_biquge(books, src, keyWord); } } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } } /** * 獲取書本詳情 筆趣閣採集規則 * @param src 源目標 * @param bookUrl 書本連接 * @return Book對象 */ public static Book book_details_biquge(Map<String, String> src, String bookUrl) { Book book = new Book(); //採集術 String html = BookUtil.gather(bookUrl, src.get("searchUrl")); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //書本連接 book.setBookUrl(doc.select("meta[property=og:url]").attr("content")); //圖片 book.setImg(doc.select("meta[property=og:image]").attr("content")); //書名 book.setBookName(doc.select("div#info > h1").text()); //做者 book.setAuthor(doc.select("meta[property=og:novel:author]").attr("content")); //更新時間 book.setUpdateDate(doc.select("meta[property=og:novel:update_time]").attr("content")); //最新章節 book.setLatestChapter(doc.select("meta[property=og:novel:latest_chapter_name]").attr("content")); book.setLatestChapterUrl(doc.select("meta[property=og:novel:latest_chapter_url]").attr("content")); //類型 book.setType(doc.select("meta[property=og:novel:category]").attr("content")); //簡介 book.setSynopsis(doc.select("meta[property=og:description]").attr("content")); //狀態 book.setStatus(doc.select("meta[property=og:novel:status]").attr("content")); //章節目錄 ArrayList<Map<String, String>> chapters = new ArrayList<>(); for (Element result : doc.select("div#list dd")) { HashMap<String, String> map = new HashMap<>(); map.put("chapterName", result.select("a").text()); map.put("url", result.select("a").attr("href")); chapters.add(map); } book.setChapters(chapters); //來源 book.setSource(src); } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } return book; } /** * 獲得當前章節名以及完整內容跟上、下一章的連接地址 筆趣閣採集規則 * @param src 源目標 * @param chapterUrl 當前章節連接 * @param refererUrl 來源連接 * @return Book對象 */ public static Book book_read_biquge(Map<String, String> src,String chapterUrl,String refererUrl) { Book book = new Book(); //當前章節連接 book.setNowChapterUrl(chapterUrl.contains("http") ? chapterUrl : (src.get("baseUrl") + chapterUrl)); //採集術 String html = BookUtil.gather(book.getNowChapterUrl(), refererUrl); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //當前章節名稱 book.setNowChapter(doc.select("div.box_con > div.bookname > h1").text()); //刪除圖片廣告 doc.select("div.box_con > div#content img").remove(); //當前章節內容 book.setNowChapterValue(doc.select("div.box_con > div#content").outerHtml()); //上、下一章 book.setPrevChapter(doc.select("div.bottem2 a:matches((?i)下一章)").text()); book.setPrevChapterUrl(doc.select("div.bottem2 a:matches((?i)下一章)").attr("href")); book.setNextChapter(doc.select("div.bottem2 a:matches((?i)上一章)").text()); book.setNextChapterUrl(doc.select("div.bottem2 a:matches((?i)上一章)").attr("href")); //來源 book.setSource(src); } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } return book; } }
BookHandler_zongheng
/** * 縱橫中文網採集規則 */ public class BookHandler_zongheng { /** * 來源信息 */ public static HashMap<String, String> zongheng = new HashMap<>(); static { //縱橫中文網 zongheng.put("name", "縱橫中文網"); zongheng.put("key", "zongheng"); zongheng.put("baseUrl", "http://www.zongheng.com"); zongheng.put("baseSearchUrl", "http://search.zongheng.com/s"); zongheng.put("UrlEncode", "UTF-8"); zongheng.put("searchUrl", "http://search.zongheng.com/s?keyword=#1&pageNo=#2&sort="); } /** * 獲取search list 縱橫中文網採集規則 * * @param books 結果集合 * @param src 源目標 * @param keyWord 關鍵字 */ public static void book_search_zongheng(ArrayList<Book> books, Map<String, String> src, String keyWord) { //採集術 String html = BookUtil.gather(src.get("searchUrl"), src.get("baseUrl")); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //當前頁集合 Elements resultList = doc.select("div.search-tab > div.search-result-list"); for (Element result : resultList) { Book book = new Book(); //書本連接 book.setBookUrl(result.select("div.imgbox a").attr("href")); //圖片 book.setImg(result.select("div.imgbox img").attr("src")); //書名 book.setBookName(result.select("h2.tit").text()); //做者 book.setAuthor(result.select("div.bookinfo > a").first().text()); //類型 book.setType(result.select("div.bookinfo > a").last().text()); //簡介 book.setSynopsis(result.select("p").text()); //狀態 book.setStatus(result.select("div.bookinfo > span").first().text()); //大小 book.setMagnitude(result.select("div.bookinfo > span").last().text()); //來源 book.setSource(src); books.add(book); } //下一頁 Elements searchNext = doc.select("div.search_d_pagesize > a.search_d_next"); String href = searchNext.attr("href"); //最多隻要888本,否則太慢了 if (books.size() < 888 && !StringUtils.isEmpty(href)) { src.put("baseUrl", src.get("searchUrl")); src.put("searchUrl", href.contains("http") ? href : (src.get("baseSearchUrl") + href)); book_search_zongheng(books, src, keyWord); } } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } } /** * 獲取書本詳情 縱橫中文網採集規則 * @param src 源目標 * @param bookUrl 書本連接 * @return Book對象 */ public static Book book_details_zongheng(Map<String, String> src, String bookUrl) { Book book = new Book(); //採集術 String html = BookUtil.gather(bookUrl, src.get("searchUrl")); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //書本連接 book.setBookUrl(bookUrl); //圖片 book.setImg(doc.select("div.book-img > img").attr("src")); //書名 book.setBookName(doc.select("div.book-info > div.book-name").text()); //做者 book.setAuthor(doc.select("div.book-author div.au-name").text()); //更新時間 book.setUpdateDate(doc.select("div.book-new-chapter div.time").text()); //最新章節 book.setLatestChapter(doc.select("div.book-new-chapter div.tit a").text()); book.setLatestChapterUrl(doc.select("div.book-new-chapter div.tit a").attr("href")); //類型 book.setType(doc.select("div.book-label > a").last().text()); //簡介 book.setSynopsis(doc.select("div.book-dec > p").text()); //狀態 book.setStatus(doc.select("div.book-label > a").first().text()); //章節目錄 String chaptersUrl = doc.select("a.all-catalog").attr("href"); ArrayList<Map<String, String>> chapters = new ArrayList<>(); //採集術 for (Element result : Jsoup.parse(BookUtil.gather(chaptersUrl, bookUrl)).select("ul.chapter-list li")) { HashMap<String, String> map = new HashMap<>(); map.put("chapterName", result.select("a").text()); map.put("url", result.select("a").attr("href")); chapters.add(map); } book.setChapters(chapters); //來源 book.setSource(src); } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } return book; } /** * 獲得當前章節名以及完整內容跟上、下一章的連接地址 縱橫中文網採集規則 * @param src 源目標 * @param chapterUrl 當前章節連接 * @param refererUrl 來源連接 * @return Book對象 */ public static Book book_read_zongheng(Map<String, String> src,String chapterUrl,String refererUrl) { Book book = new Book(); //當前章節連接 book.setNowChapterUrl(chapterUrl.contains("http") ? chapterUrl : (src.get("baseUrl") + chapterUrl)); //採集術 String html = BookUtil.gather(book.getNowChapterUrl(), refererUrl); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //當前章節名稱 book.setNowChapter(doc.select("div.title_txtbox").text()); //刪除圖片廣告 doc.select("div.content img").remove(); //當前章節內容 book.setNowChapterValue(doc.select("div.content").outerHtml()); //上、下一章 book.setPrevChapter(doc.select("div.chap_btnbox a:matches((?i)下一章)").text()); book.setPrevChapterUrl(doc.select("div.chap_btnbox a:matches((?i)下一章)").attr("href")); book.setNextChapter(doc.select("div.chap_btnbox a:matches((?i)上一章)").text()); book.setNextChapterUrl(doc.select("div.chap_btnbox a:matches((?i)上一章)").attr("href")); //來源 book.setSource(src); } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } return book; } }
BookHandler_qidian
/** * 起點中文網採集規則 */ public class BookHandler_qidian { /** * 來源信息 */ public static HashMap<String, String> qidian = new HashMap<>(); static { //起點中文網 qidian.put("name", "起點中文網"); qidian.put("key", "qidian"); qidian.put("baseUrl", "http://www.qidian.com"); qidian.put("baseSearchUrl", "https://www.qidian.com/search"); qidian.put("UrlEncode", "UTF-8"); qidian.put("searchUrl", "https://www.qidian.com/search?kw=#1&page=#2"); } /** * 獲取search list 起點中文網採集規則 * * @param books 結果集合 * @param src 源目標 * @param keyWord 關鍵字 */ public static void book_search_qidian(ArrayList<Book> books, Map<String, String> src, String keyWord) { //採集術 String html = BookUtil.gather(src.get("searchUrl"), src.get("baseUrl")); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //當前頁集合 Elements resultList = doc.select("li.res-book-item"); for (Element result : resultList) { Book book = new Book(); /* 若是你們打斷點在這裏的話就會發現,起點的連接是這樣的 //book.qidian.com/info/1012786368 以兩個斜槓開頭,不過無所謂,httpClient照樣能夠請求 */ //書本連接 book.setBookUrl(result.select("div.book-img-box a").attr("href")); //圖片 book.setImg(result.select("div.book-img-box img").attr("src")); //書名 book.setBookName(result.select("div.book-mid-info > h4").text()); //做者 book.setAuthor(result.select("div.book-mid-info > p.author > a").first().text()); //類型 book.setType(result.select("div.book-mid-info > p.author > a").last().text()); //簡介 book.setSynopsis(result.select("div.book-mid-info > p.intro").text()); //狀態 book.setStatus(result.select("div.book-mid-info > p.author > span").first().text()); //更新時間 book.setUpdateDate(result.select("div.book-mid-info > p.update > span").text()); //最新章節 book.setLatestChapter(result.select("div.book-mid-info > p.update > a").text()); book.setLatestChapterUrl(result.select("div.book-mid-info > p.update > a").attr("href")); //來源 book.setSource(src); books.add(book); } //當前頁 String page = doc.select("div#page-container").attr("data-page"); //最大頁數 String pageMax = doc.select("div#page-container").attr("data-pageMax"); //當前頁 < 最大頁數 if (Integer.valueOf(page) < Integer.valueOf(pageMax)) { src.put("baseUrl", src.get("searchUrl")); //本身拼接下一頁連接 src.put("searchUrl", src.get("searchUrl").replaceAll("page=" + Integer.valueOf(page), "page=" + (Integer.valueOf(page) + 1))); book_search_qidian(books, src, keyWord); } } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } } /** * 獲取書本詳情 起點中文網採集規則 * @param src 源目標 * @param bookUrl 書本連接 * @return Book對象 */ public static Book book_details_qidian(Map<String, String> src, String bookUrl) { Book book = new Book(); //https bookUrl = "https:" + bookUrl; //採集術 String html = BookUtil.gather(bookUrl, src.get("searchUrl")); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //書本連接 book.setBookUrl(bookUrl); //圖片 String img = doc.select("div.book-img > a#bookImg > img").attr("src"); img = "https:" + img; book.setImg(img); //書名 book.setBookName(doc.select("div.book-info > h1 > em").text()); //做者 book.setAuthor(doc.select("div.book-info > h1 a.writer").text()); //更新時間 book.setUpdateDate(doc.select("li.update em.time").text()); //最新章節 book.setLatestChapter(doc.select("li.update a").text()); book.setLatestChapterUrl(doc.select("li.update a").attr("href")); //類型 book.setType(doc.select("p.tag > span").first().text()); //簡介 book.setSynopsis(doc.select("div.book-intro > p").text()); //狀態 book.setStatus(doc.select("p.tag > a").first().text()); //章節目錄 //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去) BasicCookieStore cookieStore = new BasicCookieStore(); CloseableHttpClient httpClient = HttpClients.custom().setDefaultCookieStore(cookieStore).build(); //建立get方式請求對象 HttpGet httpGet = new HttpGet("https://book.qidian.com/"); httpGet.addHeader("Content-type", "application/json"); //包裝一下 httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"); httpGet.addHeader("Connection", "keep-alive"); //經過請求對象獲取響應對象 CloseableHttpResponse response = httpClient.execute(httpGet); //得到Cookies String _csrfToken = ""; List<Cookie> cookies = cookieStore.getCookies(); for (int i = 0; i < cookies.size(); i++) { if("_csrfToken".equals(cookies.get(i).getName())){ _csrfToken = cookies.get(i).getValue(); } } //構造post String bookId = doc.select("div.book-img a#bookImg").attr("data-bid"); HttpPost httpPost = new HttpPost(BookUtil.insertParams("https://book.qidian.com/ajax/book/category?_csrfToken=#1&bookId=#2",_csrfToken,bookId)); httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"); httpPost.addHeader("Connection", "keep-alive"); //經過請求對象獲取響應對象 CloseableHttpResponse response1 = httpClient.execute(httpPost); //獲取結果實體(json格式字符串) String chaptersJson = ""; if (response1.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { chaptersJson = EntityUtils.toString(response1.getEntity(), "UTF-8"); } //java處理json ArrayList<Map<String, String>> chapters = new ArrayList<>(); JSONObject jsonArray = JSONObject.fromObject(chaptersJson); Map<String,Object> objectMap = (Map<String, Object>) jsonArray; Map<String, Object> objectMap_data = (Map<String, Object>) objectMap.get("data"); List<Map<String, Object>> objectMap_data_vs = (List<Map<String, Object>>) objectMap_data.get("vs"); for(Map<String, Object> vs : objectMap_data_vs){ List<Map<String, Object>> cs = (List<Map<String, Object>>) vs.get("cs"); for(Map<String, Object> chapter : cs){ Map<String, String> map = new HashMap<>(); map.put("chapterName", (String) chapter.get("cN")); map.put("url", "https://read.qidian.com/chapter/"+(String) chapter.get("cU")); chapters.add(map); } } book.setChapters(chapters); //來源 book.setSource(src); //釋放連接 response.close(); } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } return book; } /** * 獲得當前章節名以及完整內容跟上、下一章的連接地址 起點中文網採集規則 * @param src 源目標 * @param chapterUrl 當前章節連接 * @param refererUrl 來源連接 * @return Book對象 */ public static Book book_read_qidian(Map<String, String> src,String chapterUrl,String refererUrl) { Book book = new Book(); //當前章節連接 book.setNowChapterUrl(chapterUrl.contains("http") ? chapterUrl : (src.get("baseUrl") + chapterUrl)); //採集術 String html = BookUtil.gather(book.getNowChapterUrl(), refererUrl); try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); System.out.println(html); //當前章節名稱 book.setNowChapter(doc.select("h3.j_chapterName").text()); //刪除圖片廣告 doc.select("div.read-content img").remove(); //當前章節內容 book.setNowChapterValue(doc.select("div.read-content").outerHtml()); //上、下一章 book.setPrevChapter(doc.select("div.chapter-control a:matches((?i)下一章)").text()); String prev = doc.select("div.chapter-control a:matches((?i)下一章)").attr("href"); prev = "https:"+prev; book.setPrevChapterUrl(prev); book.setNextChapter(doc.select("div.chapter-control a:matches((?i)上一章)").text()); String next = doc.select("div.chapter-control a:matches((?i)上一章)").attr("href"); next = "https:"+next; book.setNextChapterUrl(next); //來源 book.setSource(src); } catch (Exception e) { System.err.println("採集數據操做出錯"); e.printStackTrace(); } return book; } }
四個html頁面:
book_index
<!DOCTYPE html> <!--解決idea thymeleaf 表達式模板報紅波浪線--> <!--suppress ALL --> <html xmlns:th="http://www.thymeleaf.org"> <head> <meta charset="UTF-8"> <title>MY BOOK</title> <!-- 新 Bootstrap 核心 CSS 文件 --> <link rel="stylesheet" href="http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css"> <style> body{ background-color: antiquewhite; } .main{ margin: auto; width: 500px; margin-top: 150px; } #bookName{ width: 300px; } #title{ text-align: center; } </style> </head> <body> <div class="main"> <h2 id="title">MY BOOK</h2> <form class="form-inline" method="get" th:action="@{/book/search}"> 來源 <select class="form-control" id="source" name="sourceKey"> <option value="">全部</option> <option value="biquge">筆趣閣</option> <option value="zongheng">縱橫網</option> <option value="qidian">起點網</option> </select> <input type="text" id="bookName" name="bookName" class="form-control" placeholder="請輸入..."/> <button class="btn btn-info" type="submit">搜索</button> </form> </div> </body> </html>
book_list
<!DOCTYPE html> <!--解決idea thymeleaf 表達式模板報紅波浪線--> <!--suppress ALL --> <html xmlns:th="http://www.thymeleaf.org"> <head> <meta charset="UTF-8"> <title>BOOK LIST</title> <!-- 新 Bootstrap 核心 CSS 文件 --> <link rel="stylesheet" href="http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css"> <link rel="stylesheet" href="http://hanlei.online/Onlineaddress/layui/css/layui.css"/> <style> body { background-color: antiquewhite; } .main { margin: auto; width: 500px; margin-top: 50px; } .book { border-bottom: solid #428bca 1px; } .click-book-detail, .click-book-read { cursor: pointer; color: #428bca; } .click-book-detail:hover { color: rgba(150, 149, 162, 0.47); } .click-book-read:hover { color: rgba(150, 149, 162, 0.47); } </style> </head> <body> <div class="main"> <form class="form-inline" method="get" th:action="@{/book/search}"> 來源 <select class="form-control" id="source" name="sourceKey"> <option value="">全部</option> <option value="biquge" th:selected="${sourceKey} == 'biquge'">筆趣閣</option> <option value="zongheng" th:selected="${sourceKey} == 'zongheng'">縱橫網</option> <option value="qidian" th:selected="${sourceKey} == 'qidian'">起點網</option> </select> <input type="text" id="bookName" name="bookName" class="form-control" placeholder="請輸入..." th:value="${keyWord}"/> <button class="btn btn-info" type="submit">搜索</button> </form> <br/> <div id="books"></div> <div id="page"></div> </div> </body> <!-- jquery在線版本 --> <script src="http://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script> <script src="http://hanlei.online/Onlineaddress/layui/layui.js"></script> <script th:inline="javascript"> var ctx = /*[[@{/}]]*/''; var books = [[${books}]];//取出後臺數據 var nums = 10; //每頁出現的數量 var pages = books.length; //總數 /** * 傳入當前頁,根據nums去計算,從books集合截取對應數據作展現 */ var thisDate = function (curr) { var str = "",//當前頁須要展現的html first = (curr * nums - nums),//展現的第一條數據的下標 last = curr * nums - 1;//展現的最後一條數據的下標 last = last >= books.length ? (books.length - 1) : last; for (var i = first; i <= last; i++) { var book = books[i]; str += "<div class='book'>" + "<img class='click-book-detail' data-bookurl='" + book.bookUrl + "' data-sourcekey='" + book.source.key + "' data-searchurl='" + book.source.searchUrl + "' src='" + book.img + "'></img>" + "<p class='click-book-detail' data-bookurl='" + book.bookUrl + "' data-sourcekey='" + book.source.key + "' data-searchurl='" + book.source.searchUrl + "'>書名:" + book.bookName + "</p>" + "<p>做者:" + book.author + "</p>" + "<p>簡介:" + book.synopsis + "</p>" + "<p class='click-book-read' data-chapterurl='" + book.latestChapterUrl + "' data-sourcekey='" + book.source.key + "' data-refererurl='" + book.source.refererurl + "'>最新章節:" + book.latestChapter + "</p>" + "<p>更新時間:" + book.updateDate + "</p>" + "<p>大小:" + book.magnitude + "</p>" + "<p>狀態:" + book.status + "</p>" + "<p>類型:" + book.type + "</p>" + "<p>來源:" + book.source.name + "</p>" + "</div><br/>"; } return str; }; //獲取一個laypage實例 layui.use('laypage', function () { var laypage = layui.laypage; //調用laypage 邏輯分頁 laypage.render({ elem: 'page', count: pages, limit: nums, jump: function (obj) { //obj包含了當前分頁的全部參數,好比: // console.log(obj.curr); //獲得當前頁,以便向服務端請求對應頁的數據。 // console.log(obj.limit); //獲得每頁顯示的條數 document.getElementById('books').innerHTML = thisDate(obj.curr); }, prev: '<', next: '>', theme: '#f9c357', }) }); $("body").on("click", ".click-book-detail", function (even) { var bookUrl = $(this).data("bookurl"); var searchUrl = $(this).data("searchurl"); var sourceKey = $(this).data("sourcekey"); window.location.href = ctx + "/book/details?sourceKey=" + sourceKey + "&searchUrl=" + searchUrl + "&bookUrl=" + bookUrl; }); $("body").on("click", ".click-book-read", function (even) { var chapterUrl = $(this).data("chapterurl"); var refererUrl = $(this).data("refererurl"); var sourceKey = $(this).data("sourcekey"); window.location.href = ctx + "/book/read?sourceKey=" + sourceKey + "&refererUrl=" + refererUrl + "&chapterUrl=" + chapterUrl; }); </script> </html>
book_details
<!DOCTYPE html> <!--解決idea thymeleaf 表達式模板報紅波浪線--> <!--suppress ALL --> <html xmlns:th="http://www.thymeleaf.org"> <head> <meta charset="UTF-8"> <title>BOOK DETAILS</title> <!-- 新 Bootstrap 核心 CSS 文件 --> <link rel="stylesheet" href="http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css"> <link rel="stylesheet" href="http://hanlei.online/Onlineaddress/layui/css/layui.css"/> <style> body { background-color: antiquewhite; } .main { margin: auto; width: 500px; margin-top: 150px; } .book { border-bottom: solid #428bca 1px; } .click-book-detail, .click-book-read { cursor: pointer; color: #428bca; } .click-book-detail:hover { color: rgba(150, 149, 162, 0.47); } .click-book-read:hover { color: rgba(150, 149, 162, 0.47); } a { color: #428bca; } </style> </head> <body> <div class="main"> <div class='book'> <div id="bookImg"></div> <p>書名:<span th:text="${book.bookName}"></span></p> <p>做者:<span th:text="${book.author}"></span></p> <p>簡介:<span th:text="${book.synopsis}"></span></p> <p>最新章節:<a th:href="${book.latestChapterUrl}" th:text="${book.latestChapter}"></a></p> <p>更新時間:<span th:text="${book.updateDate}"></span></p> <p>大小:<span th:text="${book.magnitude}"></span></p> <p>狀態:<span th:text="${book.status}"></span></p> <p>類型:<span th:text="${book.type}"></span></p> <p>來源:<span th:text="${book.source.name}"></span></p> </div> <br/> <div class="chapters" th:each="chapter,iterStat:${book.chapters}"> <p class="click-book-read" th:attr="data-chapterurl=${chapter.url},data-sourcekey=${book.source.key},data-refererurl=${book.bookUrl}" th:text="${chapter.chapterName}"></p> </div> </div> </body> <!-- jquery在線版本 --> <script src="http://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script> <script th:inline="javascript"> var ctx = /*[[@{/}]]*/''; var book = [[${book}]];//取出後臺數據 /** * 反防盜鏈 */ function showImg(parentObj, url) { //來一個隨機數 var frameid = 'frameimg' + Math.random(); //放在(父頁面)window裏面 iframe的script標籤裏面綁定了window.onload,做用:設置iframe的高度、寬度 <script>window.onload = function() { parent.document.getElementById(\'' + frameid + '\').height = document.getElementById(\'img\').height+\'px\'; }<' + '/script> window.img = '<img src=\'' + url + '?' + Math.random() + '\'/>'; //iframe調用parent.img $(parentObj).append('<iframe id="' + frameid + '" src="javascript:parent.img;" frameBorder="0" scrolling="no"></iframe>'); } showImg($("#bookImg"), book.img); $("body").on("click", ".click-book-read", function (even) { var chapterUrl = $(this).data("chapterurl"); var refererUrl = $(this).data("refererurl"); var sourceKey = $(this).data("sourcekey"); window.location.href = ctx + "/book/read?sourceKey=" + sourceKey + "&refererUrl=" + refererUrl + "&chapterUrl=" + chapterUrl; }); </script> </html>
book_read
<!DOCTYPE html> <!--解決idea thymeleaf 表達式模板報紅波浪線--> <!--suppress ALL --> <html xmlns:th="http://www.thymeleaf.org"> <head> <meta charset="UTF-8"> <title>BOOK READ</title> <style> body { background-color: antiquewhite; } .main { padding: 10px 20px; } .click-book-detail, .click-book-read { cursor: pointer; color: #428bca; } .click-book-detail:hover { color: rgba(150, 149, 162, 0.47); } .click-book-read:hover { color: rgba(150, 149, 162, 0.47); } .float-left{ float: left; margin-left: 70px; } </style> </head> <body> <div class="main"> <!-- 章節名稱 --> <h3 th:text="${book.nowChapter}"></h3> <!-- 章節內容 --> <p th:utext="${book.nowChapterValue}"></p> <!-- 上、下章 --> <p class="click-book-read float-left" th:attr="data-chapterurl=${book.nextChapterUrl},data-sourcekey=${book.source.key},data-refererurl=${book.nowChapterUrl}" th:text="${book.nextChapter}"></p> <p class="click-book-read float-left" th:attr="data-chapterurl=${book.prevChapterUrl},data-sourcekey=${book.source.key},data-refererurl=${book.nowChapterUrl}" th:text="${book.prevChapter}"></p> </div> </body> <!-- jquery在線版本 --> <script src="http://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script> <script th:inline="javascript"> var ctx = /*[[@{/}]]*/''; $("body").on("click", ".click-book-read", function (even) { var chapterUrl = $(this).data("chapterurl"); var refererUrl = $(this).data("refererurl"); var sourceKey = $(this).data("sourcekey"); window.location.href = ctx + "/book/read?sourceKey=" + sourceKey + "&refererUrl=" + refererUrl + "&chapterUrl=" + chapterUrl; }); </script> </html>
2019-07-17補充:咱們以前三個來源網站的baseUrl都是用http,但網站後面都升級成了https,例如筆趣閣:
致使抓取數據時報錯
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) at sun.security.ssl.Handshaker.process_record(Handshaker.java:961) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at cn.huanzi.qch.spider.novelreading.util.BookUtil.gather(BookUtil.java:81) at cn.huanzi.qch.spider.novelreading.pojo.BookHandler_biquge.book_search_biquge(BookHandler_biquge.java:43) at cn.huanzi.qch.spider.novelreading.controller.BookContrller.search(BookContrller.java:78) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:215) at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:142) at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:102) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:895) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:800) at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1038) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:998) at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:890) at javax.servlet.http.HttpServlet.service(HttpServlet.java:634) at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:875) at javax.servlet.http.HttpServlet.service(HttpServlet.java:741) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:99) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:92) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:93) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200) at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:199) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:490) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:770) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1415) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748)
解決辦法:參考http://www.javashuo.com/article/p-grzpkcic-ng.html,繞過證書驗證
在BookUtil.java中新增方法
/** * 繞過SSL驗證 */ private static SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException { SSLContext sc = SSLContext.getInstance("SSLv3"); // 實現一個X509TrustManager接口,用於繞過驗證,不用修改裏面的方法 X509TrustManager trustManager = new X509TrustManager() { @Override public void checkClientTrusted( java.security.cert.X509Certificate[] paramArrayOfX509Certificate, String paramString) throws CertificateException { } @Override public void checkServerTrusted( java.security.cert.X509Certificate[] paramArrayOfX509Certificate, String paramString) throws CertificateException { } @Override public java.security.cert.X509Certificate[] getAcceptedIssuers() { return null; } }; sc.init(null, new TrustManager[]{trustManager}, null); return sc; }
而後在gather方法中改爲這樣獲取httpClient
/** * 採集當前url完整response實體.toString() * * @param url url * @return response實體.toString() */ public static String gather(String url, String refererUrl) { String result = null; try { //採用繞過驗證的方式處理https請求 SSLContext sslcontext = createIgnoreVerifySSL(); // 設置協議http和https對應的處理socket連接工廠的對象 Registry<ConnectionSocketFactory> socketFactoryRegistry = RegistryBuilder.<ConnectionSocketFactory>create() .register("http", PlainConnectionSocketFactory.INSTANCE) .register("https", new SSLConnectionSocketFactory(sslcontext)) .build(); PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager(socketFactoryRegistry); HttpClients.custom().setConnectionManager(connManager); //建立自定義的httpclient對象 CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connManager).build(); //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去) // CloseableHttpClient httpClient = HttpClients.createDefault(); //建立get方式請求對象 HttpGet httpGet = new HttpGet(url); httpGet.addHeader("Content-type", "application/json"); //包裝一下 httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"); httpGet.addHeader("Referer", refererUrl); httpGet.addHeader("Connection", "keep-alive"); //經過請求對象獲取響應對象 CloseableHttpResponse response = httpClient.execute(httpGet); //獲取結果實體 if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { result = EntityUtils.toString(response.getEntity(), "GBK"); } //釋放連接 response.close(); } //這裏還能夠捕獲超時異常,從新鏈接抓取 catch (Exception e) { result = null; System.err.println("採集操做出錯"); e.printStackTrace(); } return result; }
這樣就能夠正常抓取了
咱們以前獲取項目路徑用的是
var ctx = /*[[@{/}]]*/'';
忽然發現不行了,跳轉的路徑直接是/開頭,如今改爲這樣獲取
//項目路徑 var ctx = [[${#request.getContextPath()}]];
2019-08-01補充:你們若是看到有這個報錯,鏈接被重置,不要慌張,有多是網站換域名了好比如今咱們程序請求的是http://www.biquge.com.tw,但這個網址已經不能訪問了,筆趣閣已經改爲https://www.biqudu.net/,咱們改一下代碼就能夠解決問題,要注意檢查各個源路徑是否能正常訪問,同時對方也可能改頁面格式,致使咱們以前的規則沒法匹配獲取數據,這種狀況只能從新編寫爬取規則了
2019-08-02補充:發現了個bug,咱們的BookUtil.insertParams方法原理是替換#字符串
/** * 自動注入參數 * 例如: * * @param src http://search.zongheng.com/s?keyword=#1&pageNo=#2&sort= * @param params "鬥破蒼穹","1" * @return http://search.zongheng.com/s?keyword=鬥破蒼穹&pageNo=1&sort= */ public static String insertParams(String src, String... params) { int i = 1; for (String param : params) { src = src.replaceAll("#" + i, param); i++; } return src; }
可是咱們在搜索的時候,調用參數自動注入,形參src的值是來自靜態屬性Map,初始化的時候有兩個#字符串,在進行第一次搜索以後,#字符串被替換了,後面再進行搜索注入參數已經沒有#字符串了,所以後面的搜索結果都是第一次的結果...
解決:獲取來源時不是用=賦值,而是複製一份,三個方法都要改
修改前:
//獲取來源詳情 Map<String, String> src = source.get(sourceKey);
修改後:
//獲取來源詳情,複製一份 Map<String, String> src = new HashMap<>(); src.putAll(source.get(sourceKey));
公司最近打算作手機端,學習了DCloud公司的uni-app,開發工具是HBuilderX,並用咱們的小說爬蟲學習、練手,作了個H5手機端的頁面
DCloud公司官網:https://www.dcloud.io/
uni-app官網:https://uniapp.dcloud.io/
uni-app
是一個使用 Vue.js 開發全部前端應用的框架,開發者編寫一套代碼,可編譯到iOS、Android、H五、以及各類小程序等多個平臺。
效果圖:
代碼已經開源、託管到個人GitHub、碼雲: