首先先介紹一下Jsoup:(摘自官網)html
jsoup
is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.jquery
Jsoup俗稱「大殺器」,具體的使用你們能夠看 jsoup中文文檔git
首先maven引包:github
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.4</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.9</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
封裝幾個方法(思路大多都在註解裏面,相信你們都看得懂):apache
/** * 建立.txt文件 * * @param fileName 文件名(小說名) * @return File對象 */ public static File createFile(String fileName) { //獲取桌面路徑 String comPath = FileSystemView.getFileSystemView().getHomeDirectory().getPath(); //建立空白文件夾:networkNovel File file = new File(comPath + "\\networkNovel\\" + fileName + ".txt"); try { //獲取父目錄 File fileParent = file.getParentFile(); if (!fileParent.exists()) { fileParent.mkdirs(); } //建立文件 if (!file.exists()) { file.createNewFile(); } } catch (Exception e) { file = null; System.err.println("新建文件操做出錯"); e.printStackTrace(); } return file; } /** * 字符流寫入文件 * * @param file file對象 * @param value 要寫入的數據 */ public static void fileWriter(File file, String value) { //字符流 try { FileWriter resultFile = new FileWriter(file, true);//true,則追加寫入 PrintWriter myFile = new PrintWriter(resultFile); //寫入 myFile.println(value); myFile.println("\n"); myFile.close(); resultFile.close(); } catch (Exception e) { System.err.println("寫入操做出錯"); e.printStackTrace(); } } /** * 採集當前url完整response實體.toString() * * @param url url * @return response實體.toString() */ public static String gather(String url,String refererUrl) { String result = null; try { //建立httpclient對象 (這裏設置成全局變量,相對於同一個請求session、cookie會跟着攜帶過去) CloseableHttpClient httpClient = HttpClients.createDefault(); //建立get方式請求對象 HttpGet httpGet = new HttpGet(url); httpGet.addHeader("Content-type", "application/json"); //包裝一下 httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"); httpGet.addHeader("Referer", refererUrl); httpGet.addHeader("Connection", "keep-alive"); //經過請求對象獲取響應對象 CloseableHttpResponse response = httpClient.execute(httpGet); //獲取結果實體 if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { result = EntityUtils.toString(response.getEntity(), "GBK"); } //釋放連接 response.close(); } //這裏還能夠捕獲超時異常,從新鏈接抓取 catch (Exception e) { result = null; System.err.println("採集操做出錯"); e.printStackTrace(); } return result; } /** * 使用jsoup處理html字符串,根據規則,獲得當前章節名以及完整內容跟下一章的連接地址 * 每一個站點的代碼風格都不同,因此規則要根據不一樣的站點去修改
* 好比這裏的文章內容直接用一個div包起來,而有些站點是每一個段落用p標籤包起來 * @param html html字符串 * @return Map<String,String> */ public static Map<String, String> processor(String html) { HashMap<String, String> map = new HashMap<>(); String chapterName;//章節名 String chapter = null;//完整章節(包括章節名) String next = null;//下一章連接地址 try { //解析html格式的字符串成一個Document Document doc = Jsoup.parse(html); //章節名稱 Elements bookname = doc.select("div.bookname > h1"); chapterName = bookname.text().trim(); chapter = chapterName +"\n"; //文章內容 Elements content = doc.select("div#content"); String replaceText = content.text().replace(" ", "\n"); chapter = chapter + replaceText; //下一章 Elements nextText = doc.select("a:matches((?i)下一章)"); if (nextText.size() > 0) { next = nextText.attr("href"); } map.put("chapterName", chapterName);//章節名稱 map.put("chapter", chapter);//完整章節內容 map.put("next", next);//下一章連接地址 } catch (Exception e) { map = null; System.err.println("處理數據操做出錯"); e.printStackTrace(); } return map; } /** * 遞歸寫入完整的一本書 * @param file file * @param baseUrl 基礎url * @param url 當前url * @param refererUrl refererUrl */ public static void mergeBook(File file, String baseUrl, String url, String refererUrl) { String html = gather(baseUrl + url,baseUrl +refererUrl); Map<String, String> map = processor(html); //追加寫入 fileWriter(file, map.get("chapter")); System.out.println(map.get("chapterName") + " --100%"); if (!StringUtils.isEmpty(map.get("next"))) {
//遞歸 mergeBook(file, baseUrl, map.get("next"),url); } }
main測試:json
public static void main(String[] args) { //須要提供的條件:站點;小說名;第一章的連接;refererUrl String baseUrl = "http://www.biquge.com.tw"; File file = createFile("鬥破蒼穹"); mergeBook(file, baseUrl, "/1_1999/1179371.html","/1_1999/"); }
給你們看一下我以前爬取的數據,多開幾個進程,掛機爬,差很少七個G,七百八十多部小說cookie
代碼已經開源、託管到個人GitHub、碼雲:session
GitHub:https://github.com/huanzi-qch/spiderapp