Java爬蟲系列三：使用Jsoup解析HTML

時間 2019-11-08

標籤 java 爬蟲系列使用 jsoup 解析 html 欄目 Java 简体版

原文原文鏈接

在上一篇隨筆《Java爬蟲系列二：使用HttpClient抓取頁面HTML》中介紹了怎麼使用HttpClient進行爬蟲的第一步--抓取頁面html，今天接着來看下爬蟲的第二步--解析抓取到的html。html

有請第二步的主角：Jsoup粉墨登場。下面咱們把舞臺交給Jsoup，讓他完成本文剩下的內容。java

============華麗的分割線=============node

1、Jsoup自我介紹jquery

你們好，我是Jsoup。程序員

我是一款Java 的HTML解析器，可直接解析某個URL地址、HTML文本內容。它提供了一套很是省力的API，可經過DOM，CSS以及相似於jQuery的操做方法來取出和操做數據，用Java寫爬蟲的同行們十之八九用過我。爲何呢？由於我在這個方面功能強大、使用方便。不信的話，能夠繼續往下看，代碼是不會騙人的。apache

2、Jsoup解析html瀏覽器

上一篇中，HttpClient大哥已經抓取到了博客園首頁的html，可是一堆的代碼，不是程序員的人們怎麼能看懂呢？這個就須要我這個html解析專家出場了。安全

下面經過案例展現如何使用Jsoup進行解析，案例中將獲取博客園首頁的標題和第一頁的博客文章列表dom

請看代碼（在上一篇代碼的基礎上進行操做，若是還不知道如何使用httpclient的朋友請跳轉頁面進行閱讀）：ide

引入依賴

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

實現代碼。實現代碼以前首先要分析下html結構。標題是<title>不用說了，那文章列表呢？按下瀏覽器的F12，查看頁面元素源碼，你會發現列表是一個大的div，id="post_list",每篇文章是小的div,class="post_item"

接下來就能夠開始代碼了，Jsoup核心代碼以下（總體源碼會在文章末尾給出）：

/**
                 * 下面是Jsoup展示自個人平臺
                 */
                //6.Jsoup解析html
                Document document = Jsoup.parse(html);
                //像js同樣，經過標籤獲取title
                System.out.println(document.getElementsByTag("title").first());
                //像js同樣，經過id 獲取文章列表元素對象
                Element postList = document.getElementById("post_list");
                //像js同樣，經過class 獲取列表下的全部博客
                Elements postItems = postList.getElementsByClass("post_item");
                //循環處理每篇博客
                for (Element postItem : postItems) {
                    //像jquery選擇器同樣，獲取文章標題元素
                    Elements titleEle = postItem.select(".post_item_body a[class='titlelnk']");
                    System.out.println("文章標題:" + titleEle.text());;
                    System.out.println("文章地址:" + titleEle.attr("href"));
                    //像jquery選擇器同樣，獲取文章做者元素
                    Elements footEle = postItem.select(".post_item_foot a[class='lightblue']");
                    System.out.println("文章做者:" + footEle.text());;
                    System.out.println("做者主頁:" + footEle.attr("href"));
                    System.out.println("*********************************");
                }

根據以上代碼你會發現，我經過Jsoup.parse(String html)方法對httpclient獲取到的html內容進行解析獲取到Document，而後document能夠有兩種方式獲取其子元素：像js同樣能夠經過getElementXXXX的方式和像jquery 選擇器同樣經過select()方法。不管哪一種方法均可以，我我的推薦用select方法處理。對於元素中的屬性，好比超連接地址，可使用element.attr(String)方法獲取，對於元素的文本內容經過element.text()方法獲取。

執行代碼，查看結果（不得不感慨博客園的園友們真是太厲害了，從上面分析首頁html結構到Jsoup分析的代碼執行完，這段時間首頁多了那麼多文章）
因爲新文章發佈的太快了，致使上面的截圖和這裏的輸出有些不同。

3、Jsoup的其餘用法

我，Jsoup，除了能夠在httpclient大哥的工做成果上發揮做用，我還能本身獨立幹活，本身抓取頁面，而後本身分析。分析的本領已經在上面展現過了，下面來展現本身抓取頁面，其實很簡單，所不一樣的是我直接獲取到的是document，不用再經過Jsoup.parse()方法進行解析了。

除了能直接訪問網上的資源，我還能解析本地資源：

代碼：

public static void main(String[] args) {
        try {
            Document document = Jsoup.parse(new File("d://1.html"), "utf-8");
            System.out.println(document);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

4、Jsoup另外一個值得一提的功能

你確定有過這種經歷，在你的頁面文本框中，若是輸入html元素的話，保存後再查看很大機率會致使頁面排版亂七八糟，若是能對這些內容進行過濾的話，就完美了。

恰好我Jsoup就能作到。

public static void main(String[] args) {
        String unsafe = "<p><a href='網址' onclick='stealCookies()'>博客園</a></p>";
        System.out.println("unsafe: " + unsafe);
        String safe = Jsoup.clean(unsafe, Whitelist.basic());
        System.out.println("safe: " + safe);
    }

經過Jsoup.clean方法，用一個白名單進行過濾。執行結果：

unsafe: <p><a href='網址' onclick='stealCookies()'>博客園</a></p>
safe: <p><a rel="nofollow">博客園</a></p>

5、結束語

經過以上你們相信我很強大了吧，不只能夠解析HttpClient抓取到的html元素，我本身也能抓取頁面dom，我還能load並解析本地保存的html文件。

此外，我還能經過一個白名單對字符串進行過濾，篩掉一些不安全的字符。

最最重要的，上面全部功能的API的調用都比較簡單。

============華麗的分割線=============

碼字不易，點個贊再走唄~~

最後，附上案例中解析博客園首頁文章列表的完整源碼：

package httpclient_learn;

import java.io.IOException;

import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.HttpClientUtils;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HttpClientTest {
    
    public static void main(String[] args) {
        //1.生成httpclient，至關於該打開一個瀏覽器
        CloseableHttpClient httpClient = HttpClients.createDefault();
        CloseableHttpResponse response = null;
        //2.建立get請求，至關於在瀏覽器地址欄輸入 網址
        HttpGet request = new HttpGet("https://www.cnblogs.com/");
        //設置請求頭，將爬蟲假裝成瀏覽器
        request.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36");
//        HttpHost proxy = new HttpHost("60.13.42.232", 9999);
//        RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
//        request.setConfig(config);
        try {
            //3.執行get請求，至關於在輸入地址欄後敲回車鍵
            response = httpClient.execute(request);
            
            //4.判斷響應狀態爲200，進行處理
            if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                //5.獲取響應內容
                HttpEntity httpEntity = response.getEntity();
                String html = EntityUtils.toString(httpEntity, "utf-8");
                System.out.println(html);
                
                /**
                 * 下面是Jsoup展示自個人平臺
                 */
                //6.Jsoup解析html
                Document document = Jsoup.parse(html);
                //像js同樣，經過標籤獲取title
                System.out.println(document.getElementsByTag("title").first());
                //像js同樣，經過id 獲取文章列表元素對象
                Element postList = document.getElementById("post_list");
                //像js同樣，經過class 獲取列表下的全部博客
                Elements postItems = postList.getElementsByClass("post_item");
                //循環處理每篇博客
                for (Element postItem : postItems) {
                    //像jquery選擇器同樣，獲取文章標題元素
                    Elements titleEle = postItem.select(".post_item_body a[class='titlelnk']");
                    System.out.println("文章標題:" + titleEle.text());;
                    System.out.println("文章地址:" + titleEle.attr("href"));
                    //像jquery選擇器同樣，獲取文章做者元素
                    Elements footEle = postItem.select(".post_item_foot a[class='lightblue']");
                    System.out.println("文章做者:" + footEle.text());;
                    System.out.println("做者主頁:" + footEle.attr("href"));
                    System.out.println("*********************************");
                }
                
                
            } else {
                //若是返回狀態不是200，好比404（頁面不存在）等，根據狀況作處理，這裏略
                System.out.println("返回狀態不是200");
                System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
            }
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //6.關閉
            HttpClientUtils.closeQuietly(response);
            HttpClientUtils.closeQuietly(httpClient);
        }
    }
}