Android Jsoup 爬取網頁數據

時間 2019-11-12

標籤 android jsoup 網頁數據欄目 Android 简体版

原文原文鏈接

一不當心一個月又過去了，事實上近期仍是小忙小忙的，廢話很少說。直接進入今天的主題吧。javascript

Jsoup – Java HTML Parser, with best of DOM, CSS, and jquery.。看這個介紹就知道。這個就是方便我們 Java 和Android 來解析 HTML 的。html

HTML 標籤

要去爬別人的 HTML 標籤的話，首先你確定得有必定的 HTML 的基礎知識吧。java

比方說常用的標籤。標籤的相關屬性，這個就很少說了，有相關問題都可以在 www.w3school.com.cn 的站點解決一下。jquery

載入網頁

最簡單的，直接載入一個網頁：android

Document document = Jsoup.connect("https://www.google.com").get();

那看到最後的 get() 方法聰明滴你必定就猜到另外一個相應的 post() 方法了吧。git

另外，http 請求的相關操做都是可以設置的。包括 header 請求參數。請求超時等等。除此以外，本地的文件（IO流）等都是可以直接解析的哈。github

Document document = Jsoup.connect("https://android-arsenal.com")
        .timeout(5000)
        .cookie("cookie", "cxxx")
        .header("xx", "xx")
        .userAgent("")
        .get();

基本標籤解析

以後我們就獲得了一個 Document 的對象了。這個對象就是對整個請求網頁的封裝。相關內容都可以在裏面獲取。markdown

來吧，增長咱們有如下一段html標籤需要解析：cookie

<div class="project-info clearfix">
    <div class="header">
        <div class="title">
            <a href="/details/1/5442">RendererRecyclerViewAdapter</a>
            <a class="tags" href="/tag/199">Recycler Views</a>
        </div>
        <a class="badge free" href="/free">Free</a>
        <a class="badge new" href="/recent">New</a>
    </div>
    <div class="desc">
        <p>A single adapter for the whole project.</p>
        <ul>
        <li>Now you do not need to implement adapters for RecyclerView.</li>
        <li>You can easily use several types of cells in a single list.</li>
        <li>Using this library will protect you from the appearance of any business logic in an adapter.</li>
        </ul>
    </div>
    <div class="ftr l"><i class="fa fa-calendar"></i> Mar 17, 2017</div>
</div>

Jsoup 裏面對於標籤的尋找使用的方法是 select() 方法。這種方法不要太強大了。我們一步一步的來。app

比方咱們要在茫茫標籤中找到 <div class="project-info clearfix"> 的話，拿這裏就是應該 findElementByClass() ，那麼在 Jsoup 中是怎麼定義這一塊的呢？

哈哈，很是easy嘛，那就是 document.select("div.project-info clearfix") 咯，固然不是這樣子的。等等 class 屬性裏面這個空格是什麼意思啊？是否是一臉懵逼？這裏終於的寫法是 document.select("div.project-info.clearfix") 空格需要用 . 來處理。

Elements select = document.select("div.project-info.clearfix");

這裏獲得是一個集合。

咱們接下來就需要遍歷這個集合，而後把裏面的每一個標籤都拔出來。

title 部分的解析，這裏是一個 <div> 裏面嵌套了一個 <a> 的標籤。這裏就涉及到了解析 <a> 標籤了。這裏咱們需要相應的 href，也需要相應的 text ， Jsoup 提供了相應的兩個方法 attr() 和 text() 。

Elements elements = e.select("div.title");
if (!elements.isEmpty()) {
    for (Element tittle : elements) {
        Element first = tittle.select("a[href]").first();
        if (first != null) {
            title = first.text();
            titleUrl = first.attr("href");
            System.out.println("名稱：" + title);
            System.out.println("詳細地址：" + titleUrl);
        }

        Elements select1 = tittle.select("a.tags");
        if (!select1.isEmpty()) {
            tag = select1.text();
            tagUrl = select1.attr("href");
            System.out.println("tags:" + tag);
            System.out.println("tagUrl:" + tagUrl);
        }
    }
}

嵌套解析

到這裏。 <div> 和 <a> 標籤的介紹基本搞定。接下來就是 <div class="desc"> 的解析了。

<div class="desc">
    <p>A single adapter for the whole project.</p>
    <ul>
    <li>Now you do not need to implement adapters for RecyclerView.</li>
    <li>You can easily use several types of cells in a single list.</li>
    <li>Using this library will protect you from the appearance of any business logic in an adapter.</li>
    </ul>
</div>

這裏又多了 <ul> 和 <li> 了。事實上道理是幾乎相同的，但是這裏它們既沒有 class 也沒有 id 。那這個咱們應該這麼去解析呢？

這裏仍是要回到 select() 方法，這裏就需要使用到指定層級的方法了。

Elements select1 = e.select("div.desc > p");
        String s = select1.toString();

對於 <dt> <dd> 相關的標籤，就可以使用 + 相關的鏈接符了。好比我想要僅僅解析 Tag 如下的相應的 Tag 名稱和相關的 url，這個應該怎麼寫呢？

<dt>Tag</dt>
<dd><a href="/tag/9">Background Processing</a></dd>
<dt>License</dt>
<dd><a href="http://opensource.org/licenses/Apache-2.0" rel="nofollow" target="_blank">Apache License, Version 2.0</a>
</dd>

代碼就是這種，這裏一不當心就又引出了 select() 方法的嵌套高級寫法。

Elements select4 = element.select("dt:contains(Tag) + dd");

事實上不用太多解釋啦。截圖裏面描寫敘述的很是清楚了。最後一個是可以支持正則的匹配。

同級相鄰解析

另外一種狀況就是咱們需要的標籤沒有詳細的 id 或者 class，並且它沒有直接相應的父標籤或者某種固定的嵌套關係，好比如下這種狀況：

<a id="favoriteButton" href="#" class="fa fa-star-o favorite tshadow" title="Add to favorites"></a> 
<a href="/details/1/5244">ImmediateLooperScheduler</a> <div id="githubInfoValue">

這裏咱們僅僅需要解析到第二個 <a> 標籤，那麼需要怎麼處理呢？這裏就需要使用到 nextElementSibling() 的方法了。

Element ssa = h1.select("a#favoriteButton").first();
Element element = ssa.nextElementSibling();
String title = element.text();

模糊解析

有時候咱們僅僅知道這個 <div> 是以什麼開頭或者是以什麼結尾或者又是裏面包括了某個單詞的，那麼這個時候就需要使用模糊查找了。

在 Jsoup 中定義了這些狀況的相關 select() 寫法，當中。以什麼開頭。是使用 a[href^=http] ,以什麼結尾使用 a[href$=.jpg] ，包括什麼就是使用 a[href*=/search/]。

javascript 解析

剛剛說的都是普通標籤及其內容，假設我要獲取js相關的標籤以及內容呢？事實上也不難，僅僅是最後不是使用text()的方法。而是使用data()的方法了。

就是 Jsoup 最基本的就是寫好這個 select() 方法，

final Elements script = document.select("script");

String js = script.first().data();