jsoup web scraping

時間 2020-08-02

標籤 jsoup web scraping 欄目 Java 简体版

原文原文鏈接

jsoup簡介

jsoup是一款HTML解析器，可用與解析URL地址、HTML文本內同等，操做相似於jQuery，可經過DOM查找數據，操做數據, 使用時需引入jsoup jarhtml

jsoup能夠從包含字符串、url及本地文件加載html文檔，生成Document對象，經過Document對象便可操做文檔中的數據java

eg：node

 //經過url
Document doc = Jsoup.connect("http://www.cnblogs.com/wishyouhappy").get();


//經過html 字符串
String html = "<html><head></head> <body><p>#####</p></body></html>";
Document  doc = Jsoup.parse(html);

//經過文件加載,第三個參數指示baseURL
 File input = new File("D:/test.html"); 
 Document doc = Jsoup.parse(input,"UTF-8","http://www.cnblogs.com/wishyouhappy");

數據操做eg：正則表達式

Document doc = Jsoup.connect("http://www.cnblogs.com/wishyouhappy").get();
System.out.println(doc.title());

經常使用函數

parse相關：cookie

static Document parse(File in, String charsetName)
static Document parse(File in, String charsetName, String baseUri)
static Document parse(InputStream in, String charsetName, String baseUri)
static Document parse(String html)
static Document parse(String html, String baseUri)   
static Document parse(URL url, int timeoutMillis)
static Document parseBodyFragment(String bodyHtml)
static Document parseBodyFragment(String bodyHtml, String baseUri)

url connect相關：app

Connection connect(String url) //根據給定的url(必須是http或https)來建立鏈接

Connection cookie(String name, String value) //發送請求時放置cookie 
Connection data(Map<String,String> data) //傳遞請求參數 
Connection data(String... keyvals) //傳遞請求參數

Document get() //以get方式發送請求並對返回結果進行解析
Document post()//以post方式發送請求並對返回結果進行解析 

Connection userAgent(String userAgent) 
Connection header(String name, String value) //添加請求頭
Connection referrer(String referrer) //設置請求來源

獲取html元素：函數

getElementById(String id) //用id得到元素
getElementsByTag(String tag) //用標籤得到元素
getElementsByClass(String className) //用class得到元素
getElementsByAttribute(String key)  //用屬性得到元素

siblingElements(), 
firstElementSibling(), 
lastElementSibling();
nextElementSibling(), 
previousElementSibling()

獲取和設置元素的值：post

attr(String key)  //得到元素的數據 
attr(String key, String value) //設置元素數據 
attributes() //得到因此屬性
id(), 
className() 
classNames() 
text() //得到文本值
text(String value) //設置文本值
html() //獲取html 
html(String value)//設置html
outerHtml()
data()
tag()  //得到tag 
tagName() //得到tagname

添加元素：url

append(String html), 
prepend(String html)
appendText(String text), 
prependText(String text)
appendElement(String tagName),
prependElement(String tagName)

選擇器：

tagname	使用標籤名來定位，例如 a
ns\|tag	使用命名空間的標籤訂位，例如 fb:name 來查找 <fb:name> 元素
#id	使用元素 id 定位，例如 #logo
.class	使用元素的 class 屬性定位，例如 .head
[attribute]	使用元素的屬性進行定位，例如 [href] 表示檢索具備 href 屬性的全部元素
[^attr]	使用元素的屬性名前綴進行定位，例如 [^data-] 用來查找 HTML5 的 dataset 屬性
[attr=value]	使用屬性值進行定位，例如 [width=500] 定位全部 width 屬性值爲 500 的元素
*[attr^=value], [attr$=value], [attr=value]**	這三個語法分別表明，屬性以 value 開頭、結尾以及包含
[attr~=regex]	使用正則表達式進行屬性值的過濾，例如 img[src~=(?i)\.(png\|jpe?g)]
*	定位全部元素

el#id	定位 id 值某個元素，例如 a#logo -> <a id=logo href= … >
el.class	定位 class 爲指定值的元素，例如 div.head -> <div class="head">xxxx</div>
el[attr]	定位全部定義了某屬性的元素，例如 a[href]
以上三個任意組合	例如 a[href]#logo 、a[name].outerlink
ancestor child	這五種都是元素之間組合關係的選擇器語法，其中包括父子關係、合併關係和層次關係。
parent > child
siblingA + siblingB
siblingA ~ siblingX

:lt(n)	例如 td:lt(3) 表示小於三列
:gt(n)	div p:gt(2) 表示 div 中包含 2 個以上的 p
:eq(n)	form input:eq(1) 表示只包含一個 input 的表單
:has(seletor)	div:has(p) 表示包含了 p 元素的 div
:not(selector)	div:not(.logo) 表示不包含 class="logo" 元素的全部 div 列表
:contains(text)	包含某文本的元素，不區分大小寫，例如 p:contains(oschina)
:containsOwn(text)	文本信息徹底等於指定條件的過濾
:matches(regex)	使用正則表達式進行文本過濾：div:matches((?i)login)
:matchesOwn(regex)	使用正則表達式找到自身的文本

例子：

package jsoup;

/**  
 *   
 * 建立人：wish
 * 建立時間：2014年6月13日 下午1:22:49  
 */
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BlogCatch {
    /**
     * main
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        
       // getArticleTitle("http://www.cnblogs.com/wishyouhappy");
        Document doc = Jsoup.connect("http://www.cnblogs.com/wishyouhappy") 
                .data("query", "Java")   // 請求參數
                .userAgent("I ’ m jsoup") // 設置 User-Agent 
                .cookie("auth", "token") // 設置 cookie 
                .timeout(3000)           // 設置鏈接超時時間
                .post();   
        System.out.println(doc.title());
    }

    /**
     * 獲取指定HTML 文檔指定的body
     * 傳入html string
     * @throws IOException
     */
    @SuppressWarnings("unused")
    private static void getBlogBodyByString(String html) {
        Document doc = Jsoup.parse(html);
        System.out.println(doc.body());
    }
    
    /**
     * 
     * getBlogBodyByURL 經過url獲取文檔body
     * @param   url  
     * @return 
     * 
     */
    @SuppressWarnings("unused")
    private static void getBlogBodyByURL(String url) throws IOException {
        // 從 URL 直接加載 HTML 文檔
        Document doc2 = Jsoup.connect(url).get();
        String title = doc2.body().toString();
        System.out.println(title);
    }

    /**
     * 
     * article 獲取博客上的文章標題和連接
     * @param   url  
     * @return 
     * @Exception 異常對象
     */
    public static void getArticleTitle(String url) {
        Document doc;
        try {
            doc = Jsoup.connect(url).get();
            Elements ListDiv = doc.getElementsByAttributeValue("class","postTitle");
            for (Element element :ListDiv) {
                Elements links = element.getElementsByTag("a");
                for (Element link : links) {
                    String linkHref = link.attr("href");
                    String linkText = link.text().trim();
                    System.out.println(linkHref);
                    System.out.println(linkText);
                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }
    
    /**
     * 
     * getBlog 獲取指定博客文章的內容
     * @param   name  
     * @return 
     * @Exception 異常對象
     */
    public static void getBlog(String url) {
        Document doc;
        try {
            doc = Jsoup.connect(url).get();
            Elements ListDiv = doc.getElementsByAttributeValue("class","postBody");
            for (Element element :ListDiv) {
                System.out.println(element.html());
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        
    }

}

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。