頁面抓取原理

時間 2019-11-29

標籤頁面抓取原理简体版

原文原文鏈接

是根據頁面節點進行定位篩選（多級選擇器）。html

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestPreview {
public static void main(String[] args) throws IOException {
    method1();
}
private static void method1() throws IOException {
    //userAgent：例如火狐下打開百度，f12，網絡-全部-點擊任意一個-右側出來的請求頭的UserAgent
    Document document = Jsoup
    .connect("http://www.cnblogs.com/yanan7890/")
    .timeout(10000)
    .ignoreContentType(true)
    .userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0")
    .get();
    //System.out.println(document);//獲取整篇文檔內容
    Elements es = document.select("#centercontent > div.day > div.postTitle >a");
    Element e = es.get(0);//獲取知足條件的全部元素中的第一個標籤元素
    // 處理標籤內容爲空時，返回""
    String text = e.text();
    String html = e.toString();
    System.out.println(text);//獲取該標籤元素的html內容
    System.out.println(html);//獲取該標籤元素
}
}

View Code

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。