HttpClient獲取頁面信息與Jsoup封裝獲取

時間 2019-11-30

原文原文鏈接

HttpClient直接獲取頁面信息html

使用方法：java

使用HttpClient發送請求、接收響應很簡單，通常須要以下幾步便可。node

1. 建立HttpClient對象。web

2. 建立請求方法的實例，並指定請求URL。若是須要發送GET請求，建立HttpGet對象；若是須要發送POST請求，建立HttpPost對象。api

3. 若是須要發送請求參數，可調用HttpGet、HttpPost共同的setParams(HetpParams params)方法來添加請求參數；對於HttpPost對象而言，也可調用setEntity(HttpEntity entity)方法來設置請求參數。數組

4. 調用HttpClient對象的execute(HttpUriRequest request)發送請求，該方法返回一個HttpResponse。服務器

5. 調用HttpResponse的getAllHeaders()、getHeaders(String name)等方法可獲取服務器的響應頭；調用HttpResponse的getEntity()方法可獲取HttpEntity對象，該對象包裝了服務器的響應內容。程序可經過該對象獲取服務器的響應內容。ide

6. 釋放鏈接。不管執行方法是否成功，都必須釋放鏈接函數

 public String cawl(String url){ try { CloseableHttpClient httpClient = HttpClientBuilder.create().build();//初始化 CloseableHttpResponse httpResponse = httpClient.execute(new HttpGet(url));//獲取頁面信息 String result = EntityUtils.toString(httpResponse.getEntity());//將對象轉換成字符串輸出 return result; } catch (IOException e) { throw new RuntimeException(e); } }

//經過parse解析html字符串
Document doc = Jsoup.parse(result);

使用Jsoup的connect().get();fetch

Document doc = Jsoup.connect(url).get()//須要try...catch...拋出IO流異常

說明

connect(String url) 方法建立一個新的 Connection, 和 get() 取得和解析一個HTML文件。若是從該URL獲取HTML時發生錯誤，便會拋出 IOException，應適當處理。

瞭解更多參考Jsoup的API文檔：http://www.open-open.com/jsoup/load-document-from-url.htm

附主函數：

package test;

import imple.Impl;
import model.Model;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.List;

public class TestMain {

    public static int i = 0;
    Impl impl = new Impl();
    List<Model> models = impl.getNovel("http://www.biquge.tw/0_5/");

    public static void main(String[] args) {
        MyThread tw = new TestMain().new MyThread();
        for(int j = 0;j<10;j++)new Thread(tw).start();
    }

    class MyThread implements Runnable{
        public void run() {
            while (i!=models.size()) {
               // String r = impl.cawl(models.get(i).getUrl());
                Document d = null;
                try {
                    d = Jsoup.connect(models.get(i).getUrl()).get();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                models.get(i).setContent(d.select("#content").text());


                File file = new File("E:\\完美世界\\" + models.get(i).getTitle() + ".txt");
                FileWriter writer = null;
                try {
                    writer = new FileWriter(file);
                    writer.write(models.get(i).getContent());
                    writer.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                System.out.println("第" + (i + 1) + "章輸出完畢");
                i++;
            }

        }
    }
}

　　總結

以前沒有接觸Jsoup，對於怎麼獲取http請求，以及獲取頁面信息都是使用的URLConnect，今天使用這兩種實現了小說的抓取，今天有幾個問題，當我想把URL，章節名，以及內容所有都儲存到數組列表中，可是須要的時間過長，並且抓取的過程容易出錯，由於後邊換成單線程抓取小說的時候，我採用一個for循環將小說儲存進數組列表的同時也在本地建立一個txt文檔，輸出成功就會返回xxx輸出成功，也出現了一個問題，當程序抓取到1100章左右的時候報出了時間超時的問題，彷佛Jsoup鏈接URL的方法connect()有時間限制，最後我使用十個線程同時抓取小說方得成功。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。