Java網絡爬蟲實操（8）

時間 2019-12-08

標籤 java 網絡爬蟲欄目 Java 简体版

原文原文鏈接

上一篇：Java網絡爬蟲實操（7）html

你們好，本篇文章介紹一下NetDiscovery爬蟲框架裏的downloader對象java

1) 前言

面向對象設計仍然是目前編程的核心思想，從下面截圖能夠了解爬蟲框架的主要對象： git

程序在本地組織好一個request後，交給downloader從網絡上抓取數據到本地，而後由parser處理本地的這些數據，最終生成可用的信息。github

2) downloader介紹

downloader咱們也稱爲下載器，主要功能就是訪問網絡併成功抓回咱們要的數據：例如html網頁、json/xml數據、二進制流（圖片、office文檔等）目前NetDiscovery支持的downloader實現有： apache

面向接口編程是這個框架的重要設計思想之一。編程

如下介紹部分downloader代碼，這些代碼的共同點是實現了Downloader接口。json

做爲程序開發者，你也能夠實現接口com.cv4j.netdiscovery.core.downloader.Downloader，建立本身的下載器類。bash

UrlConnectionDownloader 這個用的是jdk自帶的包，java.io、java.net

//一、構建一個URL對象
url = new URL(request.getUrl());
//二、獲取一個HttpURLConnection對象
conn = url.openConnection();
//三、一堆設置
conn .setDoOutput(true);
conn .setDoInput(true);
conn .setRequestMethod("POST");
......
//四、訪問網絡服務
conn.connect();
//五、執行成功的話，獲取結果
conn.getResponseCode();
conn.getInputStream();
複製代碼

HttpClientDownloader 這個是用開源包apache httpclient實現的，代碼就更加簡潔優雅了。

//一、獲取一個HttpManager對象(框架本身封裝的)
HttpManager httpManager = HttpManager.get();
//二、而後把request扔進去，等結果就能夠了.request也是框架封裝的
httpManager.getResponse(request)
//三、等來結果後，進行處理
            @Override
            public Response apply(CloseableHttpResponse closeableHttpResponse) throws Exception {
                String charset = null;
                if (Preconditions.isNotBlank(request.getCharset())) {
                    charset = request.getCharset();  //針對一些仍是GB2312編碼的網頁
                } else {
                    charset = "UTF-8";
                }
                String html = EntityUtils.toString(closeableHttpResponse.getEntity(), charset);
                Response response = new Response();
                response.setContent(html.getBytes());
                response.setStatusCode(closeableHttpResponse.getStatusLine().getStatusCode());
                if (closeableHttpResponse.containsHeader("Content-Type")) {
                    response.setContentType(closeableHttpResponse.getFirstHeader("Content-Type").getValue());
                }

                return response;
            }
複製代碼