Java廣度優先爬蟲示例(抓取復旦新聞信息)

時間 2019-11-06

標籤 java 廣度優先爬蟲示例抓取復旦信息欄目 Java 简体版

原文原文鏈接

一.使用的技術

這個爬蟲是近半個月前學習爬蟲技術的一個小例子,比較簡單,怕時間久了會忘,這裏簡單總結一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的開發工具(IDE)爲intelij 13.1,Jar包管理工具爲Maven,不習慣用intelij的同窗,也可使用eclipse新建一個項目.html

二.爬蟲基本知識

1.什麼是網絡爬蟲?(爬蟲的基本原理)

網絡爬蟲,拆開來說,網絡即指互聯網,互聯網就像一個蜘蛛網同樣,爬蟲就像是蜘蛛同樣能夠處處爬來爬去,把爬來的數據再進行加工處理.java

百科上的解釋:網絡爬蟲（又被稱爲網頁蜘蛛，網絡機器人，在FOAF社區中間，更常常的稱爲網頁追逐者），是一種按照必定的規則，自動的抓取萬維網信息的程序或者腳本。另一些不常使用的名字還有螞蟻，自動索引，模擬程序或者蠕蟲。node

基本原理:傳統爬蟲從一個或若干初始網頁的URL開始，得到初始網頁上的URL，在抓取網頁的過程當中，不斷從當前頁面上抽取新的URL放入隊列,直到知足系統的必定中止條件，流程圖所示。聚焦爬蟲的工做流程較爲複雜，須要根據必定的網頁分析算法過濾與主題無關的連接，保留有用的連接並將其放入等待抓取的URL隊列。而後，它將根據必定的搜索策略從隊列中選擇下一步要抓取的網頁URL，並重覆上述過程，直到達到系統的某一條件時中止web

2.經常使用的爬蟲策略有哪些?

網頁的抓取策略能夠分爲深度優先、廣度優先和最佳優先三種。深度優先在不少狀況下會致使爬蟲的陷入(trapped)問題，目前常見的是廣度優先和最佳優先方法。算法

2.1廣度優先(Width-First)apache

廣度優先遍歷是連通圖的一種遍歷策略。由於它的思想是從一個頂點V0開始，輻射狀地優先遍歷其周圍較廣的區域,故得名.cookie

其基本思想:網絡

1)、從圖中某個頂點V0出發，並訪問此頂點；

2)、從V0出發，訪問V0的各個不曾訪問的鄰接點W1，W2，…,Wk;而後,依次從W1,W2,…,Wk出發訪問各自未被訪問的鄰接點；

3)、重複步驟2，直到所有頂點都被訪問爲止。

以下圖所示:app

2.2深度優先(Depth-First)eclipse

假設初始狀態是圖中全部頂點都未被訪問，則深度優先搜索方法的步驟是：

1）選取圖中某一頂點Vi爲出發點，訪問並標記該頂點；

2）以Vi爲當前頂點，依次搜索Vi的每一個鄰接點Vj，若Vj未被訪問過，則訪問和標記鄰接點Vj，若Vj已被訪問過，則搜索Vi的下一個鄰接點；

3）以Vj爲當前頂點，重複步驟2，直到圖中和Vi有路徑相通的頂點都被訪問爲止；

4）若圖中尚有頂點未被訪問過（非連通的狀況下），則可任取圖中的一個未被訪問的頂點做爲出發點，重複上述過程，直至圖中全部頂點都被訪問。

下面以一個有向圖和一個無向圖爲例:

廣度和深度和區別:

廣度優先遍歷是以層爲順序，將某一層上的全部節點都搜索到了以後才向下一層搜索；而深度優先遍歷是將某一條枝椏上的全部節點都搜索到了以後，才轉向搜索另外一條枝椏上的全部節點。

2.3 最佳優先搜索

最佳優先搜索策略按照必定的網頁分析算法，預測候選URL與目標網頁的類似度，或與主題的相關性，並選取評價最好的一個或幾個URL進行抓取。它只訪問通過網頁分析算法預測爲「有用」的網頁。這種搜索適合暗網數據的爬取,只要符合要求的內容.

3.本文爬蟲示例圖

本文介紹的例子是抓取新聞類的信息,由於通常新聞類的信息,重要的和時間近的都會放在首頁,處在網絡層中比較深的信息的重要性通常將逐級下降,因此廣度優先算法更適合,下圖是本文將要抓取的網頁結構圖:

三.廣度優先爬蟲示例

1.需求:抓取復旦新聞信息(只抓取100個網頁信息)

這裏只抓取100條信息,並用url必須以new.fudan.edu.cn開頭.

2.代碼實現

使用maven引入外部jar包:

       <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.3.4</version>
        </dependency>
        <dependency>
            <groupId>org.htmlparser</groupId>
            <artifactId>htmlparser</artifactId>
            <version>2.1</version>
        </dependency>

程序主入口:

package com.amos.crawl;

import java.util.Set;

/**
 * Created by amosli on 14-7-10.
 */
public class MyCrawler {
    /**
     * 使用種子初始化URL隊列
     *
     * @param seeds
     */
    private void initCrawlerWithSeeds(String[] seeds) {
        for (int i = 0; i < seeds.length; i++) {
            LinkQueue.addUnvisitedUrl(seeds[i]);
        }
    }

    public void crawling(String[] seeds) {
        //定義過濾器,提取以http://news.fudan.edu.cn/的連接
        LinkFilter filter = new LinkFilter() {
            @Override
            public boolean accept(String url) {
                if (url.startsWith("http://news.fudan.edu.cn")) {
                    return true;
                }
                return false;
            }
        };
        //初始化URL隊列
        initCrawlerWithSeeds(seeds);

        int count=0;
        //循環條件:待抓取的連接不爲空抓取的網頁最多100條
        while (!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 100) {

            System.out.println("count:"+(++count));

            //附頭URL出隊列
            String visitURL = (String) LinkQueue.unVisitedUrlDeQueue();
            DownLoadFile downloader = new DownLoadFile();
            //下載網頁
            downloader.downloadFile(visitURL);
            //該URL放入怩訪問的URL中
            LinkQueue.addVisitedUrl(visitURL);
            //提取出下載網頁中的URL
            Set<String> links = HtmlParserTool.extractLinks(visitURL, filter);

            //新的未訪問的URL入列
            for (String link : links) {
                System.out.println("link:"+link);
                LinkQueue.addUnvisitedUrl(link);
            }
        }

    }

    public static void main(String args[]) {
        //程序入口
        MyCrawler myCrawler = new MyCrawler();
        myCrawler.crawling(new String[]{"http://news.fudan.edu.cn/news/"});
    }

}

工具類:Tools.java

package com.amos.tool;

import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.UnknownHostException;
import java.security.KeyManagementException;
import java.security.KeyStoreException;
import java.security.NoSuchAlgorithmException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.Locale;

import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLException;

import org.apache.http.*;
import org.apache.http.client.CircularRedirectException;
import org.apache.http.client.CookieStore;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.RedirectStrategy;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpHead;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.client.methods.RequestBuilder;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.client.utils.URIUtils;
import org.apache.http.conn.ConnectTimeoutException;
import org.apache.http.conn.HttpClientConnectionManager;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.SSLContextBuilder;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.cookie.Cookie;
import org.apache.http.impl.client.*;
import org.apache.http.impl.conn.BasicHttpClientConnectionManager;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.apache.http.protocol.HttpContext;
import org.apache.http.util.Args;
import org.apache.http.util.Asserts;
import org.apache.http.util.TextUtils;
import org.omg.CORBA.Request;

/**
 * Created by amosli on 14-6-25.
 */
public class Tools {


    /**
     * 寫文件到本地
     *
     * @param httpEntity
     * @param filename
     */
    public static void saveToLocal(HttpEntity httpEntity, String filename) {

        try {

            File dir = new File(Configuration.FILEDIR);
            if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            InputStream inputStream = httpEntity.getContent();

            byte[] bytes = new byte[1024];
            int length = 0;
            while ((length = inputStream.read(bytes)) > 0) {
                fileOutputStream.write(bytes, 0, length);
            }
            inputStream.close();
            fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    /**
     * 寫文件到本地
     *
     * @param bytes
     * @param filename
     */
    public static void saveToLocalByBytes(byte[] bytes, String filename) {

        try {

            File dir = new File(Configuration.FILEDIR);
            if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
                fileOutputStream.write(bytes);
                //fileOutputStream.write(bytes, 0, bytes.length);
                fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    /**
     * 輸出
     * @param string
     */
    public static void println(String string){
        System.out.println("string:"+string);
    }
    /**
     * 輸出
     * @param string
     */
    public static void printlnerr(String string){
        System.err.println("string:"+string);
    }


    /**
     * 使用ssl通道並設置請求重試處理
     * @return
     */
    public static CloseableHttpClient createSSLClientDefault() {
        try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {
                //信任全部
                public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException {
                    return true;
                }
            }).build();

            SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);

            //設置請求重試處理,重試機制,這裏若是請求失敗會重試5次
            HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {
                @Override
                public boolean retryRequest(IOException exception, int executionCount, HttpContext context) {
                    if (executionCount >= 5) {
                        // Do not retry if over max retry count
                        return false;
                    }
                    if (exception instanceof InterruptedIOException) {
                        // Timeout
                        return false;
                    }
                    if (exception instanceof UnknownHostException) {
                        // Unknown host
                        return false;
                    }
                    if (exception instanceof ConnectTimeoutException) {
                        // Connection refused
                        return false;
                    }
                    if (exception instanceof SSLException) {
                        // SSL handshake exception
                        return false;
                    }
                    HttpClientContext clientContext = HttpClientContext.adapt(context);
                    HttpRequest request = clientContext.getRequest();
                    boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);
                    if (idempotent) {
                        // Retry if the request is considered idempotent
                        return true;
                    }
                    return false;
                }
            };

            //請求參數設置,設置請求超時時間爲20秒,鏈接超時爲10秒,不容許循環重定向
            RequestConfig requestConfig = RequestConfig.custom()
                    .setConnectionRequestTimeout(20000).setConnectTimeout(20000)
                    .setCircularRedirectsAllowed(false)
                    .build();

            Cookie cookie ;
            return HttpClients.custom().setSSLSocketFactory(sslsf)
                    .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")
                    .setMaxConnPerRoute(25).setMaxConnPerRoute(256)
                    .setRetryHandler(retryHandler)
                    .setRedirectStrategy(new SelfRedirectStrategy())
                    .setDefaultRequestConfig(requestConfig)
                    .build();

        } catch (KeyManagementException e) {
            e.printStackTrace();
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        } catch (KeyStoreException e) {
            e.printStackTrace();
        }
        return HttpClients.createDefault();
    }

    /**
     * 帶cookiestore
     * @param cookieStore
     * @return
     */

    public static CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) {
        try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {
                //信任全部
                public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException {
                    return true;
                }
            }).build();

            SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);

            //設置請求重試處理,重試機制,這裏若是請求失敗會重試5次
            HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {
                @Override
                public boolean retryRequest(IOException exception, int executionCount, HttpContext context) {
                    if (executionCount >= 5) {
                        // Do not retry if over max retry count
                        return false;
                    }
                    if (exception instanceof InterruptedIOException) {
                        // Timeout
                        return false;
                    }
                    if (exception instanceof UnknownHostException) {
                        // Unknown host
                        return false;
                    }
                    if (exception instanceof ConnectTimeoutException) {
                        // Connection refused
                        return false;
                    }
                    if (exception instanceof SSLException) {
                        // SSL handshake exception
                        return false;
                    }
                    HttpClientContext clientContext = HttpClientContext.adapt(context);
                    HttpRequest request = clientContext.getRequest();
                    boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);
                    if (idempotent) {
                        // Retry if the request is considered idempotent
                        return true;
                    }
                    return false;
                }
            };

            //請求參數設置,設置請求超時時間爲20秒,鏈接超時爲10秒,不容許循環重定向
            RequestConfig requestConfig = RequestConfig.custom()
                    .setConnectionRequestTimeout(20000).setConnectTimeout(20000)
                    .setCircularRedirectsAllowed(false)
                    .build();


            return HttpClients.custom().setSSLSocketFactory(sslsf)
                    .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")
                    .setMaxConnPerRoute(25).setMaxConnPerRoute(256)
                    .setRetryHandler(retryHandler)
                    .setRedirectStrategy(new SelfRedirectStrategy())
                    .setDefaultRequestConfig(requestConfig)
                    .setDefaultCookieStore(cookieStore)
                    .build();

        } catch (KeyManagementException e) {
            e.printStackTrace();
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        } catch (KeyStoreException e) {
            e.printStackTrace();
        }
        return HttpClients.createDefault();
    }

}

View Code

將網頁寫入到本地的下載類:DownLoadFile.java

package com.amos.crawl;

import com.amos.tool.Configuration;
import com.amos.tool.Tools;
import org.apache.http.*;
import org.apache.http.client.HttpClient;
import org.apache.http.client.HttpRequestRetryHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ClientConnectionManager;
import org.apache.http.conn.ConnectTimeoutException;
import org.apache.http.impl.client.AutoRetryHttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.protocol.HttpContext;

import javax.net.ssl.SSLException;
import java.io.*;
import java.net.UnknownHostException;


/**
 * Created by amosli on 14-7-9.
 */
public class DownLoadFile {

    public String getFileNameByUrl(String url, String contentType) {
        //移除http http://
        url = url.contains("http://") ? url.substring(7) : url.substring(8);

        //text/html類型
        if (url.contains(".html")) {
            url = url.replaceAll("[\\?/:*|<>\"]", "_");
        } else if (contentType.indexOf("html") != -1) {
            url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";
        } else {
            url = url.replaceAll("[\\?/:*|<>\"]", "_") + "." + contentType.substring(contentType.lastIndexOf("/") + 1);
        }
        return url;
    }

    /**
     * 將網頁寫入到本地
     * @param data
     * @param filePath
     */
    private void saveToLocal(byte[] data, String filePath) {

        try {
            DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath)));
            for(int i=0;i<data.length;i++){
                out.write(data[i]);
            }
            out.flush();
            out.close();

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * 寫文件到本地
     *
     * @param httpEntity
     * @param filename
     */
    public static void saveToLocal(HttpEntity httpEntity, String filename) {

        try {

            File dir = new File(Configuration.FILEDIR);
            if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            InputStream inputStream = httpEntity.getContent();

            if (!file.exists()) {
                file.createNewFile();
            }
            byte[] bytes = new byte[1024];
            int length = 0;
            while ((length = inputStream.read(bytes)) > 0) {
                fileOutputStream.write(bytes, 0, length);
            }
            inputStream.close();
            fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }


    public String downloadFile(String url)  {

        //文件路徑
        String filePath=null;

        //1.生成HttpClient對象並設置參數
        HttpClient httpClient = Tools.createSSLClientDefault();

        //2.HttpGet對象並設置參數
        HttpGet httpGet = new HttpGet(url);

        //設置get請求超時5s
        //方法1
        //httpGet.getParams().setParameter("connectTimeout",5000);
        //方法2
        RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(5000).build();
        httpGet.setConfig(requestConfig);

        try {
            HttpResponse httpResponse = httpClient.execute(httpGet);
            int statusCode = httpResponse.getStatusLine().getStatusCode();
            if(statusCode!= HttpStatus.SC_OK){
                System.err.println("Method failed:"+httpResponse.getStatusLine());
                filePath=null;
            }

            filePath=getFileNameByUrl(url,httpResponse.getEntity().getContentType().getValue());
            saveToLocal(httpResponse.getEntity(),filePath);

        } catch (Exception e) {
            e.printStackTrace();
        }

        return filePath;

    }



    public static void main(String args[]) throws IOException {
        String url = "http://websearch.fudan.edu.cn/search_dep.html";
        HttpClient httpClient = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet(url);
        HttpResponse httpResponse = httpClient.execute(httpGet);
        Header contentType = httpResponse.getEntity().getContentType();

        System.out.println("name:" + contentType.getName() + "value:" + contentType.getValue());
        System.out.println(new DownLoadFile().getFileNameByUrl(url, contentType.getValue()));

    }


}

View Code

建立一個過濾接口:LinkFilter.java

package com.amos.crawl;

/**
 * Created by amosli on 14-7-10.
 */
public interface LinkFilter {

    public boolean accept(String url);

}

使用HtmlParser的過濾url的方法:HtmlParserTool.java

package com.amos.crawl;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.filters.OrFilter;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.util.NodeList;

import java.util.HashSet;
import java.util.Set;

/**
 * Created by amosli on 14-7-10.
 */
public class HtmlParserTool {
    public static Set<String> extractLinks(String url, LinkFilter filter) {
        Set<String> links = new HashSet<String>();

        try {
            Parser parser = new Parser(url);
            parser.setEncoding("GBK");
            //過濾<frame>標籤的filter,用來提取frame標籤裏的src屬性
            NodeFilter framFilter = new NodeFilter() {
                @Override
                public boolean accept(Node node) {
                    if (node.getText().contains("frame src=")) {
                        return true;
                    } else {
                        return false;
                    }

                }
            };

            //OrFilter來設置過濾<a>標籤和<frame>標籤
            OrFilter linkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), framFilter);
            //獲得全部通過過濾的標籤
            NodeList list = parser.extractAllNodesThatMatch(linkFilter);
            for (int i = 0; i < list.size(); i++) {
                Node tag = list.elementAt(i);
                if (tag instanceof LinkTag) {
                    tag = (LinkTag) tag;
                    String linkURL = ((LinkTag) tag).getLink();

                    //若是符合條件那麼將url添加進去
                    if (filter.accept(linkURL)) {
                        links.add(linkURL);
                    }

                } else {//frame 標籤
                    //frmae裏src屬性的連接,如<frame src="test.html" />
                    String frame = tag.getText();
                    int start = frame.indexOf("src=");
                    frame = frame.substring(start);

                    int end = frame.indexOf(" ");
                    if (end == -1) {
                        end = frame.indexOf(">");
                    }
                    String frameUrl = frame.substring(5, end - 1);
                    if (filter.accept(frameUrl)) {
                        links.add(frameUrl);
                    }
                }

            }

        } catch (Exception e) {
            e.printStackTrace();
        }

        return links;
    }


}

管理網頁url的實現隊列: Queue.java

package com.amos.crawl;

import java.util.LinkedList;

/**
 * Created by amosli on 14-7-9.
 */
public class Queue {

    //使用鏈表實現隊列
    private LinkedList queueList = new LinkedList();


    //入隊列
    public void enQueue(Object object) {
        queueList.addLast(object);
    }

    //出隊列
    public Object deQueue() {
        return queueList.removeFirst();
    }

    //判斷隊列是否爲空
    public boolean isQueueEmpty() {
        return queueList.isEmpty();
    }

    //判斷隊列是否包含ject元素..
    public boolean contains(Object object) {
        return queueList.contains(object);
    }

    //判斷隊列是否爲空
    public boolean empty() {
        return queueList.isEmpty();
    }

}

網頁連接進出隊列的管理:LinkQueue.java

package com.amos.crawl;

import java.util.HashSet;
import java.util.Set;

/**
 * Created by amosli on 14-7-9.
 */
public class LinkQueue {
    //已經訪問的隊列
    private static Set visitedUrl = new HashSet();
    //未訪問的隊列
    private static Queue unVisitedUrl = new Queue();

    //得到URL隊列
    public static Queue getUnVisitedUrl() {
        return unVisitedUrl;
    }
    public static Set getVisitedUrl() {
        return visitedUrl;
    }
    //添加到訪問過的URL隊列中
    public static void addVisitedUrl(String url) {
        visitedUrl.add(url);
    }

    //刪除已經訪問過的URL
    public static void removeVisitedUrl(String url){
        visitedUrl.remove(url);
    }
    //未訪問的URL出隊列
    public static Object unVisitedUrlDeQueue(){
        return unVisitedUrl.deQueue();
    }
    //保證每一個URL只被訪問一次,url不能爲空,同時已經訪問的URL隊列中不能包含該url,並且由於已經出隊列了所未訪問的URL隊列中也不能包含該url
    public static void addUnvisitedUrl(String url){
        if(url!=null&&!url.trim().equals("")&&!visitedUrl.contains(url)&&!unVisitedUrl.contains(url))
        unVisitedUrl.enQueue(url);
    }
    //得到已經訪問過的URL的數量
    public static int getVisitedUrlNum(){
        return visitedUrl.size();
    }

    //判斷未訪問的URL隊列中是否爲空
    public static boolean isUnvisitedUrlsEmpty(){
        return unVisitedUrl.empty();
    }
}