Java 動手寫爬蟲: 3、爬取隊列

第三篇 爬取隊列的實現

第二篇中,實現了深度爬取的過程,但其中一個比較明顯的問題就是沒有實現每一個爬取做爲一個獨立的任務來執行;即串行的爬取網頁中的連接;所以,這一篇將主要集中目標在併發的爬網頁的問題上html

目標是每一個連接的爬取都當作一個獨立的job來執行java

設計

分工說明

  • 每一個job都是獨立的爬取任務,且只爬取對應的網址
  • 一個阻塞隊列,用於保存全部須要爬取的網址
  • 一個控制器,從隊列中獲取待爬取的連接,而後新建一個任務執行

爬蟲.png

圖解說明git

  • Fetcher: 從隊列中獲取 CrawlMeta, 而後建立一個Job任務開始執行github

  • Job: 根據 CrawlMeta 爬取對應的網頁,爬完以後將結果塞入 ResultSelector安全

  • ResultSelector : 分析爬取的結果,將全部知足條件的連接抽出來,封裝對應的 CrawlMeta塞入隊列bash

而後上面組成一個循環,便可實現自動的深度爬取微信

1. CrawlMeta

meta對象,保存的是待爬取的url和對應的選擇規則,連接過濾規則,如今則須要加一個當前深度的參數,表名當前爬取的url是第幾層, 用於控制是否須要中止繼續縱向的爬取網絡

/**
 * 當前爬取的深度
 */
@Getter
@Setter
private int currentDepth = 0;

2. FetchQueue

這個就是保存的待爬取網頁的隊列,其中包含兩個數據結果併發

  • toFetchQueue: CrawlMeta 隊列,其中的都是須要爬取的url
  • urls: 全部爬取過or待爬取的url集合,用於去重

源碼以下,須要注意一下幾個點框架

  • tag: 之因此留了這個,主要是考慮咱們的系統中是否能夠存在多個爬取隊列,若是存在時,則能夠用tag來表示這個隊列的用途
  • addSeed 方法,內部先判斷是否已經進入過隊列了,若爬取了則不丟入待爬取隊列(這個去重方式能夠與上一篇實現的去重方式進行對比);獲取隊列中的第一個元素時,是沒有加鎖的,ArrayBlockingQueue 內部保障了線程安全
/**
 * 待爬的網頁隊列
 * <p>
 * Created by yihui on 2017/7/6.
 */
public class FetchQueue {

    public static FetchQueue DEFAULT_INSTANCE = newInstance("default");

    /**
     * 表示爬取隊列的標識
     */
    private String tag;


    /**
     * 待爬取的網頁隊列
     */
    private Queue<CrawlMeta> toFetchQueue = new ArrayBlockingQueue<>(200);


    /**
     * 全部爬取過的url集合, 用於去重
     */
    private Set<String> urls = ConcurrentHashMap.newKeySet();


    private FetchQueue(String tag) {
        this.tag = tag;
    }


    public static FetchQueue newInstance(String tag) {
        return new FetchQueue(tag);
    }


    /**
     * 當沒有爬取過期, 才丟入隊列; 主要是避免重複爬取的問題
     *
     * @param crawlMeta
     */
    public void addSeed(CrawlMeta crawlMeta) {
        if (urls.contains(crawlMeta.getUrl())) {
            return;
        }

        synchronized (this) {
            if (urls.contains(crawlMeta.getUrl())) {
                return;
            }


            urls.add(crawlMeta.getUrl());
            toFetchQueue.add(crawlMeta);
        }
    }


    public CrawlMeta pollSeed() {
        return toFetchQueue.poll();
    }
}

3. DefaultAbstractCrawlJob

默認的抽象爬取任務,第二篇深度爬取中是直接在這個job中執行了全部的深度爬取,這裏咱們須要抽裏出來,改爲每一個job只爬取這個網頁,至於網頁內部的連接,則解析封裝後丟入隊列便可,不執行具體的抓去網頁工做

須要先增長兩個成員變量

/**
 * 待爬取的任務隊列
 */
private FetchQueue fetchQueue;


/**
 * 解析的結果
 */
private CrawlResult crawlResult;

而後執行爬取的邏輯修改一下,主要的邏輯基本上沒有變化,只是將以前的迭代調用,改爲塞入隊列,改動以下

/**
 * 執行抓取網頁
 */
void doFetchPage() throws Exception {
    HttpResponse response = HttpUtils.request(this.crawlMeta, httpConf);
    String res = EntityUtils.toString(response.getEntity(), httpConf.getCode());
    if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { // 請求成功
        this.crawlResult = new CrawlResult();
        this.crawlResult.setStatus(response.getStatusLine().getStatusCode(), response.getStatusLine().getReasonPhrase());
        this.crawlResult.setUrl(crawlMeta.getUrl());
        this.visit(this.crawlResult);
        return;
    }


    // 網頁解析
    this.crawlResult = doParse(res, this.crawlMeta);

    // 回調用戶的網頁內容解析方法
    this.visit(this.crawlResult);



    // 解析返回的網頁中的連接,將知足條件的扔到爬取隊列中
    int currentDepth = this.crawlMeta.getCurrentDepth();
    if (currentDepth > depth) {
        return;
    }


    Elements elements = crawlResult.getHtmlDoc().select("a[href]");
    String src;
    for (Element element : elements) {
        // 確保將相對地址轉爲絕對地址
        src = element.attr("abs:href");
        if (!matchRegex(src)) {
            continue;
        }

        CrawlMeta meta = new CrawlMeta(currentDepth + 1,
                src,
                this.crawlMeta.getSelectorRules(),
                this.crawlMeta.getPositiveRegex(),
                this.crawlMeta.getNegativeRegex());
        fetchQueue.addSeed(meta);
    }
}

String res = EntityUtils.toString(response.getEntity(), httpConf.getCode());

上面的代碼,與以前有一行須要注意下, 這裏對結果進行解析時,以前沒有考慮字符編碼的問題,所以所有走的都是默認編碼邏輯,對應的源碼以下,其中 defaultCharset = null, 所以最終的編碼多是 ISO_8859_1 也多是解析的編碼方式,因此在不指定編碼格式時,可能出現亂碼問題

Charset charset = null;

try {
    ContentType contentType = ContentType.get(entity);
    if(contentType != null) {
        charset = contentType.getCharset();
    }
} catch (UnsupportedCharsetException var13) {
    throw new UnsupportedEncodingException(var13.getMessage());
}

if(charset == null) {
    charset = defaultCharset;
}

if(charset == null) {
    charset = HTTP.DEF_CONTENT_CHARSET;
}

爲了解決亂碼問題,在 HttpConf (與網絡相關的配置項)中新添加了一個code參數,表示對應的編碼,由於目前咱們的教程尚未到網絡相關的模塊,因此先採用了最簡單的實現方式,在DefaultAbstractCrawlJob 中加了一個方法(後面的測試會給出對應的使用姿式)

protected void setResponseCode(String code) {
  httpConf.setCode(code);
}

4. Fetcher

這個就是咱們新增的爬取控制類,在這裏實現從隊列中獲取任務,而後建立job來執行

由於職責比較清晰,因此一個最簡單的實現以下

public class Fetcher {

    private int maxDepth;

    private FetchQueue fetchQueue;


    public FetchQueue addFeed(CrawlMeta feed) {
        fetchQueue.addSeed(feed);
        return fetchQueue;
    }


    public Fetcher() {
        this(0);
    }


    public Fetcher(int maxDepth) {
        this.maxDepth = maxDepth;
        fetchQueue = FetchQueue.DEFAULT_INSTANCE;
    }


    public <T extends DefaultAbstractCrawlJob> void start(Class<T> clz) throws Exception {
        CrawlMeta crawlMeta;
        int i = 0;
        while (true) {
            crawlMeta = fetchQueue.pollSeed();
            if (crawlMeta == null) {
                Thread.sleep(200);
                if (++i > 300) { // 連續一分鐘內沒有數據時,退出
                    break;
                }

                continue;
            }

            i = 0;

            DefaultAbstractCrawlJob job = clz.newInstance();
            job.setDepth(this.maxDepth);
            job.setCrawlMeta(crawlMeta);
            job.setFetchQueue(fetchQueue);

            new Thread(job, "crawl-thread-" + System.currentTimeMillis()).start();
        }
    }

}

5. 測試

測試代碼與以前就有些區別了,比以前要簡潔一些

public class QueueCrawlerTest {

    public static class QueueCrawlerJob extends DefaultAbstractCrawlJob {

        public void beforeRun() {
            // 設置返回的網頁編碼
            super.setResponseCode("gbk");
        }

        @Override
        protected void visit(CrawlResult crawlResult) {
            System.out.println(Thread.currentThread().getName() + " ___ " + crawlResult.getUrl());
        }
    }


    public static void main(String[] rags) throws Exception {
        Fetcher fetcher = new Fetcher(1);

        String url = "http://chengyu.t086.com/gushi/1.htm";
        CrawlMeta crawlMeta = new CrawlMeta();
        crawlMeta.setUrl(url);
        crawlMeta.addPositiveRegex("http://chengyu.t086.com/gushi/[0-9]+\\.htm$");

        fetcher.addFeed(crawlMeta);


        fetcher.start(QueueCrawlerJob.class);
    }
}

輸出結果以下

crawl-thread-1499333696153 ___ http://chengyu.t086.com/gushi/1.htm
crawl-thread-1499333710801 ___ http://chengyu.t086.com/gushi/3.htm
crawl-thread-1499333711142 ___ http://chengyu.t086.com/gushi/7.htm
crawl-thread-1499333710801 ___ http://chengyu.t086.com/gushi/2.htm
crawl-thread-1499333710802 ___ http://chengyu.t086.com/gushi/6.htm
crawl-thread-1499333710801 ___ http://chengyu.t086.com/gushi/4.htm
crawl-thread-1499333710802 ___ http://chengyu.t086.com/gushi/5.htm

改進

和以前同樣,接下來就是對上面的實現進行缺點分析和改進

1. 待改善點

  • Fetcher 中,每一個任務都起一個線程,能夠用線程池來優化管理
  • Job 中執行任務和結果分析沒有拆分,離咱們的job只作爬取的邏輯有一點差距
  • 退出程序的邏輯比較猥瑣
  • 爬取網頁的間隔時間能夠加一下
  • 頻繁的Job對象建立與銷燬,是否能夠考慮對象池的方式減小gc

2. 線程池

直接使用Java的線程池來操做,由於線程池有較多的配置參數,因此先定義一個配置類; 給了一個默認的配置項,這個可能並不知足實際的業務場景,參數配置須要和實際的爬取任務相關聯,才能夠達到最佳的使用體驗

// Fetcher.java 

  @Getter
  @Setter
  @ToString
  @NoArgsConstructor
  public static class ThreadConf {
      private int coreNum = 6;
      private int maxNum = 10;
      private int queueSize = 10;
      private int aliveTime = 1;
      private TimeUnit timeUnit = TimeUnit.MINUTES;
      private String threadName = "crawl-fetch";


      public final static ThreadConf DEFAULT_CONF = new ThreadConf();
  }

線程池初始化

private Executor executor;

@Setter
private ThreadConf threadConf;

/**
 * 初始化線程池
 */
private void initExecutor() {
    executor = new ThreadPoolExecutor(threadConf.getCoreNum(),
            threadConf.getMaxNum(),
            threadConf.getAliveTime(),
            threadConf.getTimeUnit(),
            new LinkedBlockingQueue<>(threadConf.getQueueSize()),
            new CustomThreadFactory(threadConf.getThreadName()),
            new ThreadPoolExecutor.CallerRunsPolicy());
}

任務執行,直接將原來的建立Thread方式改爲線程池執行方式便可

// com.quick.hui.crawler.core.fetcher.Fetcher#start

executor.execute(job);

測試case與以前同樣,輸出有些區別(主要是線程的名不一樣), 能夠看到其中 crawl-fetch-1 有兩個,由於咱們設置的線程的 coreSize = 6 , 而實際的爬取任務有7個,說明有一個被重用了;當爬取任務較多時,這麼作的好處就很明顯了

crawl-fetch-1 ___ http://chengyu.t086.com/gushi/1.htm
crawl-fetch-2 ___ http://chengyu.t086.com/gushi/2.htm
crawl-fetch-5 ___ http://chengyu.t086.com/gushi/5.htm
crawl-fetch-1 ___ http://chengyu.t086.com/gushi/7.htm
crawl-fetch-3 ___ http://chengyu.t086.com/gushi/3.htm
crawl-fetch-4 ___ http://chengyu.t086.com/gushi/4.htm
crawl-fetch-6 ___ http://chengyu.t086.com/gushi/6.htm

3. ResultFilter

用於結果解析的類,掃描爬取網頁中的連接,將知足條件的連接封裝以後塞入待爬取隊列

這個實現比較簡單,比較難處理的是如何判斷是否抓取完的邏輯

一個簡單的思路以下:

  • 從第0層(seed)出發, 能夠知道第一層有count個任務
  • 從第一層的第0個出發,有count10個任務; 第1個出發,有 count11個任務
  • 從第二層的第0個出發,有count20個任務...

當掃描到最後一層時,上一層的完成計數+1,若是此時上一次的完成計數正好等於任務數,則上上一層計數+1,依次知道第0層的計數等於count,此時才表示爬取完成

計數配置 JobCount

每一個爬取的job,都對應一個 JobCount , 注意其中的幾個屬性,以及要求保證 JobCount 的 id全局惟一

@Getter
public class JobCount {

    public static int SEED_ID = 1;

    public static AtomicInteger idGen = new AtomicInteger(0);


    public static int genId() {
        return idGen.addAndGet(1);
    }


    /**
     * 該Job對應的惟一ID
     */
    private int id;


    /**
     * 該job對應父job的id
     */
    private int upperId;


    /**
     * 當前的層數
     */
    private int currentDepth;


    /**
     * 該job對應的網頁中,子Job的數量
     */
    private AtomicInteger jobCount = new AtomicInteger(0);


    /**
     * 該Job對應的網頁中, 子Job完成的數量
     */
    private AtomicInteger finishCount = new AtomicInteger(0);


    public boolean fetchOver() {
        return jobCount.get() == finishCount.get();
    }


    /**
     * 爬取完成一個子任務
     */
    public synchronized boolean finishJob() {
        finishCount.addAndGet(1);
        return fetchOver();
    }


    public JobCount(int id, int upperId, int currentDepth, int jobCount, int finishCount) {
        this.id = id;
        this.upperId = upperId;
        this.currentDepth = currentDepth;
        this.jobCount.set(jobCount);
        this.finishCount.set(finishCount);
    }
}

將Job任務與 JobCount關聯,所以在 CrwalMeta 中新增兩個屬性

/**
 * 當前任務對應的 {@link JobCount#id }
 */
@Getter
@Setter
private int jobId;


/**
 * 當前任務對應的 {@link JobCount#parentId }
 */
@Getter
@Setter
private int parentJobId;

爬取隊列中作出相應的調整,新增一個 isOver 屬性,用於肯定是否結束;一個 jobCountMap 用於記錄每一個Job的計數狀況

對應的FetchQueue 修改代碼以下, 須要注意的是幾個finishOneJob方法的實現方式

/**
 * JobCount 映射表, key爲 {@link JobCount#id}, value 爲對應的JobCount
 */
public Map<Integer, JobCount> jobCountMap = new ConcurrentHashMap<>();


/**
 * 爬取是否完成的標識
 */
public volatile boolean isOver = false;


/**
 * 當沒有爬取過期, 才丟入隊列; 主要是避免重複爬取的問題
 *
 * @param crawlMeta
 */
public boolean addSeed(CrawlMeta crawlMeta) {
    if (urls.contains(crawlMeta.getUrl())) {
        return false;
    }

    synchronized (this) {
        if (urls.contains(crawlMeta.getUrl())) {
            return false;
        }


        urls.add(crawlMeta.getUrl());
        toFetchQueue.add(crawlMeta);
        return true;
    }
}


public CrawlMeta pollSeed() {
    return toFetchQueue.poll();
}


public void finishJob(CrawlMeta crawlMeta, int count, int maxDepth) {
    if (finishOneJob(crawlMeta, count, maxDepth)) {
        isOver = true;
        System.out.println("============ finish crawl! ======");
    }
}


/**
 * 完成一個爬取任務
 *
 * @param crawlMeta 爬取的任務
 * @param count     爬取的網頁上知足繼續爬取的連接數
 * @return 若是全部的都爬取完了, 則返回true
 */
private boolean finishOneJob(CrawlMeta crawlMeta, int count, int maxDepth) {
    JobCount jobCount = new JobCount(crawlMeta.getJobId(),
            crawlMeta.getParentJobId(),
            crawlMeta.getCurrentDepth(),
            count, 0);
    jobCountMap.put(crawlMeta.getJobId(), jobCount);


    if (crawlMeta.getCurrentDepth() == 0) { // 爬取種子頁時,特判一下
        return count == 0; // 若沒有子連接能夠爬取, 則直接結束
    }


    if (count == 0 || crawlMeta.getCurrentDepth() == maxDepth) {
        // 當前的爲最後一層的job時, 上一層計數+1
        return finishOneJob(jobCountMap.get(crawlMeta.getParentJobId()));
    }


    return false;
}

/**
 * 遞歸向上進行任務完成 +1
 *
 * @param jobCount
 * @return true 表示全部的任務都爬取完成
 */
private boolean finishOneJob(JobCount jobCount) {
    if (jobCount.finishJob()) {
        if (jobCount.getCurrentDepth() == 0) {
            return true; //  結束
        }

        return finishOneJob(jobCountMap.get(jobCount.getParentId()));
    }

    return false;
}

因此 Fetch 類中的循環判斷條件調整爲根據 fetchQueue的 isOver來做爲斷定條件

public <T extends DefaultAbstractCrawlJob> void start(Class<T> clz) throws Exception {
        CrawlMeta crawlMeta;

    while (!fetchQueue.isOver) {
        crawlMeta = fetchQueue.pollSeed();
        if (crawlMeta == null) {
            Thread.sleep(200);
            continue;
        }


        DefaultAbstractCrawlJob job = clz.newInstance();
        job.setDepth(this.maxDepth);
        job.setCrawlMeta(crawlMeta);
        job.setFetchQueue(fetchQueue);

        executor.execute(job);
    }
}

至此上面實現告終束斷定條件的設置,下面則是讀 Job中的代碼進行分拆,將爬取的網頁中連接過濾邏輯,遷移到 ResultFilter中實現,基本上就是代碼的遷移

public class ResultFilter {


    public static void filter(CrawlMeta crawlMeta,
                              CrawlResult crawlResult,
                              FetchQueue fetchQueue,
                              int maxDepth) {
        int count = 0;
        try {
            // 解析返回的網頁中的連接,將知足條件的扔到爬取隊列中
            int currentDepth = crawlMeta.getCurrentDepth();
            if (currentDepth >= maxDepth) {
                return;
            }


            // 當前的網址中能夠繼續爬的連接數

            Elements elements = crawlResult.getHtmlDoc().select("a[href]");
            String src;
            for (Element element : elements) {
                // 確保將相對地址轉爲絕對地址
                src = element.attr("abs:href");
                if (!matchRegex(crawlMeta, src)) {
                    continue;
                }

                CrawlMeta meta = new CrawlMeta(
                        JobCount.genId(),
                        crawlMeta.getJobId(),
                        currentDepth + 1,
                        src,
                        crawlMeta.getSelectorRules(),
                        crawlMeta.getPositiveRegex(),
                        crawlMeta.getNegativeRegex());
                if (fetchQueue.addSeed(meta)) {
                    count++;
                }
            }

        } finally { // 上一層爬完計數+1
            fetchQueue.finishJob(crawlMeta, count, maxDepth);
        }

    }


    private static boolean matchRegex(CrawlMeta crawlMeta, String url) {
        Matcher matcher;
        for (Pattern pattern : crawlMeta.getPositiveRegex()) {
            matcher = pattern.matcher(url);
            if (matcher.find()) {
                return true;
            }
        }


        for (Pattern pattern : crawlMeta.getNegativeRegex()) {
            matcher = pattern.matcher(url);
            if (matcher.find()) {
                return false;
            }
        }


        return crawlMeta.getPositiveRegex().size() == 0;
    }

}

測試代碼與以前加一點變化,將深度設置爲2,抓去的正則有小的調整

public class QueueCrawlerTest {

    public static class QueueCrawlerJob extends DefaultAbstractCrawlJob {

        public void beforeRun() {
            // 設置返回的網頁編碼
            super.setResponseCode("gbk");
        }

        @Override
        protected void visit(CrawlResult crawlResult) {
            System.out.println(Thread.currentThread().getName() + "___" + crawlMeta.getCurrentDepth() + "___" + crawlResult.getUrl());
        }
    }


    @Test
    public void testCrawel() throws Exception {
        Fetcher fetcher = new Fetcher(2);

        String url = "http://chengyu.t086.com/gushi/1.htm";
        CrawlMeta crawlMeta = new CrawlMeta();
        crawlMeta.setUrl(url);
        crawlMeta.addPositiveRegex("http://chengyu.t086.com/gushi/[0-9]+\\.html$");

        fetcher.addFeed(crawlMeta);


        fetcher.start(QueueCrawlerJob.class);
    }
}

輸出結果以下

crawl-fetch-1___0___http://chengyu.t086.com/gushi/1.htm
crawl-fetch-7___1___http://chengyu.t086.com/gushi/673.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/683.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/687.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/672.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/686.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/688.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/684.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/670.html
main___1___http://chengyu.t086.com/gushi/669.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/685.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/671.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/679.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/677.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/682.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/681.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/676.html
main___1___http://chengyu.t086.com/gushi/660.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/680.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/675.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/678.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/674.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/668.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/667.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/666.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/665.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/662.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/661.html
main___1___http://chengyu.t086.com/gushi/651.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/657.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/658.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/663.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/664.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/659.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/656.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/655.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/653.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/652.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/654.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/650.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/648.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/649.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/647.html
main___1___http://chengyu.t086.com/gushi/640.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/644.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/645.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/643.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/646.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/641.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/642.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/639.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/635.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/637.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/634.html
main___1___http://chengyu.t086.com/gushi/629.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/638.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/633.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/632.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/636.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/630.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/631.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/627.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/628.html
main___1___http://chengyu.t086.com/gushi/617.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/625.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/622.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/624.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/626.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/623.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/621.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/620.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/614.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/618.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/612.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/611.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/619.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/616.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/615.html
main___1___http://chengyu.t086.com/gushi/605.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/613.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/610.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/609.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/608.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/606.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/607.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/603.html
main___1___http://chengyu.t086.com/gushi/594.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/604.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/600.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/602.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/599.html
crawl-fetch-3___1___http://chengyu.t086.com/gushi/601.html
crawl-fetch-5___1___http://chengyu.t086.com/gushi/598.html
crawl-fetch-6___1___http://chengyu.t086.com/gushi/596.html
crawl-fetch-1___1___http://chengyu.t086.com/gushi/597.html
crawl-fetch-4___1___http://chengyu.t086.com/gushi/593.html
crawl-fetch-8___1___http://chengyu.t086.com/gushi/591.html
crawl-fetch-9___1___http://chengyu.t086.com/gushi/595.html
crawl-fetch-7___1___http://chengyu.t086.com/gushi/592.html
main___2___http://chengyu.t086.com/gushi/583.html
crawl-fetch-3___2___http://chengyu.t086.com/gushi/588.html
crawl-fetch-10___1___http://chengyu.t086.com/gushi/590.html
crawl-fetch-2___1___http://chengyu.t086.com/gushi/589.html
crawl-fetch-5___2___http://chengyu.t086.com/gushi/579.html
crawl-fetch-1___2___http://chengyu.t086.com/gushi/581.html
crawl-fetch-7___2___http://chengyu.t086.com/gushi/584.html
crawl-fetch-4___2___http://chengyu.t086.com/gushi/582.html
crawl-fetch-3___2___http://chengyu.t086.com/gushi/587.html
crawl-fetch-6___2___http://chengyu.t086.com/gushi/580.html
crawl-fetch-9___2___http://chengyu.t086.com/gushi/585.html
crawl-fetch-8___2___http://chengyu.t086.com/gushi/586.html
crawl-fetch-10___2___http://chengyu.t086.com/gushi/578.html
crawl-fetch-1___2___http://chengyu.t086.com/gushi/575.html
crawl-fetch-2___2___http://chengyu.t086.com/gushi/577.html
crawl-fetch-5___2___http://chengyu.t086.com/gushi/576.html
crawl-fetch-7___2___http://chengyu.t086.com/gushi/574.html
============ finish crawl! ======

小結

本片主要集中在一個爬取隊列+線程池方式,來實現併發的爬取任務,同時實現了一個比較猥瑣的結束爬取的方案

缺陷

上面的實現,有一個很是明顯的缺陷,就是相應的日誌輸出太少,下一篇博文將着手於此,將一些關鍵鏈路的日誌信息打印出來;同時將剩下的幾個待優化點一併作掉

到這裏,基本上一個爬蟲框架的雛形算是基本完成(固然還有不少問題,如隊列的深度,JobCountMap可能爆掉,還有一些爬蟲的基本注意事項等都有缺陷,但不要緊,留待後續一點一點來完善)

源碼地址

項目地址: https://github.com/liuyueyi/quick-crawler

優化前對應的tag: v0.004

優化後對應的tag: v0.005

相關連接

歡迎關注個人微信公衆號

小灰灰的公衆號

相關文章
相關標籤/搜索