一篇文章看懂爬蟲

時間 2019-11-19

標籤一篇文章看懂爬蟲欄目網絡爬蟲简体版

原文原文鏈接

1、導讀css

一、爬蟲基礎知識html

二、優秀國產開源爬蟲框架webmagic剖析java

2、爬蟲基礎jquery

一、爬蟲的本質nginx

爬蟲的本質：基於Http協議請求目標地址獲取響應結果解析並存儲。git

二、HTTP請求github

（1）、請求頭（Request Headers）：包裝了http請求的基本信息，比較重要的如：user-agent、referer、cookie、accept-language（接受語言）、請求方法（post、get）。web

（2）、響應頭（Response Headers）：包裝了服務器返回的頭信息，如content-language內容語言、content-type內容的類型text/html等、server服務器類型（tomcat、jetty、nginx等）、status響應狀態（如：200、30二、404等等）。ajax

（3）、Response：服務端具體的返回，類型多種多樣，有html頁面、js代碼、json串、css樣式、流等等。正則表達式

三、解析

一般狀況下，web返回的基本都是html頁面、json。

（1）、xpath

xml路徑語言，具有很強的解析能力，chrome、firefox都有對應的工具生成xpath語法，能夠很方便的對標準html文件進行解析。

（2）、jsonpath

jsonpath是一個json解析的利器，很是相似於xpath語法，用很是簡潔的表達式解析json串。

（3）、css選擇器

這裏的css選擇器和jquery有點相似，經過元素的css樣式來定位元素，大名鼎鼎的jsoup提供豐富的css選擇器

（4）、正則表達式

（5）、字符串分割

四、難點

（1）、分析請求

ajax的普及，不少網站都採用了動態渲染的模式，請求再也不是簡單的返回html的模式，那麼給爬取帶來了巨大的難度，通常只能靠分析異步請求返回的json來具體分析，解析成咱們須要的數據格式。還有一類是經過服務端內部轉發來渲染頁面，這類是最難的，請求不是經過瀏覽器來請求，而是再服務端跳轉幾回才渲染給瀏覽器，這時候須要使用模擬器來模擬請求，如selenium等。

（2）、網站的限制

cookie限制：不少網站是要登錄後才能繞過filter才能訪問，這時候必須模擬cookie

user-agent：有的網站爲了防爬蟲，必需要求是真正瀏覽器才能訪問，這時候能夠模擬 user-agent

請求加密：網站的請求若是加密過，那就看不清請求的原本面目，這時候只能靠猜想，一般加密會採用簡單的編碼，如：base6四、urlEncode等，若是過於複雜，只能窮盡的去嘗試

IP限制：有些網站，會對爬蟲ip進行限制，這時候要麼換ip，要麼假裝ip

曲線方案：對應pc端，不少網站作的防禦比較全面，有時候能夠改一下思路，請求app端服務試試，一般會有意想不到的收穫。

（3）、爬取深度

網站一般的表現形式是一個頁面超連接着另外的頁面，理論上是無限延伸下去的，這時候必須設置一個爬取深度，不能無窮無盡的爬取。

五、總結

爬蟲本質上只作了兩件事情：請求和解析結果，可是爬蟲的開發是很是困難的，須要不停的分析網站的請求，不停的跟隨目標網站來升級本身的程序，試探解密、破解目標網站限制，把它當作網絡攻防一點也不爲過。

3、webmagic架構解析

webmagic是一個優秀的國產爬蟲框架、簡單易用、提供多種選擇器，如css選擇器、xpath、正則等等，預留了多個擴展接口，如Pipeline、Scheduler、Downloader等。

上圖複製於webmagic官方文檔，webmagic由四部分組成

Downloader：負責請求url獲取訪問的數據（html頁面、json等）。

PageProcessor：解析Downloader獲取的數據。

Pipeline：PageProcessor解析出的數據由Pipeline來進行保存或者說叫持久化。

Scheduler：調度器一般負責url去重，或者保存url隊列，PageProcessor解析出的url能夠加入Scheduler隊列，用於下一次的爬取。

Webmagic使用很是簡單，實現PageProcessor 接口，便可利用Spider類啓動爬蟲任務了。

Spider.create(new GithubRepoPageProcessor())
                //從"https://github.com/code4craft"開始抓
                .addUrl("https://github.com/code4craft")
                //開啓5個線程抓取
                .thread(5)
                //啓動爬蟲
                .run();

下面重點解析一下Spider類的幾個重要方法，包括鎖的使用

一、addUrl

public Spider addUrl(String... urls) {
        for (String url : urls) {
            addRequest(new Request(url));
        }
        signalNewUrl();
        return this;
    }

private void addRequest(Request request) {
        if (site.getDomain() == null && request != null && request.getUrl() != null) {
            site.setDomain(UrlUtils.getDomain(request.getUrl()));
        }
        scheduler.push(request, this);
    }

scheduler.push(request, this)，把須要爬取的url加入到Scheduler隊列。

二、initComponent

protected void initComponent() {
        if (downloader == null) {
            this.downloader = new HttpClientDownloader();
        }
        if (pipelines.isEmpty()) {
            pipelines.add(new ConsolePipeline());
        }
        downloader.setThread(threadNum);
        if (threadPool == null || threadPool.isShutdown()) {
            if (executorService != null && !executorService.isShutdown()) {
                threadPool = new CountableThreadPool(threadNum, executorService);
            } else {
                threadPool = new CountableThreadPool(threadNum);
            }
        }
        if (startRequests != null) {
            for (Request request : startRequests) {
                addRequest(request);
            }
            startRequests.clear();
        }
        startTime = new Date();
    }

初始化downloader、pipelines、threadPool線程池，這裏有必要說明一下，webmagic默認down是HttpClientDownloader、默認pipeline是ConsolePipeline.

二、run

run方法是整個爬行運行的核心

public void run() {
        checkRunningStat();
        initComponent();
        logger.info("Spider {} started!",getUUID());
        while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) {
            final Request request = scheduler.poll(this);
            if (request == null) {
                if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
                    break;
                }
                // wait until new url added
                waitNewUrl();
            } else {
                threadPool.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            processRequest(request);
                            onSuccess(request);
                        } catch (Exception e) {
                            onError(request);
                            logger.error("process request " + request + " error", e);
                        } finally {
                            pageCount.incrementAndGet();
                            signalNewUrl();
                        }
                    }
                });
            }
        }
        stat.set(STAT_STOPPED);
        // release some resources
        if (destroyWhenExit) {
            close();
        }
        logger.info("Spider {} closed! {} pages downloaded.", getUUID(), pageCount.get());
    }

（1）、任務結束時機

隊列爲空而且全部正在運行請求完成，且設置了exitWhenComplete爲true，這時纔會退出任務，這時候必須注意一點是，當頁面請求過於慢，致使新解析的url來不及進隊列，這時候任務退出致使爬取不完整。通常設置exitWhenComplete爲false，可是有時候開啓兩個爬蟲，必須等上一個爬蟲完成，才運行下一個爬蟲，這時候就會出問題了。實現這種場景，得改一下webmagic源碼

（2）、等待新請求時間，默認是30s

private void waitNewUrl() {
        newUrlLock.lock();
        try {
            //double check
            if (threadPool.getThreadAlive() == 0 && exitWhenComplete) {
                return;
            }
            newUrlCondition.await(emptySleepTime, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            logger.warn("waitNewUrl - interrupted, error {}", e);
        } finally {
            newUrlLock.unlock();
        }
    }

（3）、若scheduler隊列裏有url，在把任務丟進線程池，頁面download成功，則執行pageProcessor的process方法，若是有pipeline，則執行pipeline鏈裏的process方法

private void onDownloadSuccess(Request request, Page page) {
        onSuccess(request);
        if (site.getAcceptStatCode().contains(page.getStatusCode())){
            pageProcessor.process(page);
            extractAndAddRequests(page, spawnUrl);
            if (!page.getResultItems().isSkip()) {
                for (Pipeline pipeline : pipelines) {
                    pipeline.process(page.getResultItems(), this);
                }
            }
        }
        sleep(site.getSleepTime());
        return;
    }

有一點要注意，對於PageProcessor接口和Pipeline接口的實現，特別要注意線程安全的問題，切記不可對單例集合對象塞元素。

（4）、線程池CountableThreadPool的execute方法

public void execute(final Runnable runnable) {


        if (threadAlive.get() >= threadNum) {
            try {
                reentrantLock.lock();
                while (threadAlive.get() >= threadNum) {
                    try {
                        condition.await();
                    } catch (InterruptedException e) {
                    }
                }
            } finally {
                reentrantLock.unlock();
            }
        }
        threadAlive.incrementAndGet();
        executorService.execute(new Runnable() {
            @Override
            public void run() {
                try {
                    runnable.run();
                } finally {
                    try {
                        reentrantLock.lock();
                        threadAlive.decrementAndGet();
                        condition.signal();
                    } finally {
                        reentrantLock.unlock();
                    }
                }
            }
        });
    }

當任務數大於初始約定的線程數時，該任務就會處於等待狀態，直到condition signal發生，並通知被阻塞的線程，這裏有點要注意，await會釋放與condition關聯的鎖，當await返回時，該線程確定是從新得到了與condition關聯的鎖。

整體說來，Webmagic架構清晰，擴展容易，使用方便，是一款不錯的爬蟲框架。

快樂源於分享。

此博客乃做者原創，轉載請註明出處