把玩爬蟲框架Gecco

若是你如今接到一個任務，獲取某某行業下的分類。javascript

做爲一個非該領域專家，沒有深厚的運營經驗功底，要提供一套擺的上臺面且讓人信服的行業分類，恐怕不那麼簡單。css

找不到專家沒有關係，咱們能夠爬蟲。把那些專家的心血抽絲剝繭爬出來再統計便可。html

肯定好思路，我和即將要說的爬蟲框架Gecco打了一天的交道。java

###Gecco簡介jquery

Gecco是一款用java語言開發的輕量化的易用的網絡爬蟲。Gecco整合了jsoup、httpclient、fastjson、spring、htmlunit、redission等優秀框架，讓您只須要配置一些jquery風格的選擇器就能很快的寫出一個爬蟲。Gecco框架有優秀的可擴展性，框架基於開閉原則進行設計，對修改關閉、對擴展開放。同時Gecco基於十分開放的MIT開源協議，不管你是使用者仍是但願共同完善Gecco的開發者（摘自GitHub上的介紹）git

主要特徵github

簡單易用，使用jquery風格的選擇器抽取元素ajax
支持爬取規則的動態配置和加載redis
支持頁面中的異步ajax請求spring
支持頁面中的javascript變量抽取
利用Redis實現分佈式抓取,參考gecco-redis
支持結合Spring開發業務邏輯,參考gecco-spring
支持htmlunit擴展,參考gecco-htmlunit
支持插件擴展機制
支持下載時UserAgent隨機選取
支持下載代理服務器隨機選取

**GitHub：**https://github.com/xtuhcy/gecco

**中文參考手冊：**http://www.geccocrawler.com/

同時GitHub上也提供了使用Gecco的實例，用於抓取京東商城分類以及分類下的商品信息。看到例子的第一眼就發現Gecco特別適合抓取這種分類以及分類下詳情列表的數據。

下面經過實例，邊實戰邊說明Gecco的用法。

###Gecco爬取分類數據

####爬取思路

首先明確爬取的種子網站：http://news.iresearch.cn/

爬取區域以下圖所示

**爬取思路：**先選取最上面的「互聯網+」分類，而後爬取下面的各個子分類（移動互聯網+電子商務+互聯網+網絡銷售+網絡遊戲），再爬取各個子分類下的全部文章，最後提取全部文章的文本信息（提取文本後須要使用IKanalyzer或ansj分詞，而後進行詞頻統計，本篇不作詳述）。

####編寫爬蟲啓動入口

我新建的是maven項目，因此要使用Gecco，第一步是添加maven依賴

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>1.0.8</version>
</dependency>

而後編寫一個main函數做爲爬蟲的入口

public class Main {

    public static void main(String[] rags) {
        System.out.println("=======start========");
        HttpGetRequest startUrl = new HttpGetRequest("http://news.iresearch.cn/");
        startUrl.setCharset("GBK");
        GeccoEngine.create()
                //Gecco搜索的包路徑
                .classpath("com.crawler.gecco")
                //開始抓取的頁面地址
                .start(startUrl)
                //開啓幾個爬蟲線程
                .thread(1)
                //單個爬蟲每次抓取完一個請求後的間隔時間
                .interval(2000)
                .run();
    }
}

HttpGetRequest用於包裹種子網站，同時能夠設置編碼，這裏設置的是「GBK」（一開始當時沒有設置該參數時，爬出的文本都是亂碼的）
classpath是一個掃描路徑，相似於Spring中的component-scan，用於掃描註解的類。這裏主要用於掃描註解「@Gecco」所在的類。

####解析獲取全部子分類

package com.crawler.gecco;

import com.geccocrawler.gecco.annotation.Gecco;
import com.geccocrawler.gecco.annotation.HtmlField;
import com.geccocrawler.gecco.annotation.Request;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.spider.HtmlBean;

import java.util.List;

/**
 * Created by jackie on 18/1/15.
 */
@Gecco(matchUrl="http://news.iresearch.cn/", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean {
    private static final long serialVersionUID = 665662335318691818L;

    @Request
    private HttpRequest request;

    // 移動互聯網
    @HtmlField(cssPath="#tab-list > div:nth-child(1)")
    private List<Category> mobileInternet;

    // 電子商務
    @HtmlField(cssPath="#tab-list > div:nth-child(2)")
    private List<Category> electric;

    // 互聯網
    @HtmlField(cssPath="#tab-list > div:nth-child(3)")
    private List<Category> internet;

    // 網絡營銷
    @HtmlField(cssPath="#tab-list > div:nth-child(4)")
    private List<Category> netMarket;

    // 網絡遊戲
    @HtmlField(cssPath="#tab-list > div:nth-child(5)")
    private List<Category> netGame;

    public List<Category> getMobileInternet() {
        return mobileInternet;
    }

    public void setMobileInternet(List<Category> mobileInternet) {
        this.mobileInternet = mobileInternet;
    }

    public List<Category> getElectric() {
        return electric;
    }

    public void setElectric(List<Category> electric) {
        this.electric = electric;
    }

    public List<Category> getInternet() {
        return internet;
    }

    public void setInternet(List<Category> internet) {
        this.internet = internet;
    }

    public List<Category> getNetMarket() {
        return netMarket;
    }

    public void setNetMarket(List<Category> netMarket) {
        this.netMarket = netMarket;
    }

    public List<Category> getNetGame() {
        return netGame;
    }

    public void setNetGame(List<Category> netGame) {
        this.netGame = netGame;
    }

    public HttpRequest getRequest() {
        return request;
    }

    public void setRequest(HttpRequest request) {
        this.request = request;
    }
}

雖然代碼很長，可是除去set和get方法，剩下的就是獲取子分類標籤的代碼
註解@Gecco告知該爬蟲匹配的url格式(matchUrl)和內容抽取後的bean處理類（pipelines處理類採用管道過濾器模式，能夠定義多個處理類），這裏matchUrl就是http://news.iresearch.cn/，意爲從這個網址對應的頁面中解析
這裏pipelines參數能夠添加多個管道處理類，意爲下一步該執行哪些管道類，須要說明的是consolePipeline，是專門將過程信息輸出到控制檯的管道類，後面會說明
註解@HtmlField表示抽取html中的元素，cssPath採用相似jquery的css selector選取元素

舉例說明，如今須要解析「移動互聯網」分類下全部的列表並將列表結果包裝爲一個list，供後面進一步解析列表的具體內容

// 移動互聯網
@HtmlField(cssPath="#tab-list > div:nth-child(1)")
private List<Category> mobileInternet;

這裏cssPath是用於指定須要解析的目標元素的css位置。

如何獲取這個區塊的位置，先看頁面

咱們要獲取的是「移動互聯網」下的全部列表，並將其包裝爲一個list集合。打開Chrome開發者工具，能夠看到該列表模塊被div標籤包裹，只要定位到該模塊的位置便可。

若是經過人肉的方式獲取cssPath確實有點傷眼，因此咱們能夠使用Chrome自帶的工具獲取css路徑，在上圖箭頭所在位置右鍵，按照以下圖所示操做，粘貼便可獲得cssPath

依次操做，能夠獲取其餘四個分類的分類列表。

####獲取分類列表對應的url

經過上面的解析，咱們獲得了各個分類下的列表模塊。經過Chrome開發者工具，咱們能夠發現每一個列表項包含的信息不多，咱們不該該直接抓取這些僅有的文本作分析，這樣會漏掉不少文本信息。

因此，咱們應該先定位解析出全部的href超連接，即每一個列表項對應的文章詳情地址，而後解析文章詳情的全部文本信息。

因此這裏的Category類以下

package com.crawler.gecco;

import com.geccocrawler.gecco.annotation.HtmlField;
import com.geccocrawler.gecco.annotation.Text;
import com.geccocrawler.gecco.spider.HrefBean;
import com.geccocrawler.gecco.spider.HtmlBean;

import java.util.List;

/**
 * Created by jackie on 18/1/15.
 */
public class Category implements HtmlBean {
    private static final long serialVersionUID = 3018760488621382659L;

    @Text
    @HtmlField(cssPath="dt a")
    private String parentName;

    @HtmlField(cssPath="ul li")
    private List<HrefBean> categorys;

    public String getParentName() {
        return parentName;
    }

    public void setParentName(String parentName) {
        this.parentName = parentName;
    }

    public List<HrefBean> getCategorys() {
        return categorys;
    }

    public void setCategorys(List<HrefBean> categorys) {
        this.categorys = categorys;
    }
}

categorys即用於手機某個分類下全部列表對應的網址

下面實現AllSortPipeline類，用於收集全部分類下的url

package com.crawler.gecco;

import com.geccocrawler.gecco.annotation.PipelineName;
import com.geccocrawler.gecco.pipeline.Pipeline;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.scheduler.SchedulerContext;
import com.geccocrawler.gecco.spider.HrefBean;

import java.util.ArrayList;
import java.util.List;

/**
 * Created by jackie on 18/1/15.
 */
@PipelineName("allSortPipeline")
public class AllSortPipeline implements Pipeline<AllSort> {
    @Override
    public void process(AllSort allSort) {
        System.out.println("-=======-");
        List<Category> categorys = new ArrayList<Category>();
        categorys.addAll(allSort.getInternet());
        categorys.addAll(allSort.getElectric());
        categorys.addAll(allSort.getMobileInternet());
        categorys.addAll(allSort.getNetGame());
        categorys.addAll(allSort.getNetMarket());
        for(Category category : categorys) {
            List<HrefBean> hrefs = category.getCategorys();
            for(HrefBean href : hrefs) {
                System.out.println("title: " + href.getTitle() + " url: " + href.getUrl());
                String url = href.getUrl();
                HttpRequest currRequest = allSort.getRequest();
                SchedulerContext.into(currRequest.subRequest(url));
            }
        }
    }
}

categorys集合用於添加全部分類下的列表
經過遍歷的方式獲取具體的url和每一個url對應的title
將url信息存儲到SchedulerContext上下文中，用於後面爬蟲

到此爲止，咱們獲取了全部的分類列表對應的url信息，並將url存儲到上下文中，用於後續爬蟲匹配。下面編寫用於解析詳情也的處理類。

####解析文章詳情

新建註解類ProductDetail，用於匹配上邊獲得的url

package com.crawler.gecco;

import com.geccocrawler.gecco.annotation.*;
import com.geccocrawler.gecco.spider.HtmlBean;

/**
 * Created by jackie on 18/1/15.
 */
@Gecco(matchUrl="http://news.iresearch.cn/content/{yeary}/{month}/{code}.shtml", pipelines={"consolePipeline", "productDetailPipeline"})
public class ProductDetail implements HtmlBean {

    private static final long serialVersionUID = -377053120283382723L;

    /**
     * 文本內容
     */
//    @Text
    @HtmlField(cssPath="body > div.g-content > div.g-bd.f-mt-auto > div > div.g-mn > div > div.g-article > div.m-article")
    private String content;

    @RequestParameter
    private String code;

    @RequestParameter
    private String year;

    @RequestParameter
    private String month;

    /**
     * 標題
     */
    @Text
    @HtmlField(cssPath="body > div.g-content > div.g-main.f-mt-auto > div > div > div.title > h1")
    private String title;

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    public String getCode() {
        return code;
    }

    public void setCode(String code) {
        this.code = code;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getYear() {
        return year;
    }

    public void setYear(String year) {
        this.year = year;
    }

    public String getMonth() {
        return month;
    }

    public void setMonth(String month) {
        this.month = month;
    }
}

matchUrl是每一個文章的url格式，year、month和code是注入的參數
同理，咱們定位到title所在的cssPath和 content所在的cssPath，用於解析獲得具體的title和content值

下面實現ProductDetailPipeline類，用於解析每篇文章的文本信息，並經過正則抽取全部的中文文本存儲到result.txt中

package com.crawler.gecco;

import com.geccocrawler.gecco.annotation.*;
import com.geccocrawler.gecco.pipeline.Pipeline;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

/**
 * Created by jackie on 18/1/15.
 */
@PipelineName("productDetailPipeline")
public class ProductDetailPipeline  implements Pipeline<ProductDetail> {
    @Override
    public void process(ProductDetail productDetail) {
        System.out.println("~~~~~~~~~productDetailPipeline~~~~~~~~~~~");
        File resultFile = new File("result.txt");
        if (!resultFile.exists()) {
            try {
                resultFile.createNewFile();
            } catch (IOException e) {
                System.out.println("create result file failed: " + e);
            }
        }

        FileWriter fileWriter = null;
        try {
            fileWriter = new FileWriter("result.txt", true);
        } catch (IOException e) {
            System.out.println("IOException");
        }

        try {
            fileWriter.write(RegrexUtil.match(productDetail.getContent()));
            fileWriter.flush();
        } catch (IOException e) {
            System.out.println("fileWriter.write failed: " + e);
        } finally {
            try {
                fileWriter.close();
            } catch (IOException e) {
                System.out.println("fileWriter.close failed");
            }
        }
    }
}

至此，咱們經過Gecco獲取到了互聯網行業各分類下的全部文章，並提取到全部的文本信息。

結果以下

項目地址：https://github.com/DMinerJackie/tour-project

若有問題，能夠下方留言

若是您以爲閱讀本文對您有幫助，請點一下「推薦」按鈕，您的「推薦」將是我最大的寫做動力！若是您想持續關注個人文章，請掃描二維碼，關注JackieZheng的微信公衆號，我會將個人文章推送給您，並和您一塊兒分享我平常閱讀過的優質文章。