##抓取商品列表信息 AllSortPipeline已經將須要進一步抓取的商品列表信息的連接提取出來了,能夠看到連接的格式是:http://list.jd.com/list.html?cat=9987,653,659&delivery=1&JL=4_10_0&go=0。所以咱們創建商品列表的Bean——ProductList,代碼以下:css
@Gecco(matchUrl="http://list.jd.com/list.html?cat={cat}&delivery={delivery}&page={page}&JL={JL}&go=0", pipelines={"consolePipeline", "productListPipeline"}) public class ProductList implements HtmlBean { private static final long serialVersionUID = 4369792078959596706L; @Request private HttpRequest request; /** * 抓取列表項的詳細內容,包括titile,價格,詳情頁地址等 */ @HtmlField(cssPath="#plist .gl-item") private List<ProductBrief> details; /** * 得到商品列表的當前頁 */ @Text @HtmlField(cssPath="#J_topPage > span > b") private int currPage; /** * 得到商品列表的總頁數 */ @Text @HtmlField(cssPath="#J_topPage > span > i") private int totalPage; public List<ProductBrief> getDetails() { return details; } public void setDetails(List<ProductBrief> details) { this.details = details; } public int getCurrPage() { return currPage; } public void setCurrPage(int currPage) { this.currPage = currPage; } public int getTotalPage() { return totalPage; } public void setTotalPage(int totalPage) { this.totalPage = totalPage; } public HttpRequest getRequest() { return request; } public void setRequest(HttpRequest request) { this.request = request; } }
currPage和totalPage是頁面上的分頁信息,爲以後的分頁抓取提供支持。ProductBrief對象是商品的簡介,主要包括標題、預覽圖、詳情頁地址等。html
public class ProductBrief implements HtmlBean { private static final long serialVersionUID = -377053120283382723L; @Attr("data-sku") @HtmlField(cssPath=".j-sku-item") private String code; @Text @HtmlField(cssPath=".p-name> a > em") private String title; @Image({"data-lazy-img", "src"}) @HtmlField(cssPath=".p-img > a > img") private String preview; @Href(click=true) @HtmlField(cssPath=".p-name > a") private String detailUrl; public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getPreview() { return preview; } public void setPreview(String preview) { this.preview = preview; } public String getDetailUrl() { return detailUrl; } public void setDetailUrl(String detailUrl) { this.detailUrl = detailUrl; } public String getCode() { return code; } public void setCode(String code) { this.code = code; } }
這裏須要說明一下@Href(click=true)的click屬性,click屬性形象的說明了,這個連接咱們但願gecco繼續點擊抓取。對於增長了click=true的連接,gecco會自動加入下載隊列中,不須要在手動調用SchedulerContext.into()增長。 ##編寫ProductList的業務邏輯 ProductList抓取完成後通常須要進行持久化,也就是將商品的基本信息入庫,入庫的方式有不少種,這個例子並無介紹,gecco支持整合spring,能夠利用spring進行pipeline的開發,你們能夠參考gecco-spring這個項目。本例子是進行了控制檯輸出。ProductList的業務處理還有一個很重要的任務,就是對分頁的處理,列表頁一般都有不少頁,若是須要所有抓取,咱們須要將下一頁的連接入抓取隊列。git
@PipelineName("productListPipeline") public class ProductListPipeline implements Pipeline<ProductList> { @Override public void process(ProductList productList) { HttpRequest currRequest = productList.getRequest(); //下一頁繼續抓取 int currPage = productList.getCurrPage(); int nextPage = currPage + 1; int totalPage = productList.getTotalPage(); if(nextPage <= totalPage) { String nextUrl = ""; String currUrl = currRequest.getUrl(); if(currUrl.indexOf("page=") != -1) { nextUrl = StringUtils.replaceOnce(currUrl, "page=" + currPage, "page=" + nextPage); } else { nextUrl = currUrl + "&" + "page=" + nextPage; } SchedulerContext.into(currRequest.subRequest(nextUrl)); } } }
JD的列表頁經過page參數來指定頁碼,咱們經過替換page參數達到分頁抓取的目的。至此,全部的商品的列表信息都已經能夠正常抓取了。github