Scrapy框架——CrawlSpider類爬蟲案例

時間 2020-05-09

標籤 scrapy 框架 crawlspider 爬蟲案例欄目 Python 简体版

原文原文鏈接

Scrapy--CrawlSpider

Scrapy框架中分兩類爬蟲，Spider類和CrawlSpider類。php

此案例採用的是CrawlSpider類實現爬蟲。html

它是Spider的派生類，Spider類的設計原則是隻爬取start_url列表中的網頁，而CrawlSpider類定義了一些規則(rule)來提供跟進link的方便的機制，從爬取的網頁中獲取link並繼續爬取的工做更適合。如爬取大型招聘網站正則表達式

建立項目json

scrapy startproject tencent #建立項目

建立模板app

scrapy genspider crawl -t tencent 'hr.tencent.com'    #tencent爲爬蟲名稱 hr.tencent.com爲限制域

建立完會模板後會生成一個tencent.py的文件框架

# -*- coding: utf-8 -*-
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class TencentSpider(CrawlSpider): name = 'tencent' allowed_domains = ['tencent.com'] start_urls = ['http://tencent.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

Link Extractors 的目的很簡單: 提取連接｡
每一個LinkExtractor有惟一的公共方法是 extract_links()，它接收一個 Response 對象，並返回一個 scrapy.link.Link 對象。
LinkExtractors要實例化一次，而且 extract_links 方法會根據不一樣的 response 調用屢次提取連接｡

主要參數：
allow：知足括號中「正則表達式」的值會被提取，若是爲空，則所有匹配。
deny：與這個正則表達式(或正則表達式列表)不匹配的URL必定不提取。
allow_domains：會被提取的連接的domains。
deny_domains：必定不會被提取連接的domains。
restrict_xpaths：使用xpath表達式，和allow共同做用過濾連接。

rules

在rules中包含一個或多個Rule對象，每一個Rule對爬取網站的動做定義了特定操做。若是多個rule匹配了相同的連接，則根據規則在本集合中被定義的順序，第一個會被使用。dom

參數介紹：
LinkExtractor：是一個Link Extractor對象，用於定義須要提取的連接。scrapy

callback：從link_extractor中每獲取到連接時，參數所指定的值做爲回調函數，該回調函數接受一個response做爲其第一個參數

follow：是一個布爾(boolean)值，指定了根據該規則從response提取的連接是否須要跟進。若是callback爲 None，follow 默認設置爲True，不然默認爲False。

process_links：指定該spider中哪一個的函數將會被調用，從link_extractor中獲取到連接列表時將會調用該函數。該方法主要用來過濾。

process_request：指定該spider中哪一個的函數將會被調用，該規則提取到每一個request時都會調用該函數。 (用來過濾request)

如下是案例代碼：

item文件

import scrapy class TencentItem(scrapy.Item): # 職位
        name = scrapy.Field() # 詳情連接
        positionlink = scrapy.Field() #職位類別
        positiontype = scrapy.Field() # 人數
        peoplenum = scrapy.Field() # 工做地點
        worklocation = scrapy.Field() # 發佈時間
        publish = scrapy.Field()

pipeline文件

import json class TencentPipeline(object): def __init__(self): self.filename = open("tencent.json", "w") def process_item(self, item, spider): text = json.dumps(dict(item), ensure_ascii = False)  + ",\n" self.filename.write(text.encode("utf-8")) return item def close_spider(self, spider): self.filename.close()

setting文件

BOT_NAME = 'tencent' SPIDER_MODULES = ['tencent.spiders'] NEWSPIDER_MODULE = 'tencent.spiders' LOG_FILE = 'tenlog.log' LOG_LEVEL = 'DEBUG' LOG_ENCODING = 'utf-8' ROBOTSTXT_OBEY = True DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en',
 } ITEM_PIPELINES = { 'tencent.pipelines.TencentPipeline': 300, }

spider文件

# -*- coding: utf-8 -*-
    import scrapy # 導入連接匹配規則類，用來提取符合規則的連接
    from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from tencent.items import TencentItem class TenecntSpider(CrawlSpider): name = 'tencent1'
        # 可選，加上會有一個爬去的範圍
        allowed_domains = ['hr.tencent.com'] start_urls = ['http://hr.tencent.com/position.php?&start=0#a'] # response中提取 連接的匹配規則，得出是符合的連接
        pagelink = LinkExtractor(allow=('start=\d+')) print (pagelink) # 能夠寫多個rule規則
        rules = [ # follow = True須要跟進的時候加上這句。
            # 有callback的時候就有follow
            # 只要符合匹配規則，在rule中都會發送請求，同是調用回調函數處理響應
            # rule就是批量處理請求
            Rule(pagelink, callback='parse_item', follow=True), ] # 不能寫parse方法，由於源碼中已經有了，回覆蓋致使程序不能跑
        def parse_item(self, response): for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"): # 把數據保存在建立的對象中，用字典的形式
 item = TencentItem() # 職位
                # each.xpath('./td[1]/a/text()')返回的是列表，extract轉爲unicode字符串，[0]取第一個
                item['name'] = each.xpath('./td[1]/a/text()').extract()[0] # 詳情連接
                item['positionlink'] = each.xpath('./td[1]/a/@href').extract()[0] # 職位類別
                item['positiontype'] = each.xpath("./td[2]/text()").extract()[0] # 人數
                item['peoplenum'] = each.xpath('./td[3]/text()').extract()[0] # 工做地點
                item['worklocation'] = each.xpath('./td[4]/text()').extract()[0] # 發佈時間
                item['publish'] = each.xpath('./td[5]/text()').extract()[0] # 把數據交給管道文件
                yield item