Scrapy框架之CrawlSpider

時間 2019-12-11

原文原文鏈接

針對問題：若是想對某一個網站的全站數據進行爬取，該如何處理？
　　解決方案：css

手動請求的發送：基於Scrapy框架中的Spider的遞歸爬取進行實現（Request模塊遞歸回調parse方法）
CrawlSpider：基於CrawlSpider的自動爬取進行實現（更加簡潔和高效）

1、CrawlSpider介紹

　　CrawlSpider實際上是Spider的一個子類。html

一、CrawlSpider功能

　　CrawlSpider功能比Spider更增強大：除了繼承到Spider的特性和功能外，還派生除了其本身獨有的更增強大的特性和功能。
　　其中最顯著的功能就是「LinkExtractors連接提取器」和「規則解析器」。python

二、Spider和CrawlSpider應用場景

　　Spider是全部爬蟲的基類，其設計原則只是爲了爬取start_url列表中網頁，而從爬取到的網頁中提取出的url進行繼續的爬取工做使用CrawlSpider更合適。web

2、CrawlSpider使用

一、建立工程與CrawlSpider爬蟲文件

# 建立scrapy工程：
$ scrapy startproject crawlSpiderPro
$ cd crawlSpiderPro/

# 建立一個基於CrawlSpider的爬蟲文件
$ scrapy genspider -t crawl chouti dig.chouti.com
Created spider 'chouti' using template 'crawl' in module:
  crawlSpiderPro.spiders.chouti

　　注意：建立爬蟲的指令對比之前的指令多了 "-t crawl"，表示建立的爬蟲文件是基於CrawlSpider這個類的，而再也不是Spider這個基類。正則表達式

二、觀察分析生成的爬蟲文件:couti.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor   # 連接提取器對應的類
from scrapy.spiders import CrawlSpider, Rule   # Rule是規則解析器對應的類

class ChoutiSpider(CrawlSpider):   # 這裏繼承的父類時CrawlSpider
    name = 'chouti'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    rules = (
        # rules中保存的是元組，元組中保存的是Rule規則解析器對象
        # 規劃解析器對象第一個參數是：連接提取器對象
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):   # 解析方法
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

三、LinkExtractor——連接提取器

　　連接提取器做用：能夠用來提取頁面中符合正則表達式要求的相關連接(url)。bash

LinkExtractor(
    allow=r'Items/',     # 知足括號中「正則表達式」的值會被提取，若是爲空，則所有匹配。
    deny=xxx,            # 知足正則表達式的則不會被提取。
    restrict_xpaths=xxx, # 知足xpath表達式的值會被提取
    restrict_css=xxx,    # 知足css表達式的值會被提取
    deny_domains=xxx,    # 不會被提取的連接的domains。　
)

allow參數：賦值一個正則表達式。
　　allow賦值正則表達式後，連接提取器就能夠根據正則表達式在頁面中提取指定的連接。提取到的連接會所有交給規則解析器處理。

四、Rule——規則解析器

　　規則解析器接受了連接提取器發送的連接後，就會對這些連接發起請求，獲取連接對應的頁面內容。
　　獲取頁面內容後，根據指定的規則將頁面內容中的指定數據值進行解析。框架

（1）解析器格式

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)

（2）參數介紹

　　參數1: 指定連接提取器
　　參數2:callback 指定規則解析器解析數據的規則（回調函數）
　　參數3:follow 是否將連接提取器繼續做用到連接提取器提取出的連接網頁中。dom

　　當callback爲None,參數3的默認值爲true。
　　follow爲False時，連接提取器只是提取當前頁面顯示的全部頁碼的url
　　follow爲True時會不斷日後根據頁碼提取頁面，直到提取全部的頁面連接，並自動完成去重操做。scrapy

五、CrawlSpider總體爬取流程

爬蟲文件首先根據起始url，獲取該url的網頁內容
連接提取器會根據指定提取規則將步驟a中網頁內容中的連接進行提取
規則解析器會根據指定解析規則將連接提取器中提取到的連接中的網頁內容根據指定的規則進行解析
將解析數據封裝到item中，而後提交給管道進行持久化存儲

3、抽屜網項目實戰

（1）choutipyide

import scrapy
from scrapy.linkextractors import LinkExtractor   # 連接提取器對應的類
from scrapy.spiders import CrawlSpider, Rule   # Rule是規則解析器對應的類
from crawlSpiderPro.items import CrawlspiderproItem

class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']
    # 定義連接提取器，且指定其提取規則
    Link = LinkExtractor(allow=r'/all/hot/recent/\d+')    # 獲取的頁碼的a標籤中href值

    rules = (
        # 定義規則解析器，且指定解析規則經過callback回調函數
        Rule(Link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):   # 解析方法
        """自定義規則解析器的解析規則函數"""
        div_list = response.xpath('//div[@id="content-list"]/div')

        for div in div_list:
            # 定義item
            item = CrawlspiderproItem()
            # 根據xpath表達式提取抽屜新聞的內容
            item['content'] = div.xpath('.//div[@class="part1"]/a/text()').extract_first().strip('\n')
            # 根據xpath表達式提取抽屜新聞的做者
            item['author'] = div.xpath('.//div[@class="part2"]/a[4]/b/text()').extract_first().strip('\n')
            yield item  # 將item提交至管道

（2）items.py

import scrapy

class CrawlspiderproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

（3）pipelines.py

class CrawlspiderproPipeline(object):
    def __init__(self):
        self.fp = None

    def open_spider(self, spider):
        print('開始爬蟲')
        self.fp = open('./data.txt', 'w')

    def process_item(self, item, spider):
        # 將爬蟲文件提交的item寫入文件進行持久化存儲
        self.fp.write(item['author'] + ':' + item['content'] + '\n')
        return item

    def close_spider(self, spider):
        print('結束爬蟲')
        self.fp.close()

（4）settings.py

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # 假裝請求載體身份

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   # 不聽從門戶網站robots協議，避免某些信息爬取不到

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'crawlSpiderPro.pipelines.CrawlspiderproPipeline': 300,
}

（5）執行爬蟲