Scrapy - Crawlspider 內置爬蟲

時間 2019-11-12

原文原文鏈接

接着前一篇經過基礎爬蟲對問答頻道文章的採集，下面咱們試用一下Scrapy工具箱中幾個不錯的功能。html

因爲大部分數據爬取工做具備類似性、一致性，因此Scrapy特別提供了若干個更高程度封裝的通用爬蟲類來協助咱們更快速、高效的完成爬蟲開發工做python

#查看scrapy提供的通用爬蟲(Generic Spiders)
scrapy genspider -l

CrawlSpider

CrawlSpider 是通用爬蟲裏最經常使用的一個
經過一套規則引擎，它自動實現了頁面連接的搜索跟進，解決了包含但不限於自動採集詳情頁、跟進分類/分頁地址等問題最後，咱們僅僅須要開發實現 ’詳情頁解析器‘ 邏輯便能完成爬蟲開發工做bash

這裏咱們以爬取馬蜂窩北京行程資源（ http://www.mafengwo.cn/xc/10065/ ）爲例：dom

#基於通用爬蟲模板建立爬蟲
scrapy genspider --template crawl xinchen www.mafengwo.cn/xc/10065

而後咱們設計如下的具體爬蟲邏輯，編輯文件 mafengwo/mafengwo/spiders/xinchen.py
爲了方便演示，本例中咱們把相關的爬蟲主邏輯、持久化邏輯，數據建模邏輯等等都封裝在該爬蟲文件中scrapy

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import pymongo


class XinchenSpider(CrawlSpider):
    name = 'xinchen'
    allowed_domains = ['www.mafengwo.cn']
    start_urls = ['http://www.mafengwo.cn/xc/10065/']

    rules = (
        # 提取下一頁連接並跟進
        Rule(
            LinkExtractor(allow=r'/xc/10065/(\d+).html', restrict_xpaths='//div[@class="page-hotel"]/a[@class="ti next"]'),
            callback=None, follow=True
        ),

        # 提取詳情頁連接，並使用parse_item解析器抓取數據
        Rule(
            LinkExtractor(allow=r'/schedule/(\d+).html', restrict_xpaths='//div[@class="post-list"]/ul/li/dl/dt/a'),
            callback='parse_item', follow=False
        ),
    )

    def __init__(self, *args, **kwargs):
        super(XinchenSpider, self).__init__(*args, **kwargs) #調用父類方法
        #mongo配置
        self.client = pymongo.MongoClient('localhost')
        self.collection = self.client['mafengwo']['xinchen_pages']

    def closed(spider, reason):
        self.client.close()

    def parse_item(self, response):
        item = {}
        item['url'] = response.url
        item['author'] = response.xpath('//dl[@class="flt1 show_from clearfix"]/dd/p/a[@class="name"]').extract_first()
        item['title'] = response.xpath('//p[@class="dd_top"]/a/text()').extract_first()
        item['content'] = response.xpath('//div[@class="guide"]').extract_first()
        self.collection.update({'url': item['url']}, item, upsert=True) #排重式的往mongo中插入數據
        yield item

運行爬蟲ide