python爬蟲之Scrapy框架（CrawSpider）

時間 2020-05-09

標籤 python 爬蟲 scrapy 框架 crawspider 欄目 Python 简体版

原文原文鏈接

需求想要爬去糗事百科全站的數據python

方法：框架

（1）基於Scrapy框架中的Spider的遞歸爬去實現dom

（2）基於Scrapy框架的CrawlSpider的自動爬取來進行實現scrapy

那麼CrawlSpider又是什麼呢？如何實現它的自動爬取？ide

CrawlSpider的簡介

一簡介

crawlspider是spider的一個子類，除了繼承到Spider的功能外，還派生了其本身的更強大的功能和特性。其中最顯著的功能就是'」LinkExtractors連接提取器'。Spider是全部怕爬蟲類的基類網站

二使用

步驟：url

（1）建立scrapy工程：scrapy startproject projectNamespa

（2）建立爬蟲文件：scrapy genspider -t crawl spidername www.xxx.comcode

注意這裏建立爬蟲文件時比以前建立的爬蟲文件多了-t crawl 表示的時建立的爬蟲文件是一個基於CrawlSpider這個類，而不是Spider這個基類了

(3)生成的爬蟲文件和以前的spider基類的爬蟲文件有所不一樣繼承

需求爬取到抽屜網站中分頁中的url

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

#爬取抽屜網站的分頁的URL
#注意 這裏繼承的類是CrawlSpider  而不是Spider
class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://dig.chouti.com/r/scoff/hot/1']
    #allow表示連接提取器提取連接的規則
    rules = (
        #Rule  規則提取器：將連接提取器提取到的連接所對應的頁面進行指定形式的解析
        #follow 讓鏈接提取器繼續做用到連接提取器提取到的連接所對應的頁面中
        Rule(LinkExtractor(allow=r'/r/scoff/hot/\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)
        item = {}
        item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        item['name'] = response.xpath('//div[@id="name"]').get()
        item['description'] = response.xpath('//div[@id="description"]').get()
        return item

需求爬取糗事百科網站的分頁的URL

#爬取糗事百科網站的分頁的URL
class ChoutiSpider(CrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/pic/']
    #allow表示連接提取器提取連接的規則
    link = LinkExtractor(allow=r'/pic/page/\d+\?s=\d+')
    link1 = LinkExtractor(allow=r'/pic/$')
    #注意這裏能夠有多個規則
    rules = (
        #Rule  規則提取器：將連接提取器提取到的連接所對應的頁面進行指定形式的解析
        #follow 讓鏈接提取器繼續做用到連接提取器提取到的連接所對應的頁面中
        Rule(link, callback='parse_item', follow=True),
        Rule(link1,callback='parse_item',follow=True)
    )

    def parse_item(self, response):
        print(response)