爬蟲(十四)：Scrapy框架(一) 初識Scrapy、第一個案例

時間 2019-12-29

標籤爬蟲十四 scrapy 框架第一個案例欄目網絡爬蟲简体版

原文原文鏈接

1. Scrapy框架

Scrapy功能很是強大，爬取效率高，相關擴展組件多，可配置和可擴展程度很是高，它幾乎能夠應對全部反爬網站，是目前Python中使用最普遍的爬蟲框架。css

1.1 Scrapy介紹

1.1.1 架構介紹

Scrapy是一個基於Twisted的異步處理框架，是純Python實現的爬蟲框架，其架構清晰，模塊之間的耦合程度低，可擴展性極強，能夠靈活完成各類需求。咱們只須要定製開發幾個模塊就能夠輕鬆實現一個爬蟲。html

它能夠分爲以下的幾個部分：web

Engine：引擎，處理整個系統的數據流處理、觸發事務，是整個框架的核心。數據庫

Item：項目，它定義了爬取結果的數據結構，爬取的數據會被賦值成該Item對象。json

Scheduler：調度器，接受引擎發過來的請求並將其加入隊列中，在引擎再次請求的時候將請求提供給引擎。cookie

Downloader：下載器，下載網頁內容，並將網頁內容返回給蜘蛛。網絡

Spiders：蜘蛛，其內定義了爬取的邏輯和網頁的解析規則，它主要負責解析響應並生成提取結果和新的請求。數據結構

Item Pipeline：項目管道，負責處理由蜘蛛從網頁中抽取的項目，它的主要任務是清洗、驗證和存儲數據。架構

Downloader Middlewares：下載器中間件，位於引擎和下載器之的鉤子框架，主要處理引擎與下載器之間的請求及響應。app

Spider Middlewares：蜘蛛中間件，位於引擎和蜘蛛之間的鉤子框架，主要處理蜘蛛輸入的響應和輸出的結果及新的請求。

1.1.2 數據流

Scrapy中的數據流由引擎控制，數據流的過程以下：

(1) Engine首先打開一個網站，找處處理該網站的Spider，並向該Spider請求第一個要爬取的URL。

(2) Engine從Spider中獲取到第一個要爬取的URL，並經過Scheduler以Request的形式調度。

(3) Engine向Scheduler請求下一個要爬取的URL。

(4) Scheduler返回下一個要爬取的URL給Engine，Engine將URL經過Downloader Middlewares發給 Downloader下載。

(5) 一旦頁面下載完畢，Downloader生成該頁面的Response，並將其經過Downloader Middlewares發送給Engine。

(6) Engine從下載器中接收到Response，並將其經過Spider Middleware發送給Spider處理。

(7) Spider處理Response，並返回爬取到的Item及新的Request給Engine。

(8) Engine將Spider返回的Item給Item Pipeline，將新的Request給Scheduler。

(9）重複第(2)步到第(8)步，直到Scheduler中沒有更多的Request，Engine關閉該網站，爬取結束。

經過多個組件的相互協做、不一樣組件完成工做的不一樣、組件對異步處理的支持Scrapy最大限度地利用了網絡帶寬，大大提升了數據爬取和處理的效率。

1.1.3 項目結構

Scrapy框架是經過命令行來建立項目的，代碼的編寫仍是須要IDE。項目建立以後，項目文件的格式以下所示：

scrapy.cfg

project/

　　__init__.py

　　items.py

　　pipelines.py

　　settings.py

　　middlewares.py

　　spiders/

　　　　__init__.py

　　　　spider1.py

　　　　spider2.py

　　　　...

這裏各個文件的功能描述以下：

scrapy.cfg：它是Scrapy項目的配置文件，其內定義了項目的配置文件路徑、部署相關信息等內容

items.py：它定義Item數據結構，全部的Item的定義均可以放這裏。

pipelines.py：它定義Item Pipeline的實現，全部的Item Pipeline的實現均可以放這裏。

settings.py：它定義項目的全局配置。

middlewares.py：它定義Spider Middlewares和Downloader Middlewares的實現。

spiders：其內包含一個個Spider的實現，每一個Spider都有一個文件。

1.2 Scrapy入門

接下來就寫一個簡單的項目，讓咱們對Scrapy的基本用法和原理有大體的瞭解。

1.2.1 建立項目

首先安裝Scrapy模塊，再建立Scrapy項目。

pip install scrapy -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

建立Scrapy一個項目，項目文件能夠直接用scrapy命令生成:

scrapy startproject tutorial

這個命令能夠在任意文件夾運行。

這個命令將會建立一個名爲tutoria文件夾，文件夾結構以下所示：

1.2.2 建立Spider

Spider是本身定義的類，Scrapy用它來從網頁裏抓取內容，並解析抓取的結果。不過這個類必須繼承Scrapy提供的Spider類scrapy.Spider，還要定義Spider的名稱和起始請求，以及怎樣處理爬取後的結果的方法。

也可使用命令行建立一個Spider。好比要生成top250這個Spider。

cd tutorial

scrapy genspider top250 movie.douban.com/250

進入剛纔建立的tutorial文件夾，而後執行genspider命令。第一個參數是Spider名稱，第二個參數是網站域名。執行完畢以後，spiders文件夾中多了一個quotes.py，它就是剛剛建立的Spider。

這裏有三個屬性——name、allowed_domains和start_urls，還有一個方法parse。

name，它是每一個項目惟一的名字，用來區分不一樣的Spider。

allowed_domains，它是容許爬取的域名，若是初始或後續的請求連接不是這個域名下的，則請求連接會被過濾掉。

start_urls，它包含了Spider在啓動時爬取的url列表，初始請求是由它來定義的。

parse，它是Spider一個方法。默認狀況下，被調用時start_urls裏面的連接構成的請求完成下載執行後，返回的響應就會做爲惟一的參數傳遞給這個函數。該方法負責解析返回的響應、提取數據或者進一步生成要處理的請求。

1.2.3 建立Item

Item是保存爬取數據的容器，它的使用方法和字典相似。不過，相比字典，Item多了額外的保護機制，能夠避免拼寫錯誤或者定義字段錯誤。

建立Item須要繼承scrapy.Item類，而且定義類型爲scrapy.Field的字段。觀察目標網站，咱們能夠獲取到到內容有serial_number、movie_name、introduce、star、evaluate、describe。

定義Item，此時修改items.py：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    serial_number = scrapy.Field()
    movie_name = scrapy.Field()
    introduce = scrapy.Field()
    star = scrapy.Field()
    evaluate = scrapy.Field()
    describe = scrapy.Field()

這裏定義了三個字段，接下來爬取時咱們會使用到這個Item。

1.2.4 解析Response

前面已經說了，parse()方法的參數resposne是start_urls裏面的連接爬取後的結果。因此在parse()方法中，咱們能夠直接對response變量包含的內容進行解析，好比瀏覽請求結果的網頁源代碼，或者進一步分析源代碼內容，或者找出結果中的連接而獲得下一個請求。咱們能夠看到網頁中既有咱們想要的結果，又有下一頁的連接，這兩部份內容咱們都要進行處理。

首先看看網頁結構：

每一頁都有多個class爲item的區塊，每一個區塊內都包含serial_number、movie_name、introduce、star、evaluate、describe那麼咱們先找出全部的item，而後提取每個item中的內容。

提取的方式能夠是css選擇器或XPath選擇器，top250.py的XPath改寫以下：

# -*- coding: utf-8 -*-
import scrapy

class Top250Spider(scrapy.Spider):
    name = 'top250'
    allowed_domains = ['movie.douban.com']
    start_urls = ['http://movie.douban.com/250/']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for i_item in movie_list:
            serial_number = i_item.xpath('.//em/text()').extract_first()
            movie_name = i_item.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
            content = i_item.xpath('.//div[@class="bd"]/p[1]/text()').extract()
            for i_content in content:
                introduce = "".join(i_content.split())
            star = i_item.xpath('.//span[@class="rating_num"]/text()').extract_first()
            evaluate = i_item.xpath('.//div[@class="star"]//span[4]/text()').extract_first()
            describe = i_item.xpath('.//p[@class="quote"]//span/text()').extract_first()

這裏首先利用選擇器選取全部的item，並將其賦值爲movie_list變量，而後利用for循環對每一個item遍歷，解析每一個item的內容。

1.2.5 使用Item

前面定義了Item，接下來就要使用它了。Item能夠理解爲字典，不過在聲明的時候須要實例化。而後依次用剛纔解析的結果賦值Item的每一段，最後將Item返回便可。

修改top250.py文件：

# -*- coding: utf-8 -*-
import scrapy

class Top250Spider(scrapy.Spider):
    name = 'top250'
    allowed_domains = ['movie.douban.com']
    start_urls = ['http://movie.douban.com/250/']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for i_item in movie_list:
            douban_item = DoubanItem()
            douban_item['serial_number'] = i_item.xpath('.//em/text()').extract_first()
            douban_item['movie_name'] = i_item.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
            content = i_item.xpath('.//div[@class="bd"]/p[1]/text()').extract()
            for i_content in content:
                content_s = "".join(i_content.split())
                douban_item['introduce'] = content_s
            douban_item['star'] = i_item.xpath('.//span[@class="rating_num"]/text()').extract_first()
            douban_item['evaluate'] = i_item.xpath('.//div[@class="star"]//span[4]/text()').extract_first()
            douban_item['describe'] = i_item.xpath('.//p[@class="quote"]//span/text()').extract_first()
            yield douban_item

這樣首頁的全部內容均可以被解析出來，並被賦值成一個個DoubanItem。

1.2.6 後續Request

前面的操做實現了從初始頁面抓取內容。那麼，下一頁的內容該如何抓取？這就須要咱們從當前頁面中找到信息來生成下一個請求，而後在下一個請求的頁面裏找到信息再構造再下一個請求。這樣循環往復迭代，從而實現整站的爬取。

將頁面拉到最底部，如圖：

這裏有後頁按鈕。查看它的源代碼，能夠發現它的連接是?start=25&filter=，完整連接是：https://movie.douban.com/top250?start=25&filter=，經過這個連接咱們就能夠構造下一個請求。

構造請求時須要用到scrapy.Request這裏咱們傳遞兩個參數——url和callback。

Request參數：

url：它是請求連接。

callback：它是回調函數。當指定了該回調函數的請求完成以後，獲取到響應，引擎會將該響應做爲參數傳遞給這個回調函數。回調函數進行解析或生成下一個請求，回調函數如上文parse()所示。

因爲parse()就是解析serial_number、movie_name、introduce、star、evaluate、describe的方法，而下一頁的結構和剛纔已經解析的頁面結構是同樣的，因此咱們能夠再次使用parse()方法來作頁面解析。

接下來咱們要作的就是利用選擇器獲得下一頁連接並生成請求，在parse()方法後追加代碼：

# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import DoubanItem

class Top250Spider(scrapy.Spider):
    name = 'top250'
    allowed_domains = ['movie.douban.com']
    start_urls = ['http://movie.douban.com/250/']

    def parse(self, response):
        movie_list = response.xpath('//div[@class="item"]')
        for i_item in movie_list:
            douban_item = DoubanItem()
            douban_item['serial_number'] = i_item.xpath('.//em/text()').extract_first()
            douban_item['movie_name'] = i_item.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
            content = i_item.xpath('.//div[@class="bd"]/p[1]/text()').extract()
            for i_content in content:
                content_s = "".join(i_content.split())
                douban_item['introduce'] = content_s
            douban_item['star'] = i_item.xpath('.//span[@class="rating_num"]/text()').extract_first()
            douban_item['evaluate'] = i_item.xpath('.//div[@class="star"]//span[4]/text()').extract_first()
            douban_item['describe'] = i_item.xpath('.//p[@class="quote"]//span/text()').extract_first()
            yield douban_item
        next_link = response.xpath('//span[@class="next"]/link/@href').extract()
        if next_link:
            next_link = next_link[0]
            yield scrapy.Request('https://movie.douban.com/top250' + next_link, callback=self.parse)

1.2.7 運行程序

進入目錄，運行以下命令：

scrapy crawl top250

信息刷的太快了，上面都刷沒了，只要顯示中沒有出現報錯，這樣就算成功了，你們能夠本身運行一遍。

首先，Scrapy輸出了當前的版本號以及正在啓動的項目名稱。接着輸出了當前settings.py中一些重寫後的配置。而後輸出了當前所應用的Middlewares和Pipelines。Middlewares默認是啓用的，能夠settings.py中修改。Pipelines 默認是空，一樣也能夠在settings.py中配置。

settings.py：

# -*- coding: utf-8 -*-

# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'


# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
DOWNLOAD_DELAY = 0.5

# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'tutorial.middlewares.TutorialSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
}


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

接下來就是輸出各個頁面的抓取結果了，能夠看到爬蟲一邊解析，一邊翻頁，直至將全部內容取完畢，而後終止。

最後，Scrapy輸出了整個抓取過程的統計信息，如請求的字節數、請求次數、響應次數、完成緣由等。

整個Scrapy程序成功運行。咱們經過很是簡單的代碼就完成了一個網站內容的爬取，這樣相比以前一步一步寫程序簡潔不少。

1.2.8 保存到文件

運行完Scrapy後，咱們只在控制檯看到了輸出結果。若是想保存結果該怎麼辦呢？

要完成這個任務其實不須要任何額外的代碼，Scrapy提供的Feed Exports能夠輕鬆將抓取結果輸出。例如，咱們想將上面的結果保存成JSON文件，能夠執行以下命令：

scrapy crawl top250 -o top250.json

命令運行後，項目內多了一個quotes.json文件（沒有就刷新），文件包含了剛纔抓取的全部內容，內容是JSON格式。

另外咱們還能夠每個Item輸出一行JSON，輸出後綴爲jl，爲jsonline的縮寫：

scrapy crawl top250 -o top250.jl

或

scrapy crawl top250 -o top250.jsonlines

輸出格式還支持不少種，例如csv、xml、pickle、marshal等，還支持ftp、s3等遠程輸出，另外還能夠經過自定義ItemExporter來實現其餘的輸出。

例如，下面命令對應的輸出分別爲csv、xml、pickle、marshal格式以及ftp遠程輸出：

scrapy crawl top250 -o top250.csv

scrapy crawl top250 -o top250.xml

scrapy crawl top250 -o top250.pickle

scrapy crawl top250 -o top250.marshal

scrapy crawl top250 -o ftp://user:pass@ftp.example.com/path/to/top250.csv

其中，ftp輸出須要正確配置用戶名、密碼、地址、輸出路徑，不然會報錯。

經過Scrapy提供的Feed Exports，咱們能夠輕鬆地輸出抓取結果到文件。對於一些小型項目來講，這應該足夠了。不過若是想要更復雜的輸出，如輸出到數據庫等，咱們可使用Item Pileline來完成。