Python網絡爬蟲4 - scrapy入門

時間 2019-11-17

標籤 python 網絡爬蟲 scrapy 入門欄目 Python 简体版

原文原文鏈接

該博客首發於www.litreily.topcss

scrapy做爲一款強大的爬蟲框架，固然要好好學習一番，本文即是本人學習和使用scrapy事後的一個總結，內容比較基礎，算是入門筆記吧，主要講述scrapy的基本概念和使用方法。html

scrapy framework

首先附上scrapy經典圖以下：node

scrapy框架包含如下幾個部分python

Scrapy Engine 引擎
Spiders 爬蟲
Scheduler 調度器
Downloader 下載器
Item Pipeline 項目管道
Downloader Middlewares 下載器中間件
Spider Middlewares 爬蟲中間件

spider process

其爬取過程簡述以下：shell

引擎從爬蟲獲取首個待爬取的連接url，並傳遞給調度器
調度器將連接存入隊列
引擎向調度器請求要爬取的連接，並將請求獲得的連接經下載器中間件傳遞給下載器
下載器從網上下載網頁，下載後的網頁經下載器中間件傳遞給引擎
引擎將網頁經爬蟲中間件傳遞給爬蟲
爬蟲對網頁進行解析，將獲得的Items和新的連接經爬蟲中間件交給引擎
引擎將從爬蟲獲得的Items交給項目管道，將新的連接請求requests交給調度器
此後循環2~7步，直到沒有待爬取的連接爲止

須要說明的是，項目管道(Item Pipeline)主要完成數據清洗，驗證，持久化存儲等工做；下載器中間件(Downloader Middlewares)做爲下載器和引擎之間的的鉤子(hook)，用於監聽或修改下載請求或已下載的網頁，好比修改請求包的頭部信息等；爬蟲中間件(Spider Middlewares)做爲爬蟲和引擎之間的鉤子(hook)，用於處理爬蟲的輸入輸出，即網頁response和爬蟲解析網頁後獲得的Items和requests。api

Items

至於什麼是Items，我的認爲就是經爬蟲解析後獲得的一個數據單元，包含一組數據，好比爬取的是某網站的商品信息，那麼每爬取一個網頁可能會獲得多組商品信息，每組信息包含商品名稱，價格，生產日期，商品樣式等，那咱們即可以定義一組Itemcookie

from scrapy.item import Item
from scrapy.item import Field

class GoodsItem(Item):
    name = Field()
    price = Field()
    date = Field()
    types = Field()
複製代碼

Field()實質就是一個字典Dict()類型的擴展，如上代碼所示，一組Item對應一個商品信息，單個網頁可能包含一個或多個商品，全部Item信息都須要在Spider中賦值，而後經引擎交給Item Pipeline。具體實如今後續博文的實例中會有體現，本文旨在簡單記述scrapy的基本概念和使用方法。app

Install

with pip框架

pip install scrapy
複製代碼

or condadom

conda install -c conda-forge scrapy
複製代碼

基本指令以下：

D:\WorkSpace>scrapy --help
Scrapy 1.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
複製代碼

若是須要使用虛擬環境，須要安裝virtualenv

pip install virtualenv
複製代碼

scrapy startproject

scrapy startproject <project-name> [project-dir]
複製代碼

使用該指令能夠生成一個新的scrapy項目，以demo爲例

$ scrapy startproject demo
...
You can start your first spider with:
    cd demo
    scrapy genspider example example.com

$ cd demo
$ tree
.
├── demo
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files
複製代碼

能夠看到startproject自動生成了一些文件夾和文件，其中：

scrapy.cfg: 項目配置文件，通常不用修改
items.py: 定義items的文件，例如上述的GoodsItem
middlewares.py: 中間件代碼，默認包含下載器中間件和爬蟲中間件
pipelines.py: 項目管道，用於處理spider返回的items，包括清洗，驗證，持久化等
settings.py: 全局配置文件，包含各種全局變量
spiders: 該文件夾用於存儲全部的爬蟲文件，注意一個項目能夠包含多個爬蟲
__init__.py: 該文件指示當前文件夾屬於一個python模塊
__pycache__: 存儲解釋器生成的.pyc文件（一種跨平臺的字節碼byte code），在python2中該類文件與.py保存在相同文件夾

scrapy genspider

項目生成之後，可使用scrapy genspider指令自動生成一個爬蟲文件，好比，若是要爬取花瓣網首頁，執行如下指令：

$ cd demo
$ scrapy genspider huaban www.huaban.com
複製代碼

默認生成的爬蟲文件huaban.py以下：

# -*- coding: utf-8 -*-
import scrapy


class HuabanSpider(scrapy.Spider):
    name = 'huaban'
    allowed_domains = ['www.huaban.com']
    start_urls = ['http://www.huaban.com/']

    def parse(self, response):
        pass
複製代碼

爬蟲類繼承於scrapy.Spider
name是必須存在的參數，用以標識該爬蟲
allowed_domains指代容許爬蟲爬取的域名，指定域名以外的連接將被丟棄
start_urls存儲爬蟲的起始連接，該參數是列表類型，因此能夠同時存儲多個連接

若是要自定義起始連接，也能夠重寫scrapy.Spider類的start_requests函數，此處不予細講。

parse函數是一個默認的回調函數，當下載器下載網頁後，會調用該函數進行解析，response就是請求包的響應數據。至於網頁內容的解析方法，scrapy內置了幾種選擇器(Selector)，包括xpath選擇器、CSS選擇器和正則匹配。下面是一些選擇器的使用示例，方便你們更加直觀的瞭解選擇器的用法。

# xpath selector
response.xpath('//a')
response.xpath('./img').extract()
response.xpath('//*[@id="huaban"]').extract_first()
repsonse.xpath('//*[@id="Profile"]/div[1]/a[2]/text()').extract_first()

# css selector
response.css('a').extract()
response.css('#Profile > div.profile-basic').extract_first()
response.css('a[href="test.html"]::text').extract_first()

# re selector
response.xpath('.').re('id:\s*(\d+)')
response.xpath('//a/text()').re_first('username: \s(.*)')
複製代碼

須要說明的是，response不能直接調用re,re_first.

scrapy crawl

假設爬蟲編寫完了，那就可使用scrapy crawl指令開始執行爬取任務了。

當進入一個建立好的scrapy項目目錄時，使用scrapy -h能夠得到相比未建立以前更多的幫助信息，其中就包括用於啓動爬蟲任務的scrapy crawl

$ scrapy -h
Scrapy 1.5.0 - project: huaban

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
複製代碼

$ scrapy crawl -h
Usage
=====
  scrapy crawl [options] <spider>

Run a spider

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  dump scraped items into FILE (use - for stdout)
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items with -o

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure
複製代碼

從scrapy crawl的幫助信息能夠看出，該指令包含不少可選參數，但必選參數只有一個，就是spider，即要執行的爬蟲名稱，對應每一個爬蟲的名稱(name)。

scrapy crawl huaban
複製代碼

至此，一個scrapy爬蟲任務的建立和執行過程就介紹完了，至於實例，後續博客會陸續介紹。

scrapy shell

最後簡要說明一下指令scrapy shell，這是一個交互式的shell,相似於命令行形式的python，當咱們剛開始學習scrapy或者剛開始爬蟲某個陌生的站點時，可使用它熟悉各類函數操做或者選擇器的使用，用它來不斷試錯糾錯，熟練掌握scrapy各類用法。

$ scrapy shell www.huaban.com
2018-05-29 23:58:49 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-05-29 23:58:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct  3
2017, 17:26:49) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-05-29 23:58:49 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2018-05-29 23:58:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-29 23:58:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-29 23:58:50 [scrapy.core.engine] INFO: Spider opened
2018-05-29 23:58:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://huaban.com/> from <GET http://www.huaban.com>
2018-05-29 23:58:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://huaban.com/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x03385CB0>
[s]   item       {}
[s]   request    <GET http://www.huaban.com>
[s]   response   <200 http://huaban.com/>
[s]   settings   <scrapy.settings.Settings object at 0x04CC4D10>
[s]   spider     <DefaultSpider 'default' at 0x4fa6bf0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: view(response)
Out[1]: True

In [2]: response.xpath('//a')
Out[2]:
[<Selector xpath='//a' data='<a id="elevator" class="off" onclick="re'>,
 <Selector xpath='//a' data='<a class="plus"></a>'>,
 <Selector xpath='//a' data='<a onclick="app.showUploadDialog();">添加採'>,
 <Selector xpath='//a' data='<a class="add-board-item">添加畫板<i class="'>,
 <Selector xpath='//a' data='<a href="/about/goodies/">安裝採集工具<i class'>,
 <Selector xpath='//a' data='<a class="huaban_security_oauth" logo_si'>]

In [3]: response.xpath('//a').extract()
Out[3]:
['<a id="elevator" class="off" onclick="return false;" title="回到頂部"></a>',
 '<a class="plus"></a>',
 '<a onclick="app.showUploadDialog();">添加採集<i class="upload"></i></a>',
 '<a class="add-board-item">添加畫板<i class="add-board"></i></a>',
 '<a href="/about/goodies/">安裝採集工具<i class="goodies"></i></a>',
 '<a class="huaban_security_oauth" logo_size="124x47" logo_type="realname" href="//www.anquan.org" rel="nofollow"> <script src="//static.anquan.org/static/outer/js/aq_auth.js"></script> </a>']

In [4]: response.xpath('//img')
Out[4]: [<Selector xpath='//img' data='<img src="https://d5nxst8fruw4z.cloudfro'>]

In [5]: response.xpath('//a/text()')
Out[5]:
[<Selector xpath='//a/text()' data='添加採集'>,
 <Selector xpath='//a/text()' data='添加畫板'>,
 <Selector xpath='//a/text()' data='安裝採集工具'>,
 <Selector xpath='//a/text()' data=' '>,
 <Selector xpath='//a/text()' data=' '>]

In [6]: response.xpath('//a/text()').extract()
Out[6]: ['添加採集', '添加畫板', '安裝採集工具', ' ', ' ']

In [7]: response.xpath('//a/text()').extract_first()
Out[7]: '添加採集'
複製代碼

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。