爬蟲 Scrapy 學習系列之一：Tutorial

時間 2019-12-07

原文原文鏈接

前言

筆者打算寫一系列的文章，記錄本身在學習並使用 Scrapy 的點滴；做者打算使用 python 3.6 做爲 Scrapy 的基礎運行環境；css

本文爲做者的原創做品，轉載需註明出處；本文轉載自本人的博客，傷神的博客：http://www.shangyang.me/2017/...html

Scrapy 安裝

我本地安裝有兩個版本的 python, 2.7 和 3.6；而正如前言所描述的那樣，筆者打算使用 Python 3.6 的環境來搭建 Scrapy；python

$ pip install Scrapy

默認安裝的支持 Python 2.7 版本的 Scrapy；正則表達式

$ pip3 install Scrap

安裝的是支持 python 3.x 版本的 Scrapy；不過安裝過程當中，遇到了些問題，<font color='red'>HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out.</font>解決辦法是，在安裝的過程當中，延長超時的時間，shell

$ pip3 install -U --timeout 1000 Scrapy

Scrapy Tutorial

建立 tutorial 項目

使用json

$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in:
    /Users/mac/workspace/scrapy/tutorial

可見默認使用的 python 2.7，可是若是須要建立一個支持 python 3.x 版本的 tutoiral 項目呢？以下所示，使用 python3 -mbash

$ python3 -m scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/mac/workspace/scrapy/tutorial

導入 PyCharm

直接 open 項目工程 /Users/mac/workspace/scrapy/tutorial；這裏須要注意的是默認的 PyCharm 使用的解釋器 Interpretor 是我本地的 Python 2.7；這裏須要將解釋器改成 Python 3.6；下面記錄下修改的步驟，框架

點擊左上角 PyCharm Community Edition，進入 Preferences
點擊 Project:tutorial，而後選擇 Project Interpreter，而後設置解釋器的版本，以下

工程結構

經過命令構建出來的項目骨架如圖所示python2.7

第一個 Spider

咱們來新建一個 Spider 類，名叫 quotes_spider.py，並將其放置到 tutorial/spiders 目錄中scrapy

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

能夠看到，咱們新建的 QuotesSpider 類是繼承自 scrapy.Spider 類的；下面看看其屬性和方法的意義，

name
是 Spider 的標識符，用於惟一標識該 Spider；它必須在整個項目中是全局惟一的；
start_requests()
必須定義並返回一組能夠被 Spider 爬取的 Requests，Request 對象由一個 URL 和一個回調函數構成；
parse()
就是 Request 對象中的回調方法，用來解析每個 Request 以後的 Response；因此，parse() 方法就是用來解析返回的內容，經過解析獲得的 URL 一樣能夠建立對應的 Requests 進而繼續爬取；

再來看看具體的實現，

start_request(self) 方法分別針對 http://quotes.toscrape.com/pa... 和 http://quotes.toscrape.com/pa... 建立了兩個須要被爬取的 Requests 對象；並經過 yield 進行迭代返回；備註，yield 是迭代生成器，是一個 Generator；
parse(self, response) 方法既是對 Request 的反饋的內容 Response 進行解析，這裏的解析的邏輯很簡單，就是分別建立兩個本地文件，而後將 response.body 的內容放入這兩個文件當中。

如何執行

執行的過程須要使用到命令行，注意，這裏須要使用到scrapy命令來執行；

$ cd /Users/mac/workspace/scrapy/tutorial
$ python3 -m scrapy crawl quotes

大體會輸出以下內容

... 
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...

能夠看到，經過爬取，咱們在本地生成了兩個 html 文件 quotes-1.html 和 quotes-2.html

如何提取

經過命令行的方式提取

Scrapy 提供了命令行的方式能夠對須要被爬取的內容進行高效的調試，經過使用Scrapy shell進入命令行，而後在命令行中能夠快速的對要爬取的內容進行提取；

如何進入 Scrapy shell 環境

咱們試着經過 Scrapy shell 來提取下 "http://quotes.toscrape.com/page/1/" 中的數據，經過執行以下命令，進入 shell

$ scrapy shell "http://quotes.toscrape.com/page/1/"

輸出

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

這樣，咱們就進入了 Scrapy shell 的環境，上面顯示了鏈接請求和返回的相關信息，response 返回 status code 200 表示成功返回；

經過 CSS 標準進行提取

這裏主要是遵循 CSS 標準 https://www.w3.org/TR/selectors/ 來對網頁的元素進行提取，

經過使用 css() 選擇咱們要提取的元素；下面演示一下如何提取元素 <title/>
```
>>> response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
```
能夠看到，它經過返回一個相似 SelectorList 的對象成功的獲取到了 http://quotes.toscrape.com/pa... 頁面中的 <title/> 的信息，該信息是封裝在Selector對象中的 data 屬性中的；
提取Selector元素的文本內容，通常有兩種方式用來提取，
- 經過使用 extract() 或者 extract_first() 方法來提取元素的內容；下面演示如何提取 #1 返回的元素 <title/> 中的文本內容 text；
```
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
```
  extract_first() 表示提取返回隊列中的第一個 Selector 對象；一樣也可使用以下的方式，
```
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
```
  不過 extract_first() 方法能夠在當頁面沒有找到的狀況下，避免出現IndexError的錯誤；
- 經過 re() 方法來使用正則表達式的方式來進行提取元素的文本內容
```
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
```
  備註，最後一個正則表示式返回了兩個匹配的 Group；

使用 XPath

除了使用 CSS 標準來提取元素意外，咱們還可使用 XPath 標準來提取元素，好比，

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath 比 CSS 的爬取方式更爲強大，由於它不只僅是根據 HTML 的結構元素去進行檢索(Navigating)，而且它能夠順帶的對文本(text)進行檢索；因此它能夠支持 CSS 標準不能作到的場景，好比，檢索一個包含文本內容"Next Page"的 link 元素；這就使得經過 XPath 去構建爬蟲更爲簡單；

提取 quotes 和 authors

下面咱們未來演示如何提取 http://quotes.toscrape.com 首頁中的內容，先來看看首頁的結構

能夠看到，裏面每一個段落包含了一個名人的一段語錄，那麼咱們如何來提取全部的相關信息呢？

咱們從提取第一個名人的信息入手，看看如何提取第一個名人的名言信息；能夠看到，第一個名人的語句是愛因斯坦的，那麼咱們試着來提取名言、做者以及相關的tags；

<div class="quote">
    <span class="text">「The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.」</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

下面咱們就來試着一步一步的去提取相關的信息，

首先，進入 Scrapy Shell，

$ scrapy shell 'http://quotes.toscrape.com'

而後，獲取 <div class="quote" /> 元素列表

>>> response.css("div.quote")

這裏會返回一系列的相關的 Selectors，不過由於這裏咱們僅僅是對第一個名言進行解析，因此咱們只取第一個元素，並將其保存在 quote 變量中

>>> quote = response.css("div.quote")[0]

而後，咱們來分別提取title、author和tags

提取title

>>> title = quote.css("span.text::text").extract_first()
>>> title
'「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」'

提取author

>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

提取tags，這裏須要注意的是，tags 是一系列的文本，

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Ok，上述完成了針對其中一個名言信息的提取，那麼，咱們如何提取完全部名人的名言信息呢？

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '「It is our choices, Harry, that show what we truly are, far more than our abilities.」'}
    ... a few more of these, omitted for brevity

寫個循環，將全部的信息的信息放入 Python dictionary；

經過 Python 程序來進行提取

本小計繼續沿用提取 quotes 和 authors 小節的例子，來看看如何經過 python 程序來作相同的爬取動做；

提取數據

修改該以前的 quotes_spider.py 內容，以下，

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

執行上述的名爲 quotes 的爬蟲，

$ scrapy crawl quotes

執行結果以下，

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '「It is better to be hated for what you are than to be loved for what you are not.」'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "「I have not failed. I've just found 10,000 ways that won't work.」"}

能夠看到，咱們經過 python 建立的爬蟲 quotes 一條一條的返回了爬取的信息；

保存數據

最簡單保存方式被爬取的數據是經過使用 Feed exports，經過使用以下的命令，

使用 JSON 格式

$ scrapy crawl quotes -o quotes.json

上述命令會生成一個文件quotes.json，該文件中包含了全部被爬取的數據；不過因爲歷史的緣由，Scrapy 是往一個文件中追加被爬取的信息，而不是覆蓋更新，因此若是你執行上述命令兩次，將會獲得一個損壞了的 json 文件；

使用 JSON Lines 格式

$ scrapy crawl quotes -o quotes.jl

這樣，保存的文件就是 JSON Lines 的格式了，注意，這裏的惟一變化是文件的後綴名改成了.jl；

補充，JSON Lines 是另外一種 JSON 格式的定義，基本設計是每行是一個有效的 JSON Value；好比它的格式比 CSV 格式更友好，

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, false]
["Deloise", "2012A", 19, true]

同時也能夠支持內嵌數據，

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

JSON Lines 格式很是適合處理含有大量數據的文件，經過迭代，每行處理一個數據對象；不過，要注意的是，使用 JSON lines 的方式，Scrapy 一樣的是以追加的方式添加內容，只是由於 JSON Lines 逐行的方式添加被爬取的數據，因此以追加的方式並不會想使用 JSON 格式那樣致使文件格式錯誤；

若是是一個小型的項目，使用 JSON Lines 的方式就足夠了；可是，若是你面臨的是一個更復雜的項目，並且有更復雜的數據須要爬取，那麼你就可使用 Item Pipeline；一個 demo Pipelines 已經幫你建立好了，tutorial/pipelines.py；

提取下一頁(提取連接信息)

如何提取章節詳細的描述瞭如何爬取頁面的信息，那麼，如何爬取該網站的全部信息呢？那麼就必須爬取相關的連接信息；那麼咱們依然以 http://quotes.toscrape.com 爲例，來看看咱們該如何爬取連接信息，

咱們能夠看到，下一頁的連接 HTML 元素，

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

咱們能夠經過 shell 來抓取它，

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

這樣，咱們獲得了這個anchor元素，可是咱們想要獲得的是其href屬性；Scrapy 支持 CSS 擴展的方式，所以咱們能夠直接爬取其屬性值，

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

好的，咱們如今已經知道該如何獲取下一頁連接的相對地址了，那麼咱們如何修改咱們的 python 程序使得咱們能夠自動的爬取全部頁面的數據呢？

使用 scrapy.Request

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

這裏簡單的描述下程序的執行邏輯，經過 for 循環處理完當前頁面的爬取操做，而後執行獲取下一頁的相關操做，首先得到下一頁的_相對路徑_並保存到變量 next_page 中，而後經過 response.urljon(next_page) 方法獲得_絕對路徑_；最後，經過該絕對路徑再生成一個 scrapy.Request 對象返回，並加入爬蟲隊列中，等待下一次的爬取；由此，你就能夠動態的去爬取全部相關頁面的信息了；

基於此，你就能夠創建起很是複雜的爬蟲了，一樣，能夠根據不一樣連接的類型，構建不一樣的 Parser，那麼就能夠對不一樣類型的返回頁面進行分別處理；

使用 response.follow

不一樣於使用 scrapy Request，須要經過相對路徑構造出絕對路徑，_response.follow_ 能夠直接使用相對路徑，所以就不須要調用 urljoin 方法了；注意，_response.follow_ 直接返回一個 Request 實例，能夠直接經過 yield 進行返回；因此，上述代碼能夠簡化爲

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

另外，_response.follow_ 在處理 <a> 元素的時候，會直接使用它們的 href 屬性；因此上述代碼還能夠簡化爲，

next_page = response.css('li.next a').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

所以匹配的時候不須要顯示的聲明 <a> 的屬性值了；

定義更多的 Parser

import scrapy

class AuthorSpider(scrapy.Spider):

    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

該例子建立了兩個解析方法 parse() 和 parse_author()，一個是用來控制整個爬取流程，一個是用來解析 author 信息的；首先，咱們來分析一下執行的流程，

進入 parse()_，從當前的頁面中爬取獲得全部相關的 author _href 屬性值既是一個連接，而後針對該連接，經過 response.follow 建立一個新的 Request 繼續進行爬取，經過回調 parse_author() 方法對爬取的內容進行進一步的解析，這裏就是對爬取到的 Author 的信息進行提取；
當 #1 有關當前頁面全部的 Author 信息都已經爬取成功之後，便開始對下一頁進行爬取；

從這個例子中，咱們須要注意的是，當爬取當前頁面的時候，咱們依然能夠經過建立子的 Requests 對子連接進行爬取直到全部有關當前頁面的信息都已經被爬取完畢之後，方可進入下一個頁面繼續進行爬取；

另外，須要注意的是，在爬取整個網站信息的時候，必然會有多個相同 Author 的名言，那麼勢必要爬取到許多的重複的 Author 的信息；這無疑是增長了爬取的壓力同時也須要處理大量的冗餘數據，基於此，Scrapy 默認實現了對重複的已經爬取過的連接在下次爬取的時候自動過濾掉了；不過，你也能夠經過 DUPEFILTER_CLASS 來進行設置是否啓用該默認行爲；

使用 Spider 參數

你能夠經過 commond line 的方式爲你的 Spider 提供參數，

$ scrapy crawl quotes -o quotes-humor.json -a tag=humor

該參數將會被傳入 Spider 的 __init__ 方法中，並默認成爲當前 Spider quotes 的屬性；在 quotes Spider 的 python 應用程序中，能夠經過使用 self.tag 來獲取該參數信息；

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

經過 getattr(self, 'tag', None) 即可以獲取從 common line 中傳入的 tag 參數，並構造出須要爬取的 URL 連接 http://quotes.toscrape.com/ta...