Scrapy簡明教程

時間 2020-12-07

標籤 css html python web chrome shell json 瀏覽器 app scrapy 欄目 Python 简体版

原文原文鏈接

本文經過示例簡要介紹一下使用Scrapy抓取網站內容的基本方法和流程。css

繼續閱讀以前請確保已安裝了scrapy。html

基本安裝方法爲：pip install scrapypython

咱們已經在以前的文章中初步介紹了scrapy，本文是前文的進一步拓展。web

本文主要包含以下幾部分：chrome

1，建立一個scrapy項目shell

2，編寫一個爬蟲（或蜘蛛spider，本文中含義相同）類用於爬取網站頁面並提取數據json

3，使用命令行導出爬到的數據瀏覽器

4，遞歸地爬取子頁面app

5，瞭解並使用spider支持的參數scrapy

咱們測試的網站爲quotes.toscrape.com，這是一個收錄名人警句的站點。Let’s go!

建立爬蟲項目

Scrapy將爬蟲代碼各模塊及配置組織爲一個項目。Scrapy安裝成功後，你能夠在shell中使用以下命令建立一個新的項目：

scrapy startproject tutorial

這將會建立一個tutorial目錄，該目錄的文件結構以下所示：

    scrapy.cfg # 部署配置文件 tutorial/ # 項目的Python模塊, 咱們從這個模塊中導入所需代碼 __init__.py items.py # items定義文件 middlewares.py # middlewares（中間件）定義文件 pipelines.py # pipelines（流水線）定義文件 settings.py # 項目配置文件 spiders/ # 存放spider的目錄 __init__.py

編寫蜘蛛類

Spiders是Scrapy中須要定義的實現爬取功能的類。

每一個spider繼承自Spider基類。spider主要定義了一些起始url，並負責解析web頁面元素，從中提早所需數據。也能夠產生新的url訪問請求。

下邊這段代碼就是咱們所定義的spider，將其保存爲quotes_spider.py，放在項目的tutorial/spiders/目錄下。

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)

在咱們的代碼中，QuotesSpider繼承自scrapy.Spider，並定義了一些屬性和方法：

name：用於在項目中惟一標識一個spider。項目中能夠包含多個spider，其name必須惟一。

start_requests()：用於產生初始的url，爬蟲從這些頁面開始爬行。這個函數須要返回一個包含Request對象的iterable，能夠是一個列表（list）或者一個生成器（generator）。咱們的例子中返回的是一個生成器。

parse()：是一個回調函數，用於解析訪問url獲得的頁面，參數response包含了頁面的詳細內容，並提供了諸多從頁面中提取數據的方法。咱們一般在parse中將提取的數據封裝爲dict，查找新的url，併爲這些url產生新的Request，以繼續爬取。

運行蜘蛛

Spider定義好了以後，咱們能夠在項目的頂層目錄，即最頂層的tutorial，執行以下命令來運行這個spider：

 
   scrapy crawl quotes 
  

這個命令會在項目的spiders目錄中查找並運行name爲quotes的Spider，它會向quotes.toscrape.com這個網站發起HTTP請求，並獲取以下響應：

... 
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ...

這些輸出告訴咱們，爬蟲已成功訪問了一些url，並將其內容保存爲html文件。這正是咱們在parse()中定義的功能。

底層執行邏輯

Scrapy統一調度由spider的start_requests()方法產生的Request。每當Request請求完成以後，Scrapy便建立一個與之相應的Response，並將這個Response做爲參數傳遞給Request關聯的回調函數（callback），由回調函數來解析這個web響應頁面，從中提取數據，或發起新的http請求。

這個流程由Scrapy內部實現，咱們只須要在spider中定義好須要訪問的url，以及如何處理頁面響應就好了。

start_requests的簡寫

除了使用start_requests()產生請求對象Request以外，咱們還可使用一個簡單的方法來生成Request。那就是在spider中定義一個start_urls列表，將開始時須要訪問的url放置其中。以下所示：

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body)

實際上，spider仍然會去調用默認的start_requests()方法，在這個方法裏讀取start_urls，並生成Request。

這個簡版的請求初始化方法也沒有顯式地將回調函數parse和Request對象關聯。很容易想到，scrapy內部爲咱們作了關聯：parse是scrapy默認的Request回調函數。

數據提取

咱們獲得頁面響應後，最重要的工做就是如何從中提取數據。這裏先介紹一下Scrapy shell這個工具，它是scrapy內置的一個調試器，能夠方便地拉取一個頁面，測試數據提取方法是否可行。

scrapy shell的執行方法爲：

 
   scrapy shell 'http://quotes.toscrape.com/page/1/' 
  

直接在後面加上要調試頁面的url就好了，注意須要用引號包括url。

回車後會獲得以下輸出：

[ ... Scrapy log here ... ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser

咱們接下來就能夠在shell中測試如何提取頁面元素了。

可使用Response.css()方法來選取頁面元素：

>>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

css()返回結果是一個selector列表，每一個selector都是對頁面元素是封裝，它提供了一些用於獲取元素數據的方法。

咱們能夠經過以下方法獲取html title的內容：

>>> response.css('title::text').getall() ['Quotes to Scrape']

這裏，咱們在css查詢中向title添加了::text，其含義是隻獲取<title>標籤中的文本，而不是整個<title>標籤：

>>> response.css('title').getall() ['<title>Quotes to Scrape</title>']

不加::text就是上邊這個效果。

另外，getall()返回的是一個列表，這是因爲經過css選取的元素多是多個。若是隻想獲取第一個，能夠用get()：

>>> response.css('title::text').get() 'Quotes to Scrape'

還能夠經過下標引用css返回的某個selector：

>>> response.css('title::text')[0].get() 'Quotes to Scrape'

若是css選擇器沒有匹配到頁面元素，get()會返回None。

除了get()和getall()，咱們還可使用re()來實現正則提取：

>>> response.css('title::text').re(r'Quotes.*') ['Quotes to Scrape'] >>> response.css('title::text').re(r'Q\w+') ['Quotes'] >>> response.css('title::text').re(r'(\w+) to (\w+)') ['Quotes', 'Scrape']

因此，數據提取的重點就在於如何找到合適的css選擇器。

經常使用的方法是藉助於瀏覽器的開發者工具進行分析。在chrome中能夠經過F12打開開發者工具。

XPath簡介

除了css，Scrapy還支持使用XPath來選取頁面元素

>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').get() 'Quotes to Scrape'

XPath表達式功能強大，它是Scrapy中選擇器實現的基礎，css在scrapy底層也會轉換爲XPath。

相較於css選擇器，XPath不只能解析頁面結構，還能夠讀取元素內容。能夠經過XPath方便地獲取到頁面上「下一頁」這樣的url，很適於爬蟲這種場景。咱們會在後續的Scrapy選取器相關內容進一步瞭解其用法，固然網上也有不少這方面的資料可供查閱。

提取警句和做者

經過上邊的介紹，咱們已經初步瞭解瞭如何選取頁面元素，如何提取數據。接下來繼續完善這個spider，咱們將從測試站點頁面獲取更多信息。

打開http://quotes.toscrape.com/，在開發者工具中查看單條警句的源碼以下所示：

<div class="quote"> <span class="text">「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>

如今咱們打開scrapy shell來測試一下如何提取其中的元素。

$ scrapy shell 'http://quotes.toscrape.com'

shell獲取到頁面內容後，咱們經過css選取器能夠獲得頁面中的警句列表：

>>> response.css("div.quote") [<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,  <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,  ...]

因爲頁面中有不少警句，這個結果是一個包含不少selector對象的列表。咱們能夠經過索引獲取第一個selector，而後調用其中的方法獲得元素內容。

 
   >>> quote = response.css("div.quote")[0]  
  

經過quote對象就能夠提取其中的文字、做者和標籤等內容，這一樣是使用css選擇器來實現的。

>>> text = quote.css("span.text::text").get() >>> text '「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」' >>> author = quote.css("small.author::text").get() >>> author 'Albert Einstein'

頁面上每一個警句都打了若干標籤，咱們能夠經過getall()來獲取這些標籤字符串：

>>> tags = quote.css("div.tags a.tag::text").getall() >>> tags ['change', 'deep-thoughts', 'thinking', 'world']

既然咱們已經獲取了第一個quote的內容，咱們一樣能夠經過循環來獲取當前頁面全部quote的內容：

>>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").get() ... author = quote.css("small.author::text").get() ... tags = quote.css("div.tags a.tag::text").getall() ... print(dict(text=text, author=author, tags=tags)) {'text': '「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']} {'text': '「It is our choices, Harry, that show what we truly are, far more than our abilities.」', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']} ...

在spider代碼中提取數據

在瞭解瞭如何使用scrapy shell提取頁面元素後，咱們從新回到以前編寫的spider代碼。

到目前爲止，咱們的spider僅僅將頁面響應Response.body一股腦保存到了HTML文件中。咱們須要對它進行完善，以保存有意義的數據。

Scrapy Spider一般會在解析頁面以後返回一些包含數據的dict，這些dict可用於後續的處理流程。

咱們能夠經過在回調函數中使用yield來返回這些dict。

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), }

運行這個spider，會在日誌中獲得以下輸出：

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '「It is better to be hated for what you are than to be loved for what you are not.」'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "「I have not failed. I've just found 10,000 ways that won't work.」"}

存儲數據

Scrapy支持將數據存儲到各類存儲系統中，最簡單的方法是將其保存文件文件。可經過以下命令實現：

 
   scrapy crawl quotes -o quotes.json  
  

這會以JSON格式保存提取的數據，而且是以append的方式寫入文件。若是同時執行屢次這個命令，寫入到相同文件的數據會相互覆蓋，形成數據破壞！

Scrapy提供了JSON Lines的寫入方式，能夠避免上述覆蓋的狀況。

 
   scrapy crawl quotes -o quotes.jl  
  

這種格式的文件是按行來保存JSON對象的。

除了JSON，Scrapy還支持csv、xml等存儲格式。

若是存儲邏輯比較複雜，還能夠經過scrapy提供的Item流水線（pipeline）來拆解存儲過程，將每一個存儲步驟封裝爲一個pipeline，由scrapy引擎來調度執行。這方面的內容會在後續文章中一塊兒學習。

追蹤連接

目前咱們實現的spider只從兩個頁面獲取數據，若是想要自動獲取整個網站的數據，咱們還須要提取頁面上的其餘連接，產生新的爬取請求。

咱們瞭解一下跟蹤頁面連接的方法。

首先要在頁面中找到要進一步爬取的連接。在測試網站頁面上，能夠看到列表右下有一個「Next」連接，其HTML源碼爲：

<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a> </li> </ul>

使用scrapy shell測試一下如何提取這個連接：

>>> response.css('li.next a').get() '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

咱們使用css(‘li.next a’)獲得了這個連接的selector，並經過get()獲得了整個連接元素。顯然這數據有點冗餘，咱們須要的是連接的href屬性值。這個值能夠經過scrapy提供的css擴展語法得到：

>>> response.css('li.next a::attr(href)').get() '/page/2/'

也能夠經過訪問selector的attrib屬性獲取：

>>> response.css('li.next a').attrib['href'] '/page/2/'

接下來，咱們將這個提取過程整合到spider代碼中，以實現遞歸跟蹤頁面連接。

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)

如今咱們的初始url爲第一頁。

parse()函數提取完第一頁上全部的警句以後，繼續查找頁面上的「Next」連接，若是找到，就產生一個新的請求，並關聯本身爲這個Request的回調函數。這樣就能夠遞歸地訪問整個網站，直到最後一頁。

這就是Scrapy跟蹤頁面連接的機制：用戶負責解析這些連接，經過yield產生新的請求Request，並給Request關聯一個處理函數callback。Scrapy負責調度這些Request，自動發送請求，並經過callback處理響應消息。

建立Requests的快捷方法

除了直接建立一個scrapy.Request對象，咱們還可使用response.follow來簡化生成Request的方法。

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('span small::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, callback=self.parse)

follow能夠直接經過相對路徑生成url，不須要再調用urljoin()。這和頁面上的href寫法一致，很方便。

follow還支持直接傳入url對應的selector，而不需調用get()提取url字符串。

for href in response.css('ul.pager a::attr(href)'): yield response.follow(href, callback=self.parse)

對<a>標籤，還能夠進一步簡化：

for a in response.css('ul.pager a'): yield response.follow(a, callback=self.parse)

這是由於follow會自動使用<a>的href屬性。

咱們還可使用follow_all從可迭代對象中批量建立Request：

#aonchors包含多個<a>選擇器
anchors = response.css('ul.pager a') yield from response.follow_all(anchors, callback=self.parse)

follow_all也支持簡化寫法：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

更多示例

咱們再看一個spider，其做用是獲取全部警句的做者信息。

import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): author_page_links = response.css('.author + a') yield from response.follow_all(author_page_links, self.parse_author) pagination_links = response.css('li.next a') yield from response.follow_all(pagination_links, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).get(default='').strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), }

這個spider從測試站點主頁開始爬取。提取這個頁面上全部的author連接，併產生新的Request；提取頁面上的Next連接，產生對應的Request。經過parse_author提取做者信息。

在parse_author中咱們定義了一個helper函數供提取數據使用。

值得注意的是，某個做者可能有多條警句，而每條警句單獨包含了這個做者的標籤。咱們可能會提取多個相同做者的url。但實際上，Scrapy並不會對相同的url發起屢次請求，它會自動進行去重處理，這在必定程度上會減輕爬蟲對網站的壓力。

使用spider參數

咱們能夠經過scrapy命令行的-a選項來向spider傳遞一些參數。好比：

scrapy crawl quotes -o quotes-humor.json -a tag=humor

這裏，-a以後，tag爲參數名，humor爲參數值。

這些參數會傳遞給spider的__init__方法，併成爲spider的屬性。咱們能夠在spider中獲取這些屬性，並根據其值處理不一樣的業務。

import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): url = 'http://quotes.toscrape.com/' tag = getattr(self, 'tag', None) if tag is not None: url = url + 'tag/' + tag yield scrapy.Request(url, self.parse) def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, self.parse)