scrapy框架之Selectors選擇器

時間 2019-11-15

原文原文鏈接

Selectors（選擇器）

當您抓取網頁時，您須要執行的最多見任務是從HTML源中提取數據。有幾個庫能夠實現這一點：css

BeautifulSoup是Python程序員中很是流行的網絡抓取庫，它基於HTML代碼的結構構建一個Python對象，而且處理至關糟糕的標記，但它有一個缺點：它很慢。
lxml是一個XML解析庫（它還解析HTML）與基於ElementTree的pythonic API 。（lxml不是Python標準庫的一部分。）
Scrapy自帶了提取數據的機制。它們稱爲選擇器，由於它們「選擇」由XPath或CSS表達式指定的HTML文檔的某些部分。html

XPath是用於選擇XML文檔中的節點的語言，其也能夠與HTML一塊兒使用。CSS是一種用於將樣式應用於HTML文檔的語言。它定義了選擇器以將這些樣式與特定的HTML元素相關聯。python

Scrapy選擇器構建在lxml庫之上，這意味着它們的速度和解析精度很是類似。程序員

這個頁面解釋了選擇器是如何工做的，並描述了他們的API是很是小和簡單，不像lxml API是更大，由於 lxml庫能夠用於許多其餘任務，除了選擇標記文檔。web

構造選擇器

Scrapy選擇器是Selector經過傳遞文本或TextResponse 對象構造的類的實例。它根據輸入類型自動選擇最佳的解析規則（XML與HTML）：正則表達式

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

從文本構造：shell

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

構建響應：express

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

爲了方便起見，響應對象在.selector屬性上顯示一個選擇器，在可能的狀況下使用此快捷鍵是徹底正確的：服務器

>>> response.selector.xpath('//span/text()').extract()
[u'good']

使用選擇器

爲了解釋如何使用選擇器，咱們將使用Scrapy shell（提供交互式測試）和位於Scrapy文檔服務器中的示例頁面：網絡

http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
這裏是它的HTML代碼：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br />![](image1_thumb.jpg)</a>
   <a href='image2.html'>Name: My image 2 <br />![](image2_thumb.jpg)</a>
   <a href='image3.html'>Name: My image 3 <br />![](image3_thumb.jpg)</a>
   <a href='image4.html'>Name: My image 4 <br />![](image4_thumb.jpg)</a>
   <a href='image5.html'>Name: My image 5 <br />![](image5_thumb.jpg)</a>
  </div>
 </body>
</html>

首先，讓咱們打開shell：
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
而後，在加載shell以後，您將有可用的響應做爲response shell變量，以及其附加的選擇器response.selector屬性。

因爲咱們處理HTML，選擇器將自動使用HTML解析器。

所以，經過查看該頁面的HTML代碼，讓咱們構造一個XPath來選擇標題標籤中的文本：

>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

使用XPath和CSS查詢響應很是廣泛，響應包括兩個方便的快捷鍵：response.xpath()和response.css()：

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

正如你所看到的，.xpath()而.css()方法返回一個 SelectorList實例，它是新的選擇列表。此API可用於快速選擇嵌套數據：

>>> response.css('img').xpath('@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

要實際提取文本數據，必須調用選擇器.extract() 方法，以下所示：

>>> response.xpath('//title/text()').extract()
[u'Example website']

若是隻想提取第一個匹配的元素，能夠調用選擇器 .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

None若是沒有找到元素則返回：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None
True

能夠提供默認返回值做爲參數，而不是使用None：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

請注意，CSS選擇器可使用CSS3僞元素選擇文本或屬性節點：

>>> response.css('title::text').extract()
[u'Example website']

如今咱們要獲取基本URL和一些圖像連接：

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

嵌套選擇器

選擇方法（.xpath()或.css()）返回相同類型的選擇器的列表，所以您也能夠調用這些選擇器的選擇方法。這裏有一個例子：

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br>![](image1_thumb.jpg)</a>',
 u'<a href="image2.html">Name: My image 2 <br>![](image2_thumb.jpg)</a>',
 u'<a href="image3.html">Name: My image 3 <br>![](image3_thumb.jpg)</a>',
 u'<a href="image4.html">Name: My image 4 <br>![](image4_thumb.jpg)</a>',
 u'<a href="image5.html">Name: My image 5 <br>![](image5_thumb.jpg)</a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

使用帶有正則表達式的選擇器

Selector也有一種.re()使用正則表達式提取數據的方法。可是，不一樣於使用 .xpath()或 .css()methods，.re()返回一個unicode字符串列表。因此你不能構造嵌套.re()調用。

如下是用於從上面的HTML代碼中提取圖片名稱的示例：

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

這裏有一個額外的輔助往復.extract_first()進行.re()，命名.re_first()。使用它只提取第一個匹配的字符串：

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
u'My image 1'

使用相對XPath

請記住，若是您嵌套選擇器並使用以XPath開頭的XPath /，該XPath將是絕對的文檔，而不是相對於 Selector您調用它。

例如，假設要提取<p>元素中的全部<div> 元素。首先，你會獲得全部的<div>元素：

>>> divs = response.xpath('//div')

首先，你可能會使用下面的方法，這是錯誤的，由於它實際上<p>從文檔中提取全部元素，而不只僅是那些內部<div>元素：

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

這是正確的方式（注意點前面的.//pXPath 的點）：

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另外一個常見的狀況是提取全部直接的<p>孩子：

>>> for p in divs.xpath('p'):
...     print p.extract()

XPath表達式中的變量

XPath容許您使用$somevariable語法來引用XPath表達式中的變量。這在某種程度上相似於SQL世界中的參數化查詢或預準備語句，您在查詢中使用佔位符替換一些參數，?而後用查詢傳遞的值替換。

這裏有一個例子來匹配元素基於其「id」屬性值，沒有硬編碼它（如前所示）：

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()  
u'Name: My image 1 '

這裏是另外一個例子，找到一個<div>標籤的「id」屬性包含五個<a>孩子（這裏咱們傳遞的值5做爲一個整數）：

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()
u'images'

全部變量引用在調用時必須有一個綁定值.xpath()（不然你會獲得一個異常）。這是經過傳遞必要的命名參數。ValueError: XPath error:

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。