Scrapy 1.5.0之選擇器

構造選擇器

Scrapy選擇器是經過文本(Text)或 TextResponse 對象構造的 Selector 類的實例。css

它根據輸入類型自動選擇最佳的解析規則(XML vs HTML):html

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

文本構造:node

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

從response構造: (調試未成功)git

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

爲了方便起見,response對象以 .selector 屬性提供了一個selector, 您能夠隨時使用該快捷方法:web

>>> res.selector.xpath('//span/text()').extract()
[u'good']

使用選擇器

爲了解釋如何使用選擇器,咱們將使用 Scrapy shell(提供交互式測試)和位於Scrapy文檔服務器中的示例頁面:正則表達式

咱們將使用 Scrapy shell(提供交互式測試)和位於Scrapy文檔服務器中的示例頁面:http://doc.scrapy.org/en/latest/_static/selectors-sample1.htmlshell

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

因爲在response中使用XPath、CSS查詢十分廣泛,所以,Scrapy提供了兩個實用的快捷方式: response.xpath()response.css():express

>>>response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]

>>>response.css('title::text').extract()
[<Selector xpath='//title/text()' data='Example website'>]

.xpath().css() 方法返回一個 SelectorList 實例,它是一個新的選擇器列表。服務器

此API可用於快速提取嵌套數據。app

爲了實際提取文本數據,必須調用選擇器 .extract() 方法,以下所示:

>>> response.xpath('//title/text()').extract()
['Example website']

若是隻想提取第一個匹配的元素,能夠調用 .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
'Name: My image 1 '

若是沒有匹配的元素,則返回 None,能夠提供默認返回值做爲參數,而不使用None。

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None
True

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

獲取base標籤下的URL和一些圖片的連接

In [16]: response.css('base::attr(href)')
Out[16]: [<Selector xpath='descendant-or-self::base/@href' data='http://example.com/'>]

In [17]: response.xpath('//base/@href')
Out[17]: [<Selector xpath='//base/@href' data='http://example.com/'>]
In [19]: response.css('a::attr(href)')
Out[19]: 
[<Selector xpath='descendant-or-self::a/@href' data='image1.html'>,
 <Selector xpath='descendant-or-self::a/@href' data='image2.html'>,
 <Selector xpath='descendant-or-self::a/@href' data='image3.html'>,
 <Selector xpath='descendant-or-self::a/@href' data='image4.html'>,
 <Selector xpath='descendant-or-self::a/@href' data='image5.html'>]


In [21]: response.xpath('//a/@href')
Out[21]: 
[<Selector xpath='//a/@href' data='image1.html'>,
 <Selector xpath='//a/@href' data='image2.html'>,
 <Selector xpath='//a/@href' data='image3.html'>,
 <Selector xpath='//a/@href' data='image4.html'>,
 <Selector xpath='//a/@href' data='image5.html'>]

嵌套選擇器

選擇器方法( .xpath() or .css() )返回相同類型的選擇器列表,所以你也能夠對這些選擇器調用選擇器方法。

css和xpath能夠混合嵌套。

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

結合正則表達式使用選擇器

Selector 也有一個 .re() 方法,用來經過正則表達式來提取數據。然而,不一樣於使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。因此你沒法構造嵌套式的 .re() 調用。

In [34]: response.xpath('//a[contains(@href, "image")]/text()').re('Name:\s(.*?)image\s(\d)')
Out[34]: ['My ', '1', 'My ', '2', 'My ', '3', 'My ', '4', 'My ', '5']

還有一個額外的輔助函數 .extract_first() for .re(),命名爲 .re_first()。使用它只提取第一個匹配的字符串:

In [35]: response.xpath('//a[contains(@href, "image")]/text()').re_first('Name:\s(.*?)image\s(\d)')
Out[35]: 'My '

使用相對XPaths

若是您嵌套選擇器並使用起始爲 / 的 XPath,該XPath將對文檔使用絕對路徑,並且對於你調用的 Selector 不是相對路徑。

假設要提取 ‘div’ 中的全部元素。首先,你會獲得全部的 ‘div’:

>>> divs = response.xpath('//div')

首先,你可能會使用下面的方法,這是錯誤的,由於它實際上從文檔中提取全部 <p> 元素,而不單單是 <div> 元素內的元素:

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

這是正確的方式(注意點前面的 .//p XPath):

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另外一個常見的狀況是提取全部直接的子標籤 <p>

>>> for p in divs.xpath('p'):
...     print p.extract()

使用EXSLT擴展

因構建在 lxml 之上,Scrapy選擇器也支持一些 EXSLT 擴展,能夠在XPath表達式中使用這些預先制定的命名空間:

前綴 命名空間 用法
re http://exslt.org/regular-expressions http://exslt.org/regexp/index.html
set http://exslt.org/sets http://exslt.org/set/index.html

正則表達式

例如,當XPath的starts-with()或contains()不能知足需求時,test()函數能夠證實是很是有用的。

例如在列表中選擇有」class」元素且結尾爲一個數字的連接:

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']
>>>

集合操做

集合操做能夠方便地在提取文本元素以前去除文檔樹中的一部份內容。

使用itemscopes組和相應的itemprops組提取微數據(從http://schema.org/Product獲取的示例內容)示例:

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...
...   Customer reviews:
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath('//div[@itemscope]'):
...     print "current scope:", scope.xpath('@itemtype').extract()
...     props = scope.xpath('''
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)''')
...     print "    properties:", props.extract()
...     print

current scope: [u'http://schema.org/Product']
    properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review']

current scope: [u'http://schema.org/AggregateRating']
    properties: [u'ratingValue', u'reviewCount']

current scope: [u'http://schema.org/Offer']
    properties: [u'price', u'availability']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

>>>

其中props = scope.xpath(''' set:difference(./descendant::/@itemprop, .//[@itemscope]/*/@itemprop)''')的 ‘descendant’ 取得是子節點,set::difference(a, b)在a裏面卻不在b裏面的結果。

注意 //node[1] 和 (//node)[1] 的區別

  • //node[1] 選擇在其各自父節點下首先出現的全部 node。
  • (//node)[1] 選擇文檔中的全部 node,而後只獲取其中的第一個節點。

示例:

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()

This gets all first <li> elements under whatever it is its parent:

>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first <li> element in the whole document:

>>> xp("(//li)[1]")
[u'<li>1</li>']

This gets all first <li> elements under an <ul> parent:

>>> xp("//ul/li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first<li> element under an <ul> parent in the whole document:

>>> xp("(//ul/li)[1]")
[u'<li>1</li>']

當按類查詢時,請考慮使用CSS

由於一個元素能夠包含多個CSS類,XPath選擇元素的方法是至關冗長:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

若是您使用 @class='someclass',你可能會丟失有其餘類的元素,若是你只是使用 contains(@class,'someclass') 來彌補,您可能會獲得更多的您想要的元素,前提是他們的類名中都包含字符串 someclass 。 事實證實,Scrapy 選擇器容許你連接選擇器,因此大多數時候你可使用CSS選擇類,而後在須要時切換到XPath:

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

這比使用上面顯示的詳細XPath技巧更清晰。記得在XPath表達式中使用 . 來表示相對路徑。

相關文章
相關標籤/搜索