scrapy中Selector的使用

scrapy的Selector選擇器其實也能夠用來解析,今天主要總結下css和xpath的用法,其實我我的最喜歡用csscss

以慕課網嵩天老師教程中的一個網頁爲例,python123.io/ws/demo.htmlhtml

解析是提取信息的一種手段,主要提取的信息包括:標籤節點、屬性、文本,下面從這三個方面來分別說明python

1、提取標籤節點scrapy

response = 」<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>」spa

上面這個就是網頁的html信息了,好比我要提取<p>標籤code

使用css選擇器htm

selector = Selector(text=response) p = selector.css('p').extract() print(p)
#['<p class="title"><b>The demo python introduces several python courses.</b></p>', '<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>']

這樣就獲得了全部p節點的信息,獲得的是一個列表信息,若是隻想獲得第一個,實際上可使用extract_first()方法,而不是使用extract()方法blog

對於簡單的節點查找,這樣就夠了,可是若是一樣的節點不少,並且我要查找的節點不在第一個,這樣處理就不行。解決的方法是添加限制條件,添加class、id等等限制信息教程

好比我想提取class=course的p節點信息,使用p[class='course'],固然,若是有其餘的屬性,也能夠用其餘屬性做爲限定it

selecor = Selector(text=result) response = selecor.css('p[class="course"]').extract_first() print(response)

#<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>

使用xpath

使用xpath大致思路也是同樣的,只不過語法有點不一樣

使用xpath實現上述第一個例子

selecor = Selector(text=result) response = selecor.xpath('//p').extract_first() print(response)

使用xpath實現上述第二個例子

selecor = Selector(text=result) response = selecor.xpath('//p[@class="course"]').extract_first() print(response)

細心點的可能會發現xpath選取標籤節點,就比css多了個//和@,//表明從當前節點進行選擇,@後面接的是屬性

2、提取屬性

有時候咱們須要提取屬性值,好比src、href

response = 」<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>」

仍是這段例子,爲了方便觀看,我拷過來

好比我如今要提取第一個a標籤的href

使用css

直接在標籤後面加上::attr(href),attr表明提取的是屬性,括號內的href表明我要提取的是哪一種屬性

 

selecor = Selector(text=result) response = selecor.css('a::attr(href)').extract_first() print(response)
#http://www.icourse163.org/course/BIT-268001

 

若是要提取特性的a標籤的href屬性,好比第二個a標籤的href,一樣可使用限制條件

selecor = Selector(text=result) response = selecor.css('a[class="py2"]::attr(href)').extract_first() print(response)
#http://www.icourse163.org/course/BIT-1001870001

使用xpath

實現上面第一個例子

selecor = Selector(text=result) response = selecor.xpath('//a/@href').extract_first() print(response)

實現上面第二個例子

selecor = Selector(text=result) response = selecor.xpath('//a[@class="py2"]/@href').extract_first() print(response)

3、提取文本信息

response = 」<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>」

提取第一個a標籤的文本

使用css選擇器

只須要在標籤後面加上::text,至於怎麼選擇標籤參照上面

selecor = Selector(text=result) response = selecor.css('a::text').extract_first() print(response)
#Basic Python

選擇特定標籤的文本,好比第二個a標籤文本,一樣是加一個限制條件就好

selecor = Selector(text=result) response = selecor.css('a[class="py2"]::text').extract_first() print(response)
#Advanced Python

使用xpath來實現

首先是第一個例子,使用//a選擇到a節點,再/text()選擇到文本信息

selecor = Selector(text=result) response = selecor.xpath('//a/text()').extract_first() print(response)

實現第二個例子,添加xpath限制條件的時候前面必定不要忘記加@,並且text後面要加()

selecor = Selector(text=result) response = selecor.xpath('//a[@class="py2"]/text()').extract_first() print(response)

 

最後總結下:對於提取而言,xpath多了/和@符號,即便在添加限制條件時,xpath也須要在限制的屬性前加@,因此這也是我喜歡css的緣由,由於我懶。

相關文章
相關標籤/搜索