筆記-scrapy-selector

筆記-scrapy-selector

scrapy版本:1.5.0css

 

1.總述

 

scrapy內置selector創建在lxml上。html

2.使用

能夠使用xpath和css方法來進行解析,二者都返回列表;node

sel = Selector(text=body).xpath('//div[@class="ip_list"/text()]').extract()express

selector中也能夠使用re()方法進行正則解析,使用方法相似於re庫;less

3.類用經常使用屬性

Selector objects

class scrapy.selector.Selector(response=Nonetext=Nonetype=None)scrapy

response is an HtmlResponse or an XmlResponse object that will be used for selecting and extracting data.spa

text is a unicode string or utf-8 encoded text for cases when a response isn’t available. Using text and response together is undefined behavior.code

type defines the selector type, it can be "html", "xml" or None (default).xml

If type is None, the selector automatically chooses the best type based on response type (see below), or defaults to "html" in case it is used together with text.htm

If type is None and a response is passed, the selector type is inferred from the response type as follows:

"html" for HtmlResponse type
"xml" for XmlResponse type
"html" for anything else
Otherwise, if type is set, the selector type will be forced and no detection will occur.

 

re(regex)

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

 

extract()

Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

 

remove_namespaces()

Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.

 

SelectorList對象

selector類對象是內建list的一個子類,能夠理解爲多個selector對象組合,對selectorlist對象使用xpath,css,extract,re方法能夠理解爲對list中每個對象使用方法後再將返回組合爲一個列表(注意:返回值並非做爲一個總體進行插入)。

相關文章
相關標籤/搜索