Xpath下根據標籤獲取指定標籤的text,相關屬性值。
要可以準確的定位到列表中的某一項(經過id或class)
根據標籤或相關屬性的值進行過濾css
response.xpath('//*[@id="resultList"]/div[4]/span[1]/a/@href').extract_first()
獲取標籤id爲resultList的標籤,向下第4個div元素,再向下第1個span元素,向下的a標籤,獲取a標籤的href屬性html
CSS根據css樣式獲取指定的某個元素或元素列表
獲取指標籤的text,相關屬性值
要能準確的定位到列表中的某一項
若是一個標籤有多個css樣式的狀況下,怎麼寫
Scrapy xpathnode
表達式 | 描述 |
---|---|
nodename | 選取此節點的全部子節點。 |
/ | 從根節點選取。 |
// | 從匹配選擇的當前節點選擇文檔中的節點,而不考慮它們的位置。 |
. | 選取當前節點。 |
.. | 選取當前節點的父節點。 |
@ | 選取屬性。 |
路徑表達式 | 結果 |
---|---|
/bookstore/book[1] | 選取屬於 bookstore 子元素的第一個 book 元素。 |
/bookstore/book[last()] | 選取屬於 bookstore 子元素的最後一個 book 元素。 |
/bookstore/book[last()-1] | 選取屬於 bookstore 子元素的倒數第二個 book 元素。 |
/bookstore/book[position()<3] | 選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。 |
//title[@lang] | 選取全部擁有名爲 lang 的屬性的 title 元素。 |
//title[@lang=’eng’] | 選取全部 title 元素,且這些元素擁有值爲 eng 的 lang 屬性。 |
/bookstore/book[price>35.00] | 選取 bookstore 元素的全部 book 元素,且其中的 price 元素的值須大於 35.00。 |
/bookstore/book[price>35.00]/title | 選取 bookstore 元素中的 book 元素的全部 title 元素,且其中的 price 元素的值須大於 35.00。 |
幾個簡單的例子:python
/html/head/title: 選擇HTML文檔<head>元素下面的<title> 標籤。 方法2:response.xpath('//title') 獲取了網頁的標題 //效率低,不建議使用 /html/head/title/text(): 選擇前面提到的<title> 元素下面的文本內容 //td: 選擇全部 <td> 元素 //div[@class="mine"]: 選擇全部包含 class="mine" 屬性的div 標籤元素
Scrapy使用css和xpath選擇器來定位元素,它有四個基本方法:
xpath(): 返回選擇器列表,每一個選擇器表明使用xpath語法選擇的節點
css(): 返回選擇器列表,每一個選擇器表明使用css語法選擇的節點
extract(): 返回被選擇元素的unicode字符串
re(): 返回經過正則表達式提取的unicode字符串列表git
>>> response.xpath('//title/text()') [<Selector (text) xpath=//title/text()>] >>> response.css('title::text') [<Selector (text) xpath=//title/text()>]
Scrapy沒有進行預期的循環抓取的操做,
解決辦法:將allow_domain中的域名改成與爬取url一致便可
緣由是 allow_domain中的域名寫錯了,與待爬取url不一致
已更改過的代碼以下:程序員
# -*- coding: utf-8 -*- import scrapy from scrapy_demo7.items import ScrapyDemo7Item from scrapy.http import Request class ZhilianSpider(scrapy.Spider): name = 'zhilian' allowed_domains = ['zhaopin.com'] start_urls = ['http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=1'] def parse(self, response): tables = response.xpath('//*[@id="newlist_list_content_table"]/table') for table in tables: item = ScrapyDemo7Item() first = table.xpath('./tbody/tr[1]/td[1]/div/a/@href').extract_first() print("first", first) tableRecord = table.xpath("./tr[1]") jobInfo = tableRecord.xpath("./td[@class='zwmc']/div/a") item["job_name"] = jobInfo.xpath("./text()").extract_first() item["company_name"] = tableRecord.xpath("./td[@class='gsmc']/a[@target='_blank']/text()").extract_first() item["job_provide_salary"] = tableRecord.xpath("./td[@class='zwyx']/text()").extract_first() item["job_location"] = tableRecord.xpath("./td[@class='gzdd']/text()").extract_first() item["job_release_date"] = tableRecord.xpath("./td[@class='gxsj']/span/text()").extract_first() item["job_url"] = jobInfo.xpath("./@href").extract_first() yield item for i in range(1, 21): url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=" + str(i) print(url) yield Request(url, callback=self.parse)
C:\Users\user>pip3 install scrapy Collecting scrapy Using cached Scrapy-1.4.0-py2.py3-none-any.whl Collecting parsel>=1.1 (from scrapy) Using cached parsel-1.2.0-py2.py3-none-any.whl Requirement already satisfied: service-identity in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: w3lib>=1.17.0 in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: cssselect>=0.9 in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: queuelib in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: lxml in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: six>=1.5.2 in d:\python362\lib\site-packages (from scrapy) Collecting Twisted>=13.1.0 (from scrapy) Using cached Twisted-17.9.0.tar.bz2 Requirement already satisfied: pyOpenSSL in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: attrs in d:\python362\lib\site-packages (from service-identity->scrapy) Requirement already satisfied: pyasn1-modules in d:\python362\lib\site-packages (from service-identity->scrapy) Requirement already satisfied: pyasn1 in d:\python362\lib\site-packages (from service-identity->scrapy) Requirement already satisfied: zope.interface>=4.0.2 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: constantly>=15.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: incremental>=16.10.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: Automat>=0.3.0 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: hyperlink>=17.1.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: cryptography>=2.1.4 in d:\python362\lib\site-packages (from pyOpenSSL->scrapy) Requirement already satisfied: setuptools in d:\python362\lib\site-packages (from zope.interface>=4.0.2->Twisted>=13.1.0->scrapy) Requirement already satisfied: cffi>=1.7; platform_python_implementation != "PyPy" in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy) Requirement already satisfied: asn1crypto>=0.21.0 in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy) Requirement already satisfied: idna>=2.1 in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy) Requirement already satisfied: pycparser in d:\python362\lib\site-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->scrapy) Installing collected packages: parsel, Twisted, scrapy Running setup.py install for Twisted ... done Successfully installed Twisted-17.9.0 parsel-1.2.0 scrapy-1.4.0 C:\Users\user>scrapy Scrapy 1.4.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command C:\Users\user>scrapy bench 2017-12-13 15:41:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 2017-12-13 15:41:49 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'} 2017-12-13 15:41:50 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] Unhandled error in Deferred: 2017-12-13 15:41:50 [twisted] CRITICAL: Unhandled error in Deferred: 2017-12-13 15:41:50 [twisted] CRITICAL: Traceback (most recent call last): File "d:\python362\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "d:\python362\lib\site-packages\scrapy\crawler.py", line 77, in crawl self.engine = self._create_engine() File "d:\python362\lib\site-packages\scrapy\crawler.py", line 102, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "d:\python362\lib\site-packages\scrapy\core\engine.py", line 69, in __init__ self.downloader = downloader_cls(crawler) File "d:\python362\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__ self.middleware = DownloaderMiddlewareManager.from_crawler(crawler) File "d:\python362\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler return cls.from_settings(crawler.settings, crawler) File "d:\python362\lib\site-packages\scrapy\middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "d:\python362\lib\site-packages\scrapy\utils\misc.py", line 44, in load_object mod = import_module(module) File "d:\python362\lib\importlib\__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 978, in _gcd_import File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 655, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed File "d:\python362\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module> from twisted.web.client import ResponseFailed File "d:\python362\lib\site-packages\twisted\web\client.py", line 42, in <module> from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS File "d:\python362\lib\site-packages\twisted\internet\endpoints.py", line 41, in <module> from twisted.internet.stdio import StandardIO, PipeAddress File "d:\python362\lib\site-packages\twisted\internet\stdio.py", line 30, in <module> from twisted.internet import _win32stdio File "d:\python362\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module> import win32api ModuleNotFoundError: No module named 'win32api'
C:\Users\user>pip3 install pypiwin32 Collecting pypiwin32 Downloading pypiwin32-220-cp36-none-win32.whl (8.3MB) 100% |████████████████████████████████| 8.3MB 34kB/s Installing collected packages: pypiwin32 Successfully installed pypiwin32-220 C:\Users\user> C:\Users\user>scrapy bench 2017-12-13 15:49:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 2017-12-13 15:49:05 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'} 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-13 15:49:06 [scrapy.core.engine] INFO: Spider opened 2017-12-13 15:49:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:07 [scrapy.extensions.logstats] INFO: Crawled 85 pages (at 5100 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:08 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 4320 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:09 [scrapy.extensions.logstats] INFO: Crawled 229 pages (at 4320 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:10 [scrapy.extensions.logstats] INFO: Crawled 293 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:11 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:12 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:13 [scrapy.extensions.logstats] INFO: Crawled 469 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:14 [scrapy.extensions.logstats] INFO: Crawled 517 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:15 [scrapy.extensions.logstats] INFO: Crawled 573 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:16 [scrapy.core.engine] INFO: Closing spider (closespider_timeout) 2017-12-13 15:49:16 [scrapy.extensions.logstats] INFO: Crawled 621 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 284168, 'downloader/request_count': 629, 'downloader/request_method_count/GET': 629, 'downloader/response_bytes': 1976557, 'downloader/response_count': 629, 'downloader/response_status_count/200': 629, 'finish_reason': 'closespider_timeout', 'finish_time': datetime.datetime(2017, 12, 13, 7, 49, 17, 78107), 'log_count/INFO': 17, 'request_depth_max': 21, 'response_received_count': 629, 'scheduler/dequeued': 629, 'scheduler/dequeued/memory': 629, 'scheduler/enqueued': 12581, 'scheduler/enqueued/memory': 12581, 'start_time': datetime.datetime(2017, 12, 13, 7, 49, 6, 563037)} 2017-12-13 15:49:17 [scrapy.core.engine] INFO: Spider closed (closespider_timeout) C:\Users\user>
在本教程中,咱們假設您已經安裝了Scrapy。若是沒有,請參閱安裝指南。github
咱們將要抓取 quotes.toscrape.com,一個列出著名做家的名言(quote)的網站。web
本教程將引導您完成如下任務:正則表達式
Scrapy 是用 Python 編寫的。若是你沒學過 Python,你可能須要瞭解一下這個語言,以充分利用 Scrapy。shell
若是您已經熟悉其餘語言,並但願快速學習 Python,咱們建議您閱讀 Dive Into Python 3。或者,您能夠學習 Python 教程。
若是您剛開始編程,並但願從 Python 開始,在線電子書《Learn Python The Hard Way》很是有用。您也能夠查看非程序員的 Python 資源列表。
在開始抓取以前,您必須建立一個新的 Scrapy 項目。 進入您要存儲代碼的目錄,而後運行:
scrapy startproject tutorial
這將建立一個包含如下內容的 tutorial 目錄:
tutorial/ scrapy.cfg # 項目配置文件 tutorial/ # 項目的 Python 模塊,放置您的代碼的地方 __init__.py items.py # 項目項(item)定義文件 pipelines.py # 項目管道(piplines)文件 settings.py # 項目設置文件 spiders/ # 一個你之後會放置 spider 的目錄 __init__.py
Spider 是您定義的類,Scrapy 用它從網站(或一組網站)中抓取信息。 他們必須是 scrapy.Spider 的子類並定義初始請求,和如何獲取要繼續抓取的頁面的連接,以及如何解析下載的頁面來提取數據。
這是咱們第一個爬蟲的代碼。 將其保存在項目中的 tutorial/spiders 目錄下的名爲 quotes_spider.py 的文件中:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
你能夠看到,咱們的 Spider 是 scrapy.Spider 的子類並定義了一些屬性和方法:
parse() 方法一般解析響應,將抓取的數據提取爲字典,而且還能夠查找新的 URL 來跟蹤並從中建立新的請求(Request)。
要使咱們的爬蟲工做,請進入項目的根目錄並運行:
scrapy crawl quotes
這個命令運行咱們剛剛添加的名稱爲 quotes 的爬蟲,它將向 quotes.toscrape.com 發送一些請求。 你將獲得相似於這樣的輸出:
... (omitted for brevity) 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ...
如今,查看當前目錄下的文件。 您會發現已經建立了兩個新文件:quotes-1.html 和 quotes-2.html,其中包含各個URL的內容,就像咱們的 parse 方法指示同樣。
注意
若是您想知道爲何咱們尚未解析 HTML,請繼續,咱們將盡快介紹。
Spider 的 start_requests 方法返回 scrapy.Request 對象,Scrapy 對其發起請求 。而後將收到的響應實例化爲 Response 對象,以響應爲參數調用請求對象中定義的回調方法(在這裏爲 parse 方法)。
用於代替實現一個從 URL 生成 scrapy.Request 對象的 start_requests() 方法,您能夠用 URL 列表定義一個 start_urls 類屬性。 此列表將默認替代 start_requests() 方法爲您的爬蟲建立初始請求:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body)
Scrapy 將調用 parse() 方法來處理每一個 URL 的請求,即便咱們沒有明確告訴 Scrapy 這樣作。 這是由於 parse() 是 Scrapy 的默認回調方法,沒有明確分配回調方法的請求默認調用此方法。
學習如何使用 Scrapy 提取數據的最佳方式是在 Scrapy shell 中嘗試一下選擇器。 運行:
scrapy shell 'http://quotes.toscrape.com/page/1/'
注意
在從命令行運行 Scrapy shell 時必須給 url 加上引號,不然包含參數(例如 &符號)的 url 將不起做用。
在Windows上,要使用雙引號:
scrapy shell "http://quotes.toscrape.com/page/1/"
你將會看到:
[ ... Scrapy log here ... ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>>
使用 shell,您能夠嘗試使用 CSS 選擇器選擇元素:
>>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
運行 response.css('title') 返回的結果是一個 SelectorList 類列表對象,它是一個指向 XML/HTML 元素的 Selector 對象的列表,容許您進行進一步的查詢來細分選擇或提取數據。
要從上面的 title 中提取文本,您能夠執行如下操做:
>>> response.css('title::text').extract() ['Quotes to Scrape']
這裏有兩件事情要注意:一個是咱們在 CSS 查詢中添加了 ::text,這意味着咱們只想要 <title> 元素中的文本。 若是咱們不指定 ::text,咱們將獲得完整的 title 元素,包括其標籤:
>>> response.css('title').extract() ['<title>Quotes to Scrape</title>']
另外一件事是調用 .extract() 返回的結果是一個列表,由於咱們在處理 SelectorList。 當你明確你只是想要第一個結果時,你能夠這樣作:
>>> response.css('title::text').extract_first() 'Quotes to Scrape'
或者你能夠這樣寫:
>>> response.css('title::text')[0].extract() 'Quotes to Scrape'
可是,若是沒有找到匹配選擇的元素,.extract_first() 返回 None,避免了 IndexError
這裏有一個教訓:對於大多數爬蟲代碼,您但願它具備容錯性,若是在頁面上找不到指定的元素致使沒法獲取某些項,至少其它的數據能夠被抓取。
除了 extract() 和 extract_first() 方法以外,還可使用 re() 方法用正則表達式來提取:
>>> response.css('title::text').re(r'Quotes.*') ['Quotes to Scrape'] >>> response.css('title::text').re(r'Q\w+') ['Quotes'] >>> response.css('title::text').re(r'(\w+) to (\w+)') ['Quotes', 'Scrape']
爲了獲得正確的 CSS 選擇器語句,您能夠在瀏覽器中打開頁面並查看源代碼。 您也可使用瀏覽器的開發工具或擴展(如 Firebug)(請參閱有關 Using Firebug for scraping 和 Using Firefox for scraping 的部分)。
Selector Gadget 也是一個很好的工具,能夠快速找到元素的 CSS 選擇器語句,它能夠在許多瀏覽器中運行。
除了 CSS,Scrapy 選擇器還支持使用 XPath 表達式:
>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape'
XPath 表達式很是強大,是 Scrapy 選擇器的基礎。 實際上,若是你查看相關的源代碼就能夠發現,CSS 選擇器被轉換爲 XPath。
雖然也許不像 CSS 選擇器那麼受歡迎,但 XPath 表達式提供更多的功能,由於除了導航結構以外,它還能夠查看內容。 使用 XPath,您能夠選擇如下內容:包含文本「下一頁」的連接。 這使得 XPath 很是適合抓取任務,咱們鼓勵您學習 XPath,即便您已經知道如何使用 CSS 選擇器,這會使抓取更容易。
咱們不會在這裏講太多關於 XPath 的內容,但您能夠閱讀 using XPath with Scrapy Selectors 獲取更多有關 XPath 的信息。 咱們推薦教程 to learn XPath through examples,和教程 「how to think in XPath」。
如今你知道了如何選擇和提取,讓咱們來完成咱們的爬蟲,編寫代碼從網頁中提取名言(quote)。
http://quotes.toscrape.com 中的每一個名言都由 HTML 元素表示,以下所示:
<div class="quote"> <span class="text">「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>
讓咱們打開 scrapy shell 玩一玩,找到提取咱們想要的數據的方法:
$ scrapy shell 'http://quotes.toscrape.com'
獲得 quote 元素的 selector 列表:
>>> response.css("div.quote")
經過上述查詢返回的每一個 selector 容許咱們對其子元素運行進一步的查詢。 讓咱們將第一個 selector 分配給一個變量,以便咱們能夠直接在特定的 quote 上運行咱們的 CSS 選擇器:
>>> quote = response.css("div.quote")[0]
如今,咱們使用剛剛建立的 quote 對象,從該 quote 中提取 title,author 和 tags:
>>> title = quote.css("span.text::text").extract_first() >>> title '「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」' >>> author = quote.css("small.author::text").extract_first() >>> author 'Albert Einstein'
鑑於標籤是字符串列表,咱們可使用 .extract() 方法將它們所有提取出來:
>>> tags = quote.css("div.tags a.tag::text").extract() >>> tags ['change', 'deep-thoughts', 'thinking', 'world']
如今已經弄清楚瞭如何提取每個信息,接下來遍歷全部 quote 元素,並把它們放在一個 Python 字典中:
>>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").extract_first() ... author = quote.css("small.author::text").extract_first() ... tags = quote.css("div.tags a.tag::text").extract() ... print(dict(text=text, author=author, tags=tags)) {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」'} {'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '「It is our choices, Harry, that show what we truly are, far more than our abilities.」'} ... a few more of these, omitted for brevity >>>
讓咱們回到咱們的爬蟲上。 到目前爲止,它並無提取任何數據,只將整個 HTML 頁面保存到本地文件。 讓咱們將上述提取邏輯整合到咱們的爬蟲中。
Scrapy 爬蟲一般生成許多包含提取到的數據的字典。 爲此,咱們在回調方法中使用 yield Python 關鍵字,以下所示:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
若是您運行此爬蟲,它將輸出提取的數據與日誌:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '「It is better to be hated for what you are than to be loved for what you are not.」'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "「I have not failed. I've just found 10,000 ways that won't work.」"}
存儲抓取數據的最簡單的方法是使用 Feed exports,使用如下命令:
scrapy crawl quotes -o quotes.json
這將生成一個 quotes.json 文件,其中包含全部抓取到的 JSON 序列化的數據。
因爲歷史緣由,Scrapy 追加內容到給定的文件,而不是覆蓋其內容。 若是您在第二次以前刪除該文件兩次運行此命令,那麼最終會出現一個破壞的 JSON 文件。您還可使用其餘格式,如 JSON 行(JSON Lines):
scrapy crawl quotes -o quotes.jl
JSON 行格式頗有用,由於它像流同樣,您能夠輕鬆地將新記錄附加到文件。 當運行兩次時,它不會發生 JSON 那樣的問題。 另外,因爲每條記錄都是單獨的行,因此您在處理大文件時無需將全部內容放到內存中,還有 JQ 等工具能夠幫助您在命令行中執行此操做。
在小項目(如本教程中的一個)中,這應該是足夠的。 可是,若是要使用已抓取的項目執行更復雜的操做,則能夠編寫項目管道(Item Pipeline)。 在工程的建立過程當中已經爲您建立了項目管道的佔位符文件 tutorial/pipelines.py, 雖然您只須要存儲已抓取的項目,不須要任何項目管道。
或許你但願獲取網站全部頁面的 quotes,而不是從 http://quotes.toscrape.com 的前兩頁抓取。
如今您已經知道如何從頁面中提取數據,咱們來看看如何跟蹤連接。
首先是提取咱們想要跟蹤的頁面的連接。 檢查咱們的頁面,咱們能夠看到連接到下一個頁面的URL在下面的元素中:
<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul>
咱們能夠嘗試在 shell 中提取它:
>>> response.css('li.next a').extract_first() '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
這獲得了超連接元素,可是咱們須要其屬性 href。 爲此,Scrapy 支持 CSS 擴展,您能夠選擇屬性內容,以下所示:
>>> response.css('li.next a::attr(href)').extract_first() '/page/2/'
如今修改咱們的爬蟲,改成遞歸地跟蹤下一頁的連接,從中提取數據:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
如今,在提取數據以後,parse() 方法查找到下一頁的連接,使用 urljoin() 方法構建一個完整的絕對 URL(由於連接能夠是相對的),並生成(yield)一個到下一頁的新的請求, 其中包括回調方法(parse)。
您在這裏看到的是 Scrapy 的連接跟蹤機制:當您在一個回調方法中生成(yield)請求(request)時,Scrapy 將安排發起該請求,並註冊該請求完成時執行的回調方法。
使用它,您能夠根據您定義的規則構建複雜的跟蹤連接機制,並根據訪問頁面提取不一樣類型的數據。
在咱們的示例中,它建立一個循環,跟蹤全部到下一頁的連接,直到它找不到要抓取的博客,論壇或其餘站點分頁。
做爲建立請求對象的快捷方式,您可使用 response.follow:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('span small::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
不像 scrapy.Request,response.follow 支持相對 URL - 不須要調用urljoin。請注意,response.follow 只是返回一個 Request 實例,您仍然須要生成請求(yield request)。
您也能夠將選擇器傳遞給 response.follow,該選擇器應該提取必要的屬性:
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)
對於<a>元素,有一個快捷方式:response.follow 自動使用它們的 href 屬性。 因此代碼能夠進一步縮短:
for a in response.css('li.next a'): yield response.follow(a, callback=self.parse)
注意
response.follow(response.css('li.next a')) 無效,由於 response.css 返回的是一個相似列表的對象,其中包含全部結果的選擇器,而不是單個選擇器。for 循環或者 response.follow(response.css('li.next a')[0]) 則能夠正常工做。
這是另一個爬蟲,示例了回調和跟蹤連接,此次是爲了抓取做者信息:
import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): # 連接到做者頁面 for href in response.css('.author + a::attr(href)'): yield response.follow(href, self.parse_author) # 連接到下一頁 for href in response.css('li.next a::attr(href)'): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), }
這個爬蟲將從主頁面開始, 以 parse_author 回調方法跟蹤全部到做者頁面的連接,以 parse 回調方法跟蹤其它頁面。
這裏咱們將回調方法做爲參數直接傳遞給 response.follow,這樣代碼更短,也能夠傳遞給 scrapy.Request。
parse_author 回調方法裏定義了另一個函數來根據 CSS 查詢語句(query)來提取數據,而後生成包含做者數據的 Python 字典。
這個爬蟲演示的另外一個有趣的事是,即便同一做者有許多名言,咱們也不用擔憂屢次訪問同一做者的頁面。默認狀況下,Scrapy 會將重複的請求過濾出來,避免了因爲編程錯誤而致使的重複服務器的問題。能夠經過 DUPEFILTER_CLASS 進行相關的設置。
但願如今您已經瞭解了 Scrapy 的跟蹤連接和回調方法機制。
CrawlSpider 類是一個小規模的通用爬蟲引擎,只須要修改其跟蹤連接的機制等,就能夠在它之上實現你本身的爬蟲程序。
另外,一個常見的模式是從多個頁面據構建一個包含數據的項(item),有一個將附加數據傳遞給回調方法的技巧。
在運行爬蟲時,能夠經過 -a 選項爲您的爬蟲提供命令行參數:
scrapy crawl quotes -o quotes-humor.json -a tag=humor
默認狀況下,這些參數將傳遞給 Spider 的 __init__ 方法併成爲爬蟲的屬性。
在此示例中,經過 self.tag 獲取命令行中參數 tag 的值。您能夠根據命令行參數構建 URL,使您的爬蟲只爬取特色標籤的名言:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): url = 'http://quotes.toscrape.com/' tag = getattr(self, 'tag', None) if tag is not None: url = url + 'tag/' + tag yield scrapy.Request(url, self.parse) def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, self.parse)
若是您將 tag = humor 傳遞給爬蟲,您會注意到它只會訪問標籤爲 humor 的 URL,例如 http://quotes.toscrape.com/tag/humor。您能夠在這裏瞭解更多關於爬蟲參數的信息。
本教程僅涵蓋了 Scrapy 的基礎知識,還有不少其餘功能未在此說起。 查看初窺 Scrapy 中的「還有什麼?」部分能夠快速瞭解有哪些重要的內容。
您能夠經過目錄瞭解更多有關命令行工具、爬蟲、選擇器以及本教程未涵蓋的其餘內容的信息。下一章是示例項目。
http://www.cnblogs.com/-E6-/p/7213872.html
原英文文檔:https://docs.scrapy.org/en/latest/topics/commands.html
github上的源碼:https://github.com/scrapy/scrapy/tree/1.4
xpath,selector:
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:
- BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
- lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.)
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they 「select」 certain parts of the HTML document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
Scrapy selectors are built over the lxml library, which means they’re very similar in speed and parsing accuracy.
This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.
For a complete reference of the selectors API see Selector reference
Scrapy selectors are instances of Selector
class constructed by passing text or TextResponse
object. It automatically chooses the best parsing rules (XML vs HTML) based on input type:
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
Constructing from text:
>>> body = '<html><body><span>good</span></body></html>' >>> Selector(text=body).xpath('//span/text()').extract() [u'good']
Constructing from response:
>>> response = HtmlResponse(url='http://example.com', body=body) >>> Selector(response=response).xpath('//span/text()').extract() [u'good']
For convenience, response objects expose a selector on .selector attribute, it’s totally OK to use this shortcut when possible:
>>> response.selector.xpath('//span/text()').extract() [u'good']
To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:
Here’s its HTML code:
<html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a> </div> </body> </html>
First, let’s open the shell:
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
Then, after the shell loads, you’ll have the response available as response
shell variable, and its attached selector in response.selector
attribute.
Since we’re dealing with HTML, the selector will automatically use an HTML parser.
So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:
>>> response.selector.xpath('//title/text()') [<Selector (text) xpath=//title/text()>]
Querying responses using XPath and CSS is so common that responses include two convenience shortcuts: response.xpath()
and response.css()
:
>>> response.xpath('//title/text()') [<Selector (text) xpath=//title/text()>] >>> response.css('title::text') [<Selector (text) xpath=//title/text()>]
As you can see, .xpath()
and .css()
methods return a SelectorList
instance, which is a list of new selectors. This API can be used for quickly selecting nested data:
>>> response.css('img').xpath('@src').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']
To actually extract the textual data, you must call the selector .extract()
method, as follows:
>>> response.xpath('//title/text()').extract() [u'Example website']
If you want to extract only first matched element, you can call the selector .extract_first()
>>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 '
It returns None
if no element was found:
>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None True
A default return value can be provided as an argument, to be used instead of None
:
>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found') 'not-found'
Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:
>>> response.css('title::text').extract() [u'Example website']
Now we’re going to get the base URL and some image links:
>>> response.xpath('//base/@href').extract() [u'http://example.com/'] >>> response.css('base::attr(href)').extract() [u'http://example.com/'] >>> response.xpath('//a[contains(@href, "image")]/@href').extract() [u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html'] >>> response.css('a[href*=image]::attr(href)').extract() [u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html'] >>> response.xpath('//a[contains(@href, "image")]/img/@src').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg'] >>> response.css('a[href*=image] img::attr(src)').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']
The selection methods (.xpath()
or .css()
) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:
>>> links = response.xpath('//a[contains(@href, "image")]') >>> links.extract() [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>'] >>> for index, link in enumerate(links): ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print 'Link number %d points to url %s and image %s' % args Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
Selector
also has a .re()
method for extracting data using regular expressions. However, unlike using .xpath()
or .css()
methods, .re()
returns a list of unicode strings. So you can’t construct nested .re()
calls.
Here’s an example used to extract image names from the HTML code above:
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)') [u'My image 1', u'My image 2', u'My image 3', u'My image 4', u'My image 5']
There’s an additional helper reciprocating .extract_first()
for .re()
, named .re_first()
. Use it to extract just the first matching string:
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)') u'My image 1'
Keep in mind that if you are nesting selectors and use an XPath that starts with /
, that XPath will be absolute to the document and not relative to the Selector
you’re calling it from.
For example, suppose you want to extract all <p>
elements inside <div>
elements. First, you would get all <div>
elements:
>>> divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p>
elements from the document, not only those inside <div>
elements:
>>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document ... print p.extract()
This is the proper way to do it (note the dot prefixing the .//p
XPath):
>>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print p.extract()
Another common case would be to extract all direct <p>
children:
>>> for p in divs.xpath('p'): ... print p.extract()
For more details about relative XPaths see the Location Paths section in the XPath specification.
XPath allows you to reference variables in your XPath expressions, using the $somevariable
syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?
, which are then substituted with values passed with the query.
Here’s an example to match an element based on its 「id」 attribute value, without hard-coding it (that was shown previously):
>>> # `$val` used in the expression, a `val` argument needs to be passed >>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first() u'Name: My image 1 '
Here’s another example, to find the 「id」 attribute of a <div>
tag containing five <a>
children (here we pass the value 5
as an integer):
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first() u'images'
All variable references must have a binding value when calling .xpath()
(otherwise you’ll get a ValueError: XPath error:
exception). This is done by passing as many named arguments as necessary.
parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.
Being built atop lxml, Scrapy selectors also support some EXSLT extensions and come with these pre-registered namespaces to use in XPath expressions:
prefix | namespace | usage |
---|---|---|
re | http://exslt.org/regular-expressions | regular expressions |
set | http://exslt.org/sets | set manipulation |
The test()
function, for example, can prove quite useful when XPath’s starts-with()
or contains()
are not sufficient.
Example selecting links in list item with a 「class」 attribute ending with a digit:
>>> from scrapy import Selector >>> doc = """ ... <div> ... <ul> ... <li class="item-0"><a href="link1.html">first item</a></li> ... <li class="item-1"><a href="link2.html">second item</a></li> ... <li class="item-inactive"><a href="link3.html">third item</a></li> ... <li class="item-1"><a href="link4.html">fourth item</a></li> ... <li class="item-0"><a href="link5.html">fifth item</a></li> ... </ul> ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> sel.xpath('//li//@href').extract() [u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html'] >>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract() [u'link1.html', u'link2.html', u'link4.html', u'link5.html'] >>>
Warning
C library libxslt
doesn’t natively support EXSLT regular expressions so lxml‘s implementation uses hooks to Python’s re
module. Thus, using regexp functions in your XPath expressions may add a small performance penalty.
These can be handy for excluding parts of a document tree before extracting text elements for example.
Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:
>>> doc = """ ... <div itemscope itemtype="http://schema.org/Product"> ... <span itemprop="name">Kenmore White 17" Microwave</span> ... <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' /> ... <div itemprop="aggregateRating" ... itemscope itemtype="http://schema.org/AggregateRating"> ... Rated <span itemprop="ratingValue">3.5</span>/5 ... based on <span itemprop="reviewCount">11</span> customer reviews ... </div> ... ... <div itemprop="offers" itemscope itemtype="http://schema.org/Offer"> ... <span itemprop="price">$55.00</span> ... <link itemprop="availability" href="http://schema.org/InStock" />In stock ... </div> ... ... Product description: ... <span itemprop="description">0.7 cubic feet countertop microwave. ... Has six preset cooking categories and convenience features like ... Add-A-Minute and Child Lock.</span> ... ... Customer reviews: ... ... <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ... <span itemprop="name">Not a happy camper</span> - ... by <span itemprop="author">Ellie</span>, ... <meta itemprop="datePublished" content="2011-04-01">April 1, 2011 ... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ... <meta itemprop="worstRating" content = "1"> ... <span itemprop="ratingValue">1</span>/ ... <span itemprop="bestRating">5</span>stars ... </div> ... <span itemprop="description">The lamp burned out and now I have to replace ... it. </span> ... </div> ... ... <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ... <span itemprop="name">Value purchase</span> - ... by <span itemprop="author">Lucas</span>, ... <meta itemprop="datePublished" content="2011-03-25">March 25, 2011 ... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ... <meta itemprop="worstRating" content = "1"/> ... <span itemprop="ratingValue">4</span>/ ... <span itemprop="bestRating">5</span>stars ... </div> ... <span itemprop="description">Great microwave for the price. It is small and ... fits in my apartment.</span> ... </div> ... ... ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> for scope in sel.xpath('//div[@itemscope]'): ... print "current scope:", scope.xpath('@itemtype').extract() ... props = scope.xpath(''' ... set:difference(./descendant::*/@itemprop, ... .//*[@itemscope]/*/@itemprop)''') ... print " properties:", props.extract() ... print current scope: [u'http://schema.org/Product'] properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review'] current scope: [u'http://schema.org/AggregateRating'] properties: [u'ratingValue', u'reviewCount'] current scope: [u'http://schema.org/Offer'] properties: [u'price', u'availability'] current scope: [u'http://schema.org/Review'] properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating'] properties: [u'worstRating', u'ratingValue', u'bestRating'] current scope: [u'http://schema.org/Review'] properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating'] properties: [u'worstRating', u'ratingValue', u'bestRating'] >>>
Here we first iterate over itemscope
elements, and for each one, we look for all itemprops
elements and exclude those that are themselves inside another itemscope
.
Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.
When you need to use the text content as argument to an XPath string function, avoid using .//text()
and use just .
instead.
This is because the expression .//text()
yields a collection of text elements – a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains()
or starts-with()
, it results in the text for the first element only.
Example:
>>> from scrapy import Selector >>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
Converting a node-set to string:
>>> sel.xpath('//a//text()').extract() # take a peek at the node-set [u'Click here to go to the ', u'Next Page'] >>> sel.xpath("string(//a[1]//text())").extract() # convert it to string [u'Click here to go to the ']
A node converted to a string, however, puts together the text of itself plus of all its descendants:
>>> sel.xpath("//a[1]").extract() # select the first node [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'] >>> sel.xpath("string(//a[1])").extract() # convert it to string [u'Click here to go to the Next Page']
So, using the .//text()
node-set won’t select anything in this case:
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract() []
But using the .
to mean the node, works:
>>> sel.xpath("//a[contains(., 'Next Page')]").extract() [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
//node[1]
selects all the nodes occurring first under their respective parents.
(//node)[1]
selects all the nodes in the document, and then gets only the first of them.
Example:
>>> from scrapy import Selector >>> sel = Selector(text=""" ....: <ul class="list"> ....: <li>1</li> ....: <li>2</li> ....: <li>3</li> ....: </ul> ....: <ul class="list"> ....: <li>4</li> ....: <li>5</li> ....: <li>6</li> ....: </ul>""") >>> xp = lambda x: sel.xpath(x).extract()
This gets all first <li>
elements under whatever it is its parent:
>>> xp("//li[1]") [u'<li>1</li>', u'<li>4</li>']
And this gets the first <li>
element in the whole document:
>>> xp("(//li)[1]") [u'<li>1</li>']
This gets all first <li>
elements under an <ul>
parent:
>>> xp("//ul/li[1]") [u'<li>1</li>', u'<li>4</li>']
And this gets the first <li>
element under an <ul>
parent in the whole document:
>>> xp("(//ul/li)[1]") [u'<li>1</li>']
Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
If you use @class='someclass'
you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass')
to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass
.
As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using CSS and then switch to XPath when needed:
>>> from scrapy import Selector >>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>') >>> sel.css('.shout').xpath('./time/@datetime').extract() [u'2014-07-23 19:00']
This is cleaner than using the verbose XPath trick shown above. Just remember to use the .
in the XPath expressions that will follow.
scrapy.selector.
Selector
(response=None, text=None, type=None)
An instance of Selector
is a wrapper over response to select certain parts of its content.
response
is an HtmlResponse
or an XmlResponse
object that will be used for selecting and extracting data.
text
is a unicode string or utf-8 encoded text for cases when a response
isn’t available. Using text
and response
together is undefined behavior.
type
defines the selector type, it can be "html"
, "xml"
or None
(default).
If
type
isNone
, the selector automatically chooses the best type based onresponse
type (see below), or defaults to"html"
in case it is used together withtext
.If
type
isNone
and aresponse
is passed, the selector type is inferred from the response type as follows:
"html"
forHtmlResponse
type"xml"
forXmlResponse
type"html"
for anything elseOtherwise, if
type
is set, the selector type will be forced and no detection will occur.
xpath
(query)
Find nodes matching the xpath query
and return the result as a SelectorList
instance with all elements flattened. List elements implement Selector
interface too.
query
is a string containing the XPATH query to apply.
Note
For convenience, this method can be called as response.xpath()
css
(query)
Apply the given CSS selector and return a SelectorList
instance.
query
is a string containing the CSS selector to apply.
In the background, CSS queries are translated into XPath queries using cssselect library and run .xpath()
method.
Note
For convenience this method can be called as response.css()
extract
()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.
re
(regex)
Apply the given regex and return a list of unicode strings with the matches.
regex
can be either a compiled regular expression or a string which will be compiled to a regular expression usingre.compile(regex)
Note
Note that re()
and re_first()
both decode HTML entities (except <
and &
).
register_namespace
(prefix, uri)
Register the given namespace to be used in this Selector
. Without registering namespaces you can’t select or extract data from non-standard namespaces. See examples below.
remove_namespaces
()
Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.
__nonzero__
()
Returns True
if there is any real content selected or False
otherwise. In other words, the boolean value of a Selector
is given by the contents it selects.
scrapy.selector.
SelectorList
The SelectorList
class is a subclass of the builtin list
class, which provides a few additional methods.
xpath
(query)
Call the .xpath()
method for each element in this list and return their results flattened as another SelectorList
.
query
is the same argument as the one in Selector.xpath()
css
(query)
Call the .css()
method for each element in this list and return their results flattened as another SelectorList
.
query
is the same argument as the one in Selector.css()
extract
()
Call the .extract()
method for each element in this list and return their results flattened, as a list of unicode strings.
re
()
Call the .re()
method for each element in this list and return their results flattened, as a list of unicode strings.
Here’s a couple of Selector
examples to illustrate several concepts. In all cases, we assume there is already a Selector
instantiated with a HtmlResponse
object like this:
sel = Selector(html_response)
Select all <h1>
elements from an HTML response body, returning a list of Selector
objects (ie. a SelectorList
object):
sel.xpath("//h1")
Extract the text of all <h1>
elements from an HTML response body, returning a list of unicode strings:
sel.xpath("//h1").extract() # this includes the h1 tag sel.xpath("//h1/text()").extract() # this excludes the h1 tag
Iterate over all <p>
tags and print their class attribute:
for node in sel.xpath("//p"): print node.xpath("@class").extract()
Here’s a couple of examples to illustrate several concepts. In both cases we assume there is already a Selector
instantiated with an XmlResponse
object like this:
sel = Selector(xml_response)
Select all <product>
elements from an XML response body, returning a list of Selector
objects (ie. a SelectorList
object):
sel.xpath("//product")
Extract all prices from a Google Base XML feed which requires registering a namespace:
sel.register_namespace("g", "http://base.google.com/ns/1.0") sel.xpath("//g:price").extract()
When dealing with scraping projects, it is often quite convenient to get rid of namespaces altogether and just work with element names, to write more simple/convenient XPaths. You can use the Selector.remove_namespaces()
method for that.
Let’s show an example that illustrates this with GitHub blog atom feed.
First, we open the shell with the url we want to scrape:
$ scrapy shell https://github.com/blog.atom
Once in the shell we can try selecting all <link>
objects and see that it doesn’t work (because the Atom XML namespace is obfuscating those nodes):
>>> response.xpath("//link") []
But once we call the Selector.remove_namespaces()
method, all nodes can be accessed directly by their names:
>>> response.selector.remove_namespaces() >>> response.xpath("//link") [<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>, <Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>, ...
If you wonder why the namespace removal procedure isn’t always called by default instead of having to call it manually, this is because of two reasons, which, in order of relevance, are:
https://docs.scrapy.org/en/latest/topics/selectors.html
Scrapy學習系列(一):網頁元素查詢CSS Selector和XPath Selector
這篇文章主要介紹建立一個簡單的spider,順便介紹一下對網頁元素的選取方式(css selector, xpath selector)。
打開命令行運行如下命令:
scrapy startproject homelink_selling_index
建立出的工程結構以下:
│ scrapy.cfg │ └─lianjia_shub │ items.py │ pipelines.py │ settings.py │ __init__.py │ └─spiders __init__.py
須要抓取的頁面元素以下圖:
導入命名空間:
import scrapy
定義spider:
class homelink_selling_index_spider(scrapy.Spider): # 定義spider的名字,在調用spider進行crawling的時候會用到: # scrapy crawl <spider.name> name = "homelink_selling_index" # 若是沒有特別指定其餘的url,spider會以start_urls中的連接爲入口開始爬取 start_urls = ["http://bj.lianjia.com/ershoufang/pg1tt2/"] # parse是scrapy.Spider處理http response的默認入口 # parse會對start_urls裏的全部連接挨個進行處理 def parse(self, response): # 獲取當前頁面的房屋列表 #house_lis = response.css('.house-lst .info-panel') house_lis = response.xpath('//ul[@class="house-lst"]/li/div[@class="info-panel"]') # 把結果輸出到文件(在命令行中房屋標題會由於編碼緣由顯示爲亂碼) with open("homelink.log", "wb") as f: ## 使用css selector進行操做 #average_price = response.css('.secondcon.fl li:nth-child(1)').css('.botline a::text').extract_first() #f.write("Average Price: " + str(average_price) + "\r\n") #yesterday_count = response.css('.secondcon.fl li:last-child').css('.botline strong::text').extract_first() #f.write("Yesterday Count: " + str(yesterday_count) + "\r\n") #for house_li in house_lis: # link = house_li.css('a::attr("href")').extract_first() # 獲取房屋的連接地址 # title = house_li.css('a::text').extract_first() # 獲取房屋的標題 # price = house_li.css('.price .num::text').extract_first() # 獲取房屋的價格 # 使用xpath selector進行操做 average_price = response.xpath('//div[@class="secondcon fl"]//li[1]/span[@class="botline"]//a/text()').extract_first() f.write("Average Price: " + str(average_price) + "\r\n") yesterday_count = response.xpath('//div[@class="secondcon fl"]//li[last()]//span[@class="botline"]/strong/text()').extract_first() f.write("Yesterday Count: " + str(yesterday_count) + "\r\n") for house_li in house_lis: link = house_li.xpath('.//a/@href').extract_first() # 注意這裏xpath的語法,前面要加上".",不然會從文檔根節點而不是當前節點爲起點開始查詢 title = house_li.xpath('.//a/text()').extract_first() price = house_li.xpath('.//div[@class="price"]/span[@class="num"]/text()').extract_first() f.write("Title: {0}\tPrice:{1}\r\n\tLink: {2}\r\n".format(title.encode('utf-8'), price, link))
Average Price: 44341 Yesterday Count: 33216 Title: 萬科假日風景全明格局 南北精裝三居 滿五惟一 Price:660 Link: http://bj.lianjia.com/ershoufang/xxx.html Title: 南北通透精裝三居 免稅帶車位 先後對花園 有鑰匙 Price:910 Link: http://bj.lianjia.com/ershoufang/xxx.html Title: 西直門 時代之光名苑 西南四居 滿五惟一 誠心出售 Price:1200 Link: http://bj.lianjia.com/ershoufang/xxx.html
......
經過上面的三步,咱們能夠對網頁元素進行簡單的爬取操做了。可是這裏尚未真正利用好Scrapy提供給咱們的不少方便、強大的功能,好比: ItemLoader, Pipeline等。這些操做會在後續的文章中繼續介紹。
https://www.cnblogs.com/silverbullet11/p/scrapy_series_1.html