python爬蟲scrapy之scrapy終端(Scrapy shell)

時間 2019-11-13

標籤 python 爬蟲 scrapy 終端 shell 欄目 Python 简体版

原文原文鏈接

　　Scrapy終端是一個交互終端，供您在未啓動spider的狀況下嘗試及調試您的爬取代碼。其本意是用來測試提取數據的代碼，不過您能夠將其做爲正常的Python終端，在上面測試任何的Python代碼。css

該終端是用來測試XPath或CSS表達式，查看他們的工做方式及從爬取的網頁中提取的數據。在編寫您的spider時，該終端提供了交互性測試您的表達式代碼的功能，免去了每次修改後運行spider的麻煩。html

一旦熟悉了Scrapy終端後，您會發現其在開發和調試spider時發揮的巨大做用。python

若是您安裝了 IPython ，Scrapy終端將使用 IPython (替代標準Python終端)。 IPython 終端與其餘相比更爲強大，提供智能的自動補全，高亮輸出，及其餘特性。shell

咱們強烈推薦您安裝 IPython ，特別是若是您使用Unix系統(IPython 在Unix下工做的很好)。詳情請參考 IPython installation guide 。api

啓動終端

您可使用 shell 來啓動Scrapy終端:瀏覽器

<url> 是您要爬取的網頁的地址。注意，這裏咱們只是進入到scrapy的shell調試裏面，到進去之後，咱們還能夠用fetch(url)來獲取其它你想要的網頁內容。查看當前你這在看的是哪一個網站，能夠用response.url進行判斷。scrapy

scrapy shell <url>

　　打印日誌：ide

scrapy shell 'http://scrapy.org'

　　不打印日誌：測試

scrapy shell 'http://scrapy.org' --nolog

使用終端

D:\項目\小項目\scrapy_day6_httpbin\httpbin>scrapy shell "https://dig.chouti.com"  --nolog
https://www.zhihu.com/captcha.gif?r=1512028381914&type=login
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x04E60090>
[s]   item       {}
[s]   request    <GET https://dig.chouti.com>
[s]   response   <200 https://dig.chouti.com>
[s]   settings   <scrapy.settings.Settings object at 0x04E60390>
[s]   spider     <DefaultSpider 'default' at 0x5a23f70>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

Scrapy終端僅僅是一個普通的Python終端(或 IPython )。其提供了一些額外的快捷方式。fetch

可用的快捷命令(shortcut)

shelp() - 打印可用對象及快捷命令的幫助列表

fetch(request_or_url) - 根據給定的請求(request)或URL獲取一個新的response，並更新相關的對象

view(response) - 在本機的瀏覽器打開給定的response。其會在response的body中添加一個 <base> tag ，使得外部連接(例如圖片及css)能正確顯示。注意，該操做會在本地建立一個臨時文件，且該文件不會被自動刪除。

可用的Scrapy對象

Scrapy終端根據下載的頁面會自動建立一些方便使用的對象，例如 Response 對象及 Selector 對象(對HTML及XML內容)。

這些對象有:

crawler - 當前 Crawler 對象.

spider - 處理URL的spider。對當前URL沒有處理的Spider時則爲一個 Spider 對象。

request - 最近獲取到的頁面的 Request 對象。您可使用 replace() 修改該request。或者使用 fetch 快捷方式來獲取新的request。

response - 包含最近獲取到的頁面的 Response 對象。

sel - 根據最近獲取到的response構建的 Selector 對象。

settings - 當前的 Scrapy settings

打印當前請求的狀態碼：

>>> response
<200 https://dig.chouti.com>

>>> response.headers
{b'Date': [b'Thu, 30 Nov 2017 09:45:06 GMT'], b'Content-Type': [b'text/html; charset=UTF-8'], b'Server': [b'Tengine'], b'Content-Language': [b'en'], b'X-Via': [b'1.1 bd157:10 (Cdn Ca
che Server V2.0)']}

嘗試咱們的xpath表達式抽取內容

>>> sel.xpath('//a[@class="show-content color-chag"]/text()').extract_first()
'\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\
t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t【迅雷嘉獎維護公司利益員工 每人獎10萬】11月30日訊，迅雷與迅雷大數據近日發生「內訌」，雙方屢次發佈公告互相指責。對此，迅雷發佈內部郵
件，嘉獎在關鍵時刻維護公司利益的5名員工，並給予每人10萬元的獎勵。\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t'
>>> sel.xpath('//a[@class="show-content color-chag"]/text()').extract_first().strip()
'【迅雷嘉獎維護公司利益員工 每人獎10萬】11月30日訊，迅雷與迅雷大數據近日發生「內訌」，雙方屢次發佈公告互相指責。對此，迅雷發佈內部郵件，嘉獎在關鍵時刻維護公司利益的5名員工，並給予每
人10萬元的獎勵。'

這裏也能夠用css抽取

>>> sel.css('.part1 a::text').extract_first().strip()
'Netflix買下《白夜追兇》海外發行權，將在全球190多個國家和地區播出'

view就有意思了，它其實就是把下載的html保存。

>>> view(response)
True

打印當前請求的url

>>> response.url

'https://dig.chouti.com'

可是這裏我如今只能想到一個問題，那像是知乎這樣相似的網站，單純是提取就須要加上request的header信息，這怎麼整，下面這麼整就行。

　　一、首先咱們須要from scrapy import Request，導入模塊。

　　二、這裏咱們把請求到的內容賦值給data，我曾經單純的想，這裏我直接data.xpath和data.css就行，可是現實不行，data.url和headers是能夠的，可查詢內容就須要利用fetch(data)把請求結果，裝換成response對象，這樣的話咱們直接用sel.xpath或者sel.css才能提取咱們須要的信息。

>>> data = Request("https://www.taobao.com",headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
})
>>> fetch(data)
2017-11-30 22:24:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.taobao.com> (referer: None)
>>> response.url
'https://www.taobao.com'
>>> sel.xpath('/html/body/div[4]/div[1]/div[1]/div[1]/div/ul/li[1]/a[1]')
[<Selector xpath='/html/body/div[4]/div[1]/div[1]/div[1]/div/ul/li[1]/a[1]' data='<a href="https://www.taobao.com/markets/'>]
>>> sel.xpath('/html/body/div[4]/div[1]/div[1]/div[1]/div/ul/li[1]/a[1]').extract_first()
'<a href="https://www.taobao.com/markets/nvzhuang/taobaonvzhuang" data-cid="1" data-dataid="222887">女裝</a>'
>>> data.headers
{b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}

　　三、仔細思考的同窗會發現這個請求頭裏面只有咱們提交的瀏覽器類型信息，其它什麼都沒有，而shell自帶的header裏面內容要不少。

>>> data.headers
{b'User-Agent': [b'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}


>>> response.headers
{b'Timing-Allow-Origin': [b'*'], b'Eagleid': [b'7583cc6515120518627751757e'], b'Age': [b'48'], b'Cache-Control': [b'max-age=0, s-maxage=90'], b'X-Cache': [b'HIT TCP_MEM_HIT dirn:-2:-
2 mlen:-1'], b'Vary': [b'Accept-Encoding', b'Ali-Detector-Type, X-CIP-PT'], b'Server': [b'Tengine'], b'Content-Type': [b'text/html; charset=utf-8'], b'X-Swift-Cachetime': [b'90'], b'
Set-Cookie': [b'thw=cn; Path=/; Domain=.taobao.com; Expires=Fri, 30-Nov-18 14:24:22 GMT;'], b'Via': [b'cache10.l2cn416[351,200-0,M], cache29.l2cn416[352,0], cache1.cn338[0,200-0,H],
cache6.cn338[0,0]'], b'Strict-Transport-Security': [b'max-age=31536000'], b'X-Swift-Savetime': [b'Thu, 30 Nov 2017 14:23:34 GMT'], b'Date': [b'Thu, 30 Nov 2017 14:24:22 GMT']}