Scrapy Shell

時間 2020-05-13

標籤 scrapy shell 欄目 Python 简体版

原文原文鏈接

Scrapy Shell

Scrapy終端是一個交互終端，咱們能夠在未啓動spider的狀況下嘗試及調試代碼，也能夠用來測試XPath或CSS表達式，查看他們的工做方式，方便咱們爬取的網頁中提取的數據。php

若是安裝了 IPython ，Scrapy終端將使用 IPython (替代標準Python終端)。 IPython 終端與其餘相比更爲強大，提供智能的自動補全，高亮輸出，及其餘特性。（推薦安裝IPython）css

啓動Scrapy Shell

進入項目的根目錄，執行下列命令來啓動shell:html

scrapy shell "http://www.itcast.cn/channel/teacher.shtml"

圖片描述正則表達式

Scrapy Shell根據下載的頁面會自動建立一些方便使用的對象，例如 Response 對象，以及 Selector 對象 (對HTML及XML內容)。shell

當shell載入後，將獲得一個包含response數據的本地 response 變量，輸入
response.body將輸出response的包體，輸出 response.headers 能夠看到response的包頭。
輸入 response.selector 時，將獲取到一個response 初始化的類 Selector 的對象，此時能夠經過使用
response.selector.xpath()或response.selector.css() 來對 response 進行查詢。
Scrapy也提供了一些快捷方式, 例如 response.xpath()或response.css()一樣能夠生效（如以前的案例）。

Selectors選擇器

Scrapy Selectors 內置 XPath 和 CSS Selector 表達式機制scrapy

Selector有四個基本的方法，最經常使用的仍是xpath:ide

xpath(): 傳入xpath表達式，返回該表達式所對應的全部節點的selector list列表
extract(): 序列化該節點爲Unicode字符串並返回list
css(): 傳入CSS表達式，返回該表達式所對應的全部節點的selector list列表，語法同 BeautifulSoup4
re(): 根據傳入的正則表達式對數據進行提取，返回Unicode字符串list列表

XPath表達式的例子及對應的含義:

/html/head/title: 選擇<HTML>文檔中 <head> 標籤內的 <title> 元素
/html/head/title/text(): 選擇上面提到的 <title> 元素的文字
//td: 選擇全部的 <td> 元素
//div[@class="mine"]: 選擇全部具備 class="mine" 屬性的 div 元素

嘗試Selector

咱們用騰訊社招的網站http://hr.tencent.com/positio...舉例：測試

# 啓動
scrapy shell "http://hr.tencent.com/position.php?&start=0#a"

# 返回 xpath選擇器對象列表
response.xpath('//title')
[<Selector xpath='//title' data=u'<title>\u804c\u4f4d\u641c\u7d22 | \u793e\u4f1a\u62db\u8058 | Tencent \u817e\u8baf\u62db\u8058</title'>]

# 使用 extract()方法返回 Unicode字符串列表
response.xpath('//title').extract()
[u'<title>\u804c\u4f4d\u641c\u7d22 | \u793e\u4f1a\u62db\u8058 | Tencent \u817e\u8baf\u62db\u8058</title>']

# 打印列表第一個元素，終端編碼格式顯示
print response.xpath('//title').extract()[0]
<title>職位搜索 | 社會招聘 | Tencent 騰訊招聘</title>

# 返回 xpath選擇器對象列表
response.xpath('//title/text()')
<Selector xpath='//title/text()' data=u'\u804c\u4f4d\u641c\u7d22 | \u793e\u4f1a\u62db\u8058 | Tencent \u817e\u8baf\u62db\u8058'>

# 返回列表第一個元素的Unicode字符串
response.xpath('//title/text()')[0].extract()
u'\u804c\u4f4d\u641c\u7d22 | \u793e\u4f1a\u62db\u8058 | Tencent \u817e\u8baf\u62db\u8058'

# 按終端編碼格式顯示
print response.xpath('//title/text()')[0].extract()
職位搜索 | 社會招聘 | Tencent 騰訊招聘

response.xpath('//*[@class="even"]')
職位名稱:

print site[0].xpath('./td[1]/a/text()').extract()[0]
TEG15-運營開發工程師（深圳）
職位名稱詳情頁:

print site[0].xpath('./td[1]/a/@href').extract()[0]
position_detail.php?id=20744&keywords=&tid=0&lid=0
職位類別:

print site[0].xpath('./td[2]/text()').extract()[0]
技術類

之後作數據提取的時候，能夠把如今Scrapy Shell中測試，測試經過後再應用到代碼中。網站

固然Scrapy Shell做用不單單如此，可是不屬於咱們課程重點，不作詳細介紹。編碼

官方文檔：[http://scrapy-chs.readthedocs...Spider][3]

1. Scrapy之Scrapy shell
2. Scrapy框架----- Scrapy Shell
3. Scrapy Shell
4. scrapy 測試工具scrapy shell
5. python爬蟲scrapy之scrapy終端(Scrapy shell)
6. Scrapy（7） Shell 研究
7. scrapy框架之shell
8. 爬蟲：Scrapy7 - Scrapy終端(Scrapy shell)
9. scrapy框架【shell使用】
10. scrapy shell 遇到的問題
更多相關文章...
• PHP fnmatch() 函數 - PHP參考手冊
• PHP EOF(heredoc) 使用說明 - PHP教程
• Docker容器實戰(七) - 容器眼光下的文件系統
• Github 簡明教程

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。