二.Pyhon_scrapy終端（scrapy shell）學習筆記

時間 2020-07-07

標籤 pyhon scrapy 終端 shell 學習筆記欄目 Python 简体版

原文原文鏈接

Scrapy shellcss

Scrapy shell是一個交互式shell，您能夠很是快速地嘗試調試您的抓取代碼，而無需運行蜘蛛。它用於測試數據提取代碼，但您實際上能夠使用它來測試任何類型的代碼，由於它也是常規的Python shell。
html

配置python

官方原文：若是安裝了IPython，Scrapy shell將使用它（而不是標準的Python控制檯）。該IPython的控制檯功能更強大，並提供智能自動完成和彩色輸出，等等。shell

咱們強烈建議您安裝IPython，特別是若是您正在使用Unix系統（IPython擅長）。有關詳細信息，請參閱IPython安裝指南。dom

Scrapy也支持bpython，而且會嘗試在IPython 不可用的地方使用它。scrapy

調用的話，能夠進入你文件中的scrapy.cfg中設置，添加，例如ipython：ide

能夠在筆記一的E:\pythoncode中設置：測試

[settings] shell = ipython

啓動fetch

進入命令行
scrapy shell <url>

scrapy也能夠抓取本地文件：this

scrapy shell X:///XXX/XXX/XXX/XXX.html

使用

Scrapy shell只是一個常規的Python控制檯（若是有的話，它能夠是IPython控制檯），它提供了一些額外的快捷功能以方便使用。

Available Shortcuts（可用的命令？）

shelp()

fetch(url[, redirect=True])

fetch(request)

view(response)

可用的Scrapy對象

Scrapy shell自動從下載的頁面建立一些方便的對象，如Response對象和 Selector對象

crawler- 當前Crawler對象。
spider- 已知處理URL的Spider，或者Spider當前URL沒有找到蜘蛛時的對象
request- Request最後一個獲取頁面的對象。您能夠replace() 使用fetch 快捷方式使用或獲取新請求（不離開shell）來修改此請求。
response- Response包含最後一個提取頁面的對象
settings- 目前的Scrapy設置

shell會話的例子

首先，進入E:\pythoncode，而後啓動shell：

scrapy shell "https://www.baidu.com" --nolog

能夠看到使用的一些命令：

[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000000000B27390>
[s] item {}
[s] request <GET https://www.baidu.com>
[s] settings <scrapy.settings.Settings object at 0x0000000004BA03C8>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default
, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local object
s
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

接着咱們輸入：
 response.css("div.celltop a b::text").extract_first() 'Information'

fetch("http://www.guoxuedashi.com/") 注:記得url要加前綴（http://或者https://）
注：若是前面scrapy shell的時候沒有加--nolog,會顯示 注：DEBUG: Crawled (200)XXXXXXXXXXXXXXXXXX

response.css("a[target=_blank]::text").extract_first()
'四庫全書'

request = request.replace(method="POST")

fetch(request)

注："POST","GET","PUT","HEAD"等等都是HTTP請求方法(通常是用GET，這裏用POST是想舉個例子）

response.status
200

注：200是網頁響應代碼

from pprint import pprint

pprint(response.headers)

注:ppint是美觀的print

從爬蟲中調用shell

有時您想要檢查蜘蛛的某個特定點正在處理的響應，若是隻是爲了檢查您指望的響應是否到達那裏。

這能夠經過使用該scrapy.shell.inspect_response功能來實現。

在E:\pythoncode\myproject\spiders建立

import scrapy class MySpider(scrapy.Spider): name = "scrapy_sh" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ] def parse(self, response): 
        if ".org" in response.url: from scrapy.shell import inspect_response inspect_response(response, self)

注：shell就出來了~
response.url
'http://example.org'

response.css("p::text").extract()
["This domain is established to be used for illustrative examples in doc..........."]

view(response)
True

注:Ctrl+Z或者Ctrl+D能夠退出

附上源頭活水：https://docs.scrapy.org/en/latest/topics/shell.html