新版Scrapy到底有哪些命令?|Python 主題月

本文正在參加「Python主題月」,詳情查看活動連接css

阿晨也是初學Scrapy,有些不對的但願大佬能不吝賜教,在底下留言告訴我!不勝感激!html

寫這篇文章時,Scrapy的最新版本是2.5.0python

好了,廢話少說,開始!react

命令行幫助

任何命令行工具,通常都會帶命令行說明,幾乎是行業內默認。Scrapy天然也不例外。web

$ scrapy -h
Scrapy 2.5.0 - project: scrapybot

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
複製代碼

startproject

建立一個新的Scrapy項目,自動生成Scrapy項目結構shell

命令格式

$ scrapy startproject 項目名稱 [項目目錄] # 項目目錄是可選項,不寫則生成同名文件夾做爲項目目錄
複製代碼

命令示例

$ scrapy startproject example
New Scrapy project 'example', using template directory 'd:\devtools\python\python39\lib\site-packages\scrapy\templates\project', created in:
    D:\WorkSpace\Personal\my-scrapy\example

You can start your first spider with:
    cd example
    scrapy genspider example example.com

$ ls
example  readme.md  venv
複製代碼

項目結構大概以下圖所示json

image-20210725014007557

genspider

使用預約義的模板生成新爬蟲,這個很是有用,有時候可以極大地提高爬蟲效率,前提是有一套好用的預約於模板。api

命令格式

$ scrapy genspider [-t 爬蟲模板名稱] 爬蟲名稱 爬蟲域名 # 爬蟲模板名稱不是必須的,不填則使用默認模板
複製代碼

命令示例

$ cd example/example/spiders/
$ scrapy genspider exampleSpider example.spider.com
Created spider 'exampleSpider' using template 'basic' in module:
  example.spiders.exampleSpider
複製代碼

建立好的爬蟲以下瀏覽器

image-20210725014306990

crawl

運行爬蟲,這個看似與runspider有點像,不過這個命令要求執行的爬蟲必須是Scrapy承認的項目結構才行。bash

命令格式

$ scrapy crawl 爬蟲名稱 # 就是咱們上面使用genspider建立的爬蟲名稱
複製代碼

命令示例

$ scrapy crawl exampleSpider
2021-07-25 01:50:59 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:50:59 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 01:50:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-07-25 01:50:59 [scrapy.crawler] INFO: Overridden settings:
...
 'start_time': datetime.datetime(2021, 7, 24, 17, 51, 0, 206683)}
2021-07-25 01:51:02 [scrapy.core.engine] INFO: Spider closed (finished)
複製代碼

runspider

這個命令也是用來執行爬蟲的,不過能夠執行外部爬蟲文件,也就是單獨的Spider

命令格式

$ scrapy runspider 爬蟲文件.py
複製代碼

命令示例

$ scrapy runspider exampleSpider.py
2021-07-25 01:54:24 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:54:24 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 01:54:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
...
 'start_time': datetime.datetime(2021, 7, 24, 17, 54, 24, 908097)}
2021-07-25 01:54:31 [scrapy.core.engine] INFO: Spider closed (finished)
複製代碼

bench

基準測試,會運行一個簡單的示例爬蟲,來進行基準測試,到底測試啥,不是很理解。

這個沒看明白到底有什麼做用,但願有大佬能留言解答阿晨的疑惑!

命令示例

$ scrapy bench
2021-07-25 01:58:17 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: example)
2021-07-25 01:58:17 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
...
2021-07-25 01:58:19 [scrapy.extensions.logstats] INFO: Crawled 90 pages (at 5400 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:20 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 6360 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:21 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 5340 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:22 [scrapy.extensions.logstats] INFO: Crawled 369 pages (at 5040 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:23 [scrapy.extensions.logstats] INFO: Crawled 433 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:24 [scrapy.extensions.logstats] INFO: Crawled 513 pages (at 4800 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:25 [scrapy.extensions.logstats] INFO: Crawled 593 pages (at 4800 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:26 [scrapy.extensions.logstats] INFO: Crawled 657 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:27 [scrapy.extensions.logstats] INFO: Crawled 721 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2021-07-25 01:58:28 [scrapy.extensions.logstats] INFO: Crawled 785 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
...
 'start_time': datetime.datetime(2021, 7, 24, 17, 58, 18, 691354)}
2021-07-25 01:58:29 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
複製代碼

check

爬蟲代碼檢查,相似於代碼靜態檢查,提早檢查爬蟲編寫是否有誤。

命令格式

$ scrapy check [-l] <spider>
複製代碼

命令示例

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4
複製代碼

list

列出當前項目全部可用的爬蟲

命令格式

$ scrapy list
複製代碼

命令示例

$ scrapy list
hotList
複製代碼

edit

編輯爬蟲,臨時修改下配置,仍是能夠的。會打開一個編輯器,來編輯爬蟲代碼

命令格式

$ scrapy edit <spider>
複製代碼

命令示例

$ scrapy edit hotList
'%s' 不是內部或外部命令,也不是可運行的程序
或批處理文件。
# 阿晨電腦裏好像沒有所謂的默認編輯器,因此出錯了
複製代碼

fetch

使用Scrapy來訪問網頁

命令格式

$ scrapy fetch <url>
# 支持的參數列表
--spider=SPIDER: 使用指定的爬蟲來訪問此網頁,能夠用來確認爬蟲是否生效
--headers: 打印響應頭,不打印響應正文
--no-redirect: 忽略重定向
複製代碼

命令示例

$ scrapy fetch https://www.baidu.com
2021-07-25 02:16:16 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: csdnHot)
2021-07-25 02:16:16 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
...
2021-07-25 02:16:17 [scrapy.core.engine] INFO: Spider closed (finished)
<!DOCTYPE html>
<html><!--STATUS OK--><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="Cache-control" content="no-cache" /><meta name="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no"/><style type="text/css">body {margin: 0;text-align: center;font-size: 14px;font-family: Arial,Helvetica,LiHei Pro Medium;color: #262626;}form {position: relative;margin: 12px 15px 91px;height: 41px;}img {border: 0}.wordWrap{margin-right: 85px;}#word {background-color: #FFF;border: 1px solid #6E6E6E;color: #000;font-size: 18px;height: 27px;padding: 6px;width: 100%;-webkit-appearance: none;-webkit-border-radius: 0;border-radius: 0;}.bn {background-color: #F5F5F5;border: 1px solid #787878;font-size: 16px;
# 會打印出百度首頁HTML代碼
複製代碼

view

使用Scrapy調起瀏覽器來打開網頁,Scrapy會對網頁作些分析。

官方文檔提到,爬蟲和普通用戶看到的網頁有時候是不同的,因此能夠確認是否能爬取此網頁。

命令格式

$ scrapy view <url>
# 支持的參數列表
--spider=SPIDER: 使用指定的爬蟲來訪問此網頁,能夠用來確認爬蟲是否生效
--no-redirect: 忽略重定向
複製代碼

命令示例

$ scrapy view https://www.baidu.com
2021-07-25 02:20:37 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: csdnHot)
2021-07-25 02:20:37 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Windows-10-10.0.19043-SP0
2021-07-25 02:20:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
# ...
複製代碼

下載了一個網頁文件到本地,確實和咱們平時見到的百度不大同樣!

image-20210725022112537

shell

也是一個開發爬蟲時經常使用命令,與fetch的區別是,能夠自定義一段shell腳原本測試解析網頁

命令格式

$ scrapy shell [url]
複製代碼

命令示例

$ scrapy shell --nolog -c '(response.status, response.url)' https://blog.csdn.net/phoenix/web/blog/hotRank?page=0&pageSize=25
(200, 'https://blog.csdn.net/phoenix/web/blog/hotRank?page=0')
複製代碼

parse

嘗試爬取給定的網址,經常使用來測試爬蟲代碼是否有效。

命令格式

$ scrapy parse <url> [options]
# 支持的參數列表
--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: set spider argument (may be repeated)
--callback or -c: spider method to use as callback for parsing the response
--meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{「foo」 : 「bar」}’
--cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{「foo」 : 「bar」}’
--pipelines: process items through pipelines
--rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: avoid using pygments to colorize the output
--depth or -d: depth level for which the requests should be followed recursively (default: 1)
--verbose or -v: display information for each depth level
--output or -o: dump scraped items to a file
複製代碼

命令示例

$ scrapy parse https://blog.csdn.net/rank/list --spider=hotList
...
2021-07-25 02:27:06 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 0 <<<
# Scraped Items ------------------------------------------------------------
[]

# Requests -----------------------------------------------------------------
[]
複製代碼

settings

查看爬蟲配置

命令格式

$ scrapy settings [options]
複製代碼

命令示例

$ scrapy settings --get BOT_NAME
csdnHot
複製代碼

參考資料

相關文章
相關標籤/搜索