對於scrapy的單元測試,官方文檔並無提到,只是說有一個Contract
功能。可是相信我,這個東西真的很差用,甚至scrapy的做者在一個issue中都說到但願刪去這個功能。javascript
那麼scrapy應該怎麼測試呢?css
首先咱們要明白咱們真正想測試的是什麼:html
- 咱們不是要測試爬蟲是否能訪問站點!這個應該在你編寫爬蟲的時候就作到;若是你的代碼在運行忽然不能夠訪問站點了,也應該使用sentry這種日誌監控系統。
- 咱們要測試
parse()
,parse_xx()
方法是否如預期返回想要的item和request - 咱們要測試
parse()
返回的item中字段類型是否正確。尤爲是你用了scrapy的processor系統以後
使用betamax進行單元測試
關於betamax的介紹,能夠看個人這篇博客。java
咱們實際要作的不只是單元測試1,仍是集成測試2。咱們不想每次都重複進行真實的請求,咱們不想使用囉嗦的mock。python
爬蟲代碼
下面是咱們的爬蟲代碼,這是爬取一個ip代理網站,獲取最新發布的ip:shell
# src/spider.py import scrapy from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, MapCompose, Join class IPItem(scrapy.Item): ip = scrapy.Field( input_processor=MapCompose(str, str.strip), output_processor=TakeFirst() ) port = scrapy.Field( input_processor=MapCompose(str, str.strip), output_processor=TakeFirst() ) protocol = scrapy.Field( input_processor=MapCompose(str, str.strip, str.lower), output_processor=TakeFirst() ) remark = scrapy.Field( input_processor=MapCompose(str, str.strip), output_processor=Join(separator=', ') ) source = scrapy.Field( input_processor=MapCompose(str, str.strip), output_processor=TakeFirst() ) class IpData5uSpider(scrapy.Spider): name = 'ip-data5u' allowed_domains = ['data5u.com'] start_urls = [ 'http://www.data5u.com/free/index.shtml', 'http://www.data5u.com/free/gngn/index.shtml', ] custom_settings = { 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 'DOWNLOAD_DELAY': 1 } def parse(self, response): for row in response.css('div.wlist ul.l2'): loader = ItemLoader(item=IPItem(), selector=row) loader.add_value('source', 'data5u') loader.add_css('ip', 'span:nth-child(1) li::text') loader.add_css('port', 'span:nth-child(2) li::text') loader.add_css('protocol', 'span:nth-child(4) li::text') loader.add_css('remark', 'span:nth-child(5) li::text') loader.add_css('remark', 'span:nth-child(5) li::text') yield loader.load_item()
測試代碼
咱們使用pytest
編寫項目的單元測試,首先咱們編寫一些fixture函數:json
# tests/conftest.py import pathlib import pytest from scrapy.http import HtmlResponse, Request import betamax from betamax.fixtures.pytest import _betamax_recorder # betamax配置,設置betamax錄像帶的存儲位置 cassette_dir = pathlib.Path(__file__).parent / 'fixture' / 'cassettes' cassette_dir.mkdir(parents=True, exist_ok=True) with betamax.Betamax.configure() as config: config.cassette_library_dir = cassette_dir.resolve() config.preserve_exact_body_bytes = True @pytest.fixture def betamax_recorder(request): """修改默認的betamax pytest fixtures 讓它默承認用接口pytest.mark.parametrize裝飾器,而且生產不一樣的錄像帶. 有些地方可能會用到 """ return _betamax_recorder(request, parametrized=True) @pytest.fixture def resource_get(betamax_session): """這是一個pytest fixture 返回一個http請求方法,至關於: with Betamax(session) as vcr: vcr.use_use_cassette('這裏是測試函數的qualname') resp = session.get(url, *args, **kwargs) # 將requests的Response,封裝成scrapy的HtmlResponse return HtmlResponse(body=resp.content) """ def get(url, *args, **kwargs): request = kwargs.pop('request', None) resp = betamax_session.get(url, *args, **kwargs) selector = HtmlResponse(body=resp.content, url=url, request=request) return selector return get
而後是測試函數:ruby
# tests/test_spider/test_ip_spider.py from src.spider import IpData5uSpider, IPItem def test_proxy_data5u_spider(resource_get): spider = IpData5uSpider() headers = { 'user-agent': spider.custom_settings['USER_AGENT'] } for urlr in spider.start_urls: selector = resource_get(url, headers=headers, request=req) result = spider.parse(selector) for item in result: if isinstance(item, IPItem): assert isinstance(item['port'], str) assert item['ip'] assert item['protocol'] in ('http', 'https') elif isinstance(item, Request): assert item.url.startswith(req.url) else: raise ValueError('yield 輸出了意料外的item')
而後咱們運行它:session
>>> pytest ... Results (2.12s): 1 passed
咱們能夠看到fixture目錄出現新的文件,相似xxx.tests.test_spiders.test_ip_spider.test_proxy_data5u_spider.json
這樣的文件名.dom
再運行一次:
>>> pytest ... Results (0.56s): 1 passed
測試運行速度明顯變快,這是由於這一次使用的是保存在fixture的文件,用它來代替進行真正的http request操做。
另外咱們能夠看一下fixture中json文件的內容:
{"http_interactions": [{"request": {"body": {"encoding": "utf-8", "base64_string": ""}, "headers": {"user-agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"], "Accept-Encoding": ["gzip, deflate"], "Accept": ["*/*"], "Connection": ["keep-alive"]}, "method": "GET", "uri": "http://www.data5u.com/free/index.shtml"}, "response": {"body": {"encoding": "UTF-8", "base64_string": "H4sIAAAAAAx..."}]}
能夠看到這裏保存了一個response的所有信息,經過這個response再構造一個request.Response
也不是難事吧。這就是betamax的原理。