Scrapy內部基於事件循環的機制實現爬蟲的併發。
原來:css
url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',] for item in url_list: response = requests.get(item) print(response.text)
如今: html
from twisted.web.client import getPage, defer from twisted.internet import reactor # 第一部分:代理開始接收任務 def callback(contents): print(contents) deferred_list = [] # [(龍泰,貝貝),(劉淞,寶件套),(呼呼,東北)] url_list = ['http://www.bing.com', 'https://segmentfault.com/','https://stackoverflow.com/' ] for url in url_list: deferred = getPage(bytes(url, encoding='utf8')) # (我,要誰) deferred.addCallback(callback) deferred_list.append(deferred) # # 第二部分:代理執行完任務後,中止 dlist = defer.DeferredList(deferred_list) def all_done(arg): reactor.stop() dlist.addBoth(all_done) # 第三部分:代理開始去處理吧 reactor.run()
什麼是twisted?
非阻塞:不等待,全部請求同時發出。 我向請求A、請求B、請求C發起鏈接請求的時候,不等鏈接返回結果以後再去連下一個,而是發送一個以後,立刻發送下一個。
import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.1,80)) import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.2,80)) import socket sk = socket.socket() sk.setblocking(False) sk.connect((1.1.1.3,80))
異步:回調。我一旦幫助callback_A、callback_B、callback_F找到想要的A,B,C,我會主動通知他們。
def callback(contents): print(contents)
事件循環: 我,我一直在循環三個socket任務(即:請求A、請求B、請求C),檢查他三個狀態:是否鏈接成功;是否返回結果。
它和requests的區別?
requests是一個Python實現的能夠僞造瀏覽器發送http請求的模塊 -封裝socket發送請求 twisted是基於事件循環的異步非阻塞網絡框架 -封裝socket發送請求 -單線程完成完成併發請求 PS:三個關鍵詞 -非阻塞:不等待 -異步:回調 -事件循環:一直循環去檢查狀態
Scrapy是一個爲了爬取網站數據,提取結構性數據而編寫的應用框架。 其能夠應用在數據挖掘,信息處理或存儲歷史數據等一系列的程序中。
其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的, 也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途普遍,能夠用於數據挖掘、監測和自動化測試。python
Scrapy 使用了 Twisted異步網絡庫來處理網絡通信。總體架構大體以下:mysql
Scrapy主要包括瞭如下組件:react
用來處理整個系統的數據流處理, 觸發事務(框架核心)web
用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL(抓取網頁的網址或者說是連接)的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址ajax
用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是創建在twisted這個高效的異步模型上的)redis
爬蟲是主要幹活的, 用於從特定的網頁中提取本身須要的信息, 即所謂的實體(Item)。用戶也能夠從中提取出連接,讓Scrapy繼續抓取下一個頁面算法
負責處理爬蟲從網頁中抽取的實體,主要的功能是持久化實體、驗證明體的有效性、清除不須要的信息。當頁面被爬蟲解析後,將被髮送到項目管道,並通過幾個特定的次序處理數據。sql
位於Scrapy引擎和下載器之間的框架,主要是處理Scrapy引擎與下載器之間的請求及響應。
介於Scrapy引擎和爬蟲之間的框架,主要工做是處理蜘蛛的響應輸入和請求輸出。
介於Scrapy引擎和調度之間的中間件,從Scrapy引擎發送到調度的請求和響應。
Scrapy運行流程大概以下:
基本命令
1 創建項目:scrapy startproject 項目名稱 2 在當前目錄中建立中建立一個項目文件(相似於Django) 3 建立爬蟲應用 4 cd 項目名稱 5 scrapy genspider [-t template] <name> <domain> 6 scrapy gensipider -t basic oldboy oldboy.com 7 scrapy genspider -t crawl weisuen sohu.com
8 PS:
9 查看全部命令:scrapy gensipider -l
10 查看模板命令:scrapy gensipider -d 模板名稱
11 scrapy list
12 展現爬蟲應用列表
13 運行爬蟲應用
14 scrapy crawl 爬蟲應用名稱
15 Scrapy crawl quotes
16 Scrapy runspider quote
17 scrapy crawl lagou -s JOBDIR=job_info/001 暫停與重啓
18 保存文件:Scrapy crawl quotes –o quotes.json
19 shell腳本測試 scrapy shell 'http://scrapy.org' --nolog
項目結構
1 project_name/ 2 scrapy.cfg 3 project_name/ 4 __init__.py 5 items.py 6 pipelines.py 7 settings.py 8 spiders/ 9 __init__.py 10 爬蟲1.py 11 爬蟲2.py 12 爬蟲3.py
文件說明:
注意:通常建立爬蟲文件時,以網站域名命名
1.start_urls
內部原理
scrapy引擎來爬蟲中取起始URL:
1. 調用start_requests並獲取返回值 2. v = iter(返回值) 3. req1 = 執行 v.__next__() req2 = 執行 v.__next__() req3 = 執行 v.__next__() ... 4. req所有放到調度器中
class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): # 方式一: for url in self.start_urls: yield Request(url=url) # 方式二: # req_list = [] # for url in self.start_urls: # req_list.append(Request(url=url)) # return req_list - 定製:能夠去redis中獲取
2. 響應:
# response封裝了響應相關的全部數據: - response.text - response.encoding - response.body
- response.meta['depth':'深度'] - response.request # 當前響應是由那個請求發起;請求中 封裝(要訪問的url,下載完成以後執行那個函數)
3. 選擇器
1 from scrapy.selector import Selector 2 from scrapy.http import HtmlResponse 3 4 html = """<!DOCTYPE html> 5 <html> 6 <head lang="en"> 7 <meta charset="UTF-8"> 8 <title></title> 9 </head> 10 <body> 11 <ul> 12 <li class="item-"><a id='i1' href="link.html">first item</a></li> 13 <li class="item-0"><a id='i2' href="llink.html">first item</a></li> 14 <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> 15 </ul> 16 <div><a href="llink2.html">second item</a></div> 17 </body> 18 </html> 19 """ 20 response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8') 21 # hxs = Selector(response=response).xpath('//a') 22 # print(hxs) 23 # hxs = Selector(text=html).xpath('//a') 24 # print(hxs) 25 # hxs = Selector(response=response).xpath('//a[2]') 26 # print(hxs) 27 # hxs = Selector(response=response).xpath('//a[@id]') 28 # print(hxs) 29 # hxs = Selector(response=response).xpath('//a[@id="i1"]') 30 # print(hxs) 31 # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') 32 # print(hxs) 33 # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') 34 # print(hxs) 35 # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') 36 # print(hxs) 37 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') 38 # print(hxs) 39 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() 40 # print(hxs) 41 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() 42 # print(hxs) 43 # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() 44 # print(hxs) 45 # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() 46 # print(hxs) 47 48 # ul_list = Selector(response=response).xpath('//body/ul/li') 49 # for item in ul_list: 50 # v = item.xpath('./a/span') 51 # # 或 52 # # v = item.xpath('a/span') 53 # # 或 54 # # v = item.xpath('*/a/span') 55 # print(v)
response.css('...') 返回一個response xpath對象
response.css('....').extract() 返回一個列表
response.css('....').extract_first() 提取列表中的元素
def parse_detail(self, response): # items = JobboleArticleItem() # title = response.xpath('//div[@class="entry-header"]/h1/text()')[0].extract() # create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace('·','').strip() # praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first()) # fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first() # try: # if re.match('.*?(\d+).*', fav_nums).group(1): # fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1)) # else: # fav_nums = 0 # except: # fav_nums = 0 # comment_nums = response.xpath('//a[contains(@href,"#article-comment")]/span/text()').extract()[0] # try: # if re.match('.*?(\d+).*',comment_nums).group(1): # comment_nums = int(re.match('.*?(\d+).*',comment_nums).group(1)) # else: # comment_nums = 0 # except: # comment_nums = 0 # contente = response.xpath('//div[@class="entry"]').extract()[0] # tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract() # tag_list = [tag for tag in tag_list if not tag.strip().endswith('評論')] # tags = ",".join(tag_list) # items['title'] = title # try: # create_date = datetime.datetime.strptime(create_date,'%Y/%m/%d').date() # except: # create_date = datetime.datetime.now() # items['date'] = create_date # items['url'] = response.url # items['url_object_id'] = get_md5(response.url) # items['img_url'] = [img_url] # items['praise_nums'] = praise_nums # items['fav_nums'] = fav_nums # items['comment_nums'] = comment_nums # items['content'] = contente # items['tags'] = tags
# title = response.css('.entry-header h1::text')[0].extract() # create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip() # praise_nums = int(response.css(".vote-post-up h10::text").extract_first() # fav_nums = response.css(".bookmark-btn::text").extract_first() # if re.match('.*?(\d+).*', fav_nums).group(1): # fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1)) # else: # fav_nums = 0 # comment_nums = response.css('a[href="#article-comment"] span::text').extract()[0] # if re.match('.*?(\d+).*', comment_nums).group(1): # comment_nums = int(re.match('.*?(\d+).*', comment_nums).group(1)) # else: # comment_nums = 0 # content = response.css('.entry').extract()[0] # tag_list = response.css('p.entry-meta-hide-on-mobile a::text') # tag_list = [tag for tag in tag_list if not tag.strip().endswith('評論')] # tags = ",".join(tag_list) # xpath選擇器 /@href /text()
def parse_detail(self, response): img_url = response.meta.get('img_url','') item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response) item_loader.add_css("title", ".entry-header h1::text") item_loader.add_value('url',response.url) item_loader.add_value('url_object_id', get_md5(response.url)) item_loader.add_css('date', 'p.entry-meta-hide-on-mobile::text') item_loader.add_value("img_url", [img_url]) item_loader.add_css("praise_nums", ".vote-post-up h10::text") item_loader.add_css("fav_nums", ".bookmark-btn::text") item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text") item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text") item_loader.add_css("content", "div.entry") items = item_loader.load_item() yield items
4. 再次發起請求
yield Request(url='xxxx',callback=self.parse)
yield Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)
5. 攜帶cookie
方式一:攜帶
cookie_dict cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) # 去對象中將cookie解析到字典 for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): cookie_dict[m] = n.value
yield Request( url='https://dig.chouti.com/login', method='POST', body='phone=8615735177116&password=zyf123&oneMonth=1', headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'}, # cookies=cookie_obj._cookies, cookies=self.cookies_dict, callback=self.check_login, )
方式二: meta
yield Request(url=url, callback=self.login, meta={'cookiejar': True})
6. 回調函數傳遞值:meta
def parse(self, response):
yield scrapy.Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail) def parse_detail(self, response): img_url = response.meta.get('img_url','')
from urllib.parse import urljoin import scrapy from scrapy import Request from scrapy.http.cookies import CookieJar class SpiderchoutiSpider(scrapy.Spider): name = 'choutilike' allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] cookies_dict = {} def parse(self, response): # 去響應頭中獲取cookie,cookie保存在cookie_jar對象 cookie_obj = CookieJar() cookie_obj.extract_cookies(response, response.request) # 去對象中將cookie解析到字典 for k, v in cookie_obj._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookies_dict[m] = n.value # self.cookies_dict = cookie_obj._cookies yield Request( url='https://dig.chouti.com/login', method='POST', body='phone=8615735177116&password=zyf123&oneMonth=1', headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'}, # cookies=cookie_obj._cookies, cookies=self.cookies_dict, callback=self.check_login, ) def check_login(self,response): # print(response.text) yield Request(url='https://dig.chouti.com/all/hot/recent/1', cookies=self.cookies_dict, callback=self.good, ) def good(self,response): id_list = response.css('div.part2::attr(share-linkid)').extract() for id in id_list: url = 'https://dig.chouti.com/link/vote?linksId={}'.format(id) yield Request( url=url, method='POST', cookies=self.cookies_dict, callback=self.show, ) pages = response.css('#dig_lcpage a::attr(href)').extract() for page in pages: url = urljoin('https://dig.chouti.com/',page) yield Request(url=url,callback=self.good) def show(self,response): print(response.text)
1. 書寫順序
import scrapy class ChoutiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() href = scrapy.Field()
ITEM_PIPELINES = { # 'chouti.pipelines.XiaohuaImagesPipeline': 300, # 'scrapy.pipelines.images.ImagesPipeline': 1, 'chouti.pipelines.ChoutiPipeline': 300, # 'chouti.pipelines.Chouti2Pipeline': 301, }
yield Item對象
2. pipeline的編寫
源碼執行流程
1. 判斷當前XdbPipeline類中是否有from_crawler 有: obj = XdbPipeline.from_crawler(....) 否: obj = XdbPipeline() 2. obj.open_spider() 3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item() 4. obj.close_spider()
import scrapy from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem class ChoutiPipeline(object): def __init__(self,conn_str): self.conn_str = conn_str @classmethod def from_crawler(cls,crawler): """ 初始化的時候,用於建立pipeline對象 :param crawler: :return: """ conn_str = crawler.settings.get('DB') return cls(conn_str) def open_spider(self,spider): """ 爬蟲開始時調用 :param spider: :return: """ self.conn = open(self.conn_str,'a',encoding='utf-8') def process_item(self, item, spider): if spider.name == 'spiderchouti': self.conn.write('{}\n{}\n'.format(item['title'],item['href'])) #交給下一個pipline使用 return item #丟棄item,不交給下一個pipline # raise DropItem() def close_spider(self,spider): """ 爬蟲關閉時調用 :param spider: :return: """ self.conn.close()
注意:pipeline是全部爬蟲公用,若是想要給某個爬蟲定製須要使用spider參數本身進行處理。
json文件
class JsonExporterPipeline(JsonItemExporter): #調用scrapy提供的json export 導出json文件 def __init__(self): self.file = open('articleexpoter.json','wb') self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False) self.exporter.start_exporting()#開始導出 def close_spider(self): self.exporter.finish_exporting() #中止導出 self.file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item class JsonWithEncodingPipeline(object): #自定義json文件的導出 def __init__(self): self.file = codecs.open('article.json','w',encoding='utf-8') def process_item(self,item,spider): lines = json.dumps(dict(item), ensure_ascii=False) + '\n' self.file.write(lines) return item def spider_closed(self): self.file.close()
存儲圖片
# -*- coding: utf-8 -*- from urllib.parse import urljoin import scrapy from ..items import XiaohuaItem class XiaohuaSpider(scrapy.Spider): name = 'xiaohua' allowed_domains = ['www.xiaohuar.com'] start_urls = ['http://www.xiaohuar.com/list-1-{}.html'.format(i) for i in range(11)] def parse(self, response): items = response.css('.item_list .item') for item in items: url = item.css('.img img::attr(src)').extract()[0] url = urljoin('http://www.xiaohuar.com',url) title = item.css('.title span a::text').extract()[0] obj = XiaohuaItem(img_url=[url],title=title) yield obj
class XiaohuaItem(scrapy.Item): img_url = scrapy.Field() title = scrapy.Field() img_path = scrapy.Field()
class XiaohuaImagesPipeline(ImagesPipeline): #調用scrapy提供的imagepipeline下載圖片 def item_completed(self, results, item, info): if "img_url" in item: for ok,value in results: print(ok,value) img_path = value['path'] item['img_path'] = img_path return item def get_media_requests(self, item, info): # 下載圖片 if "img_url" in item: for img_url in item['img_url']: yield scrapy.Request(img_url, meta={'item': item, 'index': item['img_url'].index(img_url)}) # 添加meta是爲了下面重命名文件名使用 def file_path(self, request, response=None, info=None): item = request.meta['item'] if "img_url" in item:# 經過上面的meta傳遞過來item index = request.meta['index'] # 經過上面的index傳遞過來列表中當前下載圖片的下標 # 圖片文件名,item['carname'][index]獲得汽車名稱,request.url.split('/')[-1].split('.')[-1]獲得圖片後綴jpg,png image_guid = item['title'] + '.' + request.url.split('/')[-1].split('.')[-1] # 圖片下載目錄 此處item['country']即須要前面item['country']=''.join()......,不然目錄名會變成\u97e9\u56fd\u6c7d\u8f66\u6807\u5fd7\xxx.jpg filename = u'full/{0}'.format(image_guid) return filename
ITEM_PIPELINES = { # 'chouti.pipelines.XiaohuaImagesPipeline': 300, 'scrapy.pipelines.images.ImagesPipeline': 1, }
mysql數據庫
class MysqlPipeline(object): def __init__(self): self.conn = pymysql.connect('localhost', 'root','0000', 'crawed', charset='utf8', use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = """insert into article(title,url,create_date,fav_nums) values (%s,%s,%s,%s)""" self.cursor.execute(insert_sql,(item['title'],item['url'],item['date'],item['fav_nums'])) self.conn.commit() class MysqlTwistePipeline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls,settings): dbparms = dict( host=settings['MYSQL_HOST'], db=settings['MYSQL_DB'], user=settings['MYSQL_USER'], password=settings['MYSQL_PASSWORD'], charset='utf8', cursorclass=pymysql.cursors.DictCursor, use_unicode=True, ) dbpool = adbapi.ConnectionPool('pymysql',**dbparms) return cls(dbpool) def process_item(self, item, spider): #使用twisted將mysql插入變異步執行 query = self.dbpool.runInteraction(self.do_insert,item) # query.addErrorback(self.handle_error) #處理異常 query.addErrback(self.handle_error) #處理異常 def handle_error(self,failure): #處理異步插入的異常 print(failure) def do_insert(self,cursor,item): insert_sql, params = item.get_insert_sql() try: cursor.execute(insert_sql,params) print('插入成功') except Exception as e: print('插入失敗')
MYSQL_HOST = 'localhost' MYSQL_USER = 'root' MYSQL_PASSWORD = '0000' MYSQL_DB = 'crawed' SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" SQL_DATE_FORMAT = "%Y-%m-%d" RANDOM_UA_TYPE = "random" ES_HOST = "127.0.0.1"
Scrapy默認去重規則:
from scrapy.dupefilter import RFPDupeFilter
from __future__ import print_function import os import logging from scrapy.utils.job import job_dir from scrapy.utils.request import request_fingerprint class BaseDupeFilter(object): @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): return False def open(self): # can return deferred pass def close(self, reason): # can return a deferred pass def log(self, request, spider): # log that a request has been filtered pass class RFPDupeFilter(BaseDupeFilter): """Request Fingerprint duplicates filter""" def __init__(self, path=None, debug=False): self.file = None self.fingerprints = set() self.logdupes = True self.debug = debug self.logger = logging.getLogger(__name__) if path: self.file = open(os.path.join(path, 'requests.seen'), 'a+') self.file.seek(0) self.fingerprints.update(x.rstrip() for x in self.file) @classmethod def from_settings(cls, settings): debug = settings.getbool('DUPEFILTER_DEBUG') return cls(job_dir(settings), debug) def request_seen(self, request): fp = self.request_fingerprint(request) if fp in self.fingerprints: return True self.fingerprints.add(fp) if self.file: self.file.write(fp + os.linesep) def request_fingerprint(self, request): return request_fingerprint(request) def close(self, reason): if self.file: self.file.close() def log(self, request, spider): if self.debug: msg = "Filtered duplicate request: %(request)s" self.logger.debug(msg, {'request': request}, extra={'spider': spider}) elif self.logdupes: msg = ("Filtered duplicate request: %(request)s" " - no more duplicates will be shown" " (see DUPEFILTER_DEBUG to show all duplicates)") self.logger.debug(msg, {'request': request}, extra={'spider': spider}) self.logdupes = False spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
自定義去重規則
1.編寫類
# -*- coding: utf-8 -*- """ @Datetime: 2018/8/31 @Author: Zhang Yafei """ from scrapy.dupefilter import BaseDupeFilter from scrapy.utils.request import request_fingerprint class RepeatFilter(BaseDupeFilter): def __init__(self): self.visited_fd = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): fd = request_fingerprint(request=request) if fd in self.visited_fd: return True self.visited_fd.add(fd) def open(self): # can return deferred print('open') pass def close(self, reason): # can return a deferred print('close') pass def log(self, request, spider): # log that a request has been filtered pass
2. 配置
# 默認去重規則 # DUPEFILTER_CLASS = "chouti.duplication.RepeatFilter" DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter"
3. 爬蟲使用
from urllib.parse import urljoin from ..items import ChoutiItem import scrapy from scrapy.http import Request class SpiderchoutiSpider(scrapy.Spider): name = 'spiderchouti' allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] def parse(self, response): #獲取當前頁的標題 print(response.request.url) # news = response.css('.content-list .item') # for new in news: # title = new.css('.show-content::text').extract()[0].strip() # href = new.css('.show-content::attr(href)').extract()[0] # item = ChoutiItem(title=title,href=href) # yield item #獲取全部頁碼 pages = response.css('#dig_lcpage a::attr(href)').extract() for page in pages: url = urljoin(self.start_urls[0],page) #將新要訪問的url添加到調度器 yield Request(url=url,callback=self.parse)
注意:
下載中間件
from scrapy.http import HtmlResponse
from scrapy.http import Request
class Md1(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
print('md1.process_request',request)
# 1. 返回Response
# import requests
# result = requests.get(request.url)
# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
# 2. 返回Request
# return Request('https://dig.chouti.com/r/tec/hot/1')
# 3. 拋出異常
# from scrapy.exceptions import IgnoreRequest
# raise IgnoreRequest
# 4. 對請求進行加工(*)
# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
pass
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
print('m1.process_response',request,response)
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
DOWNLOADER_MIDDLEWARES = { #'xdb.middlewares.XdbDownloaderMiddleware': 543, # 'xdb.proxy.XdbProxyMiddleware':751, 'xdb.md.Md1':666, 'xdb.md.Md2':667, }
應用:- user-agent - 代理
爬蟲中間件
class Sd1(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass # 只在爬蟲啓動時,執行一次。 def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r
SPIDER_MIDDLEWARES = { # 'xdb.middlewares.XdbSpiderMiddleware': 543, 'xdb.sd.Sd1': 666, 'xdb.sd.Sd2': 667, }
應用: - 深度 - 優先級
class SpiderMiddleware(object): def process_spider_input(self,response, spider): """ 下載完成,執行,而後交給parse處理 :param response: :param spider: :return: """ pass def process_spider_output(self,response, result, spider): """ spider處理完成,返回時調用 :param response: :param result: :param spider: :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable) """ return result def process_spider_exception(self,response, exception, spider): """ 異常調用 :param response: :param exception: :param spider: :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline """ return None def process_start_requests(self,start_requests, spider): """ 爬蟲啓動時調用 :param start_requests: :param spider: :return: 包含 Request 對象的可迭代對象 """ return start_requests
class DownMiddleware1(object): def process_request(self, request, spider): """ 請求須要被下載時,通過全部下載器中間件的process_request調用 :param request: :param spider: :return: None,繼續後續中間件去下載; Response對象,中止process_request的執行,開始執行process_response Request對象,中止中間件的執行,將Request從新調度器 raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception """ pass def process_response(self, request, response, spider): """ spider處理完成,返回時調用 :param response: :param result: :param spider: :return: Response 對象:轉交給其餘中間件process_response Request 對象:中止中間件,request會被從新調度下載 raise IgnoreRequest 異常:調用Request.errback """ print('response1') return response def process_exception(self, request, exception, spider): """ 當下載處理器(download handler)或 process_request() (下載中間件)拋出異常 :param response: :param exception: :param spider: :return: None:繼續交給後續中間件處理異常; Response對象:中止後續process_exception方法 Request對象:中止中間件,request將會被從新調用下載 """ return None
設置代理
在爬蟲啓動時,提早在os.envrion中設置代理便可。 class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): import os os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/" os.environ['HTTP_PROXY'] = '19.11.2.32', for url in self.start_urls: yield Request(url=url,callback=self.parse) meta: class ChoutiSpider(scrapy.Spider): name = 'chouti' allowed_domains = ['chouti.com'] start_urls = ['https://dig.chouti.com/'] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})
import base64 import random from six.moves.urllib.parse import unquote try: from urllib2 import _parse_proxy except ImportError: from urllib.request import _parse_proxy from six.moves.urllib.parse import urlunparse from scrapy.utils.python import to_bytes class XdbProxyMiddleware(object): def _basic_auth_header(self, username, password): user_pass = to_bytes( '%s:%s' % (unquote(username), unquote(password)), encoding='latin-1') return base64.b64encode(user_pass).strip() def process_request(self, request, spider): PROXIES = [ "http://root:woshiniba@192.168.11.11:9999/", "http://root:woshiniba@192.168.11.12:9999/", "http://root:woshiniba@192.168.11.13:9999/", "http://root:woshiniba@192.168.11.14:9999/", "http://root:woshiniba@192.168.11.15:9999/", "http://root:woshiniba@192.168.11.16:9999/", ] url = random.choice(PROXIES) orig_type = "" proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', '')) if user: creds = self._basic_auth_header(user, password) else: creds = None request.meta['proxy'] = proxy_url if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds class DdbProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.b64encode(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) else: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
單爬蟲運行 main.py
1 from scrapy.cmdline import execute 2 import sys 3 import os 4 5 sys.path.append(os.path.dirname(__file__)) 6 7 # execute(['scrapy','crawl','spiderchouti','--nolog']) 8 # os.system('scrapy crawl spiderchouti') 9 # os.system('scrapy crawl xiaohua') 10 os.system('scrapy crawl choutilike --nolog')
全部爬蟲:
# -*- coding: utf-8 -*- """ @Datetime: 2018/9/1 @Author: Zhang Yafei """ from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): print(type(self.crawler_process)) from scrapy.crawler import CrawlerProcess # 1. 執行CrawlerProcess構造方法 # 2. CrawlerProcess對象(含有配置文件)的spiders # 2.1,爲每一個爬蟲建立一個Crawler # 2.2,執行 d = Crawler.crawl(...) # ************************ # # d.addBoth(_done) # 2.3, CrawlerProcess對象._active = {d,} # 3. dd = defer.DeferredList(self._active) # dd.addBoth(self._stop_reactor) # self._stop_reactor ==> reactor.stop() # reactor.run #找到全部的爬蟲名稱 spider_list = self.crawler_process.spiders.list() # spider_list = ['choutilike','xiaohua']爬取任意項目 for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()
信號就是使用框架預留的位置,幫助你自定義一些功能。
內置信號
# 引擎開始和結束 engine_started = object() engine_stopped = object() # spider開始和結束 spider_opened = object() # 請求閒置 spider_idle = object() # spider關閉 spider_closed = object() # spider發生異常 spider_error = object() # 請求放入調度器 request_scheduled = object() # 請求被丟棄 request_dropped = object() # 接收到響應 response_received = object() # 響應下載完 response_downloaded = object() # item item_scraped = object() # item被丟棄 item_dropped = object()
自定義擴展
class MyExtend(): def __init__(self,crawler): self.crawler = crawler #鉤子上掛障礙物 #在指定信息上註冊操做 crawler.signals.connect(self.start,signals.engine_started) crawler.signals.connect(self.close,signals.engine_stopped) @classmethod def from_crawler(cls,crawler): return cls(crawler) def start(self): print('signals.engine_started start') def close(self): print('signals.engine_stopped close')
from scrapy import signals class MyExtend(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): self = cls() crawler.signals.connect(self.x1, signal=signals.spider_opened) crawler.signals.connect(self.x2, signal=signals.spider_closed) return self def x1(self, spider): print('open') def x2(self, spider): print('close')
配置
EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, 'chouti.extensions.MyExtend':200, }
""" This module contains the default values for all settings used by Scrapy. For more information about these settings you can read the settings documentation in docs/topics/settings.rst Scrapy developers, if you add a setting here remember to: * add it in alphabetical order * group similar settings without leaving blank lines * add its documentation to the available settings documentation (docs/topics/settings.rst) """ import sys from importlib import import_module from os.path import join, abspath, dirname import six AJAXCRAWL_ENABLED = False AUTOTHROTTLE_ENABLED = False AUTOTHROTTLE_DEBUG = False AUTOTHROTTLE_MAX_DELAY = 60.0 AUTOTHROTTLE_START_DELAY = 5.0 AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 BOT_NAME = 'scrapybot' CLOSESPIDER_TIMEOUT = 0 CLOSESPIDER_PAGECOUNT = 0 CLOSESPIDER_ITEMCOUNT = 0 CLOSESPIDER_ERRORCOUNT = 0 COMMANDS_MODULE = '' COMPRESSION_ENABLED = True CONCURRENT_ITEMS = 100 CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 CONCURRENT_REQUESTS_PER_IP = 0 COOKIES_ENABLED = True COOKIES_DEBUG = False DEFAULT_ITEM_CLASS = 'scrapy.item.Item' DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } DEPTH_LIMIT = 0 DEPTH_STATS = True DEPTH_PRIORITY = 0 DNSCACHE_ENABLED = True DNSCACHE_SIZE = 10000 DNS_TIMEOUT = 60 DOWNLOAD_DELAY = 0 DOWNLOAD_HANDLERS = {} DOWNLOAD_HANDLERS_BASE = { 'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler', 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', 'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler', 'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler', 's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler', 'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler', } DOWNLOAD_TIMEOUT = 180 # 3mins DOWNLOAD_MAXSIZE = 1024*1024*1024 # 1024m DOWNLOAD_WARNSIZE = 32*1024*1024 # 32m DOWNLOAD_FAIL_ON_DATALOSS = True DOWNLOADER = 'scrapy.core.downloader.Downloader' DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory' DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory' DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform, # also allowing negotiation DOWNLOADER_MIDDLEWARES = {} DOWNLOADER_MIDDLEWARES_BASE = { # Engine side 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, # Downloader side } DOWNLOADER_STATS = True DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter' EDITOR = 'vi' if sys.platform == 'win32': EDITOR = '%s -m idlelib.idle' EXTENSIONS = {} EXTENSIONS_BASE = { 'scrapy.extensions.corestats.CoreStats': 0, 'scrapy.extensions.telnet.TelnetConsole': 0, 'scrapy.extensions.memusage.MemoryUsage': 0, 'scrapy.extensions.memdebug.MemoryDebugger': 0, 'scrapy.extensions.closespider.CloseSpider': 0, 'scrapy.extensions.feedexport.FeedExporter': 0, 'scrapy.extensions.logstats.LogStats': 0, 'scrapy.extensions.spiderstate.SpiderState': 0, 'scrapy.extensions.throttle.AutoThrottle': 0, } FEED_TEMPDIR = None FEED_URI = None FEED_URI_PARAMS = None # a function to extend uri arguments FEED_FORMAT = 'jsonlines' FEED_STORE_EMPTY = False FEED_EXPORT_ENCODING = None FEED_EXPORT_FIELDS = None FEED_STORAGES = {} FEED_STORAGES_BASE = { '': 'scrapy.extensions.feedexport.FileFeedStorage', 'file': 'scrapy.extensions.feedexport.FileFeedStorage', 'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage', 's3': 'scrapy.extensions.feedexport.S3FeedStorage', 'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage', } FEED_EXPORTERS = {} FEED_EXPORTERS_BASE = { 'json': 'scrapy.exporters.JsonItemExporter', 'jsonlines': 'scrapy.exporters.JsonLinesItemExporter', 'jl': 'scrapy.exporters.JsonLinesItemExporter', 'csv': 'scrapy.exporters.CsvItemExporter', 'xml': 'scrapy.exporters.XmlItemExporter', 'marshal': 'scrapy.exporters.MarshalItemExporter', 'pickle': 'scrapy.exporters.PickleItemExporter', } FEED_EXPORT_INDENT = 0 FILES_STORE_S3_ACL = 'private' FTP_USER = 'anonymous' FTP_PASSWORD = 'guest' FTP_PASSIVE_MODE = True HTTPCACHE_ENABLED = False HTTPCACHE_DIR = 'httpcache' HTTPCACHE_IGNORE_MISSING = False HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_ALWAYS_STORE = False HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_IGNORE_SCHEMES = ['file'] HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = [] HTTPCACHE_DBM_MODULE = 'anydbm' if six.PY2 else 'dbm' HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy' HTTPCACHE_GZIP = False HTTPPROXY_ENABLED = True HTTPPROXY_AUTH_ENCODING = 'latin-1' IMAGES_STORE_S3_ACL = 'private' ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager' ITEM_PIPELINES = {} ITEM_PIPELINES_BASE = {} LOG_ENABLED = True LOG_ENCODING = 'utf-8' LOG_FORMATTER = 'scrapy.logformatter.LogFormatter' LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s' LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S' LOG_STDOUT = False LOG_LEVEL = 'DEBUG' LOG_FILE = None LOG_SHORT_NAMES = False SCHEDULER_DEBUG = False LOGSTATS_INTERVAL = 60.0 MAIL_HOST = 'localhost' MAIL_PORT = 25 MAIL_FROM = 'scrapy@localhost' MAIL_PASS = None MAIL_USER = None MEMDEBUG_ENABLED = False # enable memory debugging MEMDEBUG_NOTIFY = [] # send memory debugging report by mail at engine shutdown MEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0 MEMUSAGE_ENABLED = True MEMUSAGE_LIMIT_MB = 0 MEMUSAGE_NOTIFY_MAIL = [] MEMUSAGE_WARNING_MB = 0 METAREFRESH_ENABLED = True METAREFRESH_MAXDELAY = 100 NEWSPIDER_MODULE = '' RANDOMIZE_DOWNLOAD_DELAY = True REACTOR_THREADPOOL_MAXSIZE = 10 REDIRECT_ENABLED = True REDIRECT_MAX_TIMES = 20 # uses Firefox default setting REDIRECT_PRIORITY_ADJUST = +2 REFERER_ENABLED = True REFERRER_POLICY = 'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy' RETRY_ENABLED = True RETRY_TIMES = 2 # initial response + 2 retries = 3 requests RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408] RETRY_PRIORITY_ADJUST = -1 ROBOTSTXT_OBEY = False SCHEDULER = 'scrapy.core.scheduler.Scheduler' SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue' SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue' SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' SPIDER_LOADER_WARN_ONLY = False SPIDER_MIDDLEWARES = {} SPIDER_MIDDLEWARES_BASE = { # Engine side 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, # Spider side } SPIDER_MODULES = [] STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector' STATS_DUMP = True STATSMAILER_RCPTS = [] TEMPLATES_DIR = abspath(join(dirname(__file__), '..', 'templates')) URLLENGTH_LIMIT = 2083 USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__ TELNETCONSOLE_ENABLED = 1 TELNETCONSOLE_PORT = [6023, 6073] TELNETCONSOLE_HOST = '127.0.0.1' SPIDER_CONTRACTS = {} SPIDER_CONTRACTS_BASE = { 'scrapy.contracts.default.UrlContract': 1, 'scrapy.contracts.default.ReturnsContract': 2, 'scrapy.contracts.default.ScrapesContract': 3, }
1.深度和優先級
- 深度 - 最開始是0 - 每次yield時,會根據原來請求中的depth + 1 配置:DEPTH_LIMIT 深度控制 - 優先級 - 請求被下載的優先級 -= 深度 * 配置 DEPTH_PRIORITY 配置:DEPTH_PRIORITY
def parse(self, response): #獲取當前頁的標題 print(response.request.url, response.meta.get('depth', 0))
配置文件解讀
# -*- coding: utf-8 -*- # Scrapy settings for step8_king project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # 1. 爬蟲名稱 BOT_NAME = 'step8_king' # 2. 爬蟲應用路徑 SPIDER_MODULES = ['step8_king.spiders'] NEWSPIDER_MODULE = 'step8_king.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # 3. 客戶端 user-agent請求頭 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)' # Obey robots.txt rules # 4. 禁止爬蟲配置 # ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 5. 併發請求數 # CONCURRENT_REQUESTS = 4 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 6. 延遲下載秒數 # DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: # 7. 單域名訪問併發數,而且延遲下次秒數也應用在每一個域名 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 單IP訪問併發數,若是有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,而且延遲下次秒數也應用在每一個IP # CONCURRENT_REQUESTS_PER_IP = 3 # Disable cookies (enabled by default) # 8. 是否支持cookie,cookiejar進行操做cookie # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) # 9. Telnet用於查看當前爬蟲的信息,操做爬蟲等... # 使用telnet ip port ,而後經過命令操做 # engine.pause() 暫停 # engine.unpause() 重啓 # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 10. 默認請求頭 # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 11. 定義pipeline處理請求 # ITEM_PIPELINES = { # 'step8_king.pipelines.JsonPipeline': 700, # 'step8_king.pipelines.FilePipeline': 500, # } # 12. 自定義擴展,基於信號進行調用 # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # EXTENSIONS = { # # 'step8_king.extensions.MyExtension': 500, # } # 13. 爬蟲容許的最大深度,能夠經過meta查看當前深度;0表示無深度 # DEPTH_LIMIT = 3 # 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo # 後進先出,深度優先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先進先出,廣度優先 # DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' # 15. 調度器隊列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler # 16. 訪問URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html """ 17. 自動限速算法 from scrapy.contrib.throttle import AutoThrottle 自動限速設置 1. 獲取最小延遲 DOWNLOAD_DELAY 2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY 3. 設置初始下載延遲 AUTOTHROTTLE_START_DELAY 4. 當請求下載完成後,獲取其"鏈接"時間 latency,即:請求鏈接到接受到響應頭之間的時間 5. 用於計算的... AUTOTHROTTLE_TARGET_CONCURRENCY target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) slot.delay = new_delay """ # 開始自動限速 # AUTOTHROTTLE_ENABLED = True # The initial download delay # 初始下載延遲 # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # 最大下載延遲 # AUTOTHROTTLE_MAX_DELAY = 10 # The average number of requests Scrapy should be sending in parallel to each remote server # 平均每秒併發數 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # 是否顯示 # AUTOTHROTTLE_DEBUG = True # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings """ 18. 啓用緩存 目的用於將已經發送的請求或相應緩存下來,以便之後使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware from scrapy.extensions.httpcache import DummyPolicy from scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否啓用緩存策略 # HTTPCACHE_ENABLED = True # 緩存策略:全部請求均緩存,下次在請求直接訪問原來的緩存便可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 緩存超時時間 # HTTPCACHE_EXPIRATION_SECS = 0 # 緩存保存路徑 # HTTPCACHE_DIR = 'httpcache' # 緩存忽略的Http狀態碼 # HTTPCACHE_IGNORE_HTTP_CODES = [] # 緩存存儲的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' """ 19. 代理,須要在環境變量中設置 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默認 os.environ { http_proxy:http://root:woshiniba@192.168.11.11:9999/ https_proxy:http://192.168.11.11:9999/ } 方式二:使用自定義下載中間件 def to_bytes(text, encoding=None, errors='strict'): if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) print "**************ProxyMiddleware have pass************" + proxy['ip_port'] else: print "**************ProxyMiddleware no pass************" + proxy['ip_port'] request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = { 'step8_king.middlewares.ProxyMiddleware': 500, } """ """ 20. Https訪問 Https訪問時有兩種狀況: 1. 要爬取網站使用的可信任證書(默認支持) DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 2. 要爬取網站使用的自定義證書 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class MySSLFactory(ScrapyClientContextFactory): def getCertificateOptions(self): from OpenSSL import crypto v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read()) v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read()) return CertificateOptions( privateKey=v1, # pKey對象 certificate=v2, # X509對象 verify=False, method=getattr(self, 'method', getattr(self, '_ssl_method', None)) ) 其餘: 相關類 scrapy.core.downloader.handlers.http.HttpDownloadHandler scrapy.core.downloader.webclient.ScrapyHTTPClientFactory scrapy.core.downloader.contextfactory.ScrapyClientContextFactory 相關配置 DOWNLOADER_HTTPCLIENTFACTORY DOWNLOADER_CLIENTCONTEXTFACTORY """ """ 21. 爬蟲中間件 class SpiderMiddleware(object): def process_spider_input(self,response, spider): ''' 下載完成,執行,而後交給parse處理 :param response: :param spider: :return: ''' pass def process_spider_output(self,response, result, spider): ''' spider處理完成,返回時調用 :param response: :param result: :param spider: :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable) ''' return result def process_spider_exception(self,response, exception, spider): ''' 異常調用 :param response: :param exception: :param spider: :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline ''' return None def process_start_requests(self,start_requests, spider): ''' 爬蟲啓動時調用 :param start_requests: :param spider: :return: 包含 Request 對象的可迭代對象 ''' return start_requests 內置爬蟲中間件: 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50, 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500, 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700, 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800, 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900, """ # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { # 'step8_king.middlewares.SpiderMiddleware': 543, } """ 22. 下載中間件 class DownMiddleware1(object): def process_request(self, request, spider): ''' 請求須要被下載時,通過全部下載器中間件的process_request調用 :param request: :param spider: :return: None,繼續後續中間件去下載; Response對象,中止process_request的執行,開始執行process_response Request對象,中止中間件的執行,將Request從新調度器 raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception ''' pass def process_response(self, request, response, spider): ''' spider處理完成,返回時調用 :param response: :param result: :param spider: :return: Response 對象:轉交給其餘中間件process_response Request 對象:中止中間件,request會被從新調度下載 raise IgnoreRequest 異常:調用Request.errback ''' print('response1') return response def process_exception(self, request, exception, spider): ''' 當下載處理器(download handler)或 process_request() (下載中間件)拋出異常 :param response: :param exception: :param spider: :return: None:繼續交給後續中間件處理異常; Response對象:中止後續process_exception方法 Request對象:中止中間件,request將會被從新調用下載 ''' return None 默認下載中間件 { 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300, 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500, 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550, 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600, 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750, 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830, 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850, 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, } """ # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'step8_king.middlewares.DownMiddleware1': 100, # 'step8_king.middlewares.DownMiddleware2': 500, # } #23 Logging日誌功能 Scrapy提供了log功能,能夠經過 logging 模塊使用 能夠修改配置文件settings.py,任意位置添加下面兩行 LOG_FILE = "mySpider.log" LOG_LEVEL = "INFO" Scrapy提供5層logging級別: CRITICAL - 嚴重錯誤(critical) ERROR - 通常錯誤(regular errors) WARNING - 警告信息(warning messages) INFO - 通常信息(informational messages) DEBUG - 調試信息(debugging messages) logging設置 經過在setting.py中進行如下設置能夠被用來配置logging: LOG_ENABLED 默認: True,啓用logging LOG_ENCODING 默認: 'utf-8',logging使用的編碼 LOG_FILE 默認: None,在當前目錄裏建立logging輸出文件的文件名 LOG_LEVEL 默認: 'DEBUG',log的最低級別 LOG_STDOUT 默認: False 若是爲 True,進程全部的標準輸出(及錯誤)將會被重定向到log中。例如,執行 print "hello" ,其將會在Scrapy log中顯示 settings
方案一:徹底自定製
class RedisFilter(BaseDupeFilter): def __init__(self): from redis import Redis, ConnectionPool pool = ConnectionPool(host='127.0.0.1', port='6379') self.conn = Redis(connection_pool=pool) def request_seen(self, request): """ 檢測當前請求是否已經被訪問過 :param request: :return: True表示已經訪問過;False表示未訪問過 """ fd = request_fingerprint(request=request) added = self.conn.sadd('visited_urls', fd) return added == 0
方案二:徹底依賴scrapy-redis
# ############### scrapy redis鏈接 #################### REDIS_HOST = '127.0.0.1' # 主機名 REDIS_PORT = 6379 # 端口 # REDIS_PARAMS = {'password':'beta'} # Redis鏈接參數 默認:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" # redis編碼類型 默認:'utf-8' # REDIS_URL = 'redis://user:pass@hostname:9001' # 鏈接URL(優先於以上配置) # ############### scrapy redis去重 #################### DUPEFILTER_KEY = 'dupefilter:%(timestamp)s' # 修改默認去重規則 # DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter" DUPEFILTER_CLASS ='scrapy_redis.dupefilter.RFPDupeFilter'
方案三: 繼承scrapy-redis 實現自定製
class RedisDupeFilter(RFPDupeFilter): @classmethod def from_settings(cls, settings): """Returns an instance from given settings. This uses by default the key ``dupefilter:<timestamp>``. When using the ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as it needs to pass the spider name in the key. Parameters ---------- settings : scrapy.settings.Settings Returns ------- RFPDupeFilter A RFPDupeFilter instance. """ server = get_redis_from_settings(settings) # XXX: This creates one-time key. needed to support to use this # class as standalone dupefilter with scrapy's default scheduler # if scrapy passes spider on open() method this wouldn't be needed # TODO: Use SCRAPY_JOB env as default and fallback to timestamp. key = defaults.DUPEFILTER_KEY % {'timestamp': 'test_scrapy_redis'} debug = settings.getbool('DUPEFILTER_DEBUG') return cls(server, key=key, debug=debug)
2. 調度器
鏈接redis配置: REDIS_HOST = '127.0.0.1' # 主機名 REDIS_PORT = 6073 # 端口 # REDIS_PARAMS = {'password':'xxx'} # Redis鏈接參數 默認:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" # redis編碼類型 默認:'utf-8' 去重的配置: DUPEFILTER_KEY = 'dupefilter:%(timestamp)s' DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' 調度器配置: SCHEDULER = "scrapy_redis.scheduler.Scheduler" DEPTH_PRIORITY = 1 # 廣度優先 # DEPTH_PRIORITY = -1 # 深度優先 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 默認使用優先級隊列(默認),其餘:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) # 廣度優先 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # 默認使用優先級隊列(默認),其餘:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) # 深度優先 # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # 默認使用優先級隊列(默認),其餘:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 調度器中請求存放在redis中的key SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 對保存到redis中的數據進行序列化,默認使用pickle SCHEDULER_PERSIST = False # 是否在關閉時候保留原來的調度器和去重記錄,True=保留,False=清空 SCHEDULER_FLUSH_ON_START = True # 是否在開始以前清空 調度器和去重記錄,True=清空,False=不清空 # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去調度器中獲取數據時,若是爲空,最多等待時間(最後沒數據,未獲取到)。 SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重規則,在redis中保存時對應的key # 優先使用DUPEFILTER_CLASS,若是麼有就是用SCHEDULER_DUPEFILTER_CLASS SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重規則對應處理的類
1. scrapy crawl chouti --nolog 2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置並實例化調度器對象 - 執行Scheduler.from_crawler - 執行Scheduler.from_settings - 讀取配置文件: SCHEDULER_PERSIST # 是否在關閉時候保留原來的調度器和去重記錄,True=保留,False=清空 SCHEDULER_FLUSH_ON_START # 是否在開始以前清空 調度器和去重記錄,True=清空,False=不清空 SCHEDULER_IDLE_BEFORE_CLOSE # 去調度器中獲取數據時,若是爲空,最多等待時間(最後沒數據,未獲取到)。 - 讀取配置文件: SCHEDULER_QUEUE_KEY # %(spider)s:requests SCHEDULER_QUEUE_CLASS # scrapy_redis.queue.FifoQueue SCHEDULER_DUPEFILTER_KEY # '%(spider)s:dupefilter' DUPEFILTER_CLASS # 'scrapy_redis.dupefilter.RFPDupeFilter' SCHEDULER_SERIALIZER # "scrapy_redis.picklecompat" - 讀取配置文件: REDIS_HOST = '140.143.227.206' # 主機名 REDIS_PORT = 8888 # 端口 REDIS_PARAMS = {'password':'beta'} # Redis鏈接參數 默認:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" - 示例Scheduler對象 3. 爬蟲開始執行起始URL - 調用 scheduler.enqueue_requests() def enqueue_request(self, request): # 請求是否須要過濾? # 去重規則中是否已經有?(是否已經訪問過,若是未訪問添加到去重記錄中。) if not request.dont_filter and self.df.request_seen(request): # 若是須要過濾且已經被訪問過,返回false self.df.log(request, self.spider) # 已經訪問過就不要再訪問了 return False if self.stats: self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider) # print('未訪問過,添加到調度器', request) self.queue.push(request) return True 4. 下載器去調度器中獲取任務,去下載 - 調用 scheduler.next_requests() def next_request(self): block_pop_timeout = self.idle_before_close request = self.queue.pop(block_pop_timeout) if request and self.stats: self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider) return request
定義持久化,爬蟲yield Item對象時執行RedisPipeline a. 將item持久化到redis時,指定key和序列化函數 REDIS_ITEMS_KEY = '%(spider)s:items' REDIS_ITEMS_SERIALIZER = 'json.dumps' b. 使用列表保存item數據
""" 起始URL相關 a. 獲取起始URL時,去集合中獲取仍是去列表中獲取?True,集合;False,列表 REDIS_START_URLS_AS_SET = False # 獲取起始URL時,若是爲True,則使用self.server.spop;若是爲False,則使用self.server.lpop b. 編寫爬蟲時,起始URL從redis的Key中獲取 REDIS_START_URLS_KEY = '%(name)s:start_urls' """ # If True, it uses redis' ``spop`` operation. This could be useful if you # want to avoid duplicates in your start urls list. In this cases, urls must # be added via ``sadd`` command or you will get a type error from redis. # REDIS_START_URLS_AS_SET = False # Default start urls key for RedisSpider and RedisCrawlSpider. # REDIS_START_URLS_KEY = '%(name)s:start_urls'
from scrapy_redis.spiders import RedisSpider class SpiderchoutiSpider(RedisSpider): name = 'spiderchouti' allowed_domains = ['dig.chouti.com'] # 不用寫start_urls
from redis import Redis, ConnectionPool pool = ConnectionPool(host='127.0.0.1', port=6379) conn = Redis(connection_pool=pool) conn.lpush('spiderchouti:start_urls','https://dig.chouti.com/')
1. 什麼是深度優先?什麼是廣度優先? 就像一顆樹,深度優先先執行一顆子樹中的全部節點在執行另外一顆子樹中的全部節點,廣度優先先執行完一層,在執行下一層全部節點 2. scrapy中如何實現深度和廣度優先? 基於棧和隊列實現: 先進先出,廣度優先 後進先出,深度優先 基於有序集合實現: 優先級隊列: DEPTH_PRIORITY = 1 # 廣度優先 DEPTH_PRIORITY = -1 # 深度優先 3. scrapy中 調度器 和 隊列 和 dupefilter的關係? 調度器,調配添加或獲取那個request. 隊列,存放request,先進先出(深度優先),後進先出(廣度優先),優先級隊列。 dupefilter,訪問記錄,去重規則。
from twisted.internet import reactor # 事件循環(終止條件,全部的socket都已經移除) from twisted.web.client import getPage # socket對象(若是下載完成,自動從時間循環中移除...) from twisted.internet import defer # defer.Deferred 特殊的socket對象 (不會發請求,手動移除) from queue import Queue class Request(object): """ 用於封裝用戶請求相關信息,供用戶編寫spider時發送請求所用 """ def __init__(self,url,callback): self.url = url self.callback = callback class HttpResponse(object): """ 經過響應請求返回的數據和穿入的request對象封裝成一個response對象 目的是爲了將請求返回的數據不只包括返回的content數據,使其擁有更多的屬性,好比請求頭,請求url,請求的cookies等等 更方便的被回調函數所解析有用的數據 """ def __init__(self,content,request): self.content = content self.request = request self.url = request.url self.text = str(content,encoding='utf-8') class Scheduler(object): """ 任務調度器: 1.初始化一個隊列 2.next_request方法:讀取隊列中的下一個請求 3.enqueue_request方法:將請求加入到隊列 4.size方法:返回當前隊列請求的數量 5.open方法:無任何操做,返回一個空值,用於引擎中用裝飾器裝飾的open_spider方法返回一個yield對象 """ def __init__(self): self.q = Queue() def open(self): pass def next_request(self): try: req = self.q.get(block=False) except Exception as e: req = None return req def enqueue_request(self,req): self.q.put(req) def size(self): return self.q.qsize() class ExecutionEngine(object): """ 引擎:全部的調度 1.經過open_spider方法將start_requests中的每個請求加入到scdeuler中的隊列當中, 2.處理每個請求響應以後的回調函數(get_response_callback)和執行下一次請求的調度(_next_request) """ def __init__(self): self._close = None self.scheduler = None self.max = 5 self.crawlling = [] def get_response_callback(self,content,request): self.crawlling.remove(request) response = HttpResponse(content,request) result = request.callback(response) import types if isinstance(result,types.GeneratorType): for req in result: self.scheduler.enqueue_request(req) def _next_request(self): """ 1.對spider對象的請求進行調度 2.設置事件循環終止條件:調度器隊列中請求的個數爲0,正在執行的請求數爲0 3.設置最大併發數,根據正在執行的請求數量知足最大併發數條件對sceduler隊列中的請求進行調度執行 4.包括對請求進行下載,以及對返回的數據執行callback函數 5.開始執行事件循環的下一次請求的調度 """ if self.scheduler.size() == 0 and len(self.crawlling) == 0: self._close.callback(None) return """設置最大併發數爲5""" while len(self.crawlling) < self.max: req = self.scheduler.next_request() if not req: return self.crawlling.append(req) d = getPage(req.url.encode('utf-8')) d.addCallback(self.get_response_callback,req) d.addCallback(lambda _:reactor.callLater(0,self._next_request)) @defer.inlineCallbacks def open_spider(self,start_requests): """ 1.建立一個調度器對象 2.將start_requests中的每個請求加入到scheduler隊列中去 3.而後開始事件循環執行下一次請求的調度 注:每個@defer.inlineCallbacks裝飾的函數都必須yield一個對象,即便爲None """ self.scheduler = Scheduler() yield self.scheduler.open() while True: try: req = next(start_requests) except StopIteration as e: break self.scheduler.enqueue_request(req) reactor.callLater(0,self._next_request) @defer.inlineCallbacks def start(self): """不發送任何請求,須要手動中止,目的是爲了夯住循環""" self._close = defer.Deferred() yield self._close class Crawler(object): """ 1.用戶封裝調度器以及引擎 2.經過傳入的spider對象的路徑建立spider對象 3.建立引擎去打開spider對象,對spider中的每個請求加入到調度器中的隊列中去,經過引擎對請求去進行調度 """ def _create_engine(self): return ExecutionEngine() def _create_spider(self,spider_cls_path): """ :param spider_cls_path: spider.chouti.ChoutiSpider :return: """ module_path,cls_name = spider_cls_path.rsplit('.',maxsplit=1) import importlib m = importlib.import_module(module_path) cls = getattr(m,cls_name) return cls() @defer.inlineCallbacks def crawl(self,spider_cls_path): engine = self._create_engine() spider = self._create_spider(spider_cls_path) start_requests = iter(spider.start_requests()) yield engine.open_spider(start_requests) #將start_requests中的每個請求加入到調度器的隊列中去,並有引擎調度請求的執行 yield engine.start() #建立一個defer對象,目的是爲了夯住事件循環,手動中止 class CrawlerProcess(object): """ 1.建立一個Crawler對象 2.將傳入的每個spider對象的路徑傳入Crawler.crawl方法 3.並將返回的socket對象加入到集合中 4.開始事件循環 """ def __init__(self): self._active = set() def crawl(self,spider_cls_path): """ :param spider_cls_path: :return: """ crawler = Crawler() d = crawler.crawl(spider_cls_path) self._active.add(d) def start(self): dd = defer.DeferredList(self._active) dd.addBoth(lambda _:reactor.stop()) reactor.run() class Command(object): """ 1.建立開始運行的命令 2.將每個spider對象的路徑傳入到crawl_process.crawl方法中去 3.crawl_process.crawl方法建立一個Crawler對象,經過調用Crawler.crawl方法建立一個引擎和spider對象 4.經過引擎的open_spider方法建立一個scheduler對象,將每個spider對象加入到schedler隊列中去,而且經過自身的_next_request方法對下一次請求進行調度 5. """ def run(self): crawl_process = CrawlerProcess() spider_cls_path_list = ['spider.chouti.ChoutiSpider','spider.cnblogs.CnblogsSpider',] for spider_cls_path in spider_cls_path_list: crawl_process.crawl(spider_cls_path) crawl_process.start() if __name__ == '__main__': cmd = Command() cmd.run()