1.Scrapy總體框架css
Scrapy採用了Twisted異步網絡來處理請求,總體框架以下:html
Scrapy Engine爬蟲引擎:協調整個框架組件間的數據交互,是框架的核心node
Schedule調度器:接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址。(至關於須要爬取的url的集合)python
Downloader下載器:下載指定的url的網頁文本,並傳遞給spiders處理。react
spiders 爬蟲:處理爬取下來的網頁文本,提取出所須要的信息。能夠提取出數據Item,傳遞到Item Pipeline保存, 也能夠提取出url,傳遞給Schedule的url任務隊列。web
Item Pipeline 項目管道: 接受spiders傳遞過來的數據Item,進行持久化。寫入文件或數據庫等。ajax
Schedule Middleware 調度中間件:引擎和調度器之間的交互正則表達式
Spider Middleware 爬蟲中間件:引擎和爬蟲之間的交互算法
Downloader Middleware下載器中間件:引擎和下載器之間的交互數據庫
一次完整的流程能夠簡單總結爲:
1.首先Spiders(爬蟲)將須要發送請求的url(requests)經ScrapyEngine(引擎)交給Scheduler(調度器)。
2.Scheduler(排序,入隊)處理後,經ScrapyEngine,DownloaderMiddlewares(可選,主要有User_Agent, Proxy 代理)交給Downloader。
3.Downloader 向互聯網發送請求,並接收下載響應(response)。將響應(response)經ScrapyEngine,SpiderMiddlewares(可選)交給Spiders。
4.Spiders 處理response,提取數據並將數據經ScrapyEngine 交給ItemPipeline 保存(能夠是本地,能夠是數據庫)。提取url 從新經ScrapyEngine 交給 Scheduler 進行下一個循環。直到無Url請求程序中止結束。
2,經常使用命令語句:
官方文檔:https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/commands.html
1 scrapy startproject project_name : 當前目錄下建立爬蟲項目
2 scrapy genspider [-t template] <spider_name> <domain> 根據模板建立爬蟲應用(先進入建立的爬蟲項目目錄)
(模板有basic,crawl,csvfeed,xmlfeed,默認使用basic模板,scrapy genspider -t basic)
scrapy genspider -l :查看全部模板
scrapy genspider -d template_name : 查看模板名稱
3 scrapy list 查看建立的全部爬蟲應用
4 scrapy crawl spider_name 運行單獨的爬蟲應用
scrapy crawl spider_name --nolog 不顯示多有的記錄
3. 爬蟲項目結構
建立後的爬蟲項目目錄以下:
scrapy.cfg : 項目的主配置信息。(真正爬蟲相關的配置信息在settings.py文件中)
items.py: 設置數據存儲模板,用於結構化數據,如:Django的Model
pipelines.py: 數據處理行爲,如:通常結構化的數據持久化
settings.py: 配置文件,如:遞歸的層數、併發數,延遲下載等
spiders 爬蟲應用目錄,包含建立的全部爬蟲應用(cnblog.py)
建立後的cnblog.py中代碼以下
# -*- coding: utf-8 -*- import scrapy class CnblogSpider(scrapy.Spider): name = "cnblog" #爬蟲應用名稱 allowed_domains = ["cnblogs.com"] #限制爬蟲域名,其餘域名不爬取 start_urls = ( 'http://www.cnblogs.com/', # 爬蟲起始url ) def parse(self, response): pass # 訪問起始URL並獲取結果後的回調函數, response爲下載器返回的結果,response.text即網頁文本
若windows輸出編碼亂碼:UnicodeEncodeError: 'gbk' codec can't encode character u'\xbb' (windows採用gbk,下載器下載的網頁文本爲unicode字符串),解決方案以下:
python 3:在代碼前加入下面代碼
import sys,io sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #gb18030能夠兼容全部gb系列的編碼,能夠有效地避免少部分GBK沒法解碼的內容
python 2:輸出文檔時對文檔格式進行設置
python 2 不支持sys.stdout.buffer,對於要打印的內容設置以下編碼,:
print response.text.encode('gb18030')
4 選擇器(Selector)
官方文檔:https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html
構造選擇器
from scrapy.selector import Selector from scrapy.http import HtmlResponse #經過Selector類 response = HtmlResponse(url='http://example.com', body=html_body) Selector(response=response).xpath() #經過selector屬性,xpath(),css()方法 response.selector.xpath()
response.xpath()
response.css()
篩選表達式含義:
https://www.jianshu.com/p/2391950137a4
https://blog.csdn.net/manongpengzai/article/details/77109600
* 匹配任何元素節點
@* 匹配任何屬性節點
node()匹配任何類型的節點
text()匹配文本值
extract()拿到對象中的字符竄
string()
# hxs = Selector(response=response).xpath('//a') # 選擇文檔中的全部a元素 # print(hxs) # hxs = Selector(response=response).xpath('//a[2]') # 選擇文檔中的第二個a元素 # print(hxs) # hxs = Selector(response=response).xpath('//a[@id]') #選擇文檔中的具備id屬性的a元素 # print(hxs) # hxs = Selector(response=response).xpath('//a[@id="i1"]') #選擇文檔中的id=「i1」的a元素 # print(hxs) # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') #href屬性值包含 「link」 # print(hxs) # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') #href屬性值以 「link」開始 # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') #正則表達式,id屬性值 和「i\d+」進行匹配 # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # 匹配的a元素的文本值 # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # 匹配的a元素的href屬性值 # print(hxs) # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # 逐級查找 # print(hxs) # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() # 只返回第一個 # print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li') # for item in ul_list: # v = item.xpath('./a/span') # # 或 # # v = item.xpath('a/span') # # 或 # # v = item.xpath('*/a/span') # print(v)
4,實戰項目
1,爬取博客園主頁文章標題,並自動翻頁
import scrapy from scrapy.http.request import Request class CnblogSpider(scrapy.Spider): name = "cnblog" allowed_domains = ["cnblogs.com"] start_urls = ( 'https://www.cnblogs.com/', ) has_request_set={} def parse(self, response): #print response.text.encode("gb18030") #print dir(response) page_title = response.xpath('//div[@class="post_item"]//h3/a/text()').extract_first() print response.url, page_title pager_list=response.xpath('//div[@class="pager"]/a/@href').extract() for item in pager_list: url = 'https://www.cnblogs.com/%s'%item import hashlib hash = hashlib.md5() hash.update(url) key = hash.hexdigest() #對url加密,方便比較,不訪問重複的url if key in self.has_request_set: print u"已經下載了" #使用unicode時不亂碼 else: self.has_request_set[key]=url yield Request(url=url,method='GET') # Request()中未設置callback=, 默認採用self.parse()處理返回response,即遞歸調用 # 在settings.py 中設置DEPTH_LIMIT=1 來設置遞歸調用的深度
2,利用cookie登錄抽屜熱搜榜,實現批量點贊
import scrapy from scrapy.http.cookies import CookieJar from scrapy.http.request import Request #運行爬蟲進行批量點贊前,在設置文件中設置DEPTH_LIMIT =4,否則遞歸次數多,太暴力了!!!!! class ChoutiSpider(scrapy.Spider): name = "chouti" allowed_domains = ["chouti.com"] start_urls = ( 'https://dig.chouti.com/', ) cookies_dict={} has_request_set={} #訪問主頁面,獲取cookie def parse(self, response): cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m,n in j.items(): self.cookies_dict[m]=n.value # n 爲一個cookie實例對象 Cookie() #print n.value,type(n) #print self.cookies_dict #帶着cookie去登錄,對cookie受權 url = "https://dig.chouti.com/login" yield Request( url=url, method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, #必須設置Content-Type,post提交的數據才能被正確處理 body='oneMonth=1&password=19930624&phone=8618626429847', cookies=self.cookies_dict, callback=self.check_login ) #拿着受權後的cookie去訪問 def check_login(self,response): yield Request( url="https://dig.chouti.com/", method='GET', cookies=self.cookies_dict, callback=self.do_favor ) #進行批量點贊 def do_favor(self,response): linkid_list = response.xpath('//div[@share-linkid]/@share-linkid').extract() #print linkid user = response.xpath('//span[@id="userProNick"]/text()').extract() #print user for id in linkid_list: url = "https://dig.chouti.com/link/vote?linksId=%s"%id yield Request( url=url, method='POST', cookies=self.cookies_dict, callback=self.show_favor ) # 拿到頁碼,自動翻頁 pager_list = response.xpath('//div[@id="dig_lcpage"]/ul/li/a/@href').extract() #print pager_list for page in pager_list: page_url = "https://dig.chouti.com%s"%page import hashlib hash = hashlib.md5() hash.update(page_url) key = hash.hexdigest() if key in self.has_request_set.keys(): pass else: self.has_request_set[key]=page_url #print page_url yield Request( url=page_url, method='GET', cookies=self.cookies_dict, callback=self.do_favor # 遞歸調用,從而對每一頁進行點贊; ) #打印點贊後的返回結果:推薦成功 def show_favor(self,response): print response.text
5,數據格式化處理
對於上面實例的數據能夠在parse中簡單處理,但若要進行數據格式化和持久化,能夠用items格式化數據,並交給pipeline處理。
items
items官方文檔:https://doc.scrapy.org/en/latest/topics/items.html
Item的定義相似django中的model,每一個Item對象有若干屬性,其使用起來和dict很類似,並能夠與dict互相轉換
Creating Item >>> product = Product(name='Desktop PC', price=1000) >>> print product Product(name='Desktop PC', price=1000) Getting Field >>> product['name'] Desktop PC >>> product.get('name') Desktop PC Setting Field >>> product['last_updated'] = 'today' >>> product['last_updated'] today Creating dicts from items: >>> dict(product) # create a dict from all populated values {'price': 1000, 'name': 'Desktop PC'} Creating items from dicts >>> Product({'name': 'Laptop PC', 'price': 1500}) Product(price=1500, name='Laptop PC')
pipeline
pipeline官方文檔:https://doc.scrapy.org/en/latest/topics/item-pipeline.html
經過語句yield item,會將item傳遞給pipeline中定義的process_item()方法處理,根據在settings中設置的權重不一樣,各個pipeline類的process_item()方法會依次執行(若process_item()未return item,該item會被丟棄,不會向一個pipeline類的process_item()方法傳遞)。除了process_item()方法外,pipeline還能夠實現其餘的方法,以下:
from scrapy.exceptions import DropItem class CustomPipeline(object): def __init__(self,v): self.value = v def process_item(self, item, spider): # 操做並進行持久化 # return表示會被後續的pipeline繼續處理 return item # 表示將item丟棄,不會被後續pipeline處理 # raise DropItem() @classmethod def from_crawler(cls, crawler): """ 初始化時候,用於建立pipeline對象 :param crawler: :return: """ val = crawler.settings.getint('MMMM') return cls(val) def open_spider(self,spider): """ 爬蟲開始執行時,調用 :param spider: :return: """ print('000000') def close_spider(self,spider): """ 爬蟲關閉時,被調用 :param spider: :return: """ print('111111')
爬取鏈家房產信息,並保存:
# -*- coding: utf-8 -*- import scrapy from ..items import LianjiaItem from scrapy.http.request import Request import json class LianjiaSpider(scrapy.Spider): name = "lianjia" allowed_domains = ["lianjia.com"] start_urls = ( 'http://wh.lianjia.com/ershoufang/', ) has_request_set={} def parse(self, response): sell_list = response.xpath('//ul[@class="sellListContent"]/li') #print sell_list for item in sell_list: img_src = item.xpath('./a/img[@class="lj-lazy"]/@data-original').extract_first() #不要爬取src屬性,獲得的爲空圖片 house_name =item.xpath('.//div[@class="houseInfo"]/a/text()').extract_first() house_desc = item.xpath('.//div[@class="houseInfo"]/text()').extract_first() total_price = item.xpath('.//div[@class="totalPrice"]/span/text()').extract_first() unit_price = item.xpath('.//div[@class="unitPrice"]/span/text()').extract_first() house_item = LianjiaItem(img_src=img_src,house_name=house_name, house_desc=house_desc,total_price=total_price,unit_price=unit_price) yield house_item #沒法從返回的頁面中拿到分頁頁碼,只能拿到總頁碼數? pager_data = response.xpath('//div[@comp-module="page"]/@page-data').extract() #print pager_data total_page = json.loads(pager_data[0])["totalPage"] #for i in range(2,total_page) for i in range(2,4): #只爬取第2,3頁 page_url="https://wh.lianjia.com/ershoufang/pg%s/"%i yield Request(url=page_url,callback=self.parse)
import scrapy class LianjiaItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() img_src = scrapy.Field() house_name = scrapy.Field() house_desc = scrapy.Field() total_price= scrapy.Field() unit_price = scrapy.Field()
import json import requests import os class LianjiaPipeline(object): def __init__(self): self.file=open('lianjia.txt','a') #在當前路徑下建立文件並追加內容 def process_item(self, item, spider): if item['house_name']: data= json.dumps(dict(item),ensure_ascii=False).encode("utf8")+"\n" self.file.write(data) self.file.close() return item class ImgPipeline(object): def __init__(self): if not os.path.exists('images'): #當前路徑不存在文件夾時建立文件夾 os.mkdir('images') def process_item(self,item, spider): response = requests.get(item['img_src'], stream=True) #stream=True邊下載邊從內存保存到硬盤,而不是所有下載到內存 file_name=u'%s_%s萬.jpg'%(item['house_name'],item['total_price']) with open(os.path.join('images',file_name),'wb') as f: f.write(response.content) return item
ITEM_PIPELINES = { 'mySpider.pipelines.LianjiaPipeline': 100, 'mySpider.pipelines.ImgPipeline': 200, } # 值爲0-1000,數字越小,優先度越高,先執行其process_item()方法
6. 中間件
spider Middleware 爬蟲中間件: 介於引擎和爬蟲之間,自定義爬蟲中間件類,實現相應的方法,在settings中設置便可。數字越小越靠近引擎,process_spider_input()優先處理,數字越大越靠近spider,process_spider_output()優先處理,關閉用None。
官方文檔:https://scrapy.readthedocs.io/en/latest/topics/spider-middleware.html
https://zhuanlan.zhihu.com/p/42498126
class SpiderMiddleware(object): def process_spider_input(self,response, spider): """ 從引擎傳來的response,先在這裏處理,而後交給spider :param response: :param spider: :return: """ pass def process_spider_output(self,response, result, spider): """ spider處理完成,返回結果時調用 (返回的結果在這裏處理,後傳給引擎) :param response: :param result: :param spider: :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable) """ return result def process_spider_exception(self,response, exception, spider): """ 異常調用 :param response: :param exception: :param spider: :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline """ return None def process_start_requests(self,start_requests, spider): """ 爬蟲啓動時調用 :param start_requests: :param spider: :return: 包含 Request 對象的可迭代對象 """ return start_requests
SPIDER_MIDDLEWARES = { 'mySpider.middlewares.MyCustomSpiderMiddleware': 543, } # 會與 SPIDER_MIDDLEWARES_BASE中的中間件合併,根據權重,依次執行; ''' SPIDER_MIDDLEWARES_BASE= { 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, } '''
download Middleware 下載器中間件:介於引擎和下載器之間,須要自定義和設置,數字越小,越靠近引擎,數字越大越靠近下載器。數字越小的,process_request()優先處理;數字越大的,process_response()優先處理;若須要關閉某個中間件直接設爲None便可
DOWNLOADER_MIDDLEWARES = { 'mySpider.middlewares.MyCustomDownloaderMiddleware': 543, } #設置後會和DOWNLOADER_MIDDLEWARES_BASE合併,根據權重依次執行 ''' DOWNLOADER_MIDDLEWARES_BASE= { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, } '''
class DownMiddleware1(object): def process_request(self, request, spider): """ 從引擎傳來的request,通過全部下載器中間件的process_request調用 :param request: :param spider: :return: None,繼續後續中間件去下載; Response對象,中止process_request的執行,開始執行process_response Request對象,中止中間件的執行,將Request從新調度器 raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception """ pass def process_response(self, request, response, spider): """ 下載器處理完成返回的response,通過全部下載器中間件的process_response :param response: :param result: :param spider: :return: Response 對象:轉交給其餘中間件process_response Request 對象:中止中間件,request會被從新調度下載 raise IgnoreRequest 異常:調用Request.errback """ print('response1') return response def process_exception(self, request, exception, spider): """ 當下載處理器(download handler)或 process_request() (下載中間件)拋出異常 :param response: :param exception: :param spider: :return: None:繼續交給後續中間件處理異常; Response對象:中止後續process_exception方法 Request對象:中止中間件,request將會被從新調用下載 """ return None from_crawler(cls, crawler): # 利用crawler建立中間件實例 return
7. 自定義命令
官方文檔:https://doc.scrapy.org/en/latest/topics/commands.html?highlight=COMMANDS_MODULE
在settings.py 中添加配置 COMMANDS_MODULE = '項目名稱.目錄名稱
8. 信號機制
官方文檔:https://scrapy.readthedocs.io/en/latest/topics/signals.html
scrapy中設置了不少信號,在特定事情發生時會被調用,能夠自定義相應的處理函數
from scrapy import signals class MyExtension(object): def __init__(self, value): self.value = value @classmethod def from_crawler(cls, crawler): val = crawler.settings.getint('MMMM') ext = cls(val) crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) return ext def spider_opened(self, spider): print('open') def spider_closed(self, spider): print('close')
from scrapy import signals from scrapy import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) return spider def spider_closed(self, spider): spider.logger.info('Spider closed: %s', spider.name) def parse(self, response): pass
9.url去重設置
官方文檔:https://doc.scrapy.org/en/latest/topics/settings.html?highlight=DUPEFILTER_CLASS
DUPEFILTER_CLASS
=
'scrapy.dupefilter.RFPDupeFilter' :默認處理重複請求的類
DUPEFILTER_DEBUG
=
False #RFPDupeFilter
默認爲False,只記錄第一個重複的request。設置True時記錄全部的
dont_filter
=True),該Request的url不被去重
class RepeatUrl: def __init__(self): self.visited_url = set() @classmethod def from_settings(cls, settings): """ 初始化時,調用 :param settings: :return: """ return cls() def request_seen(self, request): """ 檢測當前請求是否已經被訪問過 :param request: :return: True表示已經訪問過;False表示未訪問過 """ if request.url in self.visited_url: return True self.visited_url.add(request.url) return False def open(self): """ 開始爬去請求時,調用 :return: """ print('open replication') def close(self, reason): """ 結束爬蟲爬取時,調用 :param reason: :return: """ print('close replication') def log(self, request, spider): """ 記錄日誌 :param request: :param spider: :return: """ print('repeat', request.url) 複製代碼
10. settings各項含義
1. 爬蟲名稱 BOT_NAME = 'step8_king' # 2. 爬蟲應用路徑 SPIDER_MODULES = ['step8_king.spiders'] NEWSPIDER_MODULE = 'step8_king.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # 3. 客戶端 user-agent請求頭 # USER_AGENT = 'step8_king (+http://www.yourdomain.com)' # Obey robots.txt rules # 4. 禁止爬蟲配置 # ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # 5. 併發請求數 # CONCURRENT_REQUESTS = 4 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # 6. 延遲下載秒數 # DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: # 7. 單域名訪問併發數,而且延遲下次秒數也應用在每一個域名 # CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 單IP訪問併發數,若是有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,而且延遲下次秒數也應用在每一個IP # CONCURRENT_REQUESTS_PER_IP = 3 # Disable cookies (enabled by default) # 8. 是否支持cookie,cookiejar進行操做cookie # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) # 9. Telnet用於查看當前爬蟲的信息,操做爬蟲等... # 使用telnet ip port ,而後經過命令操做 # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 10. 默認請求頭 # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 11. 定義pipeline處理請求 # ITEM_PIPELINES = { # 'step8_king.pipelines.JsonPipeline': 700, # 'step8_king.pipelines.FilePipeline': 500, # } # 12. 自定義擴展,基於信號進行調用 # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # EXTENSIONS = { # # 'step8_king.extensions.MyExtension': 500, # } # 13. 爬蟲容許的最大深度,能夠經過meta查看當前深度;0表示無深度 # DEPTH_LIMIT = 3 # 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo # 後進先出,深度優先 # DEPTH_PRIORITY = 0 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue' # 先進先出,廣度優先 # DEPTH_PRIORITY = 1 # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' # 15. 調度器隊列 # SCHEDULER = 'scrapy.core.scheduler.Scheduler' # from scrapy.core.scheduler import Scheduler # 16. 訪問URL去重 # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html """ 17. 自動限速算法 from scrapy.contrib.throttle import AutoThrottle 自動限速設置 1. 獲取最小延遲 DOWNLOAD_DELAY 2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY 3. 設置初始下載延遲 AUTOTHROTTLE_START_DELAY 4. 當請求下載完成後,獲取其"鏈接"時間 latency,即:請求鏈接到接受到響應頭之間的時間 5. 用於計算的... AUTOTHROTTLE_TARGET_CONCURRENCY target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) slot.delay = new_delay """ # 開始自動限速 # AUTOTHROTTLE_ENABLED = True # The initial download delay # 初始下載延遲 # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # 最大下載延遲 # AUTOTHROTTLE_MAX_DELAY = 10 # The average number of requests Scrapy should be sending in parallel to each remote server # 平均每秒併發數 # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # 是否顯示 # AUTOTHROTTLE_DEBUG = True # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings """ 18. 啓用緩存 目的用於將已經發送的請求或相應緩存下來,以便之後使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware from scrapy.extensions.httpcache import DummyPolicy from scrapy.extensions.httpcache import FilesystemCacheStorage """ # 是否啓用緩存策略 # HTTPCACHE_ENABLED = True # 緩存策略:全部請求均緩存,下次在請求直接訪問原來的緩存便可 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy" # 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略 # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 緩存超時時間 # HTTPCACHE_EXPIRATION_SECS = 0 # 緩存保存路徑 # HTTPCACHE_DIR = 'httpcache' # 緩存忽略的Http狀態碼 # HTTPCACHE_IGNORE_HTTP_CODES = [] # 緩存存儲的插件 # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' """ 19. 代理,須要在環境變量中設置 from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默認 os.environ { http_proxy:http://root:woshiniba@192.168.11.11:9999/ https_proxy:http://192.168.11.11:9999/ } 方式二:使用自定義下載中間件 def to_bytes(text, encoding=None, errors='strict'): if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) print "**************ProxyMiddleware have pass************" + proxy['ip_port'] else: print "**************ProxyMiddleware no pass************" + proxy['ip_port'] request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = { 'step8_king.middlewares.ProxyMiddleware': 500, } """ """ 20. Https訪問 Https訪問時有兩種狀況: 1. 要爬取網站使用的可信任證書(默認支持) DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" 2. 要爬取網站使用的自定義證書 DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory" DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class MySSLFactory(ScrapyClientContextFactory): def getCertificateOptions(self): from OpenSSL import crypto v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read()) v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read()) return CertificateOptions( privateKey=v1, # pKey對象 certificate=v2, # X509對象 verify=False, method=getattr(self, 'method', getattr(self, '_ssl_method', None)) ) 其餘: 相關類 scrapy.core.downloader.handlers.http.HttpDownloadHandler scrapy.core.downloader.webclient.ScrapyHTTPClientFactory scrapy.core.downloader.contextfactory.ScrapyClientContextFactory 相關配置 DOWNLOADER_HTTPCLIENTFACTORY DOWNLOADER_CLIENTCONTEXTFACTORY """ """ 21. 爬蟲中間件 class SpiderMiddleware(object): def process_spider_input(self,response, spider): ''' 下載完成,執行,而後交給parse處理 :param response: :param spider: :return: ''' pass def process_spider_output(self,response, result, spider): ''' spider處理完成,返回時調用 :param response: :param result: :param spider: :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable) ''' return result def process_spider_exception(self,response, exception, spider): ''' 異常調用 :param response: :param exception: :param spider: :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline ''' return None def process_start_requests(self,start_requests, spider): ''' 爬蟲啓動時調用 :param start_requests: :param spider: :return: 包含 Request 對象的可迭代對象 ''' return start_requests 內置爬蟲中間件: 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50, 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500, 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700, 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800, 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900, """ # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { # 'step8_king.middlewares.SpiderMiddleware': 543, } """ 22. 下載中間件 class DownMiddleware1(object): def process_request(self, request, spider): ''' 請求須要被下載時,通過全部下載器中間件的process_request調用 :param request: :param spider: :return: None,繼續後續中間件去下載; Response對象,中止process_request的執行,開始執行process_response Request對象,中止中間件的執行,將Request從新調度器 raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception ''' pass def process_response(self, request, response, spider): ''' spider處理完成,返回時調用 :param response: :param result: :param spider: :return: Response 對象:轉交給其餘中間件process_response Request 對象:中止中間件,request會被從新調度下載 raise IgnoreRequest 異常:調用Request.errback ''' print('response1') return response def process_exception(self, request, exception, spider): ''' 當下載處理器(download handler)或 process_request() (下載中間件)拋出異常 :param response: :param exception: :param spider: :return: None:繼續交給後續中間件處理異常; Response對象:中止後續process_exception方法 Request對象:中止中間件,request將會被從新調用下載 ''' return None 默認下載中間件 { 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300, 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500, 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550, 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600, 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750, 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830, 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850, 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, } """ # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'step8_king.middlewares.DownMiddleware1': 100, # 'step8_king.middlewares.DownMiddleware2': 500, # }
11. 自定義簡單版scrapy框架
準備知識:Twisted中reactor, defer, deferredlist, inlineCallback, getpage,參考https://www.cnblogs.com/silence-cho/p/9898984.html
項目架構及代碼:
#coding:utf-8 from twisted.web.client import defer,getPage from twisted.internet import reactor from Queue import Queue class Request(object): def __init__(self,url,callback): self.url = url self.callback = callback class HttpResponse(object): def __init__(self,content,request): self.response = content self.request = request @property def text(self): return self.response class Scheduler(object): def __init__(self): self.q = Queue() def open(self): pass def enqueue_request(self,req): self.q.put(req) def next_request(self): try: req = self.q.get(block=False) except Exception as e: req = None return req def size(self): return self.q.qsize() class ExecutionEngine(object): def __init__(self): self._close = None self.scheduler = None self.max = 5 self.crawling = [] def get_response_callback(self,content,request): print request.url # print self.crawling self.crawling.remove(request) # print self.crawling response = HttpResponse(content,request) result = request.callback(response) import types if isinstance(result,types.GeneratorType): for req in result: self.scheduler.enqueue_request(req) def _next_request(self): if self.scheduler.size()==0 and len(self.crawling)==0: self._close.callback(None) return while len(self.crawling) < self.max: req = self.scheduler.next_request() if not req: return #print req.url self.crawling.append(req) #print self.crawling d = getPage(req.url.encode('utf-8')) d.addCallback(self.get_response_callback,req) d.addCallback(lambda _:reactor.callLater(0,self._next_request)) @defer.inlineCallbacks def open_spider(self,start_requests): self.scheduler = Scheduler() yield self.scheduler.open() while True: try: req = next(start_requests) self.scheduler.enqueue_request(req) except StopIteration as e: break reactor.callLater(0, self._next_request) @defer.inlineCallbacks def start(self): self._close = defer.Deferred() yield self._close class Crawler(object): def __init__(self,spider_cls_path): self.spider_cls_path = spider_cls_path def _create_engine(self): return ExecutionEngine() def _create_spider(self): module_path, cls_name = self.spider_cls_path.rsplit('.',1) import importlib module = importlib.import_module(module_path) cls = getattr(module,cls_name) #print cls,'----' return cls() @defer.inlineCallbacks def crawl(self): spider = self._create_spider() start_requests = iter(spider.start_request()) engine = self._create_engine() yield engine.open_spider(start_requests) yield engine.start() class CrawlProcess(object): def __init__(self): self.active = set() def crawl(self,spider_cls_path): crawler =Crawler(spider_cls_path) d=crawler.crawl() self.active.add(d) def start(self): dd=defer.DeferredList(self.active) dd.addBoth(lambda _:reactor.stop()) reactor.run() class Command(object): def run(self): spider_cls_paths=['spider.chouti.ChoutiSpider','spider.cnblogs.CnblogsSpider'] #'spider.cnblogs.CnblogsSpider' crawlProcess = CrawlProcess() for spider_cls_path in spider_cls_paths: crawlProcess.crawl(spider_cls_path) crawlProcess.start() if __name__ == '__main__': c = Command() c.run()
#coding:utf-8 from engine import Request class CnblogsSpider(object): name = 'Cnblogs' def start_request(self): start_url = ['https://www.cnblogs.com/','https://www.baidu.com/' ] #'https://www.baidu.com/' for url in start_url: yield Request(url, self.parse) def parse(self, response): print response #print response.text
#coding:utf-8 from engine import Request class ChoutiSpider(object): name = 'chouti' def start_request(self): start_url = ['https://dig.chouti.com/','https://www.baidu.com/'] for url in start_url: yield Request(url, self.parse) def parse(self,response): #print response yield Request('https://www.sina.com.cn/',self.call) #print response.text def call(self,response): print '爬取新浪'
參考博客:http://www.cnblogs.com/wupeiqi/articles/6229292.html