scrapy是一個爬取網站數據,提取結構性數據的框架。注意敲重點是框架。框架就說明了什麼?——提供的組件豐富,scrapy的設計參考了Django,可見一斑。可是不一樣於Django的是scrapy的可拓展性也很強,因此說,你說你會用python寫爬蟲,不瞭解點scrapy。。。。html
scrapy使用了Twisted異步網絡庫來處理網絡通信,總體架構以下圖:python
Scrapy主要包括瞭如下組件:ios
好介紹到這裏。面試題來了:談談你對scrapy架構的理解:面試
上面就是個人理解,你的理解與個人可能不同,以你的爲準,面試的時候若是被問到了。千萬要說,這東西怎麼說都對。不說的話,必定會被面試官反問一句:你還有什麼要補充的嗎?。。。。redis
安裝:sql
pass
直接開始寫代碼吧。數據庫
項目上手,利用scrapy爬取抽屜。讀到這裏你是否是有點失望了?爲何不是淘寶?不是騰訊視頻?不是知乎?在此說明下抽屜雖小,五臟俱全嘛。目的是爲了展現下scrapy的基本用法不必將精力花在處理反爬蟲上。json
在開始咱們的項目以前,要肯定一件事就是咱們要爬取什麼數據?怎麼存儲數據?這個案例使用到的Mysql最爲容器存儲數據的。設計的表結構以下:api
有了這些字段就能着手於spider的編寫了,第一步首先寫item類cookie
1 import datetime 2 import re 3 4 import scrapy 5 from scrapy.loader import ItemLoader 6 from scrapy.loader.processors import Join, MapCompose, TakeFirst 7 8 9 def add_title(value): 10 """字段以後加上簽名""" 11 return value + 'Pontoon' 12 13 14 class ArticleItemLoader(ItemLoader): 15 """重寫ItemLoader,從列表中提取第一個字段""" 16 default_output_processor = TakeFirst() 17 18 19 class ChouTiArticleItem(scrapy.Item): 20 """初始化items""" 21 title = scrapy.Field( 22 input_processor=MapCompose(add_title) 23 ) 24 article_url = scrapy.Field() 25 article_url_id = scrapy.Field() 26 font_img_url = scrapy.Field() 27 author = scrapy.Field() 28 sign_up = scrapy.Field() 29 30 def get_insert_sql(self): 31 insert_sql = ''' 32 insert into chouti(title, article_url, article_url_id, font_img_url, author, sign_up) 33 VALUES (%s, %s, %s, %s, %s, %s) 34 ''' 35 params = (self["title"], self["article_url"], self["article_url_id"], self["font_img_url"], 36 self["author"], self["sign_up"],) 37 return insert_sql, params
下一步解析數據(這裏並沒由對下一頁進行處理,省事!)
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.http import Request 4 5 from ..items import ChouTiArticleItem, ArticleItemLoader 6 from ..utils.common import get_md5 7 8 9 class ChoutiSpider(scrapy.Spider): 10 name = 'chouti' 11 allowed_domains = ['dig.chouti.com/'] 12 start_urls = ['https://dig.chouti.com/'] 13 header = { 14 "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " 15 "Chrome/71.0.3578.98 Safari/537.36" 16 } 17 18 def parse(self, response): 19 article_list = response.xpath("//div[@class='item']") 20 for article in article_list: 21 item_load = ArticleItemLoader(item=ChouTiArticleItem(), selector=article) # 注意這裏的返回值並非response了 22 article_url = article.xpath(".//a[@class='show-content color-chag']/@href").extract_first("") 23 item_load.add_xpath("title", ".//a[@class='show-content color-chag']/text()") 24 item_load.add_value("article_url", article_url) 25 item_load.add_value("article_url_id", get_md5(article_url)) 26 item_load.add_xpath("font_img_url", ".//div[@class='news-pic']/img/@src") 27 item_load.add_xpath("author", ".//a[@class='user-a']//b/text()") 28 item_load.add_xpath("sign_up", ".//a[@class='digg-a']//b/text()") 29 article_item = item_load.load_item() # 解析上述定義的字段 30 yield article_item
OK,這時咱們的數據就被解析成了字典yield到了Pipline中了
1 import MySQLdb 2 import MySQLdb.cursors 3 from twisted.enterprise import adbapi # 將入庫變成異步操做 4 5 6 class MysqlTwistedPipleline(object): 7 """抽屜Pipleline""" 8 def __init__(self, db_pool): 9 self.db_pool = db_pool 10 11 @classmethod 12 def from_settings(cls, settings): 13 """內置的方法自動調用settings""" 14 db_params = dict( 15 host=settings["MYSQL_HOST"], 16 db=settings["MYSQL_DBNAME"], 17 user=settings["MYSQL_USER"], 18 password=settings["MYSQL_PASSWORD"], 19 charset="utf8", 20 cursorclass=MySQLdb.cursors.DictCursor, 21 use_unicode=True, 22 ) 23 db_pool = adbapi.ConnectionPool("MySQLdb", **db_params) 24 25 return cls(db_pool) 26 27 def process_item(self, item, spider): 28 """使用twisted異步插入數據值數據庫""" 29 query = self.db_pool.runInteraction(self.do_insert, item) # runInteraction() 執行異步操做的函數 30 query.addErrback(self.handle_error, item, spider) # addErrback() 異步處理異常的函數 31 32 def handle_error(self, failure, item, spider): 33 """自定義處理異步插入數據的異常""" 34 print(failure) 35 36 def do_insert(self, cursor, item): 37 """自定義執行具體的插入""" 38 insert_sql, params = item.get_insert_sql() 39 # chouti插入數據 40 cursor.execute(insert_sql, (item["title"], item["article_url"], item["article_url_id"], 41 item["font_img_url"], item["author"], item["sign_up"]))
再到settings.py中配置一下文件便可。
1 ITEM_PIPELINES = { 2 'ChoutiSpider.pipelines.MysqlTwistedPipleline': 3, 3 4 } 5 ... 6 7 MYSQL_HOST = "127.0.0.1" 8 MYSQL_DBNAME = "article_spider" # 數據庫名稱 9 MYSQL_USER = "xxxxxx" 10 MYSQL_PASSWORD = "xxxxx"
ok練手的項目作完了,接下來正式進入正題。我將根據scrapy的架構圖,簡單概述下scrapy的各個組件,以及源碼一覽。
scrapy爲咱們提供了5中spider用於構造請求,解析數據、返回item。經常使用的就scrapy.spider、scrapy.crawlspider兩種。下面是spider經常使用到的屬性和方法。
屬性、方法 | 功能 | 簡述 |
name | 爬蟲的名稱 | 啓動爬蟲的時候會用到 |
start_urls | 起始url | 是一個列表,默認被start_requests調用 |
allowd_doamins | 對url進行的簡單過濾 | 當請求url沒有被allowd_doamins匹配到時,會報一個很是噁心的錯,詳見個人分佈式爬蟲的那一篇博客 |
start_requests() | 第一次請求 | 本身的spider能夠重寫,突破一些簡易的反爬機制 |
custom_settings | 定製settings | 能夠對每一個爬蟲定製settings配置 |
from_crawler | 實例化入口 | 在scrapy的各個組件的源碼中,首先執行的就是它 |
關於spider咱們能夠定製start_requests、能夠單獨的設置custom_settings、也能夠設置請求頭、代理、cookies。這些基礎的用發,以前的一些博客也都介紹到了,額外的想說一下的就是關於頁面的解析。
頁面的解析,就是兩種,一種是請求獲得的文章列表中標題中信息量不大(一些新聞、資訊類的網站),只需訪問具體的文章內容。在作循環時需解析到a標籤下的href屬性裏的url,還有一種就是一些電商網站類的,商品的列表也自己包含的信息就比較多,既要提取列表中商品的信息,又要提取到詳情頁中用戶的評論,這時循環解析到具體的li標籤,而後在解析li裏面的a標籤進行下一步的跟蹤。
文檔進行了說明,其實就是利用本身定製的rules,正則匹配每次請求的url,對allowd_doamins進行的補充,經常使用於網站的全站爬取。關於全站的爬取還能夠本身找網站url的特徵來進行,博客中的新浪網的爬蟲以及噹噹圖書的爬蟲都是我本身定義的url匹配規則進行的爬取。
具體的用法請參考文檔中提供的案例以及個人博客crawlspider爬取拉勾網案例
ok!spider先介紹到這裏,若是還有其餘的發現我會回來補充
其實scrapy內部的去重策略就是首先對請求的url進行sha1加密,而後將加密後的url存儲到set( )集合裏,這樣就實現了去重的功能。(在後面分布式爬蟲中也是用到了redis的集合進行的url去重)
1 class ChoutiSpider(scrapy.Spider): 2 name = 'chouti' 3 allowed_domains = ['chouti.com'] 4 start_urls = ['https://dig.chouti.com/'] 5 header = { 6 "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " 7 "Chrome/71.0.3578.98 Safari/537.36" 8 } 9 10 def parse(self, response): 11 """ 12 article_list = response.xpath("//div[@class='item']") 13 for article in article_list: 14 item_load = ArticleItemLoader(item=ChouTiArticleItem(), selector=article) # 注意這裏的返回值並非response了 15 article_url = article.xpath(".//a[@class='show-content color-chag']/@href").extract_first("") 16 item_load.add_xpath("title", ".//a[@class='show-content color-chag']/text()") 17 item_load.add_value("article_url", article_url) 18 item_load.add_value("article_url_id", get_md5(article_url)) 19 item_load.add_xpath("font_img_url", ".//div[@class='news-pic']/img/@src") 20 item_load.add_xpath("author", ".//a[@class='user-a']//b/text()") 21 item_load.add_xpath("sign_up", ".//a[@class='digg-a']//b/text()") 22 article_item = item_load.load_item() # 解析上述定義的字段 23 yield article_item 24 """ 25 print(response.request.url) 26 27 page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract() 28 # page_list = response.xpath("//div[@id='dig_lcpage']//li[last()]/a/@href").extract() 29 for page in page_list: 30 page = "https://dig.chouti.com" + page 31 yield Request(url=page, callback=self.parse) 32 33 >>>https://dig.chouti.com/ 34 https://dig.chouti.com/all/hot/recent/2 35 https://dig.chouti.com/all/hot/recent/3 36 https://dig.chouti.com/all/hot/recent/7 37 https://dig.chouti.com/all/hot/recent/6 38 https://dig.chouti.com/all/hot/recent/8 39 https://dig.chouti.com/all/hot/recent/9 40 https://dig.chouti.com/all/hot/recent/4 41 https://dig.chouti.com/all/hot/recent/5 42 43 url去重
從輸出的結果就能夠發現scrapy其實內部自動幫助咱們實現了url去重的功能。去看看他的源碼。
from scrapy.dupefilter import RFPDupeFilter # 源碼入口
首先看這個方法
def request_seen(self, request): fp = self.request_fingerprint(request) # 將request中的url進行hashlib.sha1()加密 if fp in self.fingerprints: # self.fingerprints = set() return True # 若是訪問過了就再也不訪問了 self.fingerprints.add(fp) # 不然的話就將加密後的url放入集合之中 if self.file: self.file.write(fp + os.linesep) # 也能夠將fp寫入到文件中去
瞭解了源碼的原理以後就能夠改寫url去重規則了
1 from scrapy.dupefilters import BaseDupeFilter 2 3 4 class ChouTiDupeFilter(BaseDupeFilter): 5 6 @classmethod 7 def from_settings(cls, settings): 8 return cls() 9 10 def request_seen(self, request): 11 return False # 表示不對URL進行去重 12 13 def open(self): # can return deferred 14 pass 15 16 def close(self, reason): # can return a deferred 17 pass 18 19 def log(self, request, spider): # log that a request has been filtered 20 pass
在settings.py中添加下參數
1 # 自定製URL去重 2 DUPEFILTER_CLASS = 'JobboleSpider.chouti_dupefilters.ChouTiDupeFilter'
這個時候在運行爬蟲會發現url出現了重複訪問的現象。
1 from scrapy.dupefilters import BaseDupeFilter 2 from scrapy.dupefilters import request_fingerprint 3 4 5 class ChouTiDupeFilter(BaseDupeFilter): 6 def __init__(self): 7 self.visited_fd = set() 8 9 @classmethod 10 def from_settings(cls, settings): 11 return cls() # 實例化 12 13 def request_seen(self, request): 14 # return False # 表示不對URL進行去重 15 fd = request_fingerprint(request) # request_fingerprint 這個方法要說下 16 if fd in self.visited_fd: 17 return True 18 self.visited_fd.add(fd) 19 20 def open(self): # can return deferred 21 print('start') 22 23 def close(self, reason): # can return a deferred 24 print('close') 25 26 def log(self, request, spider): # log that a request has been filtered 27 pass
1 import hashlib 2 from scrapy.dupefilters import request_fingerprint 3 from scrapy.http import Request 4 5 # 注意書寫url的時候參數要帶上問號 6 url1 = "http://www.baidu.com/?k1=v1&k2=v2" 7 url2 = "http://www.baidu.com/?k2=v2&k1=v1" 8 ret1 = hashlib.sha1() 9 fd1 = Request(url=url1) 10 ret2 = hashlib.sha1() 11 fd2 = Request(url=url2) 12 ret1.update(url1.encode('utf-8')) 13 ret2.update(url2.encode('utf-8')) 14 15 print("fd1:", ret1.hexdigest()) 16 print("fd2:", ret2.hexdigest()) 17 print("re1:", request_fingerprint(request=fd1)) 18 print("re2:", request_fingerprint(request=fd2)) 19 20 >>>fd1: 1864fb5c5b86b058577fb94714617f3c3a226448 21 fd2: ba0922c651fc8f3b7fb07dfa52ff24ed05f6475e 22 re1: de8c206cf21aab5a0c6bbdcdaf277a9f71758525 23 re2: de8c206cf21aab5a0c6bbdcdaf277a9f71758525 24 25 # sha1 ——> 40位隨機數 26 # md5 ——> 32位隨機數
關於深度和優先級這兩個概念其實在日常的使用中是體會不到的尤爲是深度這個概念,能夠這樣來獲取請求的深度。
response.request.meta.get("depth", None)
可是深度和優先級這兩個概念在源碼的執行流程中又特別的重要,由於這兩個會spider_output至engine中最終傳遞到scheduler調度器中進行存儲的。查看下源碼
from scrapy.spidermiddlewares import depth # 源碼入口,位於spidermiddleware中
1 import logging 2 3 from scrapy.http import Request 4 5 logger = logging.getLogger(__name__) 6 7 8 class DepthMiddleware(object): 9 10 def __init__(self, maxdepth, stats=None, verbose_stats=False, prio=1): 11 self.maxdepth = maxdepth # 最大深度 12 self.stats = stats # 數據收集器 13 self.verbose_stats = verbose_stats 14 self.prio = prio # 優先級 15 16 @classmethod 17 def from_crawler(cls, crawler): 18 settings = crawler.settings 19 maxdepth = settings.getint('DEPTH_LIMIT') 20 verbose = settings.getbool('DEPTH_STATS_VERBOSE') 21 prio = settings.getint('DEPTH_PRIORITY') 22 return cls(maxdepth, crawler.stats, verbose, prio) 23 24 def process_spider_output(self, response, result, spider): 25 def _filter(request): 26 """閉包函數""" 27 if isinstance(request, Request): 28 depth = response.meta['depth'] + 1 29 request.meta['depth'] = depth 30 # 下面爲優先級 31 if self.prio: 32 # 優先級遞減也就是,請求先來的depth的優先級高於後面來的 33 request.priority -= depth * self.prio 34 if self.maxdepth and depth > self.maxdepth: 35 logger.debug( 36 "Ignoring link (depth > %(maxdepth)d): %(requrl)s ", 37 {'maxdepth': self.maxdepth, 'requrl': request.url}, 38 extra={'spider': spider} 39 ) 40 return False 41 elif self.stats: 42 if self.verbose_stats: 43 self.stats.inc_value('request_depth_count/%s' % depth, 44 spider=spider) 45 self.stats.max_value('request_depth_max', depth, 46 spider=spider) 47 return True 48 49 # base case (depth=0) 50 if self.stats and 'depth' not in response.meta: 51 response.meta['depth'] = 0 52 if self.verbose_stats: 53 self.stats.inc_value('request_depth_count/0', spider=spider) 54 # result爲yield回來的Request()對象,爲一個迭代器 55 return (r for r in result or () if _filter(r)) # if _filter(r)返回True則將這個請求的具體對象r放到調度器中
下圖所示
調度器從引擎接受request並將他們入隊,以便以後引擎請求他們時提供給引擎。
scheduler源碼
1 import os 2 import json 3 import logging 4 from os.path import join, exists 5 6 from scrapy.utils.reqser import request_to_dict, request_from_dict 7 from scrapy.utils.misc import load_object 8 from scrapy.utils.job import job_dir 9 10 logger = logging.getLogger(__name__) 11 12 13 class Scheduler(object): 14 15 def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None, 16 logunser=False, stats=None, pqclass=None): 17 self.df = dupefilter 18 self.dqdir = self._dqdir(jobdir) 19 self.pqclass = pqclass 20 self.dqclass = dqclass 21 self.mqclass = mqclass 22 self.logunser = logunser 23 self.stats = stats 24 25 @classmethod 26 def from_crawler(cls, crawler): 27 settings = crawler.settings 28 dupefilter_cls = load_object(settings['DUPEFILTER_CLASS']) 29 dupefilter = dupefilter_cls.from_settings(settings) 30 pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE']) 31 dqclass = load_object(settings['SCHEDULER_DISK_QUEUE']) 32 mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE']) 33 logunser = settings.getbool('LOG_UNSERIALIZABLE_REQUESTS', settings.getbool('SCHEDULER_DEBUG')) 34 return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser, 35 stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass) 36 37 def has_pending_requests(self): 38 return len(self) > 0 39 40 def open(self, spider): 41 self.spider = spider 42 self.mqs = self.pqclass(self._newmq) 43 self.dqs = self._dq() if self.dqdir else None 44 return self.df.open() 45 46 def close(self, reason): 47 if self.dqs: 48 prios = self.dqs.close() 49 with open(join(self.dqdir, 'active.json'), 'w') as f: 50 json.dump(prios, f) 51 return self.df.close(reason) 52 53 def enqueue_request(self, request): 54 if not request.dont_filter and self.df.request_seen(request): 55 self.df.log(request, self.spider) 56 return False 57 dqok = self._dqpush(request) 58 if dqok: 59 self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider) 60 else: 61 self._mqpush(request) 62 self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider) 63 self.stats.inc_value('scheduler/enqueued', spider=self.spider) 64 return True 65 66 def next_request(self): 67 request = self.mqs.pop() 68 if request: 69 self.stats.inc_value('scheduler/dequeued/memory', spider=self.spider) 70 else: 71 request = self._dqpop() 72 if request: 73 self.stats.inc_value('scheduler/dequeued/disk', spider=self.spider) 74 if request: 75 self.stats.inc_value('scheduler/dequeued', spider=self.spider) 76 return request 77 78 def __len__(self): 79 return len(self.dqs) + len(self.mqs) if self.dqs else len(self.mqs) 80 81 def _dqpush(self, request): 82 if self.dqs is None: 83 return 84 try: 85 reqd = request_to_dict(request, self.spider) 86 self.dqs.push(reqd, -request.priority) 87 except ValueError as e: # non serializable request 88 if self.logunser: 89 msg = ("Unable to serialize request: %(request)s - reason:" 90 " %(reason)s - no more unserializable requests will be" 91 " logged (stats being collected)") 92 logger.warning(msg, {'request': request, 'reason': e}, 93 exc_info=True, extra={'spider': self.spider}) 94 self.logunser = False 95 self.stats.inc_value('scheduler/unserializable', 96 spider=self.spider) 97 return 98 else: 99 return True 100 101 def _mqpush(self, request): 102 self.mqs.push(request, -request.priority) 103 104 def _dqpop(self): 105 if self.dqs: 106 d = self.dqs.pop() 107 if d: 108 return request_from_dict(d, self.spider) 109 110 def _newmq(self, priority): 111 return self.mqclass() 112 113 def _newdq(self, priority): 114 return self.dqclass(join(self.dqdir, 'p%s' % priority)) 115 116 def _dq(self): 117 activef = join(self.dqdir, 'active.json') 118 if exists(activef): 119 with open(activef) as f: 120 prios = json.load(f) 121 else: 122 prios = () 123 q = self.pqclass(self._newdq, startprios=prios) 124 if q: 125 logger.info("Resuming crawl (%(queuesize)d requests scheduled)", 126 {'queuesize': len(q)}, extra={'spider': self.spider}) 127 return q 128 129 def _dqdir(self, jobdir): 130 if jobdir: 131 dqdir = join(jobdir, 'requests.queue') 132 if not exists(dqdir): 133 os.makedirs(dqdir) 134 return dqdir
首先執行from_crawler加載對象
@classmethod def from_crawler(cls, crawler): settings = crawler.settings dupefilter_cls = load_object(settings['DUPEFILTER_CLASS']) dupefilter = dupefilter_cls.from_settings(settings) # 若是不設置DUPEFILTER_CLASS採用默認的去重規則 pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE']) # 優先級隊列 dqclass = load_object(settings['SCHEDULER_DISK_QUEUE']) # 磁盤隊列 mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE']) # 內存隊列 logunser = settings.getbool('LOG_UNSERIALIZABLE_REQUESTS', settings.getbool('SCHEDULER_DEBUG')) return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser, stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass)
在到init方法實例化
def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None, logunser=False, stats=None, pqclass=None): self.df = dupefilter self.dqdir = self._dqdir(jobdir) self.pqclass = pqclass self.dqclass = dqclass self.mqclass = mqclass self.logunser = logunser self.stats = stats
而後在執行open方法
def open(self, spider): self.spider = spider self.mqs = self.pqclass(self._newmq) self.dqs = self._dq() if self.dqdir else None return self.df.open()
而後在執行 enqueue_request 方法請求入列
def enqueue_request(self, request): """請求入隊""" if not request.dont_filter and self.df.request_seen(request): # request.dont_filter默認False;而self.df.request_seen(request)返回True表明url已經訪問過了,很顯然這裏返回False self.df.log(request, self.spider) return False dqok = self._dqpush(request) # 跳到self._dqpush(request)函數中去,請求入隊,採用先進後出的方式 if dqok: self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider) else: self._mqpush(request) # 執行內存隊列,一樣的請求入隊,採用先進後出的方式 self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider) self.stats.inc_value('scheduler/enqueued', spider=self.spider)
_dqpush方法
def _dqpush(self, request): if self.dqs is None: return try: reqd = request_to_dict(request, self.spider) # 將請求對象實例化成字典 self.dqs.push(reqd, -request.priority) # 內存隊列中加入請求字典,-request.priority except ValueError as e: # non serializable request if self.logunser: msg = ("Unable to serialize request: %(request)s - reason:" " %(reason)s - no more unserializable requests will be" " logged (stats being collected)") logger.warning(msg, {'request': request, 'reason': e}, exc_info=True, extra={'spider': self.spider}) self.logunser = False self.stats.inc_value('scheduler/unserializable', spider=self.spider) return else: return True
而後執行next_request請求出隊,交給下載器進行下載
def next_request(self): """請求出隊""" request = self.mqs.pop() # 先從從內存隊列中取 if request: self.stats.inc_value('scheduler/dequeued/memory', spider=self.spider) else: request = self._dqpop() if request: self.stats.inc_value('scheduler/dequeued/disk', spider=self.spider) if request: self.stats.inc_value('scheduler/dequeued', spider=self.spider) return request def __len__(self): return len(self.dqs) + len(self.mqs) if self.dqs else len(self.mqs)
ok調度器其實就是一個隊列這裏scrapy引用了第三方的隊列庫queuelib,請求是先進後出傳遞給下載器。
在scrapy中cookie能夠直接寫在Request()中。
class Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None):
詳細的用法經過案例來介紹
1 class ChoutiSpider(scrapy.Spider): 2 name = 'chouti' 3 allowed_domains = ['chouti.com'] 4 start_urls = ['https://dig.chouti.com/'] 5 cookie_dict = {} 6 7 def parse(self, response): 8 cookie_jar = CookieJar() 9 cookie_jar.extract_cookies(response, response.request) 10 for k, v in cookie_jar._cookies.items(): 11 for i, j in v.items(): 12 for m, n in j.items(): 13 self.cookie_dict[m] = n.value 14 15 print(cookie_jar) 16 print(self.cookie_dict) 17 # 模擬登陸 18 yield Request(url="https://dig.chouti.com/login", method="POST", 19 body="phone=xxxxxxxxxx&password=xxxxxxxx&oneMonth=1", 20 cookies=self.cookie_dict, 21 headers={ 22 "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " 23 "Chrome/71.0.3578.98 Safari/537.36", 24 "content-type": "application/x-www-form-urlencoded;charset=UTF-8", 25 }, 26 callback=self.check_login, 27 ) 28 29 # 解析字段 30 """ 31 article_list = response.xpath("//div[@class='item']") 32 for article in article_list: 33 item_load = ArticleItemLoader(item=ChouTiArticleItem(), selector=article) # 注意這裏的返回值並非response了 34 article_url = article.xpath(".//a[@class='show-content color-chag']/@href").extract_first("") 35 item_load.add_xpath("title", ".//a[@class='show-content color-chag']/text()") 36 item_load.add_value("article_url", article_url) 37 item_load.add_value("article_url_id", get_md5(article_url)) 38 item_load.add_xpath("font_img_url", ".//div[@class='news-pic']/img/@src") 39 item_load.add_xpath("author", ".//a[@class='user-a']//b/text()") 40 item_load.add_xpath("sign_up", ".//a[@class='digg-a']//b/text()") 41 article_item = item_load.load_item() # 解析上述定義的字段 42 yield article_item 43 """ 44 # 獲取每一頁 45 """ 46 print(response.request.url, response.request.meta.get("depth", 0)) 47 48 page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract() 49 # page_list = response.xpath("//div[@id='dig_lcpage']//li[last()]/a/@href").extract() 50 for page in page_list: 51 page = "https://dig.chouti.com" + page 52 yield Request(url=page, callback=self.parse) 53 """ 54 55 def check_login(self, response): 56 """判斷是否登陸成功,若是登陸成功跳轉至首頁須要攜帶cookie""" 57 print(response.text) 58 59 yield Request(url="https://dig.chouti.com/all/hot/recent/1", 60 cookies=self.cookie_dict, 61 callback=self.index 62 ) 63 64 def index(self,response): 65 """首頁實現點贊""" 66 news_list = response.xpath("//div[@id='content-list']/div[@class='item']") 67 for news in news_list: 68 link_id = news.xpath(".//div[@class='part2']/@share-linkid").extract_first() 69 print(link_id) 70 yield Request( 71 url="https://dig.chouti.com/link/vote?linksId=%s" % (link_id, ), 72 method="POST", 73 cookies=self.cookie_dict, 74 callback=self.do_favs 75 ) 76 77 # 翻頁 78 page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract() 79 # page_list = response.xpath("//div[@id='dig_lcpage']//li[last()]/a/@href").extract() 80 for page in page_list: 81 page = "https://dig.chouti.com" + page 82 yield Request(url=page, cookies=self.cookie_dict, callback=self.index) 83 84 def do_favs(self, response): 85 """獲取點贊信息""" 86 print(response.text)