首先咱們更新一下scrapy版本。最新版爲1.3php
再說一遍Windows的小夥伴兒 pip是裝不上Scrapy的。推薦使用anaconda 、否則仍是老老實實用Linux吧。css
conda install scrapy==1.3 或者 pip install scrapy==1.3
安裝Scrapy-Redishtml
conda install scrapy-redis 或者 pip install scrapy-redis
Python 版本爲 2.7,3.4 或者3.5 。我的使用3.6版本也沒有問題須要注意:redis
Redis>=2.8數據庫
Scrapy>=1.0json
Redis-py>=2.1 。cookie
3.X版本的Python 都是自帶Redis-py 其他小夥伴若是沒有的話、本身 pip 安裝一下。dom
開始以前咱們得知道scrapy-redis的一些配置:PS 這些配置是寫在Scrapy項目的settings.py中的!scrapy
#啓用Redis調度存儲請求隊列 SCHEDULER = "scrapy_redis.scheduler.Scheduler" #確保全部的爬蟲經過Redis去重 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #默認請求序列化使用的是pickle 可是咱們能夠更改成其餘相似的。PS:這玩意兒2.X的能夠用。3.X的不能用 #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" #不清除Redis隊列、這樣能夠暫停/恢復 爬取 #SCHEDULER_PERSIST = True #使用優先級調度請求隊列 (默認使用) #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' #可選用的其它隊列 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' #最大空閒時間防止分佈式爬蟲由於等待而關閉 #這隻有當上面設置的隊列類是SpiderQueue或SpiderStack時纔有效 #而且當您的蜘蛛首次啓動時,也可能會阻止同一時間啓動(因爲隊列爲空) #SCHEDULER_IDLE_BEFORE_CLOSE = 10 #將清除的項目在redis進行處理 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300 } #序列化項目管道做爲redis Key存儲 #REDIS_ITEMS_KEY = '%(spider)s:items' #默認使用ScrapyJSONEncoder進行項目序列化 #You can use any importable path to a callable object. #REDIS_ITEMS_SERIALIZER = 'json.dumps' #指定鏈接到redis時使用的端口和地址(可選) #REDIS_HOST = 'localhost' #REDIS_PORT = 6379 #指定用於鏈接redis的URL(可選) #若是設置此項,則此項優先級高於設置的REDIS_HOST 和 REDIS_PORT #REDIS_URL = 'redis://user:pass@hostname:9001' #自定義的redis參數(鏈接超時之類的) #REDIS_PARAMS = {} #自定義redis客戶端類 #REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' #若是爲True,則使用redis的'spop'進行操做。 #若是須要避免起始網址列表出現重複,這個選項很是有用。開啓此選項urls必須經過sadd添加,不然會出現類型錯誤。 #REDIS_START_URLS_AS_SET = False #RedisSpider和RedisCrawlSpider默認 start_usls 鍵 #REDIS_START_URLS_KEY = '%(name)s:start_urls' #設置redis使用utf-8以外的編碼 #REDIS_ENCODING = 'latin1'
看不下去的小夥伴兒看這兒:http://scrapy-redis.readthedocs.io/en/stable/readme.html請各位小夥伴兒自行挑選須要的配置寫到項目的settings.py文件中。分佈式
繼續在咱們上一篇博文中的爬蟲程序修改:
首先把咱們須要的redis配置文件寫入settings.py中:
若是你的redis數據庫按照前一片博文配置過則須要如下至少三項
SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" REDIS_URL = 'redis://root:密碼@主機IP:端口'
Nice配置文件寫到這兒。咱們來作一些基本的反爬蟲設置第三項請按照你的實際狀況配置。
最基本的一個切換UserAgent!
首先在項目文件中新建一個useragent.py用來寫一堆 User-Agent(能夠去網上找更多,也能夠用下面這些現成的)
agents = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a", "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)", "Mozilla/4.8 [en] (Windows NT 5.1; U)", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
如今咱們來重寫一下Scrapy的下載中間件(哇靠!!重寫中間件 好高端啊!!會不會好難!!!放心!!!So Easy!!跟我作!包教包會,畢竟不會你也不能順着網線來打我啊):
關於重寫中間件的詳細狀況 請參考官方文檔:
http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/downloader-middleware.html#scrapy.contrib.downloadermiddleware.DownloaderMiddleware
在項目中新建一個middlewares.py的文件(若是你使用的新版本的Scrapy,在新建的時候會有這麼一個文件,直接用就行了)
首先導入UserAgentMiddleware畢竟咱們要重寫它啊!
import json ##處理json的包 import redis #Python操做redis的包 import random #隨機選擇 from .useragent import agents #導入前面的 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware #UserAegent中間件 from scrapy.downloadermiddlewares.retry import RetryMiddleware #重試中間件
開寫:
class UserAgentmiddleware(UserAgentMiddleware): def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agent
第一行:定義了一個類UserAgentmiddleware繼承自UserAgentMiddleware
第二行:定義了函數process_request
(request, spider)爲何定義這個函數,由於Scrapy每個request經過中間件都會調用這個方法。
第三行:隨機選擇一個User-Agent
第四行:設置request的User-Agent爲咱們隨機的User-Agent
^_^Y(^o^)Y一箇中間件寫完了!哈哈 是否是So easy!
下面就須要登錄了。此次咱們不用上一篇博文的FromRequest來實現登錄了。咱們來使用Cookie登錄。這樣的話咱們須要重寫Cookie中間件!分佈式爬蟲啊!你不能手動的給每一個Spider寫一個Cookie吧。並且你還不會知道這個Cookie到底有沒有失效。因此咱們須要維護一個Cookie池(這個cookie池用redis)。
好!來理一理思路,維護一個Cookie池最基本須要具有些什麼功能呢?
好,咱們先作前三個對Cookie進行操做。
首先咱們在項目中新建一個cookies.py的文件用來寫咱們須要對Cookie進行的操做。
haoduofuli/haoduofuli/cookies.py:
首先平常導入咱們須要的文件:
import requests import json import redis import logging from .settings import REDIS_URL ##獲取settings.py中的REDIS_URL
首先咱們把登錄用的帳號密碼 以Key:value的形式存入redis數據庫。不推薦使用db0(這是Scrapy-redis默認使用的,帳號密碼單獨使用一個db進行存儲。)
就像這個樣子。
解決第一個問題:獲取Cookie:
import requests import json import redis import logging from .settings import REDIS_URL logger = logging.getLogger(__name__) ##使用REDIS_URL連接Redis數據庫, deconde_responses=True這個參數必需要,數據會變成byte形式 徹底無法用 reds = redis.Redis.from_url(REDIS_URL, db=2, decode_responses=True) login_url = 'http://haoduofuli.pw/wp-login.php' ##獲取Cookie def get_cookie(account, password): s = requests.Session() payload = { 'log': account, 'pwd': password, 'rememberme': "forever", 'wp-submit': "登陸", 'redirect_to': "http://http://www.haoduofuli.pw/wp-admin/", 'testcookie': "1" } response = s.post(login_url, data=payload) cookies = response.cookies.get_dict() logger.warning("獲取Cookie成功!(帳號爲:%s)" % account) return json.dumps(cookies)
這段很好懂吧。
使用requests模塊提交表單登錄得到Cookie,返回一個經過Json序列化後的Cookie(若是不序列化,存入Redis後會變成Plain Text格式的,後面取出來Cookie就無法用啦。)
第二個問題:將Cookie寫入Redis數據庫(分佈式呀,固然得要其它其它Spider也能使用這個Cookie了)
def init_cookie(red, spidername): redkeys = reds.keys() for user in redkeys: password = reds.get(user) if red.get("%s:Cookies:%s--%s" % (spidername, user, password)) is None: cookie = get_cookie(user, password) red.set("%s:Cookies:%s--%s"% (spidername, user, password), cookie)
判斷這個spider和帳號的Cookie是否存在,不存在 則調用get_cookie函數傳入從redis中獲取到的帳號密碼的cookie;使用咱們上面創建的redis連接獲取redis db2中的全部Key(咱們設置爲帳號的哦!),再從redis中獲取全部的Value(我設成了密碼哦!)
保存進redis,Key爲spider名字和帳號密碼,value爲cookie。
這兒操做redis的不是上面創建的那個reds連接哦!而是red;後面會傳進來的(由於要操做兩個不一樣的db,我在文檔中沒有看到切換db的方法,只好這麼用了,知道的小夥伴兒留言一下)。
spidername獲取方式後面也會說的。
還有剩餘的更新Cookie 刪除沒法使用的帳號等,你們夥能夠本身試着寫寫(寫不出來也不要緊 不影響正常使用)
好啦!搞定!簡直So Easy!!!!