Scrapy建議的幾個防止爬蟲被禁的策略

1. 隨機切換UA

配置文件settings.py 同級目錄下新增下載中間件 rotate_useragent.pyphp

# -*- coding: utf-8 -*-
import random

class RotateUserAgentMiddleware(scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware):

	def __init__(self, user_agent=''):
		self.user_agent = user_agent

	def process_request(self, request, spider):
			request.headers.setdefault('User-Agent', random.choice(self.user_agent_list))

	#UA池,更多UA頭部可參考 http://www.useragentstring.com/pages/useragentstring.php
	user_agent_list = [
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 ",
		"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
		"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 ",
		"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 ",
		"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 ",
		"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
		"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 ",
		"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
		"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 ",
		"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
		"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 ",
		"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
		"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
		"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 ",
		"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
		"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 ",
		"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
		"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 ",
		"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
	]

編輯配置文件settings.py,啓用下載中間件python

DOWNLOADER_MIDDLEWARES = {
    'WebCrawler.spiders.rotate_useragent.RotateUserAgentMiddleware': 1,
}

2. IP池

防止IP過頻被禁服務器


3. 請求時延

限制爬取速度,一方面能避免被反爬蟲措施封禁,另外一方面也能減輕對服務器的壓力
編輯配置文件settings.py,增長以下一行:cookie

DOWNLOAD_DELAY = 3

4. 禁用cookie

防止被行爲跟蹤dom

相關文章
相關標籤/搜索