UA池(每一次請求採用池中的隨機UA)
a) 在中間件類中進行導包dom
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
b)封裝一個基於UserAgentMiddleware的類,且重寫該類scrapy
例:ide
middleware.pyurl
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware import random ua_list = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;', 'User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)', 'User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)', 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'] ip_http_list = ['90.229.216.218:46796', '110.235.250.7:49341', '81.163.62.136:41258', '195.34.207.47:60878'] ip_https_list = ['140.227.207.211:60088', '140.227.209.210:60088', '185.132.133.102:1080'] class UserAgentRandom(UserAgentMiddleware): def process_request(self, request, spider): ua = random.choice(ua_list) request.headers.setdefault('User-Agent', ua)
settings.pyspa
DOWNLOADER_MIDDLEWARES = { 'handle5.middlewares.Handle5DownloaderMiddleware': 543, 'handle5.middlewares.UserAgentRandom': 542, 'handle5.middlewares.IpRandom': 541 }
代理池(IP 每次請求的IP地址隨機從IP池中獲取)
middleware.py代理
class IpRandom: def process_request(self, request, spider): url = request.url head = url.split(":")[0] if head == "http": request.meta["proxy"] = "http://" + random.choice(ip_http_list) else: request.meta["proxy"] = "https://" + random.choice(ip_https_list)