Downloader Middleware有三個核心的方法dom
process_request(request, spider)scrapy
process_response(request, response, spider)ide
process_exception(request, exception, spider) 函數
方法一:修改settings裏面的USER_AGENT變量,加一行USER_AGENT = '....'便可url
方法二:修改middleware.py,這裏實現獲得一個隨機的user-agent,在裏面定義一個RandomUserAgentMiddleware類,並寫一個process_request()函數spa
在middleware.py中定義一個process_response()函數debug
scrapy startproject httpbintestcode
cd httpbintest && scrapy genspider httpbin httpbin.orgblog
-*- coding: utf-8 -*- import scrapy class HttpbinSpider(scrapy.Spider): name = 'httpbin' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/get'] def parse(self, response): # print(response.text) self.logger.debug(response.text) self.logger.debug('status code: ' + str(response.status))
其中的process_request函數是獲得一個隨機的user-agent; process_response函數是修改網頁返回碼爲201get
import random class RandomUserAgentMiddleware(): def __init__(self): self.user_agents = [ 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1' ] def process_request(self, request, spider): request.headers['User-Agent'] = random.choice(self.user_agents) def process_response(self, request, response, spider): response.status = 201 return response
DOWNLOADER_MIDDLEWARES = { 'httpbintest.middlewares.RandomUserAgentMiddleware': 543, }