Scrapy中的Reponse和它的子類（TextResponse、HtmlResponse、XmlResponse）

時間 2019-11-10

標籤 scrapy reponse 子類 textresponse htmlresponse xmlresponse 欄目 Python 简体版

原文原文鏈接

今天用scrapy爬取壁紙的時候（url：http://pic.netbian.com/4kmein...）絮叨了一些問題，記錄下來，供後世探討，以史爲鑑。**css

由於網站是動態渲染的，因此選擇scrapy對接selenium（scrapy抓取網頁的方式和requests庫類似，都是直接模擬HTTP請求，而Scrapy也不能抓取JavaScript動態渲染的網頁。）html

因此在Downloader Middlewares中須要獲得Request而且返回一個Response，問題出在Response，經過查看官方文檔發現class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])，隨即經過from scrapy.http import Response導入Response
web

輸入scrapy crawl girl
獲得以下錯誤：
*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**
檢查相關代碼：scrapy

# middlewares.py
from scrapy import signals
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class Pic4KgirlDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        try:
            self.browser=selenium.webdriver.Chrome()
            self.wait=WebDriverWait(self.browser,10)
            
            self.browser.get(request.url)
            self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)')))
            return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
        #except:
            #raise IgnoreRequest()
        finally:
            self.browser.close()

推斷問題出在：
return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
查看Response類的定義發現：ide

@property
    def text(self):
        """For subclasses of TextResponse, this will return the body
        as text (unicode object in Python 2 and str in Python 3)
        """
        raise AttributeError("Response content isn't text")

    def css(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

    def xpath(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

說明Response類不能夠被直接使用，須要被繼承重寫方法後才能使用網站

響應子類：ui

**TextResponse對象**
class scrapy.http.TextResponse(url[, encoding[, ...]])
**HtmlResponse對象**
class scrapy.http.HtmlResponse(url[, ...])
**XmlResponse對象**
class scrapy.http.XmlResponse（url [，... ] ）

舉例觀察TextResponse的定義
from scrapy.http import TextResponse
導入TextResponse
發現this

class TextResponse(Response):

    _DEFAULT_ENCODING = 'ascii'

    def __init__(self, *args, **kwargs):
        self._encoding = kwargs.pop('encoding', None)
        self._cached_benc = None
        self._cached_ubody = None
        self._cached_selector = None
        super(TextResponse, self).__init__(*args, **kwargs)

其中xpath方法已經被重寫url

@property
    def selector(self):
        from scrapy.selector import Selector
        if self._cached_selector is None:
            self._cached_selector = Selector(self)
        return self._cached_selector

    def xpath(self, query, **kwargs):
        return self.selector.xpath(query, **kwargs)

    def css(self, query):
        return self.selector.css(query)

因此用戶想要調用Response類，必須選擇調用其子類,而且重寫部分方法spa