python Requests庫在處理response時的一些陷阱

時間 2019-11-12

標籤 python requests 處理 response 一些陷阱欄目 Python 简体版

原文原文鏈接

python的Requests（http://docs.python-requests.org/en/latest/）庫在處理http/https請求時仍是比較方便的，應用也比較普遍。
但其在處理response時有一些地方須要特別注意，簡單來講就是Response對象的content方法和text方法的區別，具體代碼以下：python

@property
    def content(self):
        """Content of the response, in bytes."""

        if self._content is False:
            # Read the contents.
            try:
                if self._content_consumed:
                    raise RuntimeError(
                        'The content for this response was already consumed')

                if self.status_code == 0:
                    self._content = None
                else:
                    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

            except AttributeError:
                self._content = None

        self._content_consumed = True
        # don't need to release the connection; that's been handled by urllib3
        # since we exhausted the data.
        return self._content

    @property
    def text(self):
        """Content of the response, in unicode.

        if Response.encoding is None and chardet module is available, encoding
        will be guessed.
        """

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return str('')

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding

        # Decode unicode from given encoding.
        try:
            content = str(self.content, encoding, errors='replace')
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            #
            # A TypeError can be raised if encoding is None
            #
            # So we try blindly encoding.
            content = str(self.content, errors='replace')

        return content
　　 @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the lovely Charade library
        (Thanks, Ian!)."""
        return chardet.detect(self.content)['encoding']

能夠看出text方法中對原始數據作了編碼操做
其中response的encoding屬性是在adapters.py中的HTTPAdapter中的build_response中進行賦值，具體代碼以下：cookie

def build_response(self, req, resp):
        """Builds a :class:`Response <requests.Response>` object from a urllib3
        response. This should not be called from user code, and is only exposed
        for use when subclassing the
        :class:`HTTPAdapter <requests.adapters.HTTPAdapter>`

        :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response.
        :param resp: The urllib3 response object.
        """
        response = Response()

        # Fallback to None if there's no status_code, for whatever reason.
        response.status_code = getattr(resp, 'status', None)

        # Make headers case-insensitive.
        response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {}))

        # Set encoding.
        response.encoding = get_encoding_from_headers(response.headers)
        response.raw = resp
        response.reason = response.raw.reason

        if isinstance(req.url, bytes):
            response.url = req.url.decode('utf-8')
        else:
            response.url = req.url

        # Add new cookies from the server.
        extract_cookies_to_jar(response.cookies, req, resp)

        # Give the Response some context.
        response.request = req
        response.connection = self

        return response

從上述代碼（response.encoding = get_encoding_from_headers(response.headers)）中能夠看出，具體的encoding是經過解析headers獲得的，app

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

爲避免Requests採用chardet去猜想response的編碼，請慎用text屬性，直接使用content屬性便可，再根據實際須要進行編碼。
對於服務端沒有顯式指明charset的response來講，採用text和content的差異以下所示：
代碼：ide

    print time.time()
    print 'begin request'
    r = requests.get(r'http://www.sina.com.cn')
    # erase response encoding
    r.encoding = None
    r.text
    #r.content
    print 'request end'
    print time.time()

採用text時的耗時：

採用content時的耗時：

ui

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。