Python+Requests編碼識別Bug

時間 2019-11-21

標籤 python+requests python requests 編碼識別 bug 欄目 Python 简体版

原文原文鏈接

Requests 是使用 Apache2 Licensed 許可證的 HTTP 庫。用 Python 編寫，更友好，更易用。 html

Requests 使用的是 urllib3，所以繼承了它的全部特性。Requests 支持 HTTP 鏈接保持和鏈接池，支持使用 cookie 保持會話，支持文件上傳，支持自動肯定響應內容的編碼，支持國際化的 URL 和 POST 數據自動編碼。現代、國際化、人性化。 python

最近在使用Requests的過程當中發現一個問題，就是抓去某些中文網頁的時候，出現亂碼，打印encoding是ISO-8859-1。爲何會這樣呢？經過查看源碼，我發現默認的編碼識別比較簡單，直接從響應頭文件的Content-Type裏獲取，若是存在charset，則能夠正確識別，若是不存在charset可是存在text就認爲是ISO-8859-1，見utils.py。 cookie

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """
    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

其實Requests提供了從內容獲取編碼，只是在默認中沒有使用，見utils.py： app

def get_encodings_from_content(content):
    """Returns encodings from given content string.

    :param content: bytestring to extract encodings from.
    """
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

還提供了使用chardet的編碼檢測，見models.py: dom

@property
def apparent_encoding(self):
    """The apparent encoding, provided by the lovely Charade library
    (Thanks, Ian!)."""
    return chardet.detect(self.content)['encoding']

如何修復這個問題呢？先來看一下示例： ide

>>> r = requests.get('http://cn.python-requests.org/en/latest/')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> requests.utils.get_encodings_from_content(r.content)
['utf-8']

>>> r = requests.get('http://reader.360duzhe.com/2013_24/index.html')
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'gb2312'
>>> requests.utils.get_encodings_from_content(r.content)
['gb2312']

經過了解，能夠這麼用一個monkey patch解決這個問題：編碼

import requests
def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = self.apparent_encoding
            _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
            self._content = _content
        return _content
    requests.models.Response.content = property(content)
monkey_patch()