Python HTTP庫requests中文頁面亂碼解決方案！

時間 2019-11-18

標籤 python http requests 中文頁面亂碼解決方案欄目 Python 简体版

原文原文鏈接

Python中文亂碼，是一個很大的坑，本身不知道在這裏遇到多少問題了。還好經過本身不斷的總結，如今遇到亂碼的狀況愈來愈少，就算出現，通常也能快速解決問題。這個問題，我七月就解決了，今天總結出來，和朋友一塊兒分享。html

最近寫過好幾個爬蟲，熟悉了下Python requests庫的用法，這個庫真的Python的官方api接口好用多了。美中不足的是：這個庫好像對中文的支持不是很友好，有些頁面會出現亂碼，而後換成urllib後，問題就沒有了。因爲requests庫最終使用的是urllib3做爲底層傳輸適配器，requests只是把urllib3庫讀取的原始進行人性化的處理，因此問題requests庫自己上！因而決定閱讀庫源碼，解決該中文亂碼問題；一方面，也是但願增強本身對HTTP協議、Python的理解。python

先是按照api接口，一行行閱讀代碼，嘗試瞭解問題出在哪裏！真個過程進展比較慢，我大概花了5天左右的時間，通讀了該庫的源代碼。閱讀代碼過程當中，有不懂的地方，就本身打印日誌信息，以幫助理解。json

最後我是這樣發現問題所在的！api

>>> req = requests.get('http://www.jd.com') >>> req <Response [200]>
>>> print req.text[:100] FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc,  LINE: 770 <==> ISO-8859-1 FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc,  LINE: 781 <==> ISO-8859-1
<!DOCTYPE html>
<html>
<head>
<meta charset="gbk" />
<title>¾©¶«(JD.COM)-×ÛºÏÍø¹ºÊ×Ñ¡-ÕýÆ·µÍ¼Û¡¢Æ·ÖÊ # 這裏出現了亂碼
>>> dir(req) ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

req有content屬性，還有text屬性，咱們看看content屬性：cookie

>>> print req.content[:100] <!DOCTYPE html>
<html>
<head>
<meta charset="gbk" />
<title>¾©¶«(JD.COM)-؛ºЍ닗ѡ-ֽƷµͼۡ¢Ʒ׊ >>> 
>>> 
>>> print req.content.decode('gbk')[:100] <!DOCTYPE html>
<html>
<head>
<meta charset="gbk" />
<title>京東(JD.COM)-綜合網購首選-正品低價、品質保障、配送及時、輕鬆購物！</
## 因爲該頁面時gbk編碼的，而Linux是utf-8編碼，因此打印確定是亂碼，咱們先進行解碼。就能正確顯示了。

但是，text屬性，按照此種方式，並不可行！app

>>> print req.text[:100] FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc,  LINE: 770 <==> ISO-8859-1 FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc,  LINE: 781 <==> ISO-8859-1
<!DOCTYPE html>
<html>
<head>
<meta charset="gbk" />
<title>¾©¶«(JD.COM)-×ÛºÏÍø¹ºÊ×Ñ¡-ÕýÆ·µÍ¼Û¡¢Æ·ÖÊ >>> print req.text.decode('gbk')[:100] FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc,  LINE: 770 <==> ISO-8859-1 FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc,  LINE: 781 <==> ISO-8859-1 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-63: ordinal not in range(128)
# 對text屬性進行解碼，就會出現錯誤。

讓咱們來看看，這兩個屬性的源碼：python2.7

# /requests/models.py
@property def content(self): """Content of the response, in bytes."""

    if self._content is False: # Read the contents.
        try: if self._content_consumed: raise RuntimeError( 'The content for this response was already consumed') if self.status_code == 0: self._content = None else: self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes() except AttributeError: self._content = None self._content_consumed = True # don't need to release the connection; that's been handled by urllib3
    # since we exhausted the data.
    return self._content

# requests/models.py
@property def text(self): """Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property. """

    # Try charset from content-type
    content = None encoding = self.encoding if not self.content: return str('') # Fallback to auto-detected encoding.
    if self.encoding is None: encoding = self.apparent_encoding # Decode unicode from given encoding.
    try: content = str(self.content, encoding, errors='replace') except (LookupError, TypeError): # A LookupError is raised if the encoding was not found which could
        # indicate a misspelling or similar mistake.
        #         # A TypeError can be raised if encoding is None
        #         # So we try blindly encoding.
        content = str(self.content, errors='replace') return content

看看注和源碼知道，content是urllib3讀取回來的原始字節碼，而text不過是嘗試對content經過編碼方式解碼爲unicode。jd.com 頁面爲gbk編碼，問題就出在這裏。ssh

>>> req.apparent_encoding;req.encoding'GB2312'
'ISO-8859-1'

這裏的兩種編碼方式和頁面編碼方式不一致，而content卻還嘗試用錯誤的編碼方式進行解碼。確定會出現問題！ide

咱們來看看，req的兩種編碼方式是怎麼獲取的：函數

# rquests/models.py
@property def apparent_encoding(self): """The apparent encoding, provided by the chardet library"""
    return chardet.detect(self.content)['encoding']

順便說一下：chardet庫監測編碼不必定是徹底對的，只有必定的可信度。好比jd.com頁面，編碼是gbk，可是檢測出來倒是GB2312，雖然這兩種編碼是兼容的，可是用GB2312區解碼gbk編碼的網頁字節串是會有運行時錯誤的！

獲取encoding的代碼在這裏：

# requests/adapters.py
def build_response(self, req, resp): """Builds a :class:`Response <requests.Response>` object from a urllib3 response. This should not be called from user code, and is only exposed for use when subclassing the :class:`HTTPAdapter <requests.adapters.HTTPAdapter>` :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response. :param resp: The urllib3 response object. """ response = Response() # Fallback to None if there's no status_code, for whatever reason.
    response.status_code = getattr(resp, 'status', None) # Make headers case-insensitive.
    response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {})) # Set encoding.
    response.encoding = get_encoding_from_headers(response.headers) # .......

經過get_encoding_from_headers(response.headers)函數獲取編碼，咱們再來看看這個函數！

# requests/utils.py
def get_encoding_from_headers(headers): """Returns encodings from given HTTP Header Dict. :param headers: dictionary to extract encoding from. """ content_type = headers.get('content-type') if not content_type: return None content_type, params = cgi.parse_header(content_type) if 'charset' in params: return params['charset'].strip("'\"") if 'text' in content_type: return 'ISO-8859-1'

發現了嗎？程序只經過http響應首部獲取編碼，假如響應中，沒有指定charset, 那麼直接返回'ISO-8859-1'。

咱們嘗試進行抓包，看看http響應內容是什麼：

能夠看到，reqponse header只指定了type，可是沒有指定編碼(通常如今頁面編碼都直接在html頁面中)。全部該函數就直接返回'ISO-8859-1'。

可能你們會問：做者爲何要默認這樣處理呢？這是一個bug嗎？其實，做者是嚴格http協議標準寫這個庫的，《HTTP權威指南》裏第16章國際化裏提到，若是HTTP響應中Content-Type字段沒有指定charset，則默認頁面是'ISO-8859-1'編碼。這處理英文頁面固然沒有問題，可是中文頁面，就會有亂碼了！

解決方案：

找到了問題所在，咱們如今有兩種方式解決該問題。

1. 修改get_encoding_from_headers函數，經過正則匹配，來檢測頁面編碼。因爲如今的頁面都在HTML代碼中指定了charset，因此經過正則式匹配的編碼方式是徹底正確的。

2. 因爲content是HTTP相應的原始字節串，因此咱們須要直接能夠經過使用它。把content按照頁面編碼方式解碼爲unicode！