Python3使用Requests抓取網頁亂碼問題

時間 2019-11-06

標籤 python3 python 使用 requests 抓取網頁亂碼問題欄目 Python 简体版

原文原文鏈接

1. 問題1

import requests
r = requests.get(url)
print r.text

結果亂碼！html

分析

with open('a.html', 'wb') as f:
    f.write(r.content)

用編輯器打開一看，非文本。用命令 file a.html 一看，識別爲 gzip 格式。原來返回數據通過了 gzip 壓縮。
難道要本身判斷格式並解壓縮？
搜了下，發現 requests 支持 gzip 自動解壓，這裏爲什麼不行？難道網站返回的編碼有誤？python

print(response.headers['content-encoding'])

返回的確實是「gzip」，怎麼沒有自動解壓？
通過坑die的探索，終於發現response.headers['content-encoding']的值實際上是「gzip 」，右邊多了幾個空格！致使沒法識別。
這個鍋誰來背？request 庫仍是網站？？app

解決方案

1. Request header 裏移除」Accept-Encoding」:」gzip, deflate」

這樣能直接獲得明文數據，缺點：流量會增長不少，網站不必定支持。編輯器

2. 本身動手，解壓縮，解碼，獲得可讀文本

這也是本文使用的方案網站

2. 問題2

print(response.encoding)

發現網頁編碼是 'ISO-8859-1'，這是神馬？
《HTTP權威指南》第16章國際化裏提到，若是HTTP響應中Content-Type字段沒有指定charset，則默認頁面是'ISO-8859-1'編碼。這處理英文頁面固然沒有問題，可是中文頁面就會有亂碼了！
編碼

分析

1. print(r.apparent_encoding)  
2. get_encodings_from_content(r.content)

聽說使用了一個叫chardet的用於探測字符集的第三方庫，解析會比較慢。沒裝，結果是 None
網上不少答案如此，應該是 python2 的，由於 3 裏的 r.content 是 bytes 類型，應改成get_encodings_from_content(r.text)，前提是 gzip 已經解壓了

3. 源碼

代碼以下, 基於Python 3.5url

# 猜想網頁編碼  
def guess_response_encoding(response):
    if response.encoding == 'ISO-8859-1':
        if response.content[:2] == b"\x1f\x8b":  # gzip header
            content = gzip.decompress(response.content)
            content = str(content, response.encoding)
        else:
            content = response.text
        encoding = get_encodings_from_content(content)
        if encoding: 
            response.encoding = encoding[0]
        elif response.apparent_encoding: 
            response.encoding = response.apparent_encoding
    print("guess encoding: ", response.encoding)