Python 2.7.3 urllib2.urlopen 獲取網頁出現亂碼解決方案

時間 2019-12-11

標籤 python 2.7.3 urllib2.urlopen urllib urlopen 獲取網頁出現亂碼解決方案欄目 Python 简体版

原文原文鏈接

出現亂碼的緣由是，網頁服務端有bug，它硬性使用使用某種特定的編碼方案，而並無按照客戶端的請求頭的編碼要求來發送編碼。html

解決方案：使用chardet來猜想網頁編碼。python

1.去chardet官網下載chardet的py源碼包。編碼

2.把chardet目錄從源碼包裏解壓到項目文件夾內。url

3.經過 import chardet 來引用它，而後：spa

 1 response = None
 2 #嘗試下載網頁
 3 try:
 4     response = urllib2.urlopen("http://www.baidu.com")
 5 except Exception as e:
 6     print "錯誤：下載網頁時遇到問題：" + str(e)
 7     return
 8 
 9 if response.code != 200:
10     print "錯誤：訪問後，返回的狀態代碼（Code）並非預期值【200】，而是【" + str(response.code) + "】"
11     return
12 
13 if response.msg != "OK":
14     print "錯誤：訪問後，返回的狀態消息並非預期值【OK】，而是【" + response.msg + "】"
15     return
16 
17 #讀取html代碼
18 htmlCode = None
19 try:
20     htmlCode = response.read()
21 except Exception as e:
22     print "錯誤：下載完畢後，從響應流裏讀出網頁代碼時遇到問題：" + str(e)
23     return
24 
25 #處理網頁編碼
26 htmlCode_encode = None
27 try:
28     #猜編碼類型
29     htmlCharsetGuess = chardet.detect(htmlCode)
30     htmlCharsetEncoding = htmlCharsetGuess["encoding"]
31     #解碼
32     htmlCode_decode = htmlCode.decode(htmlCharsetEncoding)
33     #獲取系統編碼
34     currentSystemEncoding = sys.getfilesystemencoding()
35     #按系統編碼，再進行編碼。
36     '''
37         作這一步的目的是，讓編碼出來的東西，能夠在python中進行處理
38         好比: 
39              key = "你好"
40              str = "xxxx你好yyyy"
41              keyPos = str.find( key )
42         若是不作再編碼，這一步就可能會報錯出問題
43     '''
44     htmlCode_encode = htmlCode_decode.encode(currentSystemEncoding)
45     except Exception as e:
46         print "錯誤：在處理網頁編碼時遇到問題：" + str(e)
47         return
48 #htmlCode_encode即爲所求
49 return htmlCode_encode