Python 編碼轉換與中文處理

時間 2019-11-24

原文原文鏈接

Python 編碼轉換與中文處理html

python 中的 unicode是讓人很困惑、比較難以理解的問題. utf-8是unicode的一種實現方式，unicode、gbk、gb2312是編碼字符集.python

decode是將普通字符串按照參數中的編碼格式進行解析，而後生成對應的unicode對象json

寫python時遇到的中文編碼問題：vim

➜  /test sudo vim test.py
#!/usr/bin/python
#-*- coding:utf-8 -*-
def weather():
        import time
        import re
        import urllib2
        import itchat
        #模擬瀏覽器
        hearders = "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
        url = "https://tianqi.moji.com/weather/china/guangdong/shantou"    ##要爬去天氣預報的網址
        par = '(<meta name="description" content=")(.*?)(">)'    ##正則匹配，匹配出網頁內要的內容
        ##建立opener對象並設置爲全局對象
        opener = urllib2.build_opener()
        opener.addheaders = [hearders]
        urllib2.install_opener(opener)
        ##獲取網頁
        html = urllib2.urlopen(url).read().decode("utf-8")
        ##提取須要爬取的內容
        data = re.search(par,html).group(2)
        print type(data)
        data.encode('gb2312')
        b = '天氣預報'
        print type(b)
        c = b + '\n' + data
        print c
weather()

➜  /test sudo python test.py
<type 'unicode'>
<type 'str'>
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    weather()
  File "test.py", line 28, in weather
    c = b + '\n' + data
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

解決方法：瀏覽器

➜  /test sudo vim test.py
#!/usr/bin/python
#-*- coding:utf-8 -*-
import sys
reload(sys)
# Python2.5 初始化後會刪除 sys.setdefaultencoding 這個方法，咱們須要從新載入
sys.setdefaultencoding('utf-8')
def weather():
        import time
        import re
        import urllib2
        import itchat
        #模擬瀏覽器
        hearders = "User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
        url = "https://tianqi.moji.com/weather/china/guangdong/shantou"    ##要爬去天氣預報的網址
        par = '(<meta name="description" content=")(.*?)(">)'    ##正則匹配，匹配出網頁內要的內容
        ##建立opener對象並設置爲全局對象
        opener = urllib2.build_opener()
        opener.addheaders = [hearders]
        urllib2.install_opener(opener)
        ##獲取網頁
        html = urllib2.urlopen(url).read().decode("utf-8")
        ##提取須要爬取的內容
        data = re.search(par,html).group(2)
        print type(data)
        data.encode('gb2312')
        b = '天氣預報'
        print type(b)
        c = b + '\n' + data
        print c
weather()

測試後：bash

➜  /test sudo python test.py
<type 'unicode'>
<type 'str'>

天氣預報ide

汕頭市今天實況：20度多雲，溼度：57%，東風：2級。白天：20度,多雲。夜間：晴，13度，天氣偏涼了，墨跡天氣建議您穿上厚些的外套或是保暖的羊毛衫，年老體弱者能夠選擇保暖的搖粒絨外套。測試

我的感受網上說中文亂碼通用解決辦法都是錯誤的，由於類型不同解決方法也不同，因此最近恰好出現了這種問題，從網上找了不少辦法沒解決到，最後本身去查看資料，才發現須要對症下藥。ui

這是一個抓取網頁代碼的python腳本編碼

➜  /test sudo cat file.py
#!/usr/bin/python
#_*_ coding:UTF-8 _*_
import urllib,urllib2
import re
url = 'http://sports.sohu.com/nba.shtml' #抓取的url
par = '20180125.*\">(.*?)</a></li>'
req = urllib2.Request(url)
response = urllib2.urlopen(req).read()
#response = unicode(response,'GBK').encode('UTF-8')
print type(response)
print response

遇到的問題：

使用中文抓取中文網頁時，print出來的中文會出現亂碼

➜  /test sudo python file.py
special.wait({
itemspaceid : 99999,
form:"bigView",
adsrc : 200,
order : 1,
max_turn : 1,
spec :{
onBeforeRender: function(){
},
onAfterRender: function(){
},
isCloseBtn:true//�Ƿ��йرհ�ť
}
});

解決方法：

查看網頁源代碼發現charset=GBK的類型因此python中要進行類型轉換

➜  /test sudo cat file.py
#!/usr/bin/python
#_*_ coding:UTF-8 _*_
import urllib,urllib2
import re
url = 'http://sports.sohu.com/nba.shtml' #抓取的url
par = '20180125.*\">(.*?)</a></li>'
req = urllib2.Request(url)
response = urllib2.urlopen(req).read()
response = unicode(response,'GBK').encode('UTF-8')
print type(response)
print response

➜  /test sudo python file.py
special.wait({
itemspaceid : 99999,
form:"bigView",
adsrc : 200,
order : 1,
max_turn : 1,
spec :{
onBeforeRender: function(){
},
onAfterRender: function(){
},
isCloseBtn:true//是否有關閉按鈕
}
});

如今已經把中文亂碼解決了

import json

#打印字典

dict = {'name': '張三'}

print json.dumps(dict, encoding="UTF-8", ensure_ascii=False) >>>{'name': '張三'}

#打印列表

list = [{'name': '張三'}]

print json.dumps(list, encoding="UTF-8", ensure_ascii=False) >>>[{'name': '張三'}]