python2編碼問題

看了如下兩篇博文的基礎上，感受本身明白了點。記錄下。html

http://www.cnblogs.com/yuanchenqi/articles/5938733.htmlpython

http://www.cnblogs.com/yuanchenqi/articles/5956943.htmljson

python2的編解碼的幾個問題：windows

（1）兩種數據類型unicode 和 str。ide

　　unicode-->str 的過程是編碼，能夠用encode('utf8')這種方式表示，也能夠用str（）把unicode變成str類型，編碼

　　另外，str-->unicode 是解碼過程，unicode（str），或者str.decode('utf8') 均可以。spa

可是python2中系統默認的編碼是ASCII碼，即便你在編輯.py 文件裏聲明瞭coding是utf8，那麼在執行str(u‘\uXXXX’) 或者 unicode（）時，默認仍是使用ASCII碼進行編解碼。操作系統

　除非使用代碼改變system的default encoding。code

若是變量是英文的，相應的編解碼test以下：htm

import sys print(sys.getdefaultencoding()) # 定義a1是英文 str類型
a1 = "hello world"
print "-----a1--------"
print a1 print type(a1) print repr(a1) # str(str類型）直接返回str類型
print "-----b1--------" b1 = str(a1) print b1 print type(b1) print repr(b1) #unicode（str類型）返回經過ASCII解碼方式成unicode
print "-----c1--------" c1 = unicode(a1) print c1 print type(c1) print repr(c1) #用decode方式，把str解碼成unicode，對應c1
print "-----d1--------" d1 = a1.decode('utf8') print d1 print type(d1) print repr(d1) # 定義a2是英文 unicode類型
print "-----a2--------" a2 = u'hello world'
print a2 print type(a2) print repr(a2) # str(unicode類型）至關於用默認的ASCII碼編碼成str
print "-----b2--------" b2 = str(a2) print b2 print type(b2) print repr(b2) #unicode unicode類型）不變
print "-----c2--------" c2 = unicode(a2) print c2 print type(c2) print repr(c2) # 用encode編碼，對應b2部分
print "-----d2--------" d2 = a2.encode('utf8') print d2 print type(d2) print repr(d2) 運行結果： ascii -----a1-------- hello world <type 'str'>
'hello world'
-----b1-------- hello world <type 'str'>
'hello world'
-----c1-------- hello world <type 'unicode'> u'hello world'
-----d1-------- hello world <type 'unicode'> u'hello world'
-----a2-------- hello world <type 'unicode'> u'hello world'
-----b2-------- hello world <type 'str'>
'hello world'
-----c2-------- hello world <type 'unicode'> u'hello world'
-----d2-------- hello world <type 'str'>
'hello world'

View Code

相對應的，若是變量是中文的，那麼相應的編解碼結果以下：

# -*- coding:utf-8 -*-
import sys print(sys.getdefaultencoding()) # 定義a1是英文 str類型
a1 = "I am 熊貓"
print "-----a1--------"
print a1 print type(a1) print repr(a1) # str(str類型）直接返回str類型
print "-----b1--------" b1 = str(a1) print b1 print type(b1) print repr(b1) # #unicode（str類型）返回經過ASCII解碼方式成unicode # print "-----c1--------" # c1 = unicode(a1) #運行結果：UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 5: ordinal not in range(128) # print c1 # print type(c1) # print repr(c1)

#用decode方式，把str解碼成unicode，對應c1
print "-----d1--------" d1 = a1.decode('utf8') print d1 print type(d1) print repr(d1) # 定義a2是英文 unicode類型
print "-----a2--------" a2 = u'I am 熊貓'
print a2 print type(a2) print repr(a2) # # str(unicode類型）至關於用默認的ASCII碼編碼成str # print "-----b2--------" # b2 = str(a2) #運行結果：UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128) # print b2 # print type(b2) # print repr(b2)

#unicode unicode類型）不變
print "-----c2--------" c2 = unicode(a2) print c2 print type(c2) print repr(c2) # 用encode編碼，對應b2部分
print "-----d2--------" d2 = a2.encode('utf8') print d2 print type(d2) print repr(d2) 運行結果： ascii -----a1-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'
-----b1-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'
-----d1-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----a2-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----c2-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----d2-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'

View Code

對於帶有中文的部分，若是修改相應的系統默認編碼，那麼使用str或者unicode也不會出錯：

# -*- coding:utf-8 -*-
import sys print(sys.getdefaultencoding()) reload(sys) sys.setdefaultencoding('utf8') print(sys.getdefaultencoding()) # 定義a1是英文 str類型
a1 = "I am 熊貓"
print "-----a1--------"
print a1 print type(a1) print repr(a1) # str(str類型）直接返回str類型
print "-----b1--------" b1 = str(a1) print b1 print type(b1) print repr(b1) #unicode（str類型）返回經過ASCII解碼方式成unicode
print "-----c1--------" c1 = unicode(a1) print c1 print type(c1) print repr(c1) #用decode方式，把str解碼成unicode，對應c1
print "-----d1--------" d1 = a1.decode('utf8') print d1 print type(d1) print repr(d1) # 定義a2是英文 unicode類型
print "-----a2--------" a2 = u'I am 熊貓'
print a2 print type(a2) print repr(a2) # str(unicode類型）至關於用默認的ASCII碼編碼成str
print "-----b2--------" b2 = str(a2) print b2 print type(b2) print repr(b2) #unicode unicode類型）不變
print "-----c2--------" c2 = unicode(a2) print c2 print type(c2) print repr(c2) # 用encode編碼，對應b2部分
print "-----d2--------" d2 = a2.encode('utf8') print d2 print type(d2) print repr(d2) 運行結果： ascii utf8 -----a1-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'
-----b1-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'
-----c1-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----d1-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----a2-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----b2-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'
-----c2-------- I am 熊貓 <type 'unicode'> u'I am \u718a\u732b'
-----d2-------- I am 熊貓 <type 'str'>
'I am \xe7\x86\x8a\xe7\x8c\xab'

View Code

（2）python2，當unicode和str 相連時，默認的，Python2會按照默認編碼，把str先變成unicode進行相連。

>>>print u'hello'+  '熊貓' Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

View Code

這裏python默認把‘熊貓’按照ASCII解碼成unicode，失敗。指明'熊貓'的解碼方式，能夠成功。

>>> print u'hello'+ '熊貓'.decode('utf8') hello熊貓

View Code

（3）當使用print時，print A 至關於 sys.stdout.write(str(A))

Python2.7中調用print打印var 變量時，操做系統會對var作必定的字符處理：
若是var是str類型的變量，則直接將var變量交付給終端進行顯示；
若是var變量是unicode類型，則操做系統首先將var編碼成str類型的對象（編碼格式取決於stdout的編碼格式），而後再交由終端進行顯示。
在終端顯示時，若是str類型的變量的編碼方式和終端設置的編碼方式不一致，極可能會出現亂碼問題。

如下分狀況進行用代碼進行實驗：
變量是str類型的，在不一樣terminal下運行結果：

文件1： utf8編碼的str類型的打印py2_utf8.py 文件2： gbk編碼的str類型的打印py2_gbk.py 以上兩個文件，在兩種終端運行： 1）. 在pycharm上執行，pycharm至關於sys.stdout.encoding 是UTF8的終端。 2）. 在windows cmd上執行，cmd的編碼方式是gbk # -*- coding:utf8 -*- # py2_utf8.py
 a = "I am 熊貓"
print a print repr(a) print type(a) 在pycharm上，運行結果正常： 文件中定義的a，在內存中是utf8編碼的字節碼流，傳到操做系統，操做系統不作任何操做，傳給終端，終端按照pycharm終端的utf8解碼，編解碼對應，因此輸出正常。 I am 熊貓 'I am \xe7\x86\x8a\xe7\x8c\xab'
<type 'str'> 在windows cmd上運行： c:\Python27>python D:\python_project\for_python2\py2_utf8.py I am 鐔婄尗 'I am \xe7\x86\x8a\xe7\x8c\xab'
<type 'str'> 文件中變量是utf8編碼的字節流，到終端按照gbk解碼，因此解碼出來亂碼。

View Code

變量是unicode類型的，在不一樣terminal下運行結果：

1. 文件以utf8 編碼存儲，py2_utf8_unicode.py 2. 文件以gbk編碼存儲，py2_gbk_unicode.py # -*- coding:utf8 -*- # py2_utf8_unicode.py
 b = u'I am 熊貓'
print b print repr(b) print type(b) Pycharm 結果： I am 熊貓 u'I am \u718a\u732b'
<type 'unicode'> CMD結果： c:\Python27>python D:\python_project\for_python2\py2_utf8_unicode.py I am 熊貓 u'I am \u718a\u732b'
<type 'unicode'>


# -*- coding:gbk -*- # py2_gbk_unicode.py
 b = u'I am 熊貓'
print b print repr(b) print type(b) Pycharm下： C:\Python27\python.exe D:/python_project/for_python2/py2_gbk_unicode.py I am 熊貓 u'I am \u718a\u732b'
<type 'unicode'> CMD下： c:\Python27>python D:\python_project\for_python2\py2_gbk_unicode.py I am 熊貓 u'I am \u718a\u732b'
<type 'unicode'>

View Code

（4）print list和dict時中文顯示編碼問題

#-*-coding:utf-8 -*-
 a={'name': 'english'} b={'name': '中文'} print "a=", a print "b=", b import json result = json.dumps(b, encoding='UTF-8', ensure_ascii=False) print "b=", result pycharm中運行結果： a= {'name': 'english'} b= {'name': '\xe4\xb8\xad\xe6\x96\x87'} b= {"name": "中文"}

View Code