Python 字符編碼轉換要點

時間 2019-11-19

原文原文鏈接

Python 字符編碼轉換要點

python 有str object 和 unicode object 兩種字符串, 均可以存放字符的字節編碼，可是他們是不一樣的type，這一點很重要，也是爲何會有encode 和decode。

encode 和 decode在pyhton 中的意義可表示爲

                                                                  encode
                                              unicode -------------------------> str
                                              unicode <--------------------------str
                                                                  decode
幾種經常使用法：
str_string.decode('codec') 是把str_string轉換爲unicode_string, codec是源str_string的編碼方式
unicode_string.encode('codec') 是把unicode_string 轉換爲str_string，codec是目標str_string的編碼方式
str_string.decode('from_codec').encode('to_codec') 可實現不一樣編碼的str_string之間的轉換
好比：python

>>> t = ' 長城 '

>>> t

' \xb3\xa4\xb3\xc7 '

>>> t.decode( ' gb2312 ' ).encode( ' utf-8 ' )

' \xe9\x95\xbf\xe5\x9f\x8e '

str_string.encode('codec') 是先調用系統的缺省codec去把str_string轉換爲unicode_string，而後用encode的參數codec去轉換爲最終的str_string. 至關於str_string.decode('sys_codec').encode('codec')。

unicode_string.decode('codec') 基本沒有意義，unicode 在python裏只用一種unicode編碼，UTF16或者UTF32（編譯python時就已經肯定)，沒有編碼轉換的須要。

注：缺省codec在site-packages下的sitecustomize.py文件中指定，好比post

import sys

sys.setdefaultencoding( ' utf-8 ' )

參考資料：http://blog.csdn.net/lf8289/article/details/2465196 編碼