Python Unicode與中文處理(轉)

時間 2019-11-11

標籤 python unicode 中文處理欄目 Python 简体版

原文原文鏈接

Python Unicode與中文處理

python中的unicode是讓人很困惑、比較難以理解的問題，本文力求完全解決這些問題；html

1.unicode、gbk、gb23十二、utf-8的關係；python

http://www.pythonclub.org/python-basic/encode-detail 這篇文章寫的比較好，utf-8是unicode的一種實現方式，unicode、gbk、gb2312是編碼字符集；網絡

2.python中的中文編碼問題；ide

2.1 .py文件中的編碼函數

Python 默認腳本文件都是 ANSCII 編碼的，當文件中有非 ANSCII 編碼範圍內的字符的時候就要使用"編碼指示"來修正。一個module的定義中，若是.py文件中包含中文字符（嚴格的說是含有非anscii字符），則須要在第一行或第二行指定編碼聲明：測試

# -*- coding=utf-8 -*-或者 #coding=utf-8 其餘的編碼如：gbk、gb2312也能夠；不然會出現相似:SyntaxError: Non-ASCII character '\xe4' in file ChineseTest.py on line 1, but no encoding declared; see http://www.pytho for details這樣的異常信息；n.org/peps/pep-0263.htmlthis

2.2 python中的編碼與解碼google

先說一下python中的字符串類型，在python中有兩種字符串類型，分別是str和unicode，他們都是basestring的派生類；str類型是一個包含Characters represent (at least) 8-bit bytes的序列；unicode的每一個unit是一個unicode obj;因此：編碼

len(u'中國')的值是2；len('ab')的值也是2；url

在str的文檔中有這樣的一句話：The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. 也就是說在讀取一個文件的內容，或者從網絡上讀取到內容時，保持的對象爲str類型；若是想把一個str轉換成特定編碼類型，須要把str轉爲 Unicode,而後從unicode轉爲特定的編碼類型如：utf-八、gb2312等；

python中提供的轉換函數：

unicode轉爲 gb2312,utf-8等

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = u'中國'
s_gb = s.encode('gb2312')

utf-8,GBK轉換爲unicode 使用函數unicode(s,encoding) 或者s.decode(encoding)

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = u'中國'

#s爲unicode先轉爲utf-8

s_utf8 = s.encode('UTF-8')

assert(s_utf8.decode('utf-8') == s)

普通的str轉爲unicode

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = '中國'

su = u'中國''

#s爲unicode先轉爲utf-8

#由於s爲所在的.py(# -*- coding=UTF-8 -*-)編碼爲utf-8

s_unicode = s.decode('UTF-8')

assert(s_unicode == su)

#s轉爲gb2312,先轉爲unicode再轉爲gb2312

s.decode('utf-8').encode('gb2312')

#若是直接執行s.encode('gb2312')會發生什麼？

s.encode('gb2312')

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = '中國'

#若是直接執行s.encode('gb2312')會發生什麼？

s.encode('gb2312')

這裏會發生一個異常：

Python 會自動的先將 s 解碼爲 unicode ，而後再編碼成 gb2312。由於解碼是python自動進行的，咱們沒有指明解碼方式，python 就會使用 sys.defaultencoding 指明的方式來解碼。不少狀況下 sys.defaultencoding 是 ANSCII，若是 s 不是這個類型就會出錯。
拿上面的狀況來講，個人 sys.defaultencoding 是 anscii，而 s 的編碼方式和文件的編碼方式一致，是 utf8 的，因此出錯了: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
對於這種狀況，咱們有兩種方法來改正錯誤：
一是明確的指示出 s 的編碼方式
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = '中文'
s.decode('utf-8').encode('gb2312')
二是更改 sys.defaultencoding 爲文件的編碼方式

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import sys
reload(sys) # Python2.5 初始化後會刪除 sys.setdefaultencoding 這個方法，咱們須要從新載入
sys.setdefaultencoding('utf-8')

str = '中文'
str.encode('gb2312')

文件編碼與print函數
創建一個文件test.txt，文件格式用ANSI，內容爲:
abc中文
用python來讀取
# coding=gbk
print open("Test.txt").read()
結果：abc中文
把文件格式改爲UTF-8：
結果：abc涓枃
顯然，這裏須要解碼：
# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")
結果：abc中文
上面的test.txt我是用Editplus來編輯的，但當我用Windows自帶的記事本編輯並存成UTF-8格式時，
運行時報錯：
Traceback (most recent call last):
File "ChineseTest.py", line 3, in <module>
print open("Test.txt").read().decode("utf-8")
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in position 0: illegal multibyte sequence

原來，某些軟件，如notepad，在保存一個以UTF-8編碼的文件時，會在文件開始的地方插入三個不可見的字符（0xEF 0xBB 0xBF，即BOM）。
所以咱們在讀取時須要本身去掉這些字符，python中的codecs module定義了這個常量：
# coding=gbk
import codecs
data = open("Test.txt").read()
if data[:3] == codecs.BOM_UTF8:
data = data[3:]
print data.decode("utf-8")
結果：abc中文

（四）一點遺留問題
在第二部分中，咱們用unicode函數和decode方法把str轉換成unicode。爲何這兩個函數的參數用"gbk"呢？
第一反應是咱們的編碼聲明裏用了gbk(# coding=gbk)，但真是這樣？
修改一下源文件：
# coding=utf-8
s = "中文"
print unicode(s, "utf-8")
運行，報錯：
Traceback (most recent call last):
File "ChineseTest.py", line 3, in <module>
s = unicode(s, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
顯然，若是前面正常是由於兩邊都使用了gbk，那麼這裏我保持了兩邊utf-8一致，也應該正常，不至於報錯。
更進一步的例子，若是咱們這裏轉換仍然用gbk：
# coding=utf-8
s = "中文"
print unicode(s, "gbk")
結果：中文
翻閱了一篇英文資料，它大體講解了python中的print原理：
When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

To print data reliably, you must know the encoding that this display program expects.

簡單地說，python中的print直接把字符串傳遞給操做系統，因此你須要把str解碼成與操做系統一致的格式。Windows使用CP936(幾乎與gbk相同)，因此這裏可使用gbk。
最後測試：
# coding=utf-8
s = "中文"
print unicode(s, "cp936")
結果：中文

特別推薦：

python 編碼檢測

使用 chardet 能夠很方便的實現字符串/文件的編碼檢測

例子以下:

>>>
import
urllib

>>>
rawdata = urllib
.urlopen
(
'http://www.google.cn/'
)
.read
(
)

>>>
import
chardet

>>>
chardet.detect
(
rawdata)

{
'confidence'
: 0.98999999999999999
, 'encoding'
: 'GB2312'
}

>>>

chardet 下載地址 http://chardet.feedparser.org/

特別提示：

在工做中，常常遇到，讀取一個文件，或者是從網頁獲取一個問題，明明看着是gb2312的編碼，但是當使用decode轉時，老是出錯，這個時候，可使用decode('gb18030')這個字符集來解決，若是仍是有問題，這個時候，必定要注意，decode還有一個參數，好比，若要將某個 String對象s從gbk內碼轉換爲UTF-8，能夠以下操做 s.decode('gbk').encode('utf-8′) 但是，在實際開發中，我發現，這種辦法常常會出現異常： UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence 這是由於遇到了非法字符——尤爲是在某些用C/C++編寫的程序中，全角空格每每有多種不一樣的實現方式，好比\xa3\xa0，或者\xa4\x57，這些字符，看起來都是全角空格，但它們並非「合法」的全角空格（真正的全角空格是\xa1\xa1），所以在轉碼的過程當中出現了異常。這樣的問題很讓人頭疼，由於只要字符串中出現了一個非法字符，整個字符串——有時候，就是整篇文章——就都沒法轉碼。解決辦法： s.decode('gbk', ‘ignore').encode('utf-8′) 由於decode的函數原型是decode([encoding], [errors='strict'])，能夠用第二個參數控制錯誤處理的策略，默認的參數就是strict，表明遇到非法字符時拋出異常；若是設置爲ignore，則會忽略非法字符；若是設置爲replace，則會用?取代非法字符；若是設置爲xmlcharrefreplace，則使用XML的字符引用。