Python 2 中的編碼問題

時間 2019-12-04

標籤 python 編碼問題欄目 Python 简体版

原文原文鏈接

先來看一個異常信息：html

UnicodeEncodeError: 'ascii' codec can't encode characters in position 51-52: ordinal not in range(128)python

相信每一個 Python 程序員對上面這個錯誤都再熟悉不過了，也許你這個問題的根源以及解決方法不是很清楚，那麼這篇文章將嘗試解答你心中的疑惑。程序員

什麼是字符串

Everything you thought you knew about strings is wrong.express

計算機中，處理字符串是一個看似簡單但及其複雜的問題。推薦我以前寫的一文章《字符串，那些你不知道的事》。網絡

Python 2 中的字符類型

Python 2 中有兩種字符類型：str與unicode，其區別是：python2.7

str is text representation in bytes, unicode is text representation in characters.函數

字符字面量是str類型，也就是說foo = "你好"這一賦值語句表示的是把你好所對應的二進制字節（這裏的字節就是Python解釋器讀取源文件時讀取到的）賦值給變量foo，在 Python 2 中的str類型至關於其餘語言的byte類型。大數據

>>> "你好"
'\xe4\xbd\xa0\xe5\xa5\xbd'

unicode對象保存的是字符的code point。在 Python 2 若是想表示 unicode 類型，有下面三種方式：編碼

>>> u"你好"
u'\u4f60\u597d'
>>> "你好".decode("utf8")
u'\u4f60\u597d'
>>> unicode("你好", "utf8")
u'\u4f60\u597d'

Python 2 中的默認編碼

sys.getdefaultencoding()能夠獲得當前 Python 環境的默認編碼，Python 2 中爲ascii。str與unicode兩種字符類型中轉化時，若是沒有明確指定編碼方式，就會用這個默認編碼。spa

Python 2 中編碼問題出現根源

瞭解了 Python 2 中的兩種字符類型以及默認編碼，如今就能夠分析與編碼相關的問題出現的緣由了。

在 Python 2 的世界中，不少 API 對這兩種字符類型的使用比較混亂，有的能夠混用這兩種，有的只能使用其中之一，若是在調用 API 時傳入了錯誤的字符類型，Python 2 會自動去轉爲正確的字符類型，問題就出如今自動轉化時用的編碼默認是ascii，因此常常會出現UnicodeDecodeError或UnicodeEncodeError錯誤了。

隨着 unicode 的普及，Python 2 中愈來愈多的 API 使用 unicode 類型的字符串做爲參數與返回值，咱們在設計 API 時，也儘量要使用unicode類型。那是否是說，把程序裏面的全部字符串都用unicode類型表示，就不會出錯了呢？也不盡然，通常有以下準則：

在進行文本處理（如查找一個字符串中字符的個數，分割字符串等）時，使用unicode類型
在進行I/O處理（如，讀寫磁盤上的文件，打印一個字符串，網絡通訊等）時，使用str類型

想一想也很好理解，由於 Python 2 中的str類型至關於其餘語言的byte類型，在進行I/O時操做的是一個個的字節。

實戰演練

知道了問題出現的緣由，下面舉一些常見的與編碼相關的錯誤代碼，演示如何正確的使用。

字符串拼接、比較

Python 中字符串在進行拼接與比較時，若是一個是str類型，另外一個是unicode類型，那麼會把str隱式轉爲unicode類型。

>>> print "%s, %s" % (u"你好", "中國")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>>

解決方法也很簡單，就像上面說的，只要不涉及到I/O操做，一概用unicode類型。

>>> print u"%s, %s" % (u"你好", u"中國")
你好, 中國

讀寫文件

內置函數 open(name[, mode[, buffering]]) 能夠返回一個文件類型的對象，這裏返回的文件對象操做的是str類型的字符，咱們能夠手動將讀到的內容轉爲unicode類型，可是這裏有個問題，對於多字節編碼來講，一個 unicode 字符可能被數目不一樣的字節表示，若是咱們讀取了任意固定大小（好比1K，或4K）的數據塊，這個數據快的最後幾個字節極可能是某個 unicode 字符的前幾個字節，咱們須要去處理這種異常，一個比較笨的解決方式是把全部數據讀取到內存中，而後再去轉碼，顯然這不適合大數據的狀況。一個比較好的方法是使用codecs模塊的 open(filename, mode='rb', encoding=None, errors='strict', buffering=1)方法，這個方法返回的文件對象操做的是unicode類型的字符，

# cat /tmp/debug.log
你好

>>> with open('/tmp/debug.log') as f:
>>>     s = f.read(1)    # 讀一個字節
>>>     print type(s)    # str
>>>     print s          # 無心義的一個符號
>>>
>>> import codecs
>>>
>>> with codecs.open('/tmp/debug.log', encoding='utf-8') as f:
>>>     s = f.read(1)    # 讀一個字符
>>>     print type(s)    # unicode
>>>     print s          # 你

若是咱們用內置的open進行寫文件，必須將unicode字符轉爲str字符，不然會報錯。

>>> with open('/tmp/debug.log', 'w') as f:
>>>     f.write(u'你好')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

這個錯誤很典型，就是由於用默認的ascii去編碼你好致使的，顯然你好不在ascii字符集內，正確的方式：

>>> with open('/tmp/debug.log', 'w') as f:
>>>     f.write(u'你好'.encode('utf-8'))

$ cat /tmp/debug.log
你好

print

首先須要注意的是 print 在 Python 2 中是一個表達式（和if、return同一級別），而不是一個函數。
print有兩種語法形式：

print_stmt ::=  "print" ([expression ("," expression)* [","]]
                | ">>" expression [("," expression)+ [","]])

默認狀況下print打印到標準輸出sys.stdout中，可使用>>後跟一個file-like的對象（具備write方法）進行重定向。例如：

with open('/tmp/debug.log', 'w') as f:
    print >> f, '你好'

由於print的參數爲str類型的字符，因此在打印到標準輸出（通常爲終端，例如Mac的iTerm2）時有個隱式轉碼的過程，這個轉碼過程默認用的編碼在類unix系統上是經過環境變量LC_ALL指定的，在 Windows 系統中，終端默認只能顯示256個字符（cp437 指定）。

自 Python 2.6 起，Python 解釋器在啓動時能夠經過指定 PYTHONIOENCODING 這個環境變量來指定。
在程序裏面，咱們能夠經過只讀屬性sys.stdout.encoding查看。

$ cat encode.py
# coding: utf-8
import sys
print sys.stdout.encoding
print u"你好"

$ python encode.py
UTF-8
你好

$ LC_ALL=C python encode.py
US-ASCII
Traceback (most recent call last):
  File "encode.py", line 21, in <module>
    print u"你好"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

當 print 經過重定向，不是打印到標準輸出sys.stdout時，因爲它不知道目標文件的locale，因此它又會用默認的ascii進行編碼了。

$ python encode.py > abc
Traceback (most recent call last):
  File "encode.py", line 21, in <module>
    print u"你好"
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

$ cat abc
None

$ PYTHONIOENCODING=UTF-8 python encode.py > abc
$ cat abc
UTF-8
你好

能夠看到，在不指定PYTHONIOENCODING時，sys.stdout.encoding輸出None了，而且執行print u"你好"時報錯了。

爲了解決打印unicode字符的問題，咱們能夠經過codecs.StreamWriter來包裝一次sys.stdout對象。例如：

$ cat encode2.py
# coding: utf-8
import codecs
import sys

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print u'你好'

$ python encode2.py > abc
$ cat abc
你好

須要注意的是，經過codecs.StreamWriter包裝後的print，在輸出str類型的字符時，會先把這個字符轉爲unicode類型，而後再轉爲str類型，這兩個轉化過程用的也是默認的ascii編碼，因此頗有可能又會出錯。

$ cat encode3.py
# coding: utf-8
import codecs
import sys

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print '你好'

$ python encode3.py > abc
Traceback (most recent call last):
  File "encode3.py", line 7, in <module>
    print '你好'
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

你可能會問，有沒有一勞永逸的解決方法，第三方模塊kitchen能夠解決這個問題。

$ pip install kitchen
$ cat encode4.py
# coding: utf-8
import sys
from kitchen.text.converters import getwriter
UTF8Writer = getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print u'你好'
print '你好'

$ python encode4.py > abc
$ cat abc
你好
你好

能夠看到，兩種類型的你好均被正確重定向到文件中。

其餘

我上面重點講解了輸入輸出時的常見編碼錯誤，其餘的編碼錯誤基本上就是 API 參數類型不匹配的參數。本身代碼推薦還比較好解決，第三方模塊裏面的就很差調試了，若是遇到了，只能經過hack的方式來修改第三方模塊的源代碼了。

一個比較好的建議是，str類型的變量名前面用b_標示，好比b_search_hits，表示返回的搜索結果的類型是str。

never reload(sys)

互聯網上比較常見的一個解決編碼的方式是：

reload(sys)
sys.setdefaultencoding("utf-8")

這種解決方式帶來的弊遠遠大於利，下面一個簡單的例子：

# coding: utf-8
import sys

print "你好" == u"你好"
# False

reload(sys)
sys.setdefaultencoding("utf-8")

print "你好" == u"你好"
# True

能夠看到，設置默認編碼以後，程序的邏輯已經發生了改變，最主要的是，若是咱們改變了默認編碼，咱們所引用的全部第三方模塊，也都會改變，就想我這裏舉的例子，程序的邏輯頗有可能會改變。關於這個問題的詳盡解釋，能夠參考Dangers of sys.setdefaultencoding('utf-8')。

總結

經過上面的分析，想象你們對 Python 2 中爲何會出現那麼多的編碼錯誤有所瞭解，根本緣由就在於 Python 設計早期混淆了byte類型與str類型，好歹在 Python 3 解決了這個設計錯誤。
在另外一方面，這裏的編碼問題對咱們理解計算機的運行原理頗有幫助，也反映出copy & paste的危害，但願你們看了我這篇文章以後，嚴禁reload(sys)這種作法。

若是你們對 Python 2 中的編碼問題，還有任何疑問，歡迎留言討論。