Python str() 引起的 UnicodeEncodeError

時間 2019-11-22

標籤 python str 引起 unicodeencodeerror 欄目 Python 简体版

原文原文鏈接

原由

衆所周知，Python 2 中的 UnicodeEncodeError 與 UnicodeDecodeError 是比較棘手的問題，有時候遇到這類問題的發生，老是一頭霧水，感受莫名其妙。甚至，《Fluent Python》的做者還提出了所謂「三明治模型」的東西來幫助解決此類問題（其實大可沒必要如此麻煩，後文有述）。python

今天在線上遇到一個與此有關的小問題，感受頗有趣，水文一篇記錄之。bash

Bug 轉到我這裏時，看到現象天然是UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)這類莫名其妙的提示。而後翻 log，迅速找到對應的代碼行，大概相似下面這種：網絡

thrift_obj = ThriftKeyValue(key=str(xx_obj.name))  # 出錯行, xx_obj.name 是一個 str
複製代碼

一開始，看見str(xx_obj.name)，也不知道是手誤，仍是故意爲之，反正是學不會這種操做（應該每一個項目裏面，或多或少都有這樣的神奇代碼吧）。函數

分析

看異常的字面意思，大體就是：有某個串，正在被 ASCII 編碼器編碼，可是顯然該串超出了 ASCII 編碼器所規定的範圍，因而出錯。因而推測：ui

哪裏應該有個什麼Unicode串（什麼串無所謂，反正只要超出 ASCII 的範圍就行），這裏應該是 xx_obj.name。
某處正在發生編碼動做，並且是偷偷地在搞（最煩這種隱式轉換了，Python 2 中不少），從代碼看不出在哪裏。

左看右看，應該是 str() 這個內置函數，因而簡單地試了一下以下代碼：this

In [5]: u = u'中國'

In [6]: str(u)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-6-b3b94fb7b5a0> in <module>()
----> 1 str(u)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) In [7]: b = u.encode('utf-8') In [8]: str(b) Out[8]: '\xe4\xb8\xad\xe5\x9b\xbd' 複製代碼

果真如此。查閱文檔一看，沒啥有價值的信息，描述太模糊了：編碼

class str(object='')
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ''.

For more information on strings see Sequence Types — str, unicode, list, tuple, bytearray, buffer, xrange which describes sequence functionality (strings are sequences), and also the string-specific methods described in the String Methods section. To output formatted strings use template strings or the % operator described in the String Formatting Operations section. In addition see the String Services section. See also unicode().
複製代碼

咱們的代碼裏面（Python 2），每一個 py 文件都有這麼一行：spa

from __future__ import unicode_literals, absolute_import
複製代碼

因此我推測 xx_obj.name 是要給 unicode 串，打 log 一看，果真如此。code

解決

至此，要麼將 xx_obj.name 轉化成 str() 能認識的東西，在這裏至少不能是 unicode，應該是 bytes。不過我沒有這麼作，太醜陋了，二是改爲這樣：orm

thrift_obj = ThriftKeyValue(key=xx_obj.name) # 這裏不必調用 str() ，估計前面能跑正常，是由於 name 剛好老是 ASCII 字符
複製代碼

Bug 修復，其餘功能也表現正常。

總結

前文講到，Python 2 中有較多這種隱式轉換，並且也沒啥文檔說明，特別是加上 Windows環境和 print 操做時，報錯信息更是看得人不明因此。《Fluent Python》中有講到所謂「三明治模型」來解決這一問題，仍是蠻有啓發的。

不過，我通常遵循的原則是：只用 Unicode，讓任何地方都是 Unicode。方式以下：

全部 py 文件必須有以下文件頭：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#

from __future__ import unicode_literals, absolute_import
複製代碼

接到外界的字節串（從網絡，從文件等），先轉成 Unicode，不過抽取成函數更好,省得重複編碼：

API 的起名優勢冗餘，主要是爲了作到 「見名知義」


class UnicodeUtils(object):
    @classmethod
    def get_unicode_str(cls, bytes_str, try_decoders=('utf-8', 'gbk', 'utf-16')):
        """轉換成字符串(通常是Unicode)"""
        
        if not bytes_str:
            return u''

        if isinstance(bytes_str, (unicode,)):
            return bytes_str

        for decoder in try_decoders:
            try:
                unicode_str = bytes_str.decode(decoder)
            except UnicodeDecodeError:
                pass
            else:
                return unicode_str

        raise DecodeBytesFailedException('decode bytes failed. tried decoders: %s' % list(try_decoders))

    @classmethod
    def encode_to_bytes(cls, unicode_str, encoder='utf-8'):
        """轉換成字節串"""
        
        if unicode_str is None:
            return b''

        if isinstance(unicode_str, unicode):
            return unicode_str.encode(encoding=encoder)
        else:
            u = cls.get_unicode(unicode_str)
            return u.encode(encoding=encoder)
複製代碼

送到外界的東西，所有轉成 UTF-8 編碼的字節串，見上面代碼

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。