Python編碼和Unicode

時間 2019-11-12

原文原文鏈接

原文連接： ERIC MORITZ 翻譯：伯樂在線 - 賤聖OMG
譯文連接： http://blog.jobbole.com/50345/html

我肯定有不少關於Unicode和Python的說明，但爲了方便本身的理解使用，我仍是打算再寫一些關於它們的東西。python

字節流 vs Unicode對象

咱們先來用Python定義一個字符串。當你使用string類型時，實際上會儲存一個字節串。數據庫

 
        [  a ][  b ][  c ]  
        = 
        "abc" 
       
        [  
        97 
        ][  
        98 
        ][  
        99 
        ]  
        = 
        "abc"

在這個例子裏，abc這個字符串是一個字節串。97.，98,，99是ASCII碼。Python 2.x版本的一個不足之處就是默認將全部的字符串當作ASCII來對待。不幸的是，ASCII在拉丁式字符集裏是最不常見的標準。windows

ASCII是用前127個數字來作字符映射。像windows-1252和UTF-8這樣的字符映射有相同的前127個字符。在你的字符串裏每一個字節的值低於127的時候是安全的混合字符串編碼。然而做這個假設是件很危險的事情，下面還將會提到。安全

當你的字符串裏有字節的值大於126的時候就會出現問題了。咱們來看一個用windows-1252編碼的字符串。Windows-1252裏的字符映射是8位的字符映射，那麼總共就會有256個字符。前127個跟ASCII是同樣的，接下來的127個是由windows-1252定義的其餘字符。app

 
        A windows- 
        1252 
        encoded string looks like  
        this 
        : 
       
        [  
        97 
        ] [  
        98 
        ] [  
        99 
        ] [  
        150 
        ] =  
        "abc–"

Windows-1252仍然是一個字節串，但你有沒有看到最後一個字節的值是大於126的。若是Python試着用默認的ASCII標準來解碼這個字節流，它就會報錯。咱們來看當Python解碼這個字符串的時候會發生什麼：socket

 
   
    
      
      
        >>> x  
        = 
        "abc" 
        + 
        chr 
        ( 
        150 
        ) 
       
 
        >>>  
        print 
        repr 
        (x) 
       
 
        'abc\x96' 
       
 
        >>> u 
        "Hello" 
        + 
        x 
       
 
        Traceback (most recent call last): 
       
 
           
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        ? 
       
 
        UnicodeDecodeError:  
        'ASCII' 
        codec can't decode byte  
        0x96 
        in 
        position  
        3 
        : ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
    
 
   
 

咱們來用UTF-8來編碼另外一個字符串：工具

 
   
    
      
      
        A UTF- 
        8 
        encoded string looks like  
        this 
        : 
       
 
        [  
        97 
        ] [  
        98 
        ] [  
        99 
        ] [  
        226 
        ] [  
        128 
        ] [  
        147 
        ] =  
        "abc–" 
       
 
        [ 
        0x61 
        ] [ 
        0x62 
        ] [ 
        0x63 
        ] [ 
        0xe2 
        ]  [  
        0x80 
        ] [  
        0x93 
        ] =  
        "abc-" 
       
 
    
 
   
 

若是你拿起看你熟悉的Unicode編碼表，你會發現英文的破折號對應的Unicode編碼點爲8211（0×2013）。這個值大於ASCII最大值127。大於一個字節可以存儲的值。由於8211（0×2013）是兩個字節，UTF-8必須利用一些技巧告訴系統存儲一個字符須要三個字節。咱們再來看當Python準備用默認的ASCII來編碼一個裏面有字符的值大於126的UTF-8編碼字符串。學習

 
        >>> x  
        = 
        "abc\xe2\x80\x93" 
       
        >>>  
        print 
        repr 
        (x) 
       
        'abc\xe2\x80\x93' 
       
        >>> u 
        "Hello" 
        + 
        x 
       
        Traceback (most recent call last): 
       
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        ? 
       
        UnicodeDecodeError:  
        'ASCII' 
        codec can't decode byte  
        0xe2 
        in 
        position  
        3 
        : ordinal  
        not 
        in 
        range 
        ( 
        128 
        )

你能夠看到，Python一直是默認使用ASCII編碼。當它處理第4個字符的時候，由於它的值爲226大於126，因此Python拋出了錯誤。這就是混合編碼所帶來的問題。this

解碼字節流

在一開始學習Python Unicode 的時候，解碼這個術語可能會讓人很疑惑。你能夠把字節流解碼成一個Unicode對象，把一個Unicode 對象編碼爲字節流。

Python須要知道如何將字節流解碼爲Unicode對象。當你拿到一個字節流，你調用它的「解碼方法來從它建立出一個Unicode對象。

你最好是儘早的將字節流解碼爲Unicode。

 
   
    
      
      
        >>> x  
        = 
        "abc\xe2\x80\x93" 
       
 
        >>> x  
        = 
        x.decode( 
        "utf-8" 
        ) 
       
 
        >>>  
        print 
        type 
        (x) 
       
 
        < 
        type 
        'unicode' 
        > 
       
 
        >>> y  
        = 
        "abc" 
        + 
        chr 
        ( 
        150 
        ) 
       
 
        >>> y  
        = 
        y.decode( 
        "windows-1252" 
        ) 
       
 
        >>>  
        print 
        type 
        (y) 
       
 
        >>>  
        print 
        x  
        + 
        y 
       
 
        abc–abc– 
       
 
    
 
   
 

將Unicode編碼爲字節流

Unicode對象是一個文本的編碼不可知論的表明。你不能簡單地輸出一個Unicode對象。它必須在輸出前被變成一個字節串。Python會很適合作這樣的工做，儘管Python將Unicode編碼爲字節流時默認是適用ASCII，這個默認的行爲會成爲不少讓人頭疼的問題的緣由。

 
   
    
      
      
        >>> u  
        = 
        u 
        "abc\u2013" 
       
 
        >>>  
        print 
        u 
       
 
        Traceback (most recent call last): 
       
 
           
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        <module> 
       
 
        UnicodeEncodeError:  
        'ascii' 
        codec can 
        't encode character u' 
        \u2013'  
        in 
        position  
        3 
        : ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
        >>>  
        print 
        u.encode( 
        "utf-8" 
        ) 
       
 
        abc– 
       
 
    
 
   
 

使用codecs模塊

codecs模塊能在處理字節流的時候提供很大幫助。你能夠用定義的編碼來打開文件而且你從文件裏讀取的內容會被自動轉化爲Unicode對象。

試試這個：

 
        >>>  
        import 
        codecs 
       
        >>> fh  
        = 
        codecs. 
        open 
        ( 
        "/tmp/utf-8.txt" 
        ,  
        "w" 
        ,  
        "utf-8" 
        ) 
       
        >>> fh.write(u 
        "\u2013" 
        ) 
       
        >>> fh.close()

它所作的就是拿到一個Unicode對象而後將它以utf-8編碼寫入到文件。你也能夠在其餘的狀況下這麼使用它。

試試這個：

當從一個文件讀取數據的時候，codecs.open 會建立一個文件對象可以自動將utf-8編碼文件轉化爲一個Unicode對象。

咱們接着上面的例子，此次使用urllib流。

 
        >>> stream  
        = 
        urllib.urlopen( 
        "http://www.google.com" 
        ) 
       
        >>> Reader  
        = 
        codecs.getreader( 
        "utf-8" 
        ) 
       
        >>> fh  
        = 
        Reader(stream) 
       
        >>>  
        type 
        (fh.read( 
        1 
        )) 
       
        < 
        type 
        'unicode' 
        > 
       
        >>> Reader 
       
        < 
        class 
        encodings.utf_8.StreamReader at  
        0xa6f890 
        >

單行版本：

 
        >>> fh  
        = 
        codecs.getreader( 
        "utf-8" 
        )(urllib.urlopen( 
        "http://www.google.com" 
        )) 
       
        >>>  
        type 
        (fh.read( 
        1 
        ))

你必須對codecs模塊十分當心。你傳進去的東西必須是一個Unicode對象，不然它會自動將字節流做爲ASCII進行解碼。

 
   
    
      
      
        >>> x  
        = 
        "abc\xe2\x80\x93" 
        # our "abc-" utf-8 string 
       
 
        >>> fh  
        = 
        codecs. 
        open 
        ( 
        "/tmp/foo.txt" 
        ,  
        "w" 
        ,  
        "utf-8" 
        ) 
       
 
        >>> fh.write(x) 
       
 
        Traceback (most recent call last): 
       
 
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        <module> 
       
 
        File 
        "/usr/lib/python2.5/codecs.py" 
        , line  
        638 
        ,  
        in 
        write 
       
 
           
        return 
        self 
        .writer.write(data) 
       
 
        File 
        "/usr/lib/python2.5/codecs.py" 
        , line  
        303 
        ,  
        in 
        write 
       
 
           
        data, consumed  
        = 
        self 
        .encode( 
        object 
        ,  
        self 
        .errors) 
       
 
        UnicodeDecodeError:  
        'ascii' 
        codec can't decode byte  
        0xe2 
        in 
        position  
        3 
        : ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
    
 
   
 

哎呦我去，Python又開始用ASCII來解碼一切了。

將UTF-8字節流切片的問題

由於一個UTF-8編碼串是一個字節列表，len( )和切片操做沒法正常工做。首先用咱們以前用的字符串。

接下來作如下的：

 
        >>> my_utf8  
        = 
        "abc–" 
       
        >>>  
        print 
        len 
        (my_utf8) 
       
        6

神馬？它看起來是4個字符，可是len的結果說是6。由於len計算的是字節數而不是字符數。

 
        >>>  
        print 
        repr 
        (my_utf8) 
       
        'abc\xe2\x80\x93'

如今咱們來切分這個字符串。

 
        >>> my_utf8[ 
        - 
        1 
        ]  
        # Get the last char 
       
        '\x93'

我去，切分結果是最後一字節，不是最後一個字符。

爲了正確的切分UTF-8，你最好是解碼字節流建立一個Unicode對象。而後就能安全的操做和計數了。

 
        >>> my_unicode  
        = 
        my_utf8.decode( 
        "utf-8" 
        ) 
       
        >>>  
        print 
        repr 
        (my_unicode) 
       
        u 
        'abc\u2013' 
       
        >>>  
        print 
        len 
        (my_unicode) 
       
        4 
       
        >>>  
        print 
        my_unicode[ 
        - 
        1 
        ] 
       
        –

當Python自動地編碼/解碼

在一些狀況下，當Python自動地使用ASCII進行編碼/解碼的時候會拋出錯誤。

第一個案例是當它試着將Unicode和字節串合併在一塊兒的時候。

 
   
    
      
      
        >>> u" 
        " + u" 
        \u2019 
        ".encode(" 
        utf 
        - 
        8 
        ") 
       
 
        Traceback (most recent call last): 
       
 
           
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        <module> 
       
 
        UnicodeDecodeError:  
        'ascii' 
        codec can't decode byte  
        0xe2 
        in 
        position  
        0 
        :   ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
    
 
   
 

在合併列表的時候會發生一樣的狀況。Python在列表裏有string和Unicode對象的時候會自動地將字節串解碼爲Unicode。

 
   
    
      
      
        >>>  
        "," 
        .join([u 
        "This string\u2019s unicode" 
        , u 
        "This string\u2019s utf-8" 
        .encode( 
        "utf-8" 
        )]) 
       
 
        Traceback (most recent call last): 
       
 
           
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        <module> 
       
 
        UnicodeDecodeError:  
        'ascii' 
        codec can't decode byte  
        0xe2 
        in 
        position  
        11 
        :  ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
    
 
   
 

或者當試着格式化一個字節串的時候：

 
   
    
      
      
        >>>  
        "%s\n%s" 
        % 
        (u 
        "This string\u2019s unicode" 
        , u 
        "This string\u2019s  utf-8" 
        .encode( 
        "utf-8" 
        ),) 
       
 
        Traceback (most recent call last): 
       
 
           
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        <module> 
       
 
        UnicodeDecodeError:  
        'ascii' 
        codec can't decode byte  
        0xe2 
        in 
        position  
        11 
        : ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
    
 
   
 

基本上當你把Unicode和字節串混在一塊兒用的時候，就會致使出錯。

在這個例子裏面，你建立一個utf-8文件，而後往裏面添加一些Unicode對象的文本。就會報UnicodeDecodeError錯誤。

 
   
    
      
      
        >>>  
        buffer 
        = 
        [] 
       
 
        >>> fh  
        = 
        open 
        ( 
        "utf-8-sample.txt" 
        ) 
       
 
        >>>  
        buffer 
        .append(fh.read()) 
       
 
        >>> fh.close() 
       
 
        >>>  
        buffer 
        .append(u 
        "This string\u2019s unicode" 
        ) 
       
 
        >>>  
        print 
        repr 
        ( 
        buffer 
        ) 
       
 
        [ 
        'This file\xe2\x80\x99s got utf-8 in it\n' 
        , u 
        'This string\u2019s unicode' 
        ] 
       
 
        >>>  
        print 
        "\n" 
        .join( 
        buffer 
        ) 
       
 
        Traceback (most recent call last): 
       
 
           
        File 
        "<stdin>" 
        , line  
        1 
        ,  
        in 
        <module> 
       
 
        UnicodeDecodeError:  
        'ascii' 
        codec can't decode byte  
        0xe2 
        in 
        position  
        9 
        : ordinal  
        not 
        in 
        range 
        ( 
        128 
        ) 
       
 
    
 
   
 

你可使用codecs模塊把文件做爲Unicode加載來解決這個問題。

 
   
    
      
      
        >>>  
        import 
        codecs 
       
 
        >>>  
        buffer 
        = 
        [] 
       
 
        >>> fh  
        = 
        open 
        ( 
        "utf-8-sample.txt" 
        ,  
        "r" 
        ,  
        "utf-8" 
        ) 
       
 
        >>>  
        buffer 
        .append(fh.read()) 
       
 
        >>> fh.close() 
       
 
        >>>  
        print 
        repr 
        ( 
        buffer 
        ) 
       
 
        [u 
        'This file\u2019s got utf-8 in it\n' 
        , u 
        'This string\u2019s unicode' 
        ] 
       
 
        >>>  
        buffer 
        .append(u 
        "This string\u2019s unicode" 
        ) 
       
 
        >>>  
        print 
        "\n" 
        .join( 
        buffer 
        ) 
       
 
        This  
        file 
        ’s got utf 
        - 
        8 
        in 
        it 
       

           
       
 
        This string’s  
        unicode 
       
 
    
 
   
 

正如你看到的，由codecs.open 建立的流在當數據被讀取的時候自動地將比特串轉化爲Unicode。

最佳實踐

1．最早解碼，最後編碼

2．默認使用utf-8編碼

3．使用codecs和Unicode對象來簡化處理

最早解碼意味着不管什麼時候有字節流輸入，須要儘早將輸入解碼爲Unicode。這會防止出現len( )和切分utf-8字節流發生問題。

最後編碼意味着只有你打算將文本輸出到某個地方時，才把它編碼爲字節流。這個輸出多是一個文件，一個數據庫，一個socket等等。只有在處理完成以後才編碼unicode對象。最後編碼也意味着，不要讓Python爲你編碼Unicode對象。Python將會使用ASCII編碼，你的程序會崩潰。

默認使用UTF-8編碼意味着：由於UTF-8能夠處理任何Unicode字符，因此你最好用它來替代windows-1252和ASCII。

codecs模塊可以讓咱們在處理諸如文件或socket這樣的流的時候能少踩一些坑。若是沒有codecs提供的這個工具，你就必須將文件內容讀取爲字節流，而後將這個字節流解碼爲Unicode對象。

codecs模塊可以讓你快速的將字節流轉化爲Unicode對象，省去不少麻煩。

解釋UTF-8

最後的部分是讓你能對UTF-8有一個入門的瞭解，若是你是個超級極客能夠無視這一段。

利用UTF-8，任何在127和255之間的字節是特別的。這些字節告訴系統這些字節是多字節序列的一部分。

 
   
    
      
      
        Our UTF- 
        8 
        encoded string looks like  
        this 
        : 
       
 
        [  
        97 
        ] [  
        98 
        ] [  
        99 
        ] [  
        226 
        ] [  
        128 
        ] [  
        147 
        ] =  
        "abc–" 
       
 
    
 
   
 

最後3字節是一個UTF-8多字節序列。若是你把這三個字節裏的第一個轉化爲2進制能夠看到如下的結果：

11100010

前3比特告訴系統它開始了一個3字節序列226，128，147。

那麼完整的字節序列。

而後你對三字節序列運用下面的掩碼。（詳見這裏）

 
        1110xxxx 10xxxxxx 10xxxxxx 
       
        XXXX0010 XX000000 XX010011 Remove the X's 
       
        0010       
        000000   
        010011 
        Collapse the numbers 
       
        00100000 
        00010011          
        Get Unicode number  
        0x2013 
        ,  
        8211 
        The  
        "–"