Python 「黑魔法」之 Encoding & Decoding

時間 2019-11-06

標籤 python encoding decoding 欄目 Python 简体版

原文原文鏈接

首發於個人博客，轉載請註明出處html

寫在前面

本文爲科普文

本文中的例子在 Ubuntu 14.04 / Python 2.7.11 下運行成功，Python 3+ 的接口有些許不一樣，須要讀者自行轉換

引子

先看一段代碼：python

example.py：less

# -*- coding=yi -*-

從 math 導入 sin, pi

打印 'sin(pi) =', sin(pi)

這是什麼？！是 Python 嗎？能夠運行嗎？——想必你會問。函數

我能夠明確告訴你：這不是 Python，但它能夠用 Python 解釋器運行。固然，若是你願意，能夠叫它「Yython」（易語言 + Python）。ui

怎麼作到的？也許你已經注意到第一行的奇怪註釋——沒錯，祕密全在這裏。編碼

這種黑魔法，還要從 PEP 263 提及。spa

古老的 PEP 263

我相信 99% 的中國 Python 開發者都曾經爲一個問題而頭疼——字符編碼。那是每一個初學者的夢靨。命令行

還記得那天嗎？當你試圖用代碼向它示好：設計

print '你好'

它卻給你當頭一棒：code

SyntaxError: Non-ASCII character '\xe4' in file chi.py on line 1, but no encoding declared

【一臉懵逼】

因而，你上網查找解決方案。很快，你便有了答案：

# -*- coding=utf-8 -*-

print '你好'

其中第一行的註釋用於指定解析該文件的編碼。

這個特新來自 2001 年的 PEP 263 -- Defining Python Source Code Encodings，它的出現是爲了解決一個反響普遍的問題：

In Python 2.1, Unicode literals can only be written using the Latin-1 based encoding "unicode-escape". This makes the programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. Programmers can write their 8-bit strings using the favorite encoding, but are bound to the "unicode-escape" encoding for Unicode literals.

Python 默認用 ASCII 編碼解析文件，給 15 年前的非英文世界開發者形成了不小的困擾——看來 Guido 老爹有些我的主義，設計時只考慮到了英文世界。

提案者設想：使用一種特殊的文件首註釋，用於指定代碼的編碼。這個註釋的正則原型是這樣的：

^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)

也就是說 # -*- coding=utf-8 -*- 並非惟一的寫法，只是 Emacs 推薦寫法而已。諸如 # coding=utf-8、# encoding: utf-8 都是合法的——所以你沒必要驚訝於他人編碼聲明與你不一樣。

正則的捕獲組 ([-_.a-zA-Z0-9]+) 將會被用做查找編碼的名稱，查找到的編碼信息會被用於解碼文件。也就是說，import example 背後其實至關於有以下轉換過程：

with open('example.py', 'r') as f:
    content = f.read()
    encoding = extract_encoding_info(content) # 解析首註釋
    exec(content.decode(encoding))

問題其實又回到咱們經常使用的 str.encode 和 str.decode 上來了。

可 Python 怎麼這麼強大？！幾乎全部編碼它都認得！這是怎麼作到的？是標準庫？仍是內置於解釋器中？

一切，都是 codecs 模塊在起做用。

codecs

codecs 算是較爲冷門的一個模塊，更爲經常使用的是 str 的 encode/decode 的方法——但它們本質都是對 codecs 的調用。

打開 /path/to/your/python/lib/encodings/ 目錄，你會發現有許多以編碼名稱命名的 .py 文件，如 utf_8.py、latin_1.py。這些都是系統預約義的編碼系統，實現了應對各類編碼的邏輯——也就是說：編碼系統其實也是普通的模塊。

除了內置的編碼，用戶也能夠 自行定義編碼系統。codecs 暴露了一個 register 函數，用於註冊自定義編碼。register 簽名以下：

codecs.register(search_function)
Register a codec search function. Search functions are expected to take one argument, the encoding name in all lower case letters, and return a CodecInfo object having the following attributes:

name: The name of the encoding;

encode: The stateless encoding function;

decode: The stateless decoding function;

incrementalencoder: An incremental encoder class or factory function;

incrementaldecoder: An incremental decoder class or factory function;

streamwriter: A stream writer class or factory function;

streamreader: A stream reader class or factory function.

encode 和 decode 是無狀態的編碼/解碼的函數，簡單說就是：前一個被編解碼的字符串與後一個沒有關聯。若是你想用 codecs 系統進行語法樹解析，解析邏輯最好不要寫在這裏，由於代碼的連續性沒法被保證；incremental* 則是有狀態的解析類，能彌補 encode、decode 的不足；stream* 是流相關的解析類，行爲一般與 encode/decode 相同。

關於這六個對象的具體寫法，能夠參考 /path/to/your/python/lib/encodings/rot_13.py，該文件實現了一個簡單的密碼系統。

那麼，是時候揭開真相了。

所謂的「Yython」

黑魔法其實並不神祕，照貓畫虎定義好相應的接口便可。做爲例子，這裏只處理用到的關鍵字：

yi.py：

# encoding=utf8

import codecs

yi_map = {
    u'從': 'from',
    u'導入': 'import',
    u'打印': 'print'
}


def encode(input):
    for key, value in yi_map.items():
        input = input.replace(value, key)

    return input.encode('utf8')


def decode(input):
    input = input.decode('utf8')
    for key, value in yi_map.items():
        input = input.replace(key, value)

    return input


class Codec(codecs.Codec):

    def encode(self, input, errors="strict"):
        input = encode(input)

        return (input, len(input))

    def decode(self, input, errors="strict"):
        input = decode(input)

        return (input, len(input))


class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        return encode(input)


class IncrementalDecoder(codecs.IncrementalDecoder):
    def decode(self, input, final=False):
        return decode(input)


class StreamWriter(Codec, codecs.StreamWriter):
    pass


class StreamReader(Codec, codecs.StreamReader):
    pass


def register_entry(encoding):
    return codecs.CodecInfo(
        name='yi',
        encode=Codec().encode,
        decode=Codec().decode,
        incrementalencoder=IncrementalEncoder,
        incrementaldecoder=IncrementalDecoder,
        streamwriter=StreamWriter,
        streamreader=StreamReader
    ) if encoding == 'yi' else None

在命令行裏註冊一下，就能夠看到激動人心的結果了：

>>> import codecs, yi
>>> codecs.register(yi.register_entry)
>>> import example
sin(pi) = 1.22464679915e-16

結語

有時，對習覺得常的東西深刻了解一下，說不定會有驚人的發現。

References

codecs - Codec registry and base classes

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

Python 「黑魔法」 之 Encoding & Decoding