腦洞：如何用一個整數來表示一個列表？

時間 2020-12-23

標籤 html python 算法數據庫編程數組數據結構 app 編程語言 ide 欄目 HTML 简体版

原文原文鏈接

原題 | Storing a list in an int (https://iantayler.com/2020/12/07/storing-a-list-in-an-int)html

做者 | Computer Witpython

譯者 | 豌豆花下貓（「Python貓」公衆號做者）算法

聲明 | 本翻譯已獲得原做者受權。爲便於閱讀，內容略有改動。數據庫

概要

與 C、Rust 和 Go 不一樣，Python 默認的int 具備任意大小。[注1] 、[注2] 編程

這意味着，一個整數能夠存儲無限大的值，只要內存足夠。數組

例如，你能夠打開 Python3 並運行如下命令：數據結構

>>> import math
>>> math.factorial(2020)
[number omitted]  # Python貓注：此處求2020的階乘，結果是一長串數字，因此省略
>>> math.log2(math.factorial(2020))
19272.453841606068
>>> type(math.factorial(2020))
<class 'int'>

也就是說，在 Python 中，日常使用的 int 能夠輕鬆地保存一個佔用 19273 比特的 C 類型固定大小無符號 int 類型的值（C-style fixed-size unsigned int ）。在 Python 這樣的語言中，便利性高於速度和內存效率，這確實頗有用。app

這種無限的精度，也意味着咱們能夠在單個 int 中存儲任意數量的信息。只要編碼正確，一整本書、一整個數據庫、甚至任何東西，均可以被存入一個單獨的 Python int 中。編程語言

(Python貓注：這有一篇文章，深度剖析了 Python 整型不會溢出的實現原理，可做關聯閱讀)ide

所以，咱們能夠設想出一種 Python 的方言，它只有整型，須要用 int 表示其它全部的類型（字典、列表、等等）。咱們還有一些特殊的函數和方法，能夠將 int 視爲 list 、dict 等等。

這將會是一個有趣而好玩的練習，而這就是本文想要作的事。

有一個顯而易見的實現方法：全部數據結構只是內存中的位數組（bit-arrays）。最壞的狀況下，它是一組相關的位數組（例如，像鏈表或樹中的每一個節點），而且它們的集合也只是位數組。位數組能夠被解釋爲二進制數。因此咱們必然能這樣作。但這有點無聊。

在本博文以及本系列的後續博文中，我將介紹一些用 int 來表示複雜數據結構的方法。它們不必定是最緊湊、最合理或最有效的，其共同的目標是找到這些數據結構的有趣的表示方式。[注3]

哥德爾數（Gödel numbering）簡介

咱們要表示的第一個數據結構是 list。咱們將使用以邏輯學家 KurtGödel 命名的Gödel數。爲了方便起見，咱們僅處理由無符號整數（即天然數）組成的列表。

哥德爾數的原理是令每一個大於 1 的天然數都用惟一的質數分解來表示。它依據的是算術的基本定理。

（Python貓注：質數分解，即 prime factorization，又譯做質因數分解、素因子分解等，指的是把每一個數都寫成用質數相乘的形式）

看一些例子：

一個數字能夠經過其質因子（prime factors ）的指數列表來惟一標識（直到其最高位的非零指數）。因此，咱們能夠用 126 來表示列表[1, 2, 0, 1] 。列表中的第一個數字是 126 做質數分解後 2 的指數，第二個數是 3 的指數，依此類推。

再來幾個例子：

若是列表末尾有 0 ，該怎麼辦呢？好吧，基於這樣的編碼，不會出現這種狀況。

在咱們的質數分解中，指數爲 0 的質數可能有無限個，所以咱們須要停在某個地方。[注4] 咱們選擇在最後一個非零指數處中止。

當列表中包含較大的數字時，這種表示形式也會使用很是大的數字。那是由於列表中的數字表示的是指數，因此 int 的大小與它們成指數增加。例如，[50, 1000, 250] 須要使用大小爲 2266 比特的數字表示。

另外一方面，相比於其它用 int 編碼的列表，那些包含很是多小整數的長列表，尤爲是大型稀疏列表（即大部分的值都爲 0），則擁有很是緊湊的表示形式。

提醒一下，將 list 編碼爲 int，這不是很好的編程實踐，僅僅是一個好玩的實驗。

Python實現

讓咱們看一下 Python 的實現。這裏有幾點注意事項：

咱們會使用帶有 yield 的函數，由於它極大地簡化了操做。[注5]
你會看到大量的 while 循環。這是由於列表生成式、range 和大多數你打算在 for 循環中使用的東西，都被禁止用在只有 int 類型的方言中。全部這些都被 while 循環替代了。

質數生成器

咱們要編寫的第一個函數是一個迭代器，它將按順序生成質數。它從頭至尾都很關鍵。這裏的實現是最簡單可行的版本。

我可能很快會寫一篇完整的關於生成質數的算法的文章，由於這是一個很酷的話題，自己也是一個古老的研究領域。最廣爲人知的算法是愛拉託遜斯篩法（Sieve of Erathosthenes ），但這只是冰山一角。[注6]

在這裏，一個很是幼稚的實現就夠了：

def primes(starting: int = 2):
    """Yield the primes in order.
     
    Args:
        starting: sets the minimum number to consider.
     
    Note: `starting` can be used to get all prime numbers
    _larger_ than some number. By default it doesn't skip
    any candidate primes.
    """
    candidate_prime = starting
    while True:
        candidate_factor = 2
        is_prime = True
        # We'll try all the numbers between 2 and
        # candidate_prime / 2. If any of them divide
        # our candidate_prime, then it's not a prime!
        while candidate_factor <= candidate_prime // 2:
            if candidate_prime % candidate_factor == 0:
                is_prime = False
                break
            candidate_factor += 1
        if is_prime:
            yield candidate_prime
        candidate_prime += 1

建立空列表

def empty_list() -> int:
    """Create a new empty list."""
    # 1 is the empty list. It isn't divisible by any prime.
    return 1

遍歷元素

def iter_list(l: int):
    """Yields elements in the list, from first to last."""
    # We go through each prime in order. The next value of
    # the list is equal to the number of times the list is
    # divisible by the prime.
    for p in primes():
        # We decided we will have no trailing 0s, so when
        # the list is 1, it's over.
        if l <= 1:
            break
        # Count the number of divisions until the list is
        # not divisible by the prime number.
        num_divisions = 0
        while l % p == 0:
            num_divisions += 1
            l = l // p  # could be / as well
        yield num_divisions

訪問元素

def access(l: int, i: int) -> int:
    """Return i-th element of l."""
    # First we iterate over all primes until we get to the
    # ith prime.
    j = 0
    for p in primes():
        if j == i:
            ith_prime = p
            break
        j += 1
    # Now we divide the list by the ith-prime until we
    # cant divide it no more.
    num_divisions = 0
    while l % ith_prime == 0:
        num_divisions += 1
        l = l // ith_prime
    return num_divisions

添加元素

def append(l: int, elem: int) -> int:
    # The first step is finding the largest prime factor.
    # We look at all primes until l.
    # The next prime after the last prime factor is going
    # to be the base we need to use to append.
    # E.g. if the list if 18 -> 2**1 * 3**2 -> [1, 2]
    # then the largest prime factor is 3, and we will
    # multiply by the _next_ prime factor to some power to
    # append to the list.
    last_prime_factor = 1  # Just a placeholder
    for p in primes():
        if p > l:
            break
        if l % p == 0:
            last_prime_factor = p
    # Now get the _next_ prime after the last in the list.
    for p in primes(starting=last_prime_factor + 1):
        next_prime = p
        break
    # Now finally we append an item by multiplying the list
    # by the next prime to the `elem` power.
    return l * next_prime ** elem

試用這些函數

你能夠打開一個 Python、iPython 或 bPython會話，並試試這些函數！

建議列表元素使用從 1 到 10 之間的數字。若是使用比較大的數字，則 append 和 access 可能會花費很長時間。

從某種程度上說，使用哥德爾數來表示列表並不實用，儘管能夠經過優化質數生成及分解算法，來極大地擴大可用數值的範圍。

In [16]: l = empty_list()
 
In [17]: l = append(l, 2)
 
In [18]: l = append(l, 5)
 
In [19]: list(iter_list(l))
Out[19]: [2, 5]
 
In [20]: access(l, 0)
Out[20]: 2
 
In [21]: access(l, 1)
Out[21]: 5
 
In [22]: l
Out[22]: 972  # Python貓注：2^2*3^5=972

其它 int 編碼

咱們看到了一種將天然數列表表示爲 int 的方法。還有其它更實用的方法，這些方法依賴於將數字的二進制形式細分爲大小不一的塊。我相信你能夠提出這樣的建議。

我之後可能會寫其它文章，介紹更好的用於生成和分解質數的算法，以及其它複雜數據結構的 int 表示形式。

腳註

我認爲在內存不足以前，程序也會出現中斷，可是文檔確實明確地提到它們具備無限的精度。
請注意，對於 Python3，這是正確的，但對於 Python2 則否則。對於 Python2，int 是固定大小的。我認爲在 2020 年用 Python 指代 Python3 是沒問題的，但我也認爲這個細節值得加一條腳註。
對於用哥德爾數表示列表，這很容易被反駁說是一種糟糕的表示形式。在後續的博文中，咱們會討論有關表示形式的權衡問題。
咱們能夠將列表的長度存儲在單獨的 int 中，據此知道要在列表末尾考慮多少個 0。（貓注：還有幾句話沒看懂，不譯）If we don’t want to have a whole separate int, we can always write the length of the list as the exponent of 2 and start the actual list with the exponent of 3. This has some redundant information, though. The way to avoid redundant information is to store the number of final 0s in the list, instead of the entire length. We won’t be worrying about any of this, though.
請注意，跟使用 return 並將狀態變量做爲參數相比，使用 yield 沒有區別（一般足以得到最後一個返回的元素）。這有點像 Continuation Passing Style。也相似於日常的使非尾遞歸函數尾遞歸的累加器。若是你從未據說過累加器技巧，這裏有一些連接[1] 、[2] 。我將來可能會在沒有它們的語言中，寫模仿迭代器的東西。
另請參見《 The Genuine Sieve of Erathosthenes》論文，它澄清了這一算法是如何被定義的。

Python貓注： 以上是所有譯文，但我最後還想補充一個有趣的內容。在《黑客與畫家》中，保羅·格雷大師有一個驚人的預言，他認爲在邏輯上不須要有整數類型，由於整數 n 能夠用一個 n 元素的列表來表示。哈哈，這跟上文的腦洞剛好反過來了！想象一下，一個只有整數類型沒有列表的編程語言，以及一個只有列表類型沒有整數的編程語言，哪個更有可能在將來出現呢？