Python中的字符串對象（《Python源碼剖析》筆記三）

時間 2019-11-16

標籤 python 字符串對象 Python源碼剖析筆記欄目 Python 简体版

原文原文鏈接

這是個人關於《Python源碼剖析》一書的筆記的第三篇。 Learn Python by Analyzing Python Source Code · GitBook

Python中的字符串對象

在Python3中，str類型就是Python2的unicode類型，以前的str類型轉化成了一個新的bytes類型。咱們能夠分析bytes類型的實現，也就是《Python源碼剖析》中的內容，但鑑於咱們對str類型的經常使用程度，且咱們對它較淺的理解，因此咱們來剖析一下這個相較而言複雜得多的類型。python

在以前的分析中，Python2中的整數對象是定長對象，而字符串對象則是變長對象。同時字符串對象又是一個不可變對象，建立以後就沒法再改變它的值。git

Unicode的四種形式

在Python3中，一個unicode字符串有四種形式：緩存

compact asciiapp
compact函數
legacy string， not readyoop
legacy string ，ready佈局

compact的意思是，假如一個字符串對象是compact的模式，它將只使用一個內存塊來存儲內容，也就是說，在內存中字符是牢牢跟在結構體後面的。對於non-compact的對象來講，也就是PyUnicodeObject，Python使用一個內存塊來保存PyUnicodeObject結構體，另外一個內存塊來保存字符。性能

對於ASCII-only的字符串，Python使用PyUnicode_New來建立，並將其保存在PyASCIIObject結構體中。只要它是經過UTF-8來解碼的，utf-8字符串就是數據自己，也就是說二者等價。測試

legacy string 是經過PyUnicodeObject來保存的。優化

咱們先看源碼，而後再敘述其餘內容。

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;       
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible * surrogates count as two code points. */
} PyCompactUnicodeObject;

typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;複製代碼

能夠看出，整個字符串對象機制以PyASCIIObject爲基礎，咱們就先來看這個對象。length中保存了字符串中code points的數量。hash中則保存了字符串的hash值，由於一個字符串對象是不可變對象，它的hash值永遠不會改變，所以Python將其緩存在hash變量中，防止重複計算帶來的性能損失。state結構體中保存了關於這個對象的一些信息，它們和咱們以前介紹的字符串的四種形式有關。wstr變量則是字符串對象真正的值所在。

state結構體中的變量都是什麼意思？爲了節省篇幅，我將註釋刪除了，咱們來一一解釋。interned變量的值和字符串對象的intern機制有關，它能夠有三個值：SSTATE_NOT_INTERNED (0)，SSTATE_INTERNED_MORTAL (1)，SSTATE_INTERNED_IMMORTAL (2)。分別表示不intern，intern但可刪除，永久intern。具體的機制咱們後面會說。kind主要是表示字符串以幾字節的形式保存。compact咱們已經解釋，ascii也很好理解。ready則是用來講明對象的佈局是否被初始化。若是是1，就說明要麼這個對象是緊湊的（compact），要麼它的數據指針已經被填滿了。

咱們前面提到，一個ASCII字符串使用PyUnicode_New來建立，並保存在PyASCIIObject結構體中。一樣使用PyUnicode_New建立的字符串對象，若是是非ASCII字符串，則保存在PyCompactUnicodeObject結構體中。一個PyUnicodeObject經過PyUnicode_FromUnicode(NULL, len)建立，真正的字符串數據一開始保存在wstr block中，而後使用_PyUnicode_Ready被複制到了data block中。

咱們再來看一下PyUnicode_Type：

PyTypeObject PyUnicode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",              /* tp_name */
    sizeof(PyUnicodeObject),        /* tp_size */
    ……
    unicode_repr,           /* tp_repr */
    &unicode_as_number,         /* tp_as_number */
    &unicode_as_sequence,       /* tp_as_sequence */
    &unicode_as_mapping,        /* tp_as_mapping */
    (hashfunc) unicode_hash,        /* tp_hash*/
    ……
};複製代碼

能夠看出，Python3中的str的確就是以前的unicode。

建立字符串對象

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) {
    PyObject *unicode;
    Py_UCS4 maxchar = 0;
    Py_ssize_t num_surrogates;

    if (u == NULL)
        return (PyObject*)_PyUnicode_New(size);

    /* If the Unicode data is known at construction time, we can apply some optimizations which share commonly used objects. */

    /* Optimization for empty strings */
    if (size == 0)
        _Py_RETURN_UNICODE_EMPTY();

    /* Single character Unicode objects in the Latin-1 range are shared when using this constructor */
    if (size == 1 && (Py_UCS4)*u < 256)
        return get_latin1_char((unsigned char)*u);

    /* If not empty and not single character, copy the Unicode data into the new object */
    if (find_maxchar_surrogates(u, u + size,
                                &maxchar, &num_surrogates) == -1)
        return NULL;

    unicode = PyUnicode_New(size - num_surrogates, maxchar);
    if (!unicode)
        return NULL;

    switch (PyUnicode_KIND(unicode)) {
    case PyUnicode_1BYTE_KIND:
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, unsigned char,
                                u, u + size, PyUnicode_1BYTE_DATA(unicode));
        break;
    case PyUnicode_2BYTE_KIND:
#if Py_UNICODE_SIZE == 2
        memcpy(PyUnicode_2BYTE_DATA(unicode), u, size * 2);
#else
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, Py_UCS2,
                                u, u + size, PyUnicode_2BYTE_DATA(unicode));
#endif
        break;
    case PyUnicode_4BYTE_KIND:
#if SIZEOF_WCHAR_T == 2
        /* This is the only case which has to process surrogates, thus a simple copy loop is not enough and we need a function. */
        unicode_convert_wchar_to_ucs4(u, u + size, unicode);
#else
        assert(num_surrogates == 0);
        memcpy(PyUnicode_4BYTE_DATA(unicode), u, size * 4);
#endif
        break;
    default:
        assert(0 && "Impossible state");
    }

    return unicode_result(unicode);
}
PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) {
    PyObject *obj;
    PyCompactUnicodeObject *unicode;
    void *data;
    enum PyUnicode_Kind kind;
    int is_sharing, is_ascii;
    Py_ssize_t char_size;
    Py_ssize_t struct_size;

    /* Optimization for empty strings */
    if (size == 0 && unicode_empty != NULL) {
        Py_INCREF(unicode_empty);
        return unicode_empty;
    }

    is_ascii = 0;
    is_sharing = 0;
    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
        is_ascii = 1;
        struct_size = sizeof(PyASCIIObject);
    }
    else if (maxchar < 256) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
    }
    else if (maxchar < 65536) {
        kind = PyUnicode_2BYTE_KIND;
        char_size = 2;
        if (sizeof(wchar_t) == 2)
            is_sharing = 1;
    }
    else {
        if (maxchar > MAX_UNICODE) {
            PyErr_SetString(PyExc_SystemError,
                            "invalid maximum character passed to PyUnicode_New");
            return NULL;
        }
        kind = PyUnicode_4BYTE_KIND;
        char_size = 4;
        if (sizeof(wchar_t) == 4)
            is_sharing = 1;
    }

    /* Ensure we won't overflow the size. */
    if (size < 0) {
        PyErr_SetString(PyExc_SystemError,
                        "Negative size passed to PyUnicode_New");
        return NULL;
    }
    if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
        return PyErr_NoMemory();

    /* Duplicated allocation code from _PyObject_New() instead of a call to * PyObject_New() so we are able to allocate space for the object and * it's data buffer. */
    obj = (PyObject *) PyObject_MALLOC(struct_size + (size + 1) * char_size);
    if (obj == NULL)
        return PyErr_NoMemory();
    obj = PyObject_INIT(obj, &PyUnicode_Type);
    if (obj == NULL)
        return NULL;

    unicode = (PyCompactUnicodeObject *)obj;
    if (is_ascii)
        data = ((PyASCIIObject*)obj) + 1;
    else
        data = unicode + 1;
    _PyUnicode_LENGTH(unicode) = size;
    _PyUnicode_HASH(unicode) = -1;
    _PyUnicode_STATE(unicode).interned = 0;
    _PyUnicode_STATE(unicode).kind = kind;
    _PyUnicode_STATE(unicode).compact = 1;
    _PyUnicode_STATE(unicode).ready = 1;
    _PyUnicode_STATE(unicode).ascii = is_ascii;
    if (is_ascii) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
    }
    else if (kind == PyUnicode_1BYTE_KIND) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
        _PyUnicode_WSTR_LENGTH(unicode) = 0;
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
    }
    else {
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
        if (kind == PyUnicode_2BYTE_KIND)
            ((Py_UCS2*)data)[size] = 0;
        else /* kind == PyUnicode_4BYTE_KIND */
            ((Py_UCS4*)data)[size] = 0;
        if (is_sharing) {
            _PyUnicode_WSTR_LENGTH(unicode) = size;
            _PyUnicode_WSTR(unicode) = (wchar_t *)data;
        }
        else {
            _PyUnicode_WSTR_LENGTH(unicode) = 0;
            _PyUnicode_WSTR(unicode) = NULL;
        }
    }
#ifdef Py_DEBUG
    unicode_fill_invalid((PyObject*)unicode, 0);
#endif
    assert(_PyUnicode_CheckConsistency((PyObject*)unicode, 0));
    return obj;
}複製代碼

先來分析PyUnicode_FromUnicode的流程。若是傳入的u是個空指針，調用_PyUnicode_New(size)直接返回一個指定大小但值爲空的PyUnicodeObject對象。若是size==0，調用_Py_RETURN_UNICODE_EMPTY()直接返回。若是是在Latin-1範圍內的單字符字符串，直接返回該字符對應的PyUnicodeObject，這和咱們在上一章說的小整數對象池相似，這裏也有一個字符緩衝池。若是二者都不是，則建立一個新的對象並將數據複製到這個對象中。

PyUnicode_New的流程很好理解，傳入對象的大小和maxchar，根據這兩個參數來決定返回的是PyASCIIObject，PyCompactUnicodeObject仍是PyUnicodeObject。

Intern機制

咱們以前提到了intern機制，它指的就是在建立一個新的字符串對象時，若是已經有了和它的值相同的字符串對象，那麼就直接返回那個對象的引用，而不返回新建立的字符串對象。Python在那裏尋找呢？事實上，python維護着一個鍵值對類型的結構interned，鍵就是字符串的值。但這個intern機制並不是對於全部的字符串對象都適用，簡單來講對於那些符合python標識符命名原則的字符串，也就是隻包括字母數字下劃線的字符串，python會對它們使用intern機制。在標準庫中，有一個函數可讓咱們對一個字符串強制實行這個機制——sys.intern()，下面是這個函數的文檔：

Enter string in the table of 「interned」 strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

具體機制見下面代碼：

PyObject *PyUnicode_InternFromString(const char *cp) {
    PyObject *s = PyUnicode_FromString(cp);
    if (s == NULL)
        return NULL;
    PyUnicode_InternInPlace(&s);
    return s;
}複製代碼

void PyUnicode_InternInPlace(PyObject **p) {
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt. The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}複製代碼

當Python調用PyUnicode_InternFromString時，會返回一個interned的對象，具體過程由PyUnicode_InternInPlace來實現。

事實上，即便Python會對一個字符串進行intern操做，它也會先建立出一個PyUnicodeObject對象，以後再檢查是否有值和其相同的對象。若是有的話，就將interned中保存的對象返回，以前新建立出來的，由於引用計數變爲零，被回收了。

被intern機制處理後的對象分爲兩類：mortal和immortal，前者會被回收，後者則不會被回收，與Python虛擬機共存亡。

PyUnicodeObject有關的效率問題

在《Python源碼剖析》原書中提到使用+來鏈接字符串是一個極其低效的操做，由於每次鏈接都會建立一個新的字符串對象，推薦使用字符串的join方法來鏈接字符串。在Python3.6下，通過個人測試，使用+來鏈接字符串已經和使用join的耗時相差不大。固然這只是我在個別環境下的測試，真正的答案我還不知道。

小結

在Python3中，str底層實現使用unicode，這很好的解決了Python2中複雜麻煩的非ASCII字符串的種種問題。同時在底層，Python對於ASCII和非ASCII字符串區別對待，加上utf-8兼容ASCII字符，兼顧了性能和簡單程度。在Python中，不可變對象每每都有相似intern機制的東西，這使得Python減小了沒必要要的內存消耗，可是在真正的實現中，Python也是取平衡點。由於，一味使用intern機制，有可能會形成額外的計算和查找，這就和優化性能的目的背道而馳了。