理解numpy中ndarray的內存佈局和設計哲學

時間 2020-02-11

標籤理解 numpy ndarray 內存佈局設計哲學简体版

原文原文鏈接

目錄html

本文的主要目的在於理解numpy.ndarray的內存結構及其背後的設計哲學。git

ndarray是什麼

NumPy provides an N-dimensional array type, the ndarray, which describes a collection of 「items」 of the same type. The items can be indexed using for example N integers.github

—— from https://docs.scipy.org/doc/numpy-1.17.0/reference/arrays.htmlapi

ndarray是numpy中的多維數組，數組中的元素具備相同的類型，且能夠被索引。數組

以下所示：ide

>>> import numpy as np
>>> a = np.array([[0,1,2,3],[4,5,6,7],[8,9,10,11]])
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> type(a)
<class 'numpy.ndarray'>
>>> a.dtype   
dtype('int32')
>>> a[1,2]
6
>>> a[:,1:3]
array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

>>> a.ndim    
2
>>> a.shape   
(3, 4)        
>>> a.strides 
(16, 4)

注：np.array並非類，而是用於建立np.ndarray對象的其中一個函數，numpy中多維數組的類爲np.ndarray。函數

ndarray的設計哲學

ndarray的設計哲學在於數據存儲與其解釋方式的分離，或者說copy和view的分離，讓儘量多的操做發生在解釋方式上（view上），而儘可能少地操做實際存儲數據的內存區域。佈局

以下所示，像reshape操做返回的新對象b，a和b的shape不一樣，可是二者共享同一個數據block，c=b.T，c是b的轉置，但二者仍共享同一個數據block，數據並無發生變化，發生變化的只是數據的解釋方式。.net

>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> b = a.reshape(4, 3)
>>> b
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

# reshape操做產生的是view視圖，只是對數據的解釋方式發生變化，數據物理地址相同
>>> a.ctypes.data
80831392
>>> b.ctypes.data
80831392
>>> id(a) == id(b)
false

# 數據在內存中連續存儲
>>> from ctypes import string_at
>>> string_at(b.ctypes.data, b.nbytes).hex()
'000000000100000002000000030000000400000005000000060000000700000008000000090000000a0000000b000000'

# b的轉置c，c仍共享相同的數據block，只改變了數據的解釋方式，「以列優先的方式解釋行優先的存儲」
>>> c = b.T
>>> c
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  4,  8, 11]])
>>> c.ctypes.data
80831392
>>> string_at(c.ctypes.data, c.nbytes).hex()
'000000000100000002000000030000000400000005000000060000000700000008000000090000000a0000000b000000'
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

# copy會複製一份新的數據，其物理地址位於不一樣的區域
>>> c = b.copy()
>>> c
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
>>> c.ctypes.data
80831456
>>> string_at(c.ctypes.data, c.nbytes).hex()
'000000000100000002000000030000000400000005000000060000000700000008000000090000000a0000000b000000'

# slice操做產生的也是view視圖，仍指向原來數據block中的物理地址
>>> d = b[1:3, :]
>>> d
array([[3, 4, 5],
       [6, 7, 8]])
>>> d.ctypes.data
80831404
>>> print('data buff address from {0} to {1}'.format(b.ctypes.data, b.ctypes.data + b.nbytes))
data buff address from 80831392 to 80831440

副本是一個數據的完整的拷貝，若是咱們對副本進行修改，它不會影響到原始數據，物理內存不在同一位置。

視圖是數據的一個別稱或引用，經過該別稱或引用亦即可訪問、操做原有數據，但原有數據不會產生拷貝。若是咱們對視圖進行修改，它會影響到原始數據，物理內存在同一位置。

視圖通常發生在：

一、numpy 的切片操做返回原數據的視圖。

二、調用 ndarray 的 view() 函數產生一個視圖。

副本通常發生在：

Python 序列的切片操做，調用deepCopy()函數。

調用 ndarray 的 copy() 函數產生一個副本。

—— from NumPy 副本和視圖

view機制的好處顯而易見，省內存，同時速度快。

ndarray的內存佈局

NumPy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. NumPy also contains a significant set of data that describes how to interpret the data in the data buffer.

—— from NumPy internals

ndarray的內存佈局示意圖以下：

可大體劃分紅2部分——對應設計哲學中的數據部分和解釋方式：

raw array data：爲一個連續的memory block，存儲着原始數據，相似C或Fortran中的數組，連續存儲
metadata：是對上面內存塊的解釋方式

metadata都包含哪些信息呢？

dtype：數據類型，指示了每一個數據佔用多少個字節，這幾個字節怎麼解釋，好比int32、float32等；
ndim：有多少維；
shape：每維上的數量；
strides：維間距，即到達當前維下一個相鄰數據須要前進的字節數，因考慮內存對齊，不必定爲每一個數據佔用字節數的整數倍；

上面4個信息構成了ndarray的indexing schema，即如何索引到指定位置的數據，以及這個數據該怎麼解釋。

除此以外的信息還有：字節序（大端小端）、讀寫權限、C-order（行優先存儲） or Fortran-order（列優先存儲）等，以下所示，

>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

ndarray的底層是C和Fortran實現，上面的屬性能夠在其源碼中找到對應，具體可見PyArrayObject和PyArray_Descr等結構體。

爲何能夠這樣設計

爲何ndarray能夠這樣設計？

由於ndarray是爲矩陣運算服務的，ndarray中的全部數據都是同一種類型，好比int32、float64等，每一個數據佔用的字節數相同、解釋方式也相同，因此能夠稠密地排列在一塊兒，在取出時根據dtype現copy一份數據組裝成scalar對象輸出。這樣極大地節省了空間，scalar對象中除了數據以外的域不必重複存儲，同時由於連續內存的緣由，能夠按秩訪問，速度也要快得多。

>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1,1]
5
>>> i,j = a[1,1], a[1,1]

# i和j爲不一樣的對象，訪問一次就「組裝一個」對象
>>> id(i)
102575536
>>> id(j)
102575584
>>> a[1,1] = 4
>>> i
5
>>> j
5
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  4,  6,  7],
       [ 8,  9, 10, 11]])

# isinstance(val, np.generic) will return True if val is an array scalar object. Alternatively, what kind of array scalar is present can be determined using other members of the data type hierarchy.
>> isinstance(i, np.generic)
True

這裏，能夠將ndarray與python中的list對比一下，list能夠容納不一樣類型的對象，像string、int、tuple等均可以放在一個list裏，因此list中存放的是對象的引用，再經過引用找到具體的對象，這些對象所在的物理地址並非連續的，以下所示

因此相對ndarray，list訪問到數據須要多跳轉1次，list只能作到對對象引用的按秩訪問，對具體的數據並非按秩訪問，因此效率上ndarray比list要快得多，空間上，由於ndarray只把數據緊密存儲，而list須要把每一個對象的全部域值都存下來，因此ndarray比list要更省空間。