『Numpy』內存分析_高級切片和內存數據解析

時間 2019-11-06

標籤 Numpy 內存分析高級切片數據解析简体版

原文原文鏈接

在計算機中，沒有任何數據類型是固定的，徹底取決於如何看待這片數據的內存區域。

在numpy.ndarray.view中，提供對內存區域不一樣的切割方式，來完成數據類型的轉換，而無需要對數據進行額外的copy，能夠節約內存空間，咱們能夠將view看作對內存的展現方式。

如：

import numpy as np
x = np.arange(10, dtype=np.int)

print('An integer array:', x)
print ('An float array:', x.view(np.float))

An integer array: [0 1 2 3 4 5 6 7 8 9]

An float array: [  0.00000000e+000   4.94065646e-324   
   9.88131292e-324   1.48219694e-323   1.97626258e-323   
   2.47032823e-323   2.96439388e-323   3.45845952e-323
   3.95252517e-323   4.44659081e-323]

An float array:

在實際使用中咱們每每會採起更復雜的dtype（也就是說view能夠與dtype搭配使用）輸出內存中的值，後面咱們會示範對於結構化數組的較爲複雜的view使用。html

1、view和copy

咱們從numpy.reshape()函數入手，文檔對於其返回值的解釋：python

Returns
    -------
    reshaped_array : ndarray
        This will be a new view object if possible; otherwise, it will
        be a copy. Note there is no guarantee of the *memory layout* (C- or
        Fortran- contiguous) of the returned array.

其返回值多是一個view，或是一個copy。相應的條件爲：

　　一、返回一個view條件：數據區域連續的時候

　　二、反之，則返回一個copy

咱們獲得了一個新概念， 數組內存區域是否連續，numpy數組有flags['C_CONTIGUOUS']表示是否連續，有np.may_share_memory方法判斷兩個數組內存區域是否一致：

a = np.zeros([2,10], dtype=np.int32)
b = a.T  # 轉置破壞連續結構

a.flags['C_CONTIGUOUS']  # True
b.flags['C_CONTIGUOUS']  # False

np.may_share_memory(a,b)  # True
b.base is a  # True
id(b)==id(a)  # False


a.shape = 20  # a的shape變了
a.flags['C_CONTIGUOUS']  # True

# b.shape = 20
# AttributeError: incompatible shape for a non-contiguous array
# 想要使用指定shape的方式，只能是連續數組，可是reshape方法因爲不改變原數組，因此reshape不受影響

數組切片是否會copy數據？

不過，數組的切片對象雖然並不是contiguous，可是對它的reshape操做並不會copy新的對象，數組

a = np.arange(16).reshape(4,4)  

print(a.T.flags['C_CONTIGUOUS'],a[:,0].flags['C_CONTIGUOUS'])
# False False

print (np.may_share_memory(a,a.T.reshape(16)),
       np.may_share_memory(a,a[:,0].reshape(4)))
# False True

可是，下一小節會介紹，高級切片會copy數組，開闢新的內存。ide

2、numpy的結構數組

利用np.dtype能夠構建結構數組，numpy.ndarray.base會返回內存主人的信息，文檔以下，函數

Help on getset descriptor numpy.ndarray.base:

base
    Base object if memory is from some other object.

    Examples
    --------
    The base of an array that owns its memory is None:

    >>> x = np.array([1,2,3,4])
    >>> x.base is None
    True

    Slicing creates a view, whose memory is shared with x:

    >>> y = x[2:]
    >>> y.base is x
    Truepost

一、創建結構數組

persontype = np.dtype({
    'names':['name','age','weight','height'],
    'formats':['S30','i','f','f']}, align=True)
a = np.array([('Zhang',32,72.5,167),
              ('Wang',24,65,170)],dtype=persontype)
a['age'].base

array([(b'Zhang', 32, 72.5, 167.),spa

            (b'Wang', 24, 65. , 170.)],指針

            dtype={'names':['name','age','weight','height'],code

            'formats':['S30','<i4','<f4','<f4'], orm

            'offsets':[0,32,36,40],

            'itemsize':44,

            'aligned':True})

二、高級切片和普通切片的不一樣

In [26]: a.base
In [27]: a[0].base
In [28]: a[:1].base
Out[28]: array([123,   4,   5,   6,  78])
In [29]: a[[0,1]].base

In [30]: a.base is None
Out[30]: True
In [31]: a[0].base is None
Out[31]: True
In [32]: a[:1].base is None
Out[32]: False
In [33]: a[[0,1]].base is None
Out[33]: True

由上可見高級切片會開闢新的內存，複製被切出的數據，這是由於這種不規則的內存訪問使用原來的內存結構效率很低(邏輯相鄰元素內存不相鄰，標準的訪問因爲固定了起始和步長至關於訪問相鄰元素，因此效率較高)，拷貝出來就是連續的內存數組了。

三、高級切片且不開闢新內存的方法

回到上上小節的結構數組，

print(a['age'].base is a)
print(a[['age', 'height']].base is None)

True

True

咱們經過指定內存解析方式，實現不開闢新內存，將原內存解析爲高級切片指定的結構數組，

def fields_view(arr, fields):
    dtype2 = np.dtype({name:arr.dtype.fields[name] for name in fields})
    # print(dtype2)
    # {'names':['age','weight'], 'formats':['<i4','<f4'], 'offsets':[32,36], 'itemsize':40}
    # print([(name,arr.dtype.fields[name]) for name in fields])
    # [('age', (dtype('int32'), 32)), ('weight', (dtype('float32'), 36))]
    # print(arr.strides)
    # (44,)
    return np.ndarray(arr.shape, dtype2, arr, 0, arr.strides)
'''
ndarray(shape, dtype=float, buffer=None, offset=0,
 |          strides=None, order=None)
 
參數 	類型 	做用
shape 	int型tuple 	多維數組的形狀
dtype 	data-type 	數組中元素的類型
buffer 		用於初始化數組的buffer
offset 	int 	buffer中用於初始化數組的首個數據的偏移
strides 	int型tuple 	每一個軸的下標增長1時，數據指針在內存中增長的字節數
order 	'C' 或者 'F' 	'C':行優先；'F':列優先
'''

v = fields_view(a, ['age', 'weight'])
print(v.base is a)

v['age'] += 10
print('+++'*10)
print(v)
print(v.dtype)
print(v.dtype.fields)
print('+++'*10)
print(a)
print(a.dtype)
print(a.dtype.fields)

True
++++++++++++++++++++++++++++++
[(42,  72.5) (34,  65. )]
{'names':['age','weight'], 'formats':['<i4','<f4'], 'offsets':[32,36], 'itemsize':40}
{'age': (dtype('int32'), 32), 'weight': (dtype('float32'), 36)}
++++++++++++++++++++++++++++++
[(b'Zhang', 42,  72.5,  167.) (b'Wang', 34,  65. ,  170.)]
{'names':['name','age','weight','height'], 'formats':['S30','<i4','<f4','<f4'], 'offsets':[0,32,36,40], 'itemsize':44, 'aligned':True}
{'name': (dtype('S30'), 0), 'age': (dtype('int32'), 32), 'weight': (dtype('float32'), 36), 'height': (dtype('float32'), 40)}

這裏注意一下.dtype的’itemsize‘參數，表示添加一條（行）數據，內存增長了多少字節，因爲保存了'offsets'偏移信息，咱們生成的dtype展現的是一個稀疏的結構，可是每一行不會有多餘的尾巴，這是由於空元素是由實元素記錄偏移量的空隙產生的。

在『Numpy』內存分析_numpy.dtype解析內存數據中咱們會更詳細的介紹有關數組內存解析的方法。