[Pandas] 01 - A guy based on NumPy

主要搞明白NumPy 「爲何快」?


1、學習資源

  

 

2、效率進化

NumPy 底層進行了不錯的優化。html

In [11]:
loops = 25000000
from math import *
# 策略1:對每個元素依次進行計算
a = range(1, loops) def f(x): return 3 * log(x) + cos(x) ** 2
%timeit r = [f(x) for x in a]
 
1 loop, best of 3: 14.4 s per loop
In [12]:
# 策略2:使用np

import
numpy as np
a = np.arange(1, loops)
%timeit r = 3 * np.log(a) + np.cos(a) ** 2
 
10 loops, best of 3: 1.94 s per loop
In [13]:
# 策略3:numexpr方法,避免了數組在內存中複製

import
numexpr as ne
ne.set_num_threads(1)
f = '3 * log(a) + cos(a) ** 2'
%timeit r = ne.evaluate(f)
 
10 loops, best of 3: 373 ms per loop
In [14]:
# 策略4:再併發四線程
ne
.set_num_threads(4)  # 支持併發
%timeit r = ne.evaluate(f)
 
100 loops, best of 3: 195 ms per loop

 

 

3、高級知識點

(a) numexpr

NumPy 雖然經過底層高度優化過的計算庫能夠實現接近C的高效計算,但在計算複雜且計算量龐大的時候多少仍是有些嫌慢。Numexpr 庫是最近發現的一個很是簡單易用的 Numpy性能提高工具。python

Goto: NumExpr: Fast numerical expression evaluator for NumPygit

 

(b) cupy利用gpu加速

先安裝cuda驅動:[CUDA] 00 - GPU Driver Installation & Concurrency Programminggithub

再安裝CuPy驅動:CuPy Installationweb

  

(c) PyPy 與 CPython 

須要瞭解:PyPy和CPython的不一樣點數據庫

簡單理解:PyPy 爲何會比 CPython 還要快?express

CPython的思路編程

def add(x, y):
    return x + y

if instance_has_method(x, '__add__') {
    return call(x, '__add__', y)   // x.__add__ 裏面又有一大堆針對不一樣類型的 y 的判斷
}
else if isinstance_has_method(super_class(x), '__add__' { return call(super_class, '__add__', y)
}
else if isinstance(x, str) and isinstance(y, str) { return concat_str(x, y)
}
else if isinstance(x, float) and isinstance(y, float) { return add_float(x, y)
}
else if isinstance(x, int) and isinstance(y, int) { return add_int(x, y)
}
else ... 

Just In Time 編譯數組

爲什麼加速了呢?PyPy 執行的時候,發現執行了一百遍 add 函數,發現你 TM 每次都只傳兩個整數進來,那我何苦每次還給你作這麼多計算。因而當場生成了一個相似 C 的函數:緩存

int add_int_int(int x, int y) {
  return x + y; }

 

  

 

系統級性能剖析


1、原始實驗

原始代碼,做爲以後的優化參照。

import numpy as np
import time

grid_size = (512,)


def laplacian(grid, out):
    np.copyto(out, grid)
    np.multiply(out, -2.0, out)
    np.add(out, np.roll(grid, +1), out)
    np.add(out, np.roll(grid, -1), out)


def evolve(grid, dt, out, D=1):
    laplacian(grid, out)
    np.multiply(out, D * dt, out)
    np.add(out, grid, out)

def run_experiment(num_iterations):
scratch
= np.zeros(grid_size) grid = np.zeros(grid_size) block_low = int(grid_size[0] * .4) block_high = int(grid_size[0] * .5) grid[block_low:block_high] = 0.005 start = time.time() for i in range(num_iterations): evolve(grid, 0.1, scratch) grid, scratch = scratch, grid return time.time() - start
if __name__ == "__main__": run_experiment(500)

性能分析一:

$ sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python diffusion_python_memory.py
 Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory.py' (3 runs):

         358139881      cycles                    #    3.605 GHz                      ( +-  2.30% )  (65.11%)
   <not supported>      stalled-cycles-frontend                                     
   <not supported>      stalled-cycles-backend                                      
         453383936      instructions              #    1.27  insn per cycle           ( +-  1.22% )  (82.56%)
          22986206      cache-references          #  231.400 M/sec                    ( +-  0.56% )  (82.71%)
           2022121      cache-misses              #    8.797 % of all cache refs      ( +- 15.29% )  (83.90%)
         100438152      branches                  # 1011.101 M/sec                    ( +-  1.59% )  (83.90%)
           3004796      branch-misses             #    2.99% of all branches          ( +-  0.49% )  (84.38%)
             99.34 msec task-clock                #    0.996 CPUs utilized            ( +-  4.31% )
              4984      faults                    #    0.050 M/sec                    ( +-  0.07% )
              4984      minor-faults              #    0.050 M/sec                    ( +-  0.07% )
                 1      cs                        #    0.007 K/sec                    ( +- 50.00% )
                 0      migrations                #    0.000 K/sec                  

           0.09976 +- 0.00448 seconds time elapsed  ( +-  4.49% )

 

 

2、利用內建優化代碼

寫在一塊兒的好處是:短代碼每每意味着高性能,由於容易利用內建優化機制。

def laplacian(grid):
    return np.roll(grid, +1) + np.roll(grid, -1) - 2 * grid

def evolve(grid, dt, D=1): return grid + dt * D * laplacian(grid)

性能分析二:cache-references很是大,但0.97指令減小纔是重點。

 Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy.py' (3 runs):

         345131518      cycles                    #    3.368 GHz                      ( +-  2.05% )  (63.87%)
   <not supported>      stalled-cycles-frontend                                     
   <not supported>      stalled-cycles-backend                                      
         441624002      instructions              #    1.28  insn per cycle           ( +-  0.41% )  (82.69%)
          22383742      cache-references          #  218.428 M/sec                    ( +-  0.44% )  (84.39%)
           2334617      cache-misses              #   10.430 % of all cache refs      ( +-  4.03% )  (84.39%)
          98647251      branches                  #  962.634 M/sec                    ( +-  1.17% )  (84.39%)
           2950862      branch-misses             #    2.99% of all branches          ( +-  0.24% )  (82.97%)
            102.48 msec task-clock                #    0.998 CPUs utilized            ( +-  2.02% )
              4978      faults                    #    0.049 M/sec                    ( +-  0.04% )
              4978      minor-faults              #    0.049 M/sec                    ( +-  0.04% )
                 1      cs                        #    0.007 K/sec                    ( +- 50.00% )
                 0      migrations                #    0.000 K/sec                  

           0.10270 +- 0.00209 seconds time elapsed  ( +-  2.04% )

 

 

3、下降內存分配、提升指令效率

找到 「正確的優化」 的點

咱們使用 numpy下降了 CPU 負擔,咱們下降了解決問題所必要的內存分配次數。

事實上 evolve 函數有 93%的時間花費在 laplacian 函數中。

參見:[Optimized Python] 17 - Bottle-neck in performance

 

進一步優化

向操做系統進行請求所需的開銷比簡單填充緩存大不少。

填充一次緩存失效是一個在主板上優化過的硬件行爲,而內存分配則須要跟另外一個進程、內核打交道來完成。

(1)"+="運算符的使用減小了「沒必要要的變量內存分配」。

(2)roll_add替代過於全能的np.roll,以減小沒必要要的指令。

def roll_add(rollee, shift, out):
    if shift == 1:
        out[1:] += rollee[:-1]
        out[0] += rollee[-1]
    elif shift == -1:
        out[:-1] += rollee[1:]
        out[-1] += rollee[0]


def laplacian(grid, out):
    np.copyto(out, grid)
    np.multiply(out, -2.0, out)
    roll_add(grid, +1, out)
    roll_add(grid, -1, out)


def evolve(grid, dt, out, D=1):
    laplacian(grid, out)
    np.multiply(out, D * dt, out)
    np.add(out, grid, out

性能分析三:減低了內存缺頁,提升了指令效率。

 Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory2.py' (3 runs):

         314575986      cycles                    #    3.670 GHz                      ( +-  1.89% )  (62.67%)
   <not supported>      stalled-cycles-frontend                                     
   <not supported>      stalled-cycles-backend                                      
         402037134      instructions              #    1.28  insn per cycle           ( +-  0.19% )  (81.33%)
          18138094      cache-references          #  211.611 M/sec                    ( +-  2.65% )  (81.94%)
           2298500      cache-misses              #   12.672 % of all cache refs      ( +-  6.84% )  (84.68%)
          88677913      branches                  # 1034.575 M/sec                    ( +-  0.68% )  (86.00%)
           2885039      branch-misses             #    3.25% of all branches          ( +-  1.19% )  (84.71%)
             85.71 msec task-clock                #    0.997 CPUs utilized            ( +-  1.86% )
              4974      faults                    #    0.058 M/sec                    ( +-  0.10% )
              4974      minor-faults              #    0.058 M/sec                    ( +-  0.10% )
                 0      cs                        #    0.004 K/sec                    ( +-100.00% )
                 0      migrations                #    0.000 K/sec                  

           0.08596 +- 0.00163 seconds time elapsed  ( +-  1.89% )

 

 

4、總體處理表達式

numexpr 的一個重要特色是它考慮到了 CPU 緩存。它特意移動數據來讓各級CPU 緩存可以擁有正確的數據讓緩存失效最小化。

(a)當咱們增長矩陣的大小時,咱們會發現 numexpr 比原生 numpy 更好地利用了咱們的緩存。並且,numexpr 利用了多核來進行計算並嘗試填滿每一個核心的緩存。

(b)當矩陣較小時,管理多核的開銷蓋過了任何可能的性能提高。

import numexpr as ne

def evolve(grid, dt, out, D=1): laplacian(grid, out) ne.evaluate("out*D*dt+grid", out=out)

性能分析四:減低了內存缺頁,提升了指令效率。 

 Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory2_numexpr.py' (3 runs):

         389532813      cycles                    #    3.888 GHz                      ( +-  1.01% )  (65.40%)
   <not supported>      stalled-cycles-frontend                                     
   <not supported>      stalled-cycles-backend                                      
         478293442      instructions              #    1.23  insn per cycle           ( +-  1.70% )  (82.72%)
          23279515      cache-references          #  232.372 M/sec                    ( +-  0.91% )  (82.89%)
           2447425      cache-misses              #   10.513 % of all cache refs      ( +- 10.84% )  (83.77%)
         103451737      branches                  # 1032.639 M/sec                    ( +-  1.21% )  (84.97%)
           3324954      branch-misses             #    3.21% of all branches          ( +-  0.61% )  (82.97%)
            100.18 msec task-clock                #    1.000 CPUs utilized            ( +-  5.75% )
              8591      faults                    #    0.086 M/sec                    ( +-  0.01% )
              8591      minor-faults              #    0.086 M/sec                    ( +-  0.01% )
                12      cs                        #    0.123 K/sec                    ( +-  2.70% )
                 3      migrations                #    0.030 K/sec                  

           0.10018 +- 0.00582 seconds time elapsed  ( +-  5.81% )

 

 

5、利用現成的方案

直接使用scipy提供的接口。

from scipy.ndimage.filters import laplace

def laplacian(grid, out):
    laplace(grid, out, mode='wrap')

性能分析五:可能過於通用而沒有明顯提高。 

 Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_scipy.py' (3 runs):

         346206970      cycles                    #    3.647 GHz                      ( +-  0.37% )  (63.49%)
   <not supported>      stalled-cycles-frontend                                     
   <not supported>      stalled-cycles-backend                                      
         441742003      instructions              #    1.28  insn per cycle           ( +-  1.10% )  (81.74%)
          21243859      cache-references          #  223.757 M/sec                    ( +-  0.37% )  (82.77%)
           2092136      cache-misses              #    9.848 % of all cache refs      ( +-  1.46% )  (84.55%)
          96943942      branches                  # 1021.088 M/sec                    ( +-  0.80% )  (85.19%)
           3120156      branch-misses             #    3.22% of all branches          ( +-  1.27% )  (84.00%)
             94.94 msec task-clock                #    0.997 CPUs utilized            ( +- 11.09% )
              5156      faults                    #    0.054 M/sec                    ( +-  0.10% )
              5156      minor-faults              #    0.054 M/sec                    ( +-  0.10% )
                 0      cs                        #    0.004 K/sec                    ( +-100.00% )
                 0      migrations                #    0.000 K/sec                  

            0.0952 +- 0.0106 seconds time elapsed  ( +- 11.14% )

 

 

6、總結

看來,scipy沒有大的必要,「定製的」 優化比較靠譜。

 

 

 

 

NumPy Data Structures


使用原生 List 做爲 Array

In [1]:
v = [0.5, 0.75, 1.0, 1.5, 2.0]  # vector of numbers
In [2]:
m = [v, v, v]  # matrix of numbers
m
Out[2]:
[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]
In [3]:
m[1]
Out[3]:
[0.5, 0.75, 1.0, 1.5, 2.0]
In [4]:
m[1][0]
Out[4]:
0.5
In [5]:
v1 = [0.5, 1.5]
v2 = [1, 2]
m = [v1, v2]
c = [m, m]  # cube of numbers
c
Out[5]:
[[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]]
In [6]:
c[1][1][0]
Out[6]:
1
In [7]:
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = [v, v, v]
m
Out[7]:
[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]
In [8]:
v[0] = 'Python'
m
Out[8]:
[['Python', 0.75, 1.0, 1.5, 2.0],
 ['Python', 0.75, 1.0, 1.5, 2.0],
 ['Python', 0.75, 1.0, 1.5, 2.0]]
In [9]:
from copy import deepcopy
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = 3 * [deepcopy(v), ]
m
Out[9]:
[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]
In [10]:
v[0] = 'Python'
m
Out[10]:
[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]
 

 

 

使用 NumPy Arrays

In [11]:
import numpy as np
In [12]:
a = np.array([0, 0.5, 1.0, 1.5, 2.0])
type(a)
Out[12]:
numpy.ndarray
In [13]:
a[:2]  # indexing as with list objects in 1 dimension
Out[13]:
array([ 0. ,  0.5])
In [14]:
a.sum()  # sum of all elements
Out[14]:
5.0
In [15]:
a.std()  # standard deviation
Out[15]:
0.70710678118654757
In [16]:
a.cumsum()  # running cumulative sum
Out[16]:
array([ 0. ,  0.5,  1.5,  3. ,  5. ])
In [17]:
a * 2
Out[17]:
array([ 0.,  1.,  2.,  3.,  4.])
In [18]:
a ** 2
Out[18]:
array([ 0.  ,  0.25,  1.  ,  2.25,  4.  ])
In [19]:
np.sqrt(a)
Out[19]:
array([ 0.        ,  0.70710678,  1.        ,  1.22474487,  1.41421356])
In [20]:
b = np.array([a, a * 2])
b
Out[20]:
array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
       [ 0. ,  1. ,  2. ,  3. ,  4. ]])
In [21]:
b[0]  # first row
Out[21]:
array([ 0. ,  0.5,  1. ,  1.5,  2. ])
In [22]:
b[0, 2]  # third element of first row
Out[22]:
1.0
In [23]:
b.sum()
Out[23]:
15.0
In [24]:
b.sum(axis=0)
  # sum along axis 0, i.e. column-wise sum
Out[24]:
array([ 0. ,  1.5,  3. ,  4.5,  6. ])
In [25]:
b.sum(axis=1)
  # sum along axis 1, i.e. row-wise sum
Out[25]:
array([  5.,  10.])
In [26]:
c = np.zeros((2, 3, 4), dtype='i', order='C')  # also: np.ones()
c
Out[26]:
array([[[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]], dtype=int32)
In [27]:
d = np.ones_like(c, dtype='f16', order='C')  # also: np.zeros_like()
d
Out[27]:
array([[[ 1.0,  1.0,  1.0,  1.0],
        [ 1.0,  1.0,  1.0,  1.0],
        [ 1.0,  1.0,  1.0,  1.0]],

       [[ 1.0,  1.0,  1.0,  1.0],
        [ 1.0,  1.0,  1.0,  1.0],
        [ 1.0,  1.0,  1.0,  1.0]]], dtype=float128)
In [28]:
import random
I = 5000 
In [29]:
%time mat = [[random.gauss(0, 1) for j in range(I)] for i in range(I)]
  # a nested list comprehension
 
CPU times: user 21.3 s, sys: 238 ms, total: 21.5 s
Wall time: 21.5 s
In [30]:
%time mat = np.random.standard_normal((I, I))
 
CPU times: user 1.36 s, sys: 55.9 ms, total: 1.42 s
Wall time: 1.42 s
In [31]:
%time mat.sum()
 
CPU times: user 15.2 ms, sys: 0 ns, total: 15.2 ms
Wall time: 14.1 ms
Out[31]:
1137.43504265149
 

 

 

自定義 Arrays

In [32]:
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'),
               ('Height', 'f'), ('Children/Pets', 'i4', 2)])  # 相似與數據庫的表,定義列的屬性;每列數據的類型是要確保一致的
s = np.array([('Smith', 45, 1.83, (0, 1)),
              ('Jones', 53, 1.72, (2, 2))], dtype=dt)
s
Out[32]:
array([(b'Smith', 45,  1.83000004, [0, 1]),
       (b'Jones', 53,  1.72000003, [2, 2])],
      dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Children/Pets', '<i4', (2,))])
In [33]:
s['Name']
Out[33]:
array([b'Smith', b'Jones'],
      dtype='|S10')
In [34]:
s['Height'].mean()
Out[34]:
1.7750001
In [35]:
s[1]['Age']
Out[35]:
53
 
 
 
 

Vectorization of Code


Basic Vectorization 向量化

In [36]:
r = np.random.standard_normal((4, 3))
s = np.random.standard_normal((4, 3))
In [37]:
r + s
Out[37]:
array([[  1.33110811e+00,   1.26931718e+00,   4.43935711e-01],
       [ -1.06905167e+00,   1.38422425e+00,  -1.99126769e+00],
       [  3.24832612e-01,  -6.63041986e-01,  -1.11329038e+00],
       [  6.53171734e-01,  -1.69497518e-05,   3.48108030e-01]])
In [38]:
2 * r + 3
Out[38]:
array([[ 6.23826763,  3.4056893 ,  4.93537977],
       [ 3.41201018,  3.18058875,  0.07823276],
       [ 5.21407609,  2.93499984,  2.62703963],
       [ 6.23041052,  6.13968592,  2.783421  ]])
In [39]:
s = np.random.standard_normal(3)
r + s
Out[39]:
array([[ 1.43696477, -0.89451367, -0.07713104],
       [ 0.02383604, -1.00706394, -2.50570454],
       [ 0.924869  , -1.12985839, -1.23130111],
       [ 1.43303621,  0.47248465, -1.15311042]])
In [40]:
# causes intentional error
# s = np.random.standard_normal(4)
# r + s
In [41]:
# r.transpose() + s
In [42]:
np.shape(r.T)
Out[42]:
(3, 4)
In [43]:
def f(x):
    return 3 * x + 5
In [44]:
f(0.5)  # float object
Out[44]:
6.5
In [45]:
f(r)  # NumPy array
Out[45]:
array([[ 9.85740144,  5.60853394,  7.90306966],
       [ 5.61801527,  5.27088312,  0.61734914],
       [ 8.32111413,  4.90249976,  4.44055944],
       [ 9.84561578,  9.70952889,  4.6751315 ]])
In [46]:
# causes intentional error
# import math
# math.sin(r)
In [47]:
np.sin(r)  # array as input
Out[47]:
array([[ 0.99883197,  0.20145647,  0.82357761],
       [ 0.2045511 ,  0.09017173, -0.99396568],
       [ 0.89437769, -0.03249436, -0.18540126],
       [ 0.99901409,  0.99999955, -0.10807798]])
In [48]:
np.sin(np.pi)  # float as input
Out[48]:
1.2246467991473532e-16
 

 

 

結構組織方式

In [49]:
x = np.random.standard_normal((5, 10000000))
y = 2 * x + 3  # linear equation y = a * x + b
C = np.array((x, y), order='C')          # 按行統計快些
F = np.array((x, y), order='F')          # 按列統計快些
x = 0.0; y = 0.0  # memory clean-up
In [50]:
C[:2].round(2)  # 四捨五入
Out[50]:
array([[[ 1.51,  0.67, -0.96, ..., -0.71, -1.55,  2.49],
        [ 0.65, -0.15,  0.02, ...,  0.49, -0.26,  1.37],
        [-0.07,  0.74, -2.31, ..., -0.51,  0.52,  0.49],
        [ 0.52, -0.43, -0.18, ..., -0.06,  0.58,  1.46],
        [-0.32,  0.59, -0.75, ..., -0.03,  0.23, -0.75]],

       [[ 6.01,  4.35,  1.07, ...,  1.58, -0.11,  7.97],
        [ 4.3 ,  2.71,  3.04, ...,  3.99,  2.48,  5.75],
        [ 2.86,  4.48, -1.62, ...,  1.97,  4.04,  3.97],
        [ 4.03,  2.14,  2.63, ...,  2.88,  4.16,  5.92],
        [ 2.36,  4.19,  1.51, ...,  2.94,  3.46,  1.5 ]]])
In [51]:
%timeit C.sum()
 
10 loops, best of 3: 45.2 ms per loop
In [52]:
%timeit F.sum()
 
10 loops, best of 3: 48.2 ms per loop
In [53]:
%timeit C[0].sum(axis=0)
 
10 loops, best of 3: 66.4 ms per loop
In [54]:
%timeit C[0].sum(axis=1)
 
10 loops, best of 3: 22.8 ms per loop
In [55]:
%timeit F.sum(axis=0)
 
1 loop, best of 3: 712 ms per loop
In [56]:
%timeit F.sum(axis=1)
 
1 loop, best of 3: 1.64 s per loop
In [57]:
F = 0.0; C = 0.0  # memory clean-up
 
End.
相關文章
相關標籤/搜索