NumPy 底層進行了不錯的優化。html
loops = 25000000
from math import *
# 策略1:對每個元素依次進行計算
a = range(1, loops)
def f(x):
return 3 * log(x) + cos(x) ** 2
%timeit r = [f(x) for x in a]
# 策略2:使用np
import numpy as np
a = np.arange(1, loops)
%timeit r = 3 * np.log(a) + np.cos(a) ** 2
# 策略3:numexpr方法,避免了數組在內存中複製
import numexpr as ne
ne.set_num_threads(1)
f = '3 * log(a) + cos(a) ** 2'
%timeit r = ne.evaluate(f)
# 策略4:再併發四線程
ne.set_num_threads(4) # 支持併發
%timeit r = ne.evaluate(f)
NumPy 雖然經過底層高度優化過的計算庫能夠實現接近C的高效計算,但在計算複雜且計算量龐大的時候多少仍是有些嫌慢。Numexpr 庫是最近發現的一個很是簡單易用的 Numpy性能提高工具。python
Goto: NumExpr: Fast numerical expression evaluator for NumPygit
先安裝cuda驅動:[CUDA] 00 - GPU Driver Installation & Concurrency Programminggithub
再安裝CuPy驅動:CuPy Installationweb
須要瞭解:PyPy和CPython的不一樣點。數據庫
簡單理解:PyPy 爲何會比 CPython 還要快?express
CPython的思路編程
def add(x, y): return x + y if instance_has_method(x, '__add__') { return call(x, '__add__', y) // x.__add__ 裏面又有一大堆針對不一樣類型的 y 的判斷
} else if isinstance_has_method(super_class(x), '__add__' { return call(super_class, '__add__', y)
} else if isinstance(x, str) and isinstance(y, str) { return concat_str(x, y)
} else if isinstance(x, float) and isinstance(y, float) { return add_float(x, y)
} else if isinstance(x, int) and isinstance(y, int) { return add_int(x, y)
} else ...
Just In Time 編譯數組
爲什麼加速了呢?PyPy 執行的時候,發現執行了一百遍 add 函數,發現你 TM 每次都只傳兩個整數進來,那我何苦每次還給你作這麼多計算。因而當場生成了一個相似 C 的函數:緩存
int add_int_int(int x, int y) {
return x + y; }
原始代碼,做爲以後的優化參照。
import numpy as np import time grid_size = (512,) def laplacian(grid, out): np.copyto(out, grid) np.multiply(out, -2.0, out) np.add(out, np.roll(grid, +1), out) np.add(out, np.roll(grid, -1), out) def evolve(grid, dt, out, D=1): laplacian(grid, out) np.multiply(out, D * dt, out) np.add(out, grid, out)
def run_experiment(num_iterations):
scratch = np.zeros(grid_size) grid = np.zeros(grid_size) block_low = int(grid_size[0] * .4) block_high = int(grid_size[0] * .5) grid[block_low:block_high] = 0.005 start = time.time() for i in range(num_iterations): evolve(grid, 0.1, scratch) grid, scratch = scratch, grid return time.time() - start
if __name__ == "__main__": run_experiment(500)
性能分析一:
$ sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python diffusion_python_memory.py
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory.py' (3 runs): 358139881 cycles # 3.605 GHz ( +- 2.30% ) (65.11%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 453383936 instructions # 1.27 insn per cycle ( +- 1.22% ) (82.56%) 22986206 cache-references # 231.400 M/sec ( +- 0.56% ) (82.71%) 2022121 cache-misses # 8.797 % of all cache refs ( +- 15.29% ) (83.90%) 100438152 branches # 1011.101 M/sec ( +- 1.59% ) (83.90%) 3004796 branch-misses # 2.99% of all branches ( +- 0.49% ) (84.38%) 99.34 msec task-clock # 0.996 CPUs utilized ( +- 4.31% ) 4984 faults # 0.050 M/sec ( +- 0.07% ) 4984 minor-faults # 0.050 M/sec ( +- 0.07% ) 1 cs # 0.007 K/sec ( +- 50.00% ) 0 migrations # 0.000 K/sec 0.09976 +- 0.00448 seconds time elapsed ( +- 4.49% )
寫在一塊兒的好處是:短代碼每每意味着高性能,由於容易利用內建優化機制。
def laplacian(grid): return np.roll(grid, +1) + np.roll(grid, -1) - 2 * grid
def evolve(grid, dt, D=1): return grid + dt * D * laplacian(grid)
性能分析二:cache-references很是大,但0.97指令減小纔是重點。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy.py' (3 runs): 345131518 cycles # 3.368 GHz ( +- 2.05% ) (63.87%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 441624002 instructions # 1.28 insn per cycle ( +- 0.41% ) (82.69%) 22383742 cache-references # 218.428 M/sec ( +- 0.44% ) (84.39%) 2334617 cache-misses # 10.430 % of all cache refs ( +- 4.03% ) (84.39%) 98647251 branches # 962.634 M/sec ( +- 1.17% ) (84.39%) 2950862 branch-misses # 2.99% of all branches ( +- 0.24% ) (82.97%) 102.48 msec task-clock # 0.998 CPUs utilized ( +- 2.02% ) 4978 faults # 0.049 M/sec ( +- 0.04% ) 4978 minor-faults # 0.049 M/sec ( +- 0.04% ) 1 cs # 0.007 K/sec ( +- 50.00% ) 0 migrations # 0.000 K/sec 0.10270 +- 0.00209 seconds time elapsed ( +- 2.04% )
咱們使用 numpy下降了 CPU 負擔,咱們下降了解決問題所必要的內存分配次數。
事實上 evolve 函數有 93%的時間花費在 laplacian 函數中。
參見:[Optimized Python] 17 - Bottle-neck in performance
向操做系統進行請求所需的開銷比簡單填充緩存大不少。
填充一次緩存失效是一個在主板上優化過的硬件行爲,而內存分配則須要跟另外一個進程、內核打交道來完成。
(1)"+="運算符的使用減小了「沒必要要的變量內存分配」。
(2)roll_add替代過於全能的np.roll,以減小沒必要要的指令。
def roll_add(rollee, shift, out): if shift == 1: out[1:] += rollee[:-1] out[0] += rollee[-1] elif shift == -1: out[:-1] += rollee[1:] out[-1] += rollee[0] def laplacian(grid, out): np.copyto(out, grid) np.multiply(out, -2.0, out) roll_add(grid, +1, out) roll_add(grid, -1, out) def evolve(grid, dt, out, D=1): laplacian(grid, out) np.multiply(out, D * dt, out) np.add(out, grid, out
性能分析三:減低了內存缺頁,提升了指令效率。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory2.py' (3 runs): 314575986 cycles # 3.670 GHz ( +- 1.89% ) (62.67%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 402037134 instructions # 1.28 insn per cycle ( +- 0.19% ) (81.33%) 18138094 cache-references # 211.611 M/sec ( +- 2.65% ) (81.94%) 2298500 cache-misses # 12.672 % of all cache refs ( +- 6.84% ) (84.68%) 88677913 branches # 1034.575 M/sec ( +- 0.68% ) (86.00%) 2885039 branch-misses # 3.25% of all branches ( +- 1.19% ) (84.71%) 85.71 msec task-clock # 0.997 CPUs utilized ( +- 1.86% ) 4974 faults # 0.058 M/sec ( +- 0.10% ) 4974 minor-faults # 0.058 M/sec ( +- 0.10% ) 0 cs # 0.004 K/sec ( +-100.00% ) 0 migrations # 0.000 K/sec 0.08596 +- 0.00163 seconds time elapsed ( +- 1.89% )
numexpr 的一個重要特色是它考慮到了 CPU 緩存。它特意移動數據來讓各級CPU 緩存可以擁有正確的數據讓緩存失效最小化。
(a)當咱們增長矩陣的大小時,咱們會發現 numexpr 比原生 numpy 更好地利用了咱們的緩存。並且,numexpr 利用了多核來進行計算並嘗試填滿每一個核心的緩存。
(b)當矩陣較小時,管理多核的開銷蓋過了任何可能的性能提高。
import numexpr as ne
def evolve(grid, dt, out, D=1): laplacian(grid, out) ne.evaluate("out*D*dt+grid", out=out)
性能分析四:減低了內存缺頁,提升了指令效率。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory2_numexpr.py' (3 runs): 389532813 cycles # 3.888 GHz ( +- 1.01% ) (65.40%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 478293442 instructions # 1.23 insn per cycle ( +- 1.70% ) (82.72%) 23279515 cache-references # 232.372 M/sec ( +- 0.91% ) (82.89%) 2447425 cache-misses # 10.513 % of all cache refs ( +- 10.84% ) (83.77%) 103451737 branches # 1032.639 M/sec ( +- 1.21% ) (84.97%) 3324954 branch-misses # 3.21% of all branches ( +- 0.61% ) (82.97%) 100.18 msec task-clock # 1.000 CPUs utilized ( +- 5.75% ) 8591 faults # 0.086 M/sec ( +- 0.01% ) 8591 minor-faults # 0.086 M/sec ( +- 0.01% ) 12 cs # 0.123 K/sec ( +- 2.70% ) 3 migrations # 0.030 K/sec 0.10018 +- 0.00582 seconds time elapsed ( +- 5.81% )
直接使用scipy提供的接口。
from scipy.ndimage.filters import laplace def laplacian(grid, out): laplace(grid, out, mode='wrap')
性能分析五:可能過於通用而沒有明顯提高。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_scipy.py' (3 runs): 346206970 cycles # 3.647 GHz ( +- 0.37% ) (63.49%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 441742003 instructions # 1.28 insn per cycle ( +- 1.10% ) (81.74%) 21243859 cache-references # 223.757 M/sec ( +- 0.37% ) (82.77%) 2092136 cache-misses # 9.848 % of all cache refs ( +- 1.46% ) (84.55%) 96943942 branches # 1021.088 M/sec ( +- 0.80% ) (85.19%) 3120156 branch-misses # 3.22% of all branches ( +- 1.27% ) (84.00%) 94.94 msec task-clock # 0.997 CPUs utilized ( +- 11.09% ) 5156 faults # 0.054 M/sec ( +- 0.10% ) 5156 minor-faults # 0.054 M/sec ( +- 0.10% ) 0 cs # 0.004 K/sec ( +-100.00% ) 0 migrations # 0.000 K/sec 0.0952 +- 0.0106 seconds time elapsed ( +- 11.14% )
看來,scipy沒有大的必要,「定製的」 優化比較靠譜。
v = [0.5, 0.75, 1.0, 1.5, 2.0] # vector of numbers
m = [v, v, v] # matrix of numbers
m
m[1]
m[1][0]
v1 = [0.5, 1.5]
v2 = [1, 2]
m = [v1, v2]
c = [m, m] # cube of numbers
c
c[1][1][0]
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = [v, v, v]
m
v[0] = 'Python'
m
from copy import deepcopy
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = 3 * [deepcopy(v), ]
m
v[0] = 'Python'
m
import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0])
type(a)
a[:2] # indexing as with list objects in 1 dimension
a.sum() # sum of all elements
a.std() # standard deviation
a.cumsum() # running cumulative sum
a * 2
a ** 2
np.sqrt(a)
b = np.array([a, a * 2])
b
b[0] # first row
b[0, 2] # third element of first row
b.sum()
b.sum(axis=0)
# sum along axis 0, i.e. column-wise sum
b.sum(axis=1)
# sum along axis 1, i.e. row-wise sum
c = np.zeros((2, 3, 4), dtype='i', order='C') # also: np.ones()
c
d = np.ones_like(c, dtype='f16', order='C') # also: np.zeros_like()
d
import random
I = 5000
%time mat = [[random.gauss(0, 1) for j in range(I)] for i in range(I)]
# a nested list comprehension
%time mat = np.random.standard_normal((I, I))
%time mat.sum()
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'),
('Height', 'f'), ('Children/Pets', 'i4', 2)]) # 相似與數據庫的表,定義列的屬性;每列數據的類型是要確保一致的
s = np.array([('Smith', 45, 1.83, (0, 1)),
('Jones', 53, 1.72, (2, 2))], dtype=dt)
s
s['Name']
s['Height'].mean()
s[1]['Age']
r = np.random.standard_normal((4, 3))
s = np.random.standard_normal((4, 3))
r + s
2 * r + 3
s = np.random.standard_normal(3)
r + s
# causes intentional error
# s = np.random.standard_normal(4)
# r + s
# r.transpose() + s
np.shape(r.T)
def f(x):
return 3 * x + 5
f(0.5) # float object
f(r) # NumPy array
# causes intentional error
# import math
# math.sin(r)
np.sin(r) # array as input
np.sin(np.pi) # float as input
x = np.random.standard_normal((5, 10000000))
y = 2 * x + 3 # linear equation y = a * x + b
C = np.array((x, y), order='C') # 按行統計快些
F = np.array((x, y), order='F') # 按列統計快些
x = 0.0; y = 0.0 # memory clean-up
C[:2].round(2) # 四捨五入
%timeit C.sum()
%timeit F.sum()
%timeit C[0].sum(axis=0)
%timeit C[0].sum(axis=1)
%timeit F.sum(axis=0)
%timeit F.sum(axis=1)
F = 0.0; C = 0.0 # memory clean-up