讓代碼飛起來——高性能Julia學習筆記(二)

首發於 https://magicly.me/hpc-julia-2/python

上一篇:讓代碼飛起來——高性能 Julia 學習筆記(一), 繼續整理高性能 Julia 學習筆記。git

數字

Julia 中 Number 的 size 就跟 C 語言裏面同樣, 直接依賴於底層的 CPU/OS, 32 位 OS 上 integer 默認是 32 位, 64 位 OS 上 integer 默認是 64 位。github

能夠用bitstring查看 number 的二進制表示:算法

julia > bitstring(3)
"0000000000000000000000000000000000000000000000000000000000000011"

julia > bitstring(-3)
"1111111111111111111111111111111111111111111111111111111111111101"

有時候數字可能會被 boxed,Julia 的 compiler 會自動 boxing/unboxing。數組

Int64 和 Int32 都只能表示特定的整數範圍, 超過範圍會 overflow。安全

julia> typemax(Int64)
9223372036854775807

julia> bitstring(typemax(Int64))
"0111111111111111111111111111111111111111111111111111111111111111"

julia> typemin(Int64)
-9223372036854775808

julia> bitstring(typemin(Int64))
"1000000000000000000000000000000000000000000000000000000000000000"

julia> typemax(Int64) + 1
-9223372036854775808

julia> typemin(Int64) - 1
9223372036854775807

若是要表示任意精度的話, 能夠用BitInt分佈式

julia> big(typemax(Int64)) + 1
9223372036854775808

float point 遵循IEEE 754標準。函數

julia> bitstring(1.5)
"0011111111111000000000000000000000000000000000000000000000000000"

julia> bitstring(-1.5)
"1011111111111000000000000000000000000000000000000000000000000000"

無符號整數 Unsigned integer 能夠用 UInt64 和 UInt32 表示, 不一樣類型能夠轉, 可是若是超出可表示範圍會報錯。oop

julia> UInt64(UInt32(1))
0x0000000000000001

julia> UInt32(UInt64(1))
0x00000001

julia> UInt32(typemax(UInt64))
ERROR: InexactError: trunc(UInt32, 18446744073709551615)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Any, ::UInt64) at ./boot.jl:567
 [2] checked_trunc_uint at ./boot.jl:597 [inlined]
 [3] toUInt32 at ./boot.jl:686 [inlined]
 [4] UInt32(::UInt64) at ./boot.jl:721
 [5] top-level scope at none:0

有時候不須要考慮是否溢出, 能夠直接用%符號取低位, 速度還更快:佈局

julia> (typemax(UInt64) - 1) % UInt32
0xfffffffe

精度和效率平衡
有時候爲了更高的效率, 可使用@fastmath宏, 它會放寬一些限制, 包括檢查 NaN 或 Infinity 等, 相似於 GCC 中的-ffast-math編譯選項。

julia> function sum_diff(x)
             n = length(x); d = 1/(n-1)
             s = zero(eltype(x))
             s = s +  (x[2] - x[1]) / d
             for i = 2:length(x)-1
                 s =  s + (x[i+1] - x[i+1]) / (2*d)
             end
             s = s + (x[n] - x[n-1])/d
         end
sum_diff (generic function with 1 method)

julia> function sum_diff_fast(x)
              n=length(x); d = 1/(n-1)
              s = zero(eltype(x))
              @fastmath s = s +  (x[2] - x[1]) / d
              @fastmath for i = 2:n-1
                  s =  s + (x[i+1] - x[i+1]) / (2*d)
              end
              @fastmath s = s + (x[n] - x[n-1])/d
          end
sum_diff_fast (generic function with 1 method)

julia> t=rand(2000);

julia> sum_diff(t)
1124.071808538789

julia> sum_diff_fast(t)
1124.0718085387887

julia> using BenchmarkTools

julia> @benchmark sum_diff(t)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     4.447 μs (0.00% GC)
  median time:      4.504 μs (0.00% GC)
  mean time:        4.823 μs (0.00% GC)
  maximum time:     17.318 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark sum_diff_fast(t)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     574.951 ns (0.00% GC)
  median time:      580.831 ns (0.00% GC)
  mean time:        621.044 ns (1.04% GC)
  maximum time:     65.017 μs (99.06% GC)
  --------------
  samples:          10000
  evals/sample:     183

差別仍是蠻大的, 差很少 8 倍差距! 咱們能夠用macroexpand看看宏展開的代碼:

julia> ex = :(@fastmath for i in 2:n-1
         s =  s + (x[i+1] - x[i+1]) / (2*d)
       end)
:(#= REPL[57]:1 =# @fastmath for i = 2:n - 1
          #= REPL[57]:2 =#
          s = s + (x[i + 1] - x[i + 1]) / (2d)
      end)

julia> macroexpand(Main, ex)
:(for i = 2:(Base.FastMath).sub_fast(n, 1)
      #= REPL[57]:2 =#
      s = (Base.FastMath).add_fast(s, (Base.FastMath).div_fast((Base.FastMath).sub_fast(x[(Base.FastMath).add_fast(i, 1)], x[(Base.FastMath).add_fast(i, 1)]), (Base.FastMath).mul_fast(2, d)))
  end)

能夠看到, 主要是把普通的加減乘除換成了Base.FastMath中的方法。

本章最後介紹了K-B-N求和減小偏差,以及 Subnormal numbers, 感受都是科學計算裏面纔會用到的, 暫時沒管。

數組

科學計算以及人工智能裏面有大量向量、矩陣運算, 在 Julia 裏都直接對應到 Array。 Vector 和 Matrix 其實就是 Array 的特例:

julia> Vector
Array{T,1} where T

julia> Matrix
Array{T,2} where T

即 Vector 是一維數組, Matrix 是二維數組。從這裏也能夠看出, Julia 中類型的參數類型能夠是具體的 value, 好比這裏的 1 和 2。

Julia 中 Array 數據是連續存放的, 以下圖:

Julia Array Storage

跟存放指針相比好處是少了一次內存訪問, 而且能夠更好地利用 CPU 的 pipeline 和 cache,以及 SIMD。

Julia 中多維數組是按列優先存儲的(相似 Matlab 和 Fortran),這點跟 C 語言中不同:
Julia 2d array storage

按照數據在內存中的佈局來讀取數據, 能顯著提升效率。

julia> function col_iter(x)
         s=zero(eltype(x))
         for i = 1:size(x, 2)
           for j = 1:size(x, 1)
             s = s + x[j, i] ^ 2
             x[j, i] = s
           end
         end
       end
col_iter (generic function with 1 method)

julia> function row_iter(x)
         s=zero(eltype(x))
         for i = 1:size(x, 1)
           for j = 1:size(x, 2)
             s = s + x[i, j] ^ 2
             x[i, j] = s
           end
         end
       end
row_iter (generic function with 1 method)

julia> a = rand(1000, 1000);

julia> @benchmark row_iter(a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.217 ms (0.00% GC)
  median time:      2.426 ms (0.00% GC)
  mean time:        2.524 ms (0.00% GC)
  maximum time:     11.723 ms (0.00% GC)
  --------------
  samples:          1974
  evals/sample:     1

julia> @benchmark col_iter(a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     815.984 μs (0.00% GC)
  median time:      910.121 μs (0.00% GC)
  mean time:        917.850 μs (0.00% GC)
  maximum time:     1.292 ms (0.00% GC)
  --------------
  samples:          5416
  evals/sample:     1

Julia runtime 會對 array 訪問作 bound 判斷, 看是否超出邊界。 超出邊界的訪問會致使不少 bugs,甚至是安全問題。 另外一方面, bound check 會帶來額外的開銷,若是你能很肯定不會越界, 那能夠用@inbound 宏告訴 Julia 不要作 bound check, 效率會提升很多。

julia> function prefix_bounds(a, b)
         for i = 2:size(a, 1)
           a[i] = b[i-1] + b[i]
         end
       end
prefix_bounds (generic function with 1 method)

julia> function prefix_inbounds(a, b)
         @inbounds for i = 2:size(a, 1)
           a[i] = b[i-1] + b[i]
         end
       end
prefix_inbounds (generic function with 1 method)

julia> a = rand(1000);

julia> b = rand(1000);

julia> @benchmark prefix_bounds(a, b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     742.360 ns (0.00% GC)
  median time:      763.264 ns (0.00% GC)
  mean time:        807.497 ns (0.00% GC)
  maximum time:     1.968 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     125

julia> @benchmark prefix_inbounds(a, b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     179.294 ns (0.00% GC)
  median time:      181.826 ns (0.00% GC)
  mean time:        185.124 ns (0.00% GC)
  maximum time:     635.909 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     690

慎用@inbound!!! 最好是限制 loop 直接依賴於 array 長度, 好比:for i in 1:length(array)這種形式。

能夠在啓動的時候加上--check-bounds=yes/no來所有開啓或者關閉(優先級高於@inbound 宏)bound check! 再說一次, 慎用!

Julia 內置了不少操做 Array 的函數, 通常都提供兩個版本, 注意可變和不可變版本差別很大!!!

julia> a = rand(1000);

julia> @benchmark sort(a)
BenchmarkTools.Trial:
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     16.050 μs (0.00% GC)
  median time:      17.493 μs (0.00% GC)
  mean time:        18.933 μs (0.00% GC)
  maximum time:     726.416 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark sort!(a)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.868 μs (0.00% GC)
  median time:      4.997 μs (0.00% GC)
  mean time:        5.282 μs (0.00% GC)
  maximum time:     13.772 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

咱們能夠看到, 不可變版本sort不管時間仍是內存佔用和分配上都比可變版本sort!高。 這很容易理解, 不可變版本須要 clone 一份數據出來,而不是直接修改參數。 根據須要選擇最適合的版本。

相似的,合理利用預分配,能夠有效下降時間和內存佔用。

julia> function xpow(x)
         return [x x^2 x^3 x^4]
       end
xpow (generic function with 1 method)

julia> function xpow_loop(n) s= 0
          for i = 1:n
            s = s + xpow(i)[2]
         end
         s
       end
xpow_loop (generic function with 1 method)

julia> @benchmark xpow_loop(1_000_000)
BenchmarkTools.Trial:
  memory estimate:  167.84 MiB
  allocs estimate:  4999441
  --------------
  minimum time:     68.342 ms (4.11% GC)
  median time:      70.378 ms (5.21% GC)
  mean time:        71.581 ms (6.04% GC)
  maximum time:     120.430 ms (39.92% GC)
  --------------
  samples:          70
  evals/sample:     1

julia> function xpow!(result::Array{Int, 1}, x)
         @assert length(result) == 4
         result[1] = x
         result[2] = x^2
         result[3] = x^3
         result[4] = x^4
       end
xpow! (generic function with 1 method)

julia> function xpow_loop_noalloc(n)
         r = [0, 0, 0, 0]
         s= 0
         for i = 1:n
           xpow!(r, i)
           s = s + r[2]
         end
         s
       end
xpow_loop_noalloc (generic function with 1 method)

julia> @benchmark xpow_loop_noalloc(1_000_000)
BenchmarkTools.Trial:
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     7.314 ms (0.00% GC)
  median time:      7.486 ms (0.00% GC)
  mean time:        7.599 ms (0.00% GC)
  maximum time:     9.580 ms (0.00% GC)
  --------------
  samples:          658
  evals/sample:     1

Julia 中作 Array slicing 很容易,相似 python 的語法:

julia> a = collect(1:100);

julia> a[1:10]
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

語法容易使用很容易形成濫用, 致使性能問題, 由於:array slicing 會 copy 一個副本! 咱們來計算一個矩陣的每列的和, 簡單實現以下:

julia> function sum_vector(x::Array{Float64, 1})
         s = 0.0
         for i = 1:length(x)
           s = s + x[i]
         end
         return s
       end
sum_vector (generic function with 1 method)

julia> function sum_cols_matrix(x::Array{Float64, 2})
         num_cols = size(x, 2)
         s = zeros(num_cols)
         for i = 1:num_cols
           s[i] = sum_vector(x[:, i])
         end
         return s
       end
sum_cols_matrix (generic function with 1 method)

julia> t = rand(1000, 1000);

julia> @benchmark sum_cols_matrix(t)
BenchmarkTools.Trial:
  memory estimate:  7.76 MiB
  allocs estimate:  1001
  --------------
  minimum time:     1.703 ms (0.00% GC)
  median time:      2.600 ms (0.00% GC)
  mean time:        2.902 ms (19.10% GC)
  maximum time:     40.978 ms (94.27% GC)
  --------------
  samples:          1719
  evals/sample:     1

x[:, j]是取第 j 列的全部元素。 算法很簡單, 定義一個函數sum_vector計算向量的和, 而後在sum_cols_matrix中取每一列傳過去。

因爲x[:, j]這樣 slicing 會 copy 元素, 因此內存佔用和分配都比較大。 Julia 提供了view函數,能夠複用父數組的元素而不是 copy, 具體用法能夠參考https://docs.julialang.org/en...

因爲view返回的是SubArray類型, 因此咱們先修改一下sum_vector的參數類型爲AbstractArray

julia> function sum_vector2(x::AbstractArray)
         s = 0.0
         for i = 1:length(x)
           s = s + x[i]
         end
         return s
       end
sum_vector2 (generic function with 1 method)

julia> function sum_cols_matrix_views(x::Array{Float64, 2})
         num_cols = size(x, 2)
         s = zeros(num_cols)
         for i = 1:num_cols
            s[i] = sum_vector2(view(x, :, i))
         end
         return s
       end
sum_cols_matrix_views (generic function with 1 method)

julia>

julia> @benchmark sum_cols_matrix_views(t)
BenchmarkTools.Trial:
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     812.209 μs (0.00% GC)
  median time:      883.884 μs (0.00% GC)
  mean time:        897.888 μs (0.00% GC)
  maximum time:     6.552 ms (0.00% GC)
  --------------
  samples:          5474
  evals/sample:     1

能夠看到,提高仍是蠻大的。

SIMD全稱Single Instruction Multiple Data, 是現代 CPU 的特性, 能夠一條指令操做多條數據。 來加兩個向量試試:

julia> x = zeros(1_000_000); y = rand(1_000_000); z = rand(1_000_000);

julia> function sum_vectors!(x, y, z)
         n = length(x)
         for i = 1:n
           x[i] = y[i] + z[i]
         end
       end
sum_vectors! (generic function with 1 method)

julia> function sum_vectors_inbounds!(x, y, z)
         n = length(x)
         @inbounds for i = 1:n
           x[i] = y[i] + z[i]
         end
       end
sum_vectors_inbounds! (generic function with 1 method)

julia> function sum_vectors_inbounds_simd!(x, y, z)
         n = length(x)
         @inbounds @simd for i = 1:n
           x[i] = y[i] + z[i]
         end
       end
sum_vectors_inbounds_simd! (generic function with 1 method)

julia> @benchmark sum_vectors!(x, y, z)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.117 ms (0.00% GC)
  median time:      1.156 ms (0.00% GC)
  mean time:        1.174 ms (0.00% GC)
  maximum time:     1.977 ms (0.00% GC)
  --------------
  samples:          4234
  evals/sample:     1

julia> @benchmark sum_vectors_inbounds!(x, y, z)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.118 ms (0.00% GC)
  median time:      1.148 ms (0.00% GC)
  mean time:        1.158 ms (0.00% GC)
  maximum time:     1.960 ms (0.00% GC)
  --------------
  samples:          4294
  evals/sample:     1

julia> @benchmark sum_vectors_inbounds_simd!(x, y, z)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.118 ms (0.00% GC)
  median time:      1.145 ms (0.00% GC)
  mean time:        1.155 ms (0.00% GC)
  maximum time:     2.080 ms (0.00% GC)
  --------------
  samples:          4305
  evals/sample:     1

這個測試結果, 不管用@inbounds仍是@simd,都沒有提速(難道 Julia 如今默認會使用 SIMD?), 測試了幾回都不行。 另外, 我在測試https://github.com/JuliaCompu... 的時候, 也發現有沒有 simd 差異不大, 若有知道緣由的童鞋歡迎留言告知, 很是謝謝。

另外說Yeppp這個包也能提升速度, 尚未測試。

若是咱們設計的函數給別人用, 那麼 API 會設計的比較通用(好比參數設計成 AbstractArray), 可能會接受各類類型的 Array,這時候須要當心如何遍歷數組

有兩種 indexing 方式。 一種是 linear indexing, 好比是一個三維 array, 每維度 10 個元素, 則能夠用 x[1], x[2], ..., x[1000]連續地訪問數組。 第二種叫 cartesian indexing, 訪問方式爲 x[i, j, k]。 某些數組不是連續的(好比 view 生成的 SubArray),這時候用 cartesian indexing 訪問會比用 linear indexing 訪問快, 由於 linear indexing 須要除法轉化成每一維的下標。 通用代碼通常會用 linear indexing, 這就有可能致使性能問題。

julia> function mysum_linear(a::AbstractArray)
         s=zero(eltype(a))
         for i = 1:length(a)
           s=s+a[i]
         end
         return s
       end
mysum_linear (generic function with 1 method)

julia> mysum_linear(1:1000000)
500000500000

julia> mysum_linear(reshape(1:1000000, 100, 100, 100))
500000500000

julia> mysum_linear(reshape(1:1000000, 1000, 1000))
500000500000

julia> mysum_linear(view(reshape(1:1000000, 1000, 1000), 1:500, 1:500) )
62437625000

julia> @benchmark mysum_linear(1:1000000)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.380 ns (0.00% GC)
  median time:      1.426 ns (0.00% GC)
  mean time:        1.537 ns (0.00% GC)
  maximum time:     13.475 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark mysum_linear(reshape(1:1000000, 1000, 1000))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.379 ns (0.00% GC)
  median time:      1.392 ns (0.00% GC)
  mean time:        1.482 ns (0.00% GC)
  maximum time:     23.089 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark mysum_linear(view(reshape(1:1000000, 1000, 1000), 1:500, 1:500))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.016 ms (0.00% GC)
  median time:      1.047 ms (0.00% GC)
  mean time:        1.071 ms (0.00% GC)
  maximum time:     2.211 ms (0.00% GC)
  --------------
  samples:          4659
  evals/sample:     1

能夠看到最後一次測試, 元素總數更少了, 可是時間更長, 緣由就是數組不是連續的, 可是又用了 linear indexing。 最簡單的解決方法是直接迭代元素, 而不是迭代下標。

julia> function mysum_in(a::AbstractArray)
         s = zero(eltype(a))
         for i in a
           s=s+ i
         end
       end
mysum_in (generic function with 1 method)

julia> @benchmark mysum_in(view(reshape(1:1000000, 1000, 1000), 1:500, 1:500))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     224.538 μs (0.00% GC)
  median time:      238.493 μs (0.00% GC)
  mean time:        246.847 μs (0.00% GC)
  maximum time:     413.477 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

能夠看到, 跟 linear indexing 相比, 效率是 4 倍多。 可是若是咱們須要下標怎麼辦呢?能夠用eachindex()方法,每一種 array 都會定義一個優化過的eachindex()方法:

julia> function mysum_eachindex(a::AbstractArray)
         s = zero(eltype(a))
         for i in eachindex(a)
           s = s + a[i]
         end
       end
mysum_eachindex (generic function with 1 method)

julia> @benchmark mysum_eachindex(view(reshape(1:1000000, 1000, 1000), 1:500, 1:500))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     243.295 μs (0.00% GC)
  median time:      273.168 μs (0.00% GC)
  mean time:        285.002 μs (0.00% GC)
  maximum time:     4.191 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

經過這兩篇文章介紹, 咱們基本上掌握了 Julia 高性能的方法。 若是還不夠, 那就只能求助於並行和分佈式了, 等着咱們下一篇介紹吧。

歡迎加入知識星球一塊兒分享討論有趣的技術話題。

星球jsforfun

相關文章
相關標籤/搜索