英文版本:http://deeplearning.net/software/theano/tutorial/using_gpu.html using the GPUphp
想要看GPU的介紹性的討論和對密集並行計算的使用,查閱:GPGPU.html
theano設計的一個目標就是在一個抽象層面上進行特定的計算,因此內部的函數編譯器須要靈活的處理這些計算,其中一個靈活性體如今能夠在顯卡上進行計算。node
當前有兩種方式來使用gpu,一種只支持NVIDIA cards (CUDA backend) ;另外一種,還在開發中,能夠支持任何 OpenCL設備,就像和NVIDIA cards (GpuArray Backend)同樣。python
若是你沒有準備好,那麼就須要安裝Nvidia 的 GPU編程工具鏈 (CUDA),而後配置好 Theano。咱們提供了安裝指南Linux, MacOS and Windows.(舉個例子介紹一下具體安裝過程)。linux
爲了檢查你的GPU是否啓用了,能夠剪切下面的代碼而後保存成一個文件,運行看看。 web
from theano import function, config, shared, sandbox import theano.tensor as T import numpy import time vlen = 10 * 30 * 768 # 10 x #cores x # threads per core iters = 1000 rng = numpy.random.RandomState(22) x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) f = function([], T.exp(x)) print f.maker.fgraph.toposort() t0 = time.time() for i in xrange(iters): r = f() t1 = time.time() print 'Looping %d times took' % iters, t1 - t0, 'seconds' print 'Result is', r if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]): print 'Used the cpu' else: print 'Used the gpu'
該程序會計算一堆隨機數的exp() 。注意到咱們使用了 shared 函數來確保輸入的x 是存儲在顯卡設備上的。express
若是運行該程序(保存文件名爲check1.py),並且device=cpu, 那麼計算機將會花費大約 3 ;而在GPU 上,只須要0.64秒。不過 GPU不會一直生成徹底和CPU一致的浮點數。 做爲一個基準來講,調用numpy.exp(x.get_value()) 的一個循環會花費大約 46秒。macos
1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)] 3 Looping 1000 times took 3.06635117531 seconds 4 Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761 5 1.62323284] 6 Used the cpu 7 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py 9 Using gpu device 0: GeForce GTX 580 10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)] 11 Looping 1000 times took 0.638810873032 seconds 12 Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 13 1.62323296] 14 Used the gpu
注意到在theano中GPU的操做在目前來講,只支持 floatX 爲float32類型。編程
在前面的例子中,加速並無那麼明顯,這是由於函數返回的結果是做爲一個 NumPy ndarray,而爲了方便,已經從設備複製到主機上了。這就是爲何在device=gpu下很容易交換的緣由,不過若是你不建議更少的可移植性,能夠經過改變graph來用GPU的存儲結果表示一個計算的過程來獲得更大的加速。 gpu_from_host 操做也就是說「將輸入從主機複製到GPU上」,而後在T.exp(x)被GPU版本的exp()替換後進行優化。windows
1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)] 3 Looping 1000 times took 3.06635117531 seconds 4 Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761 5 1.62323284] 6 Used the cpu 7 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py 9 Using gpu device 0: GeForce GTX 580 10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)] 11 Looping 1000 times took 0.638810873032 seconds 12 Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 13 1.62323296] 14 Used the gpu
輸出結果爲:
1 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py 2 Using gpu device 0: GeForce GTX 580 3 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)] 4 Looping 1000 times took 0.34898686409 seconds 5 Result is <CudaNdarray object at 0x6a7a5f0> 6 Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 7 1.62323296] 8 Used the gpu
這裏咱們經過簡單的不要將結果數組複製回主機的方式省掉了大約50%的運行時間。經過每次的函數調用返回的對象不是一個NumPy array,而是一個 「CudaNdarray」,後者能夠經過正常的Numpy casting機制(例如numpy.asarray())來轉換成一個NumPy ndarray。
對更對你可使用borrow flag加速的資料,查閱:Borrowing when Constructing Function Objects.
在當咱們接着優化咱們的實現的時候,效果的特性也會改變,並且在從設備到設備之間會有所變化,不過如今仍是給出一個粗略的想法吧:
從Theano 0.6開始,咱們就開始使用gpu的異步功能了。這可讓咱們運行的更快,不過可能會讓一些錯誤在它們本應該出現的地方延遲拋出異常。則會致使當分析 theano apply節點的時候有些困難。這裏有一個 NVIDIA 驅動特性有助於解決這些問題。若是你將環境變量設置成CUDA_LAUNCH_BLOCKING=1 那麼,全部的kernel調用都會自動同步的。這會下降性能,不過卻提供很好的profiling和合理的位置錯誤信息。
該特性會與theano的中間結果的垃圾回收相關聯。爲了獲取該特性的大部分效果,你須要禁用gc來在graph中插入同步點。設置theano flag allow_gc=False 來獲得甚至更快的速度!不過這會引發內存使用率上升的問題。
爲了改變共享變量的值,即對進程提供新的數據,可使用函數shared_variable.set_value(new_value). 更詳細的資料,查閱 Understanding Memory Aliasing for Speed and Correctness.
1 import numpy 2 import theano 3 import theano.tensor as T 4 rng = numpy.random 5 6 N = 400 7 feats = 784 8 D = (rng.randn(N, feats).astype(theano.config.floatX), 9 rng.randint(size=N,low=0, high=2).astype(theano.config.floatX)) 10 training_steps = 10000 11 12 # Declare Theano symbolic variables 13 x = T.matrix("x") 14 y = T.vector("y") 15 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w") 16 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b") 17 x.tag.test_value = D[0] 18 y.tag.test_value = D[1] 19 #print "Initial model:" 20 #print w.get_value(), b.get_value() 21 22 # Construct Theano expression graph 23 p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one 24 prediction = p_1 > 0.5 # The prediction that is done: 0 or 1 25 xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy 26 cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize 27 gw,gb = T.grad(cost, [w,b]) 28 29 # Compile expressions to functions 30 train = theano.function( 31 inputs=[x,y], 32 outputs=[prediction, xent], 33 updates={w:w-0.01*gw, b:b-0.01*gb}, 34 name = "train") 35 predict = theano.function(inputs=[x], outputs=prediction, 36 name = "predict") 37 38 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in 39 train.maker.fgraph.toposort()]): 40 print 'Used the cpu' 41 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in 42 train.maker.fgraph.toposort()]): 43 print 'Used the gpu' 44 else: 45 print 'ERROR, not able to tell if theano used the cpu or the gpu' 46 print train.maker.fgraph.toposort() 47 48 for i in range(training_steps): 49 pred, err = train(D[0], D[1]) 50 #print "Final model:" 51 #print w.get_value(), b.get_value() 52 53 print "target values for D" 54 print D[1] 55 56 print "prediction on D" 57 print predict(D[0])
修改並經過使用floatX= float32來在gpu上執行該例子,並使用time python file.py。來查看執行時間 (幫助資料:Configuration Settings and Compiling Mode)。
從cpu到gpu上有速度的提高嗎?
Where does it come from? (Use ProfileMode)
在gpu上如何有更好的速度的提高?
note:
答案(Solution)
1 #!/usr/bin/env python 2 # Theano tutorial 3 # Solution to Exercise in section 'Using the GPU' 4 5 6 # 1. Raw results 7 8 9 from __future__ import print_function 10 import numpy 11 import theano 12 import theano.tensor as tt 13 14 from theano import sandbox, Out 15 16 theano.config.floatX = 'float32' 17 18 rng = numpy.random 19 20 N = 400 21 feats = 784 22 D = (rng.randn(N, feats).astype(theano.config.floatX), 23 rng.randint(size=N, low=0, high=2).astype(theano.config.floatX)) 24 training_steps = 10000 25 26 # Declare Theano symbolic variables 27 x = theano.shared(D[0], name="x") 28 y = theano.shared(D[1], name="y") 29 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w") 30 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b") 31 x.tag.test_value = D[0] 32 y.tag.test_value = D[1] 33 #print "Initial model:" 34 #print w.get_value(), b.get_value() 35 36 # Construct Theano expression graph 37 p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one 38 prediction = p_1 > 0.5 # The prediction that is done: 0 or 1 39 xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy 40 cost = tt.cast(xent.mean(), 'float32') + \ 41 0.01 * (w ** 2).sum() # The cost to optimize 42 gw, gb = tt.grad(cost, [w, b]) 43 44 """ 45 # Compile expressions to functions 46 train = theano.function( 47 inputs=[x, y], 48 outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)], 49 updates={w: w - 0.01 * gw, b: b - 0.01 * gb}, 50 name="train") 51 predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True), 52 name="predict") 53 """ 54 55 # Compile expressions to functions 56 train = theano.function( 57 inputs=[], 58 outputs=[prediction, xent], 59 updates={w: w - 0.01 * gw, b: b - 0.01 * gb}, 60 name="train") 61 predict = theano.function(inputs=[], outputs=prediction, 62 name="predict") 63 64 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in 65 train.maker.fgraph.toposort()]): 66 print('Used the cpu') 67 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in 68 train.maker.fgraph.toposort()]): 69 print('Used the gpu') 70 else: 71 print('ERROR, not able to tell if theano used the cpu or the gpu') 72 print(train.maker.fgraph.toposort()) 73 74 for i in range(training_steps): 75 pred, err = train() 76 #print "Final model:" 77 #print w.get_value(), b.get_value() 78 79 print("target values for D") 80 print(D[1]) 81 82 print("prediction on D") 83 print(predict()) 84 85 """ 86 87 # 2. Profiling 88 89 90 # 2.1 Profiling for CPU computations 91 92 # In your terminal, type: 93 $ THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py 94 95 # You'll see first the output of the script: 96 Used the cpu 97 target values for D 98 prediction on D 99 100 # Followed by the output of profiling.. You'll see profiling results for each function 101 # in the script, followed by a summary for all functions. 102 # We'll show here only the summary: 103 104 Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz 105 106 Function profiling 107 ================== 108 Message: Sum of all(3) printed profiles at exit excluding Scan op profile. 109 Time in 10002 calls to Function.__call__: 1.590916e+00s 110 Time in Function.fn.__call__: 1.492365e+00s (93.805%) 111 Time in thunks: 1.408159e+00s (88.512%) 112 Total compile time: 6.309664e+00s 113 Number of Apply nodes: 25 114 Theano Optimizer time: 4.848340e-01s 115 Theano validate time: 5.454302e-03s 116 Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s 117 118 Class 119 --- 120 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 121 59.6% 59.6% 0.839s 4.19e-05s C 20001 3 theano.tensor.blas_c.CGemv 122 30.1% 89.7% 0.424s 4.71e-06s C 90001 10 theano.tensor.elemwise.Elemwise 123 5.5% 95.2% 0.078s 7.79e-02s Py 1 1 theano.tensor.blas.Gemv 124 1.9% 97.1% 0.026s 1.30e-06s C 20001 3 theano.tensor.basic.Alloc 125 1.3% 98.4% 0.018s 1.85e-06s C 10000 1 theano.tensor.elemwise.Sum 126 1.0% 99.4% 0.014s 4.78e-07s C 30001 4 theano.tensor.elemwise.DimShuffle 127 0.6% 100.0% 0.008s 4.23e-07s C 20001 3 theano.compile.ops.Shape_i 128 ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) 129 130 Ops 131 --- 132 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 133 59.6% 59.6% 0.839s 4.19e-05s C 20001 3 CGemv{inplace} 134 15.8% 75.4% 0.223s 2.23e-05s C 10000 1 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)] 135 7.7% 83.1% 0.109s 1.09e-05s C 10000 1 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)] 136 5.5% 88.7% 0.078s 7.79e-02s Py 1 1 Gemv{no_inplace} 137 4.3% 92.9% 0.060s 6.00e-06s C 10000 1 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}} 138 1.9% 94.8% 0.026s 1.30e-06s C 20001 3 Alloc 139 1.3% 96.1% 0.018s 1.85e-06s C 10000 1 Sum{acc_dtype=float64} 140 0.7% 96.8% 0.009s 4.73e-07s C 20001 3 InplaceDimShuffle{x} 141 0.6% 97.4% 0.009s 8.52e-07s C 10000 1 Elemwise{sub,no_inplace} 142 0.6% 98.0% 0.008s 4.23e-07s C 20001 3 Shape_i{0} 143 0.5% 98.5% 0.007s 7.06e-07s C 10000 1 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 144 0.5% 98.9% 0.007s 6.57e-07s C 10000 1 Elemwise{neg,no_inplace} 145 0.3% 99.3% 0.005s 4.88e-07s C 10000 1 InplaceDimShuffle{1,0} 146 0.3% 99.5% 0.004s 3.78e-07s C 10000 1 Elemwise{inv,no_inplace} 147 0.2% 99.8% 0.003s 3.44e-07s C 10000 1 Elemwise{Cast{float32}} 148 0.2% 100.0% 0.003s 3.01e-07s C 10000 1 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 149 0.0% 100.0% 0.000s 8.11e-06s C 1 1 Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}} 150 ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) 151 152 Apply 153 ------ 154 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 155 31.6% 31.6% 0.445s 4.45e-05s 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0}) 156 27.9% 59.6% 0.393s 3.93e-05s 10000 17 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417}) 157 15.8% 75.4% 0.223s 2.23e-05s 10000 14 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0) 158 7.7% 83.1% 0.109s 1.09e-05s 10000 15 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0) 159 5.5% 88.7% 0.078s 7.79e-02s 1 0 Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0}) 160 4.3% 92.9% 0.060s 6.00e-06s 10000 13 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5}) 161 1.3% 94.2% 0.018s 1.85e-06s 10000 16 Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0) 162 1.0% 95.2% 0.013s 1.34e-06s 10000 5 Alloc(TensorConstant{0.0}, Shape_i{0}.0) 163 0.9% 96.1% 0.013s 1.27e-06s 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0) 164 0.6% 96.7% 0.009s 8.52e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y) 165 0.5% 97.2% 0.007s 7.06e-07s 10000 9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0) 166 0.5% 97.6% 0.007s 6.57e-07s 10000 11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0) 167 0.4% 98.1% 0.006s 6.27e-07s 10000 0 InplaceDimShuffle{x}(b) 168 0.4% 98.5% 0.006s 5.90e-07s 10000 1 Shape_i{0}(x) 169 0.3% 98.9% 0.005s 4.88e-07s 10000 2 InplaceDimShuffle{1,0}(x) 170 0.3% 99.1% 0.004s 3.78e-07s 10000 10 Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0) 171 0.2% 99.4% 0.003s 3.44e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0) 172 0.2% 99.6% 0.003s 3.19e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0) 173 0.2% 99.8% 0.003s 3.01e-07s 10000 18 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0) 174 0.2% 100.0% 0.003s 2.56e-07s 10000 3 Shape_i{0}(y) 175 ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime) 176 177 178 179 # 2.2 Profiling for GPU computations 180 181 # In your terminal, type: 182 $ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py 183 184 # You'll see first the output of the script: 185 Used the gpu 186 target values for D 187 prediction on D 188 189 Results were produced using a GeForce GTX TITAN 190 191 # Profiling summary for all functions: 192 193 Function profiling 194 ================== 195 Message: Sum of all(3) printed profiles at exit excluding Scan op profile. 196 Time in 10002 calls to Function.__call__: 3.535239e+00s 197 Time in Function.fn.__call__: 3.420863e+00s (96.765%) 198 Time in thunks: 2.865905e+00s (81.067%) 199 Total compile time: 4.728150e-01s 200 Number of Apply nodes: 36 201 Theano Optimizer time: 4.283385e-01s 202 Theano validate time: 7.687330e-03s 203 Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s 204 205 Class 206 --- 207 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 208 45.7% 45.7% 1.308s 1.64e-05s C 80001 9 theano.sandbox.cuda.basic_ops.GpuElemwise 209 17.2% 62.8% 0.492s 2.46e-05s C 20002 4 theano.sandbox.cuda.blas.GpuGemv 210 15.1% 77.9% 0.433s 2.17e-05s C 20001 3 theano.sandbox.cuda.basic_ops.GpuAlloc 211 8.2% 86.1% 0.234s 1.17e-05s C 20002 4 theano.sandbox.cuda.basic_ops.HostFromGpu 212 7.2% 93.3% 0.207s 2.07e-05s C 10000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce 213 4.4% 97.7% 0.127s 1.27e-05s C 10003 4 theano.sandbox.cuda.basic_ops.GpuFromHost 214 0.9% 98.6% 0.025s 8.23e-07s C 30001 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle 215 0.7% 99.3% 0.020s 9.88e-07s C 20001 3 theano.tensor.elemwise.Elemwise 216 0.5% 99.8% 0.014s 7.18e-07s C 20001 3 theano.compile.ops.Shape_i 217 0.2% 100.0% 0.006s 5.78e-07s C 10000 1 theano.tensor.elemwise.DimShuffle 218 ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) 219 220 Ops 221 --- 222 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 223 17.2% 17.2% 0.492s 2.46e-05s C 20001 3 GpuGemv{inplace} 224 8.2% 25.3% 0.234s 1.17e-05s C 20002 4 HostFromGpu 225 8.0% 33.3% 0.228s 2.28e-05s C 10001 2 GpuAlloc{memset_0=True} 226 7.4% 40.7% 0.211s 2.11e-05s C 10000 1 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace} 227 7.2% 47.9% 0.207s 2.07e-05s C 10000 1 GpuCAReduce{add}{1} 228 7.1% 55.0% 0.205s 2.05e-05s C 10000 1 GpuAlloc 229 6.9% 62.0% 0.198s 1.98e-05s C 10000 1 GpuElemwise{sub,no_inplace} 230 6.9% 68.9% 0.198s 1.98e-05s C 10000 1 GpuElemwise{inv,no_inplace} 231 6.2% 75.1% 0.178s 1.78e-05s C 10000 1 GpuElemwise{neg,no_inplace} 232 5.6% 80.6% 0.159s 1.59e-05s C 10000 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)] 233 4.4% 85.1% 0.127s 1.27e-05s C 10003 4 GpuFromHost 234 4.3% 89.4% 0.124s 1.24e-05s C 10000 1 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 235 4.2% 93.6% 0.121s 1.21e-05s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)] 236 4.2% 97.7% 0.119s 1.19e-05s C 10000 1 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 237 0.5% 98.2% 0.014s 7.18e-07s C 20001 3 Shape_i{0} 238 0.5% 98.7% 0.013s 1.33e-06s C 10001 2 Elemwise{gt,no_inplace} 239 0.3% 99.0% 0.010s 9.81e-07s C 10000 1 GpuDimShuffle{1,0} 240 0.3% 99.3% 0.008s 7.90e-07s C 10000 1 GpuDimShuffle{0} 241 0.2% 99.6% 0.007s 6.97e-07s C 10001 2 GpuDimShuffle{x} 242 0.2% 99.8% 0.006s 6.50e-07s C 10000 1 Elemwise{Cast{float32}} 243 ... (remaining 3 Ops account for 0.20%(0.01s) of the runtime) 244 245 Apply 246 ------ 247 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 248 8.8% 8.8% 0.251s 2.51e-05s 10000 22 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417}) 249 8.4% 17.2% 0.241s 2.41e-05s 10000 7 GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0}) 250 8.0% 25.1% 0.228s 2.28e-05s 10000 5 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0) 251 7.4% 32.5% 0.211s 2.11e-05s 10000 13 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0) 252 7.2% 39.7% 0.207s 2.07e-05s 10000 21 GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0) 253 7.1% 46.9% 0.205s 2.05e-05s 10000 17 GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0) 254 6.9% 53.8% 0.198s 1.98e-05s 10000 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y) 255 6.9% 60.7% 0.198s 1.98e-05s 10000 12 GpuElemwise{inv,no_inplace}(GpuFromHost.0) 256 6.2% 66.9% 0.178s 1.78e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0) 257 5.6% 72.5% 0.159s 1.59e-05s 10000 19 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0) 258 4.8% 77.3% 0.138s 1.38e-05s 10000 18 HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0) 259 4.4% 81.7% 0.126s 1.26e-05s 10000 10 GpuFromHost(Elemwise{Cast{float32}}.0) 260 4.3% 86.0% 0.124s 1.24e-05s 10000 9 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0) 261 4.2% 90.2% 0.121s 1.21e-05s 10000 15 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0) 262 4.2% 94.4% 0.119s 1.19e-05s 10000 23 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0) 263 3.4% 97.7% 0.096s 9.61e-06s 10000 16 HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0) 264 0.5% 98.2% 0.013s 1.33e-06s 10000 20 Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5}) 265 0.3% 98.5% 0.010s 9.81e-07s 10000 2 GpuDimShuffle{1,0}(x) 266 0.3% 98.8% 0.008s 8.27e-07s 10000 1 Shape_i{0}(x) 267 0.3% 99.1% 0.008s 7.90e-07s 10000 14 GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0) 268 ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime) 269 270 271 # 3. Conclusions 272 273 Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves 274 consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead. 275 Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs; 276 if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU. 277 278 """
若是你尚未準備好,你須要安裝 libgpuarray 和至少一個計算工具箱。能夠看相關的介紹說明 libgpuarray.
若是使用OpenGL,那麼全部設備的類型都支持的,對於該章節剩下的部分,無論你使用的計算設備是什麼,都表示是gpu。
waring:咱們想徹底支持OpenCL, 在2014年5月的時候,該支持仍然是個想法而已。一些有用的ops仍然沒有被支持,由於 想要在舊的後端以最小化變化來移植。
爲了查看是否使用的是GPU,能夠將下面代碼剪切而後建立個文件運行:
1 from theano import function, config, shared, tensor, sandbox 2 import numpy 3 import time 4 5 vlen = 10 * 30 * 768 # 10 x #cores x # threads per core 6 iters = 1000 7 8 rng = numpy.random.RandomState(22) 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) 10 f = function([], tensor.exp(x)) 11 print f.maker.fgraph.toposort() 12 t0 = time.time() 13 for i in xrange(iters): 14 r = f() 15 t1 = time.time() 16 print 'Looping %d times took' % iters, t1 - t0, 'seconds' 17 print 'Result is', r 18 if numpy.any([isinstance(x.op, tensor.Elemwise) and 19 ('Gpu' not in type(x.op).__name__) 20 for x in f.maker.fgraph.toposort()]): 21 print 'Used the cpu' 22 else: 23 print 'Used the gpu'
該程序只計算一羣隨機數的 exp() 。注意到咱們使用 theano.shared() 函數來確保輸入x存儲在gpu上。
1 $ THEANO_FLAGS=device=cpu python check1.py 2 [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)] 3 Looping 1000 times took 2.6071999073 seconds 4 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 5 1.62323285] 6 Used the cpu 7 8 $ THEANO_FLAGS=device=cuda0 python check1.py 9 Using device cuda0: GeForce GTX 275 10 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)] 11 Looping 1000 times took 2.28562092781 seconds 12 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 13 1.62323285] 14 Used the gpu
在默認狀況下,在gpu上執行的函數仍然返回一個標準的numpy ndarray。在獲得結果以前會有一個遷移操做,將數據傳輸會cpu上從而來確保與cpu代碼的兼容。這可讓在不改變源代碼的狀況下只使用flag device來改變代碼運行的位置。
若是不建議損失一些靈活性,可讓theano直接返回gpu對象。下面的代碼就是這樣:
1 from theano import function, config, shared, tensor, sandbox 2 import numpy 3 import time 4 5 vlen = 10 * 30 * 768 # 10 x #cores x # threads per core 6 iters = 1000 7 8 rng = numpy.random.RandomState(22) 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) 10 f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x))) 11 print f.maker.fgraph.toposort() 12 t0 = time.time() 13 for i in xrange(iters): 14 r = f() 15 t1 = time.time() 16 print 'Looping %d times took' % iters, t1 - t0, 'seconds' 17 print 'Result is', numpy.asarray(r) 18 if numpy.any([isinstance(x.op, tensor.Elemwise) and 19 ('Gpu' not in type(x.op).__name__) 20 for x in f.maker.fgraph.toposort()]): 21 print 'Used the cpu' 22 else: 23 print 'Used the gpu'
這裏的 theano.sandbox.gpuarray.basic.gpu_from_host() 調用的意思是 「將輸入複製到 GPU上」。然而在優化的階段中,由於結果已經在gpu上了,它會被移除掉(即該函數會被忽略)。這裏是爲了告訴theano咱們想要gpu上的結果。
輸出爲:
1 $ THEANO_FLAGS=device=cuda0 python check2.py 2 Using device cuda0: GeForce GTX 275 3 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)] 4 Looping 1000 times took 0.455810785294 seconds 5 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 6 1.62323285] 7 Used the gpu
然而每次調用的時間看上去會比以前的兩個調用更少 (的確是會更少,由於這裏避免了數據傳輸r)這裏這麼大的加速是由於gpu上執行的異步過程所致使的,也就是說工做並無完成,只是「啓動」了。
返回的對象是一個從pygou上獲得的 GpuArray。它幾乎扮演着帶有一些異常的 numpy ndarray ,由於它的數據都在gpu上,你能夠將它複製到主機上,而後經過使用日常的numpy cast ,例如numpy.asarray()來轉換成一個常規的ndarray 。
爲了更快的速度,可使用borrow flag,查閱: Borrowing when Constructing Function Objects.
固然在不一樣設備之間,性能特性仍是不太的,一樣的,咱們會改進咱們的實現。
該backend支持全部的常規theano數據類型 (float32, float64, int, ...),然而GPU的支持是變化的,並且一些單元無法處理 double (float64)或者更小的 (小於32 位,好比 int16)數據類型。若是使用了這些單元,那麼會在編譯的時候或者運行的時候獲得一個錯誤。
複雜的支持還未測試,並且大多數都不行。
一般來講,大的操做,好比矩陣乘法或者有着大量輸入的逐元素操做將會明顯更快的。
默認狀況下,在gpu上全部的操做都是異步的,這能夠經過底層的libgpuarray來使得這些操做都是透明的。
當在設備和主機之間進行內存遷移的時候,能夠經過引入同步點。當在gpu上釋放活動的(活動的緩衝區就是仍然會被kernel使用的緩衝區)內存緩衝區的時候,能夠引入另外一個同步點。
能夠經過調用它的sync()方法來對一個特定的GpuArray強制同步。這在作基準的時候能夠用來獲得準確的耗時計算。
強制的同步點會和中間結果的垃圾回收相關聯。爲了獲得最快的速度,你應該經過使用theano flag allow_gc=False來禁用垃圾回收器。不過要注意這會致使內存使用提高的問題。
撇開theano這種元編程,有:
CUDA: GPU 編程API,是NVIDIA 對C的擴展 (CUDA C)
OpenCL: CUDA的多供應商版本
PyCUDA:對CUDA驅動接口的python綁定,容許經過python來訪問 Nvidia的 CUDA 並行計算API
方便:
使用python來更容易的進行GPU 元編程。
從python中可以抽象的編譯更低層的 CUDA 代碼 (pycuda.driver.SourceModule).
GPU 內存緩存 (pycuda.gpuarray.GPUArray).
幫助文檔.
完整性: 綁定了全部的CUDA驅動 API.
自動的錯誤檢測:全部的 CUDA 錯誤都會自動的轉到python異常。
速度: PyCUDA的底層是用 C++寫的。
針對GPU對象,具備很好的內存管理:
對象的清理是和對象的生命週期綁定的 (RAII, ‘Resource Acquisition Is Initialization’).
使得更容易編寫正確的,無漏洞的和不容易崩潰的代碼。
PyCUDA 會知道依賴條件 (例如,它不會在全部分配的內存釋放以前對上下文進行分離)。
(查閱PyCUDA的 documentation 和 在PyCUDA上Andreas Kloeckner的 website )
PyOpenCL: PyCUDA for OpenCL
若是你已經精通C了,那麼你就能夠很容易的經過學習來充分利用你的知識,首先用CUDA C來編寫GPU,而後,使用 PyCUDA來訪問 CUDA API。
下面的資源有助於你學習的過程:
下面的例子是用來講明用PyCUDA來對GPU編程的一個預言。一旦你以爲徹底足夠了,你就能夠嘗試去作相對應的練習。
Example: PyCUDA
1 # (from PyCUDA's documentation) 2 import pycuda.autoinit 3 import pycuda.driver as drv 4 import numpy 5 6 from pycuda.compiler import SourceModule 7 mod = SourceModule(""" 8 __global__ void multiply_them(float *dest, float *a, float *b) 9 { 10 const int i = threadIdx.x; 11 dest[i] = a[i] * b[i]; 12 } 13 """) 14 15 multiply_them = mod.get_function("multiply_them") 16 17 a = numpy.random.randn(400).astype(numpy.float32) 18 b = numpy.random.randn(400).astype(numpy.float32) 19 20 dest = numpy.zeros_like(a) 21 multiply_them( 22 drv.Out(dest), drv.In(a), drv.In(b), 23 block=(400,1,1), grid=(1,1)) 24 25 assert numpy.allclose(dest, a*b) 26 print dest
運行以前的例子
修改並執行一個shape(20,10)的矩陣
Example: Theano + PyCUDA
1 import numpy, theano 2 import theano.misc.pycuda_init 3 from pycuda.compiler import SourceModule 4 import theano.sandbox.cuda as cuda 5 6 class PyCUDADoubleOp(theano.Op): 7 def __eq__(self, other): 8 return type(self) == type(other) 9 10 def __hash__(self): 11 return hash(type(self)) 12 13 def __str__(self): 14 return self.__class__.__name__ 15 16 def make_node(self, inp): 17 inp = cuda.basic_ops.gpu_contiguous( 18 cuda.basic_ops.as_cuda_ndarray_variable(inp)) 19 assert inp.dtype == "float32" 20 return theano.Apply(self, [inp], [inp.type()]) 21 22 def make_thunk(self, node, storage_map, _, _2): 23 mod = SourceModule(""" 24 __global__ void my_fct(float * i0, float * o0, int size) { 25 int i = blockIdx.x*blockDim.x + threadIdx.x; 26 if(i<size){ 27 o0[i] = i0[i]*2; 28 } 29 }""") 30 pycuda_fct = mod.get_function("my_fct") 31 inputs = [storage_map[v] for v in node.inputs] 32 outputs = [storage_map[v] for v in node.outputs] 33 34 def thunk(): 35 z = outputs[0] 36 if z[0] is None or z[0].shape != inputs[0][0].shape: 37 z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape) 38 grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1) 39 pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size), 40 block=(512, 1, 1), grid=grid) 41 return thunk
使用這個代碼來測試:
1 >>> x = theano.tensor.fmatrix() 2 >>> f = theano.function([x], PyCUDADoubleOp()(x)) 3 >>> xv = numpy.ones((4, 5), dtype="float32") 4 >>> assert numpy.allclose(f(xv), xv*2) 5 >>> print numpy.asarray(f(xv))
運行前面的例子
修改並執行兩個矩陣的乘法: x * y.
修改並執行返回兩個輸出: x + y 和 x - y.
(注意到theano當前的逐元素優化只對涉及到單一輸出的計算有用。因此,爲了提供基本解決狀況下的效率,須要在代碼中顯式的對這兩個操做進行優化)。
修改而後執行來支持跨越行爲(stride) (即,避免受限於輸入必定是C-連續的)。
查閱 Other Implementations 來了解如何在gpu上處理隨機數
參考資料:
[1]官網:http://deeplearning.net/software/theano/tutorial/using_gpu.html
[2]person:https://www.cnblogs.com/shouhuxianjian/p/4590224.html