paper 167：GPU的使用Theano之tutorial

時間 2019-12-11

標籤 paper gpu 使用 theano tutorial 简体版

原文原文鏈接

Theano之使用GPU

英文版本：http://deeplearning.net/software/theano/tutorial/using_gpu.html using the GPUphp

想要看GPU的介紹性的討論和對密集並行計算的使用，查閱：GPGPU.html

theano設計的一個目標就是在一個抽象層面上進行特定的計算，因此內部的函數編譯器須要靈活的處理這些計算，其中一個靈活性體如今能夠在顯卡上進行計算。node

當前有兩種方式來使用gpu，一種只支持NVIDIA cards (CUDA backend) ；另外一種，還在開發中，能夠支持任何 OpenCL設備，就像和NVIDIA cards (GpuArray Backend)同樣。python

1、CUDA backend

若是你沒有準備好，那麼就須要安裝Nvidia 的 GPU編程工具鏈 (CUDA)，而後配置好 Theano。咱們提供了安裝指南Linux, MacOS and Windows.（舉個例子介紹一下具體安裝過程）。linux

1.1 測試theano和GPU

爲了檢查你的GPU是否啓用了，能夠剪切下面的代碼而後保存成一個文件，運行看看。　web

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

該程序會計算一堆隨機數的exp() 。注意到咱們使用了 shared 函數來確保輸入的x 是存儲在顯卡設備上的。express

若是運行該程序(保存文件名爲check1.py)，並且device=cpu, 那麼計算機將會花費大約 3 ；而在GPU 上，只須要0.64秒。不過 GPU不會一直生成徹底和CPU一致的浮點數。做爲一個基準來講，調用numpy.exp(x.get_value()) 的一個循環會花費大約 46秒。macos

 1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
 3 Looping 1000 times took 3.06635117531 seconds
 4 Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
 5   1.62323284]
 6 Used the cpu
 7 
 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
 9 Using gpu device 0: GeForce GTX 580
10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
11 Looping 1000 times took 0.638810873032 seconds
12 Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
13   1.62323296]
14 Used the gpu

注意到在theano中GPU的操做在目前來講，只支持 floatX 爲float32類型。編程

1.2 返回設備分配數據的句柄

在前面的例子中，加速並無那麼明顯，這是由於函數返回的結果是做爲一個 NumPy ndarray，而爲了方便，已經從設備複製到主機上了。這就是爲何在device=gpu下很容易交換的緣由，不過若是你不建議更少的可移植性，能夠經過改變graph來用GPU的存儲結果表示一個計算的過程來獲得更大的加速。 gpu_from_host 操做也就是說「將輸入從主機複製到GPU上」，而後在T.exp(x)被GPU版本的exp()替換後進行優化。windows

 1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
 3 Looping 1000 times took 3.06635117531 seconds
 4 Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
 5   1.62323284]
 6 Used the cpu
 7 
 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
 9 Using gpu device 0: GeForce GTX 580
10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
11 Looping 1000 times took 0.638810873032 seconds
12 Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
13   1.62323296]
14 Used the gpu

輸出結果爲：

1 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
2 Using gpu device 0: GeForce GTX 580
3 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
4 Looping 1000 times took 0.34898686409 seconds
5 Result is <CudaNdarray object at 0x6a7a5f0>
6 Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
7   1.62323296]
8 Used the gpu

這裏咱們經過簡單的不要將結果數組複製回主機的方式省掉了大約50%的運行時間。經過每次的函數調用返回的對象不是一個NumPy array，而是一個「CudaNdarray」，後者能夠經過正常的Numpy casting機制（例如numpy.asarray()）來轉換成一個NumPy ndarray。

對更對你可使用borrow flag加速的資料，查閱：Borrowing when Constructing Function Objects.

1.3 在GPU上加速的是什麼？

在當咱們接着優化咱們的實現的時候，效果的特性也會改變，並且在從設備到設備之間會有所變化，不過如今仍是給出一個粗略的想法吧：

只有float32 的數據類型的計算能夠加速。針對float64的更好的支持期待未來的硬件，不過在目前（2010年1月）float64仍是至關慢的。
當參數是足夠大而保持30個處理器都工做的時候，矩陣乘法，卷積和大型的逐元素計算能夠加速大概5-50x。
索引、維度重排和常量時間的reshaping在gpu和cpu上同樣塊。
在張量上基於行/列的求和在gpu上可能會比cpu上慢一點。
設備與主機之間大量的數據的複製是至關慢的，一般會抵消掉在數據上一兩個加速函數的大部分優點。讓gpu取得性能上的提高的關鍵取決於數據傳輸到設備上的時間消耗。

1.4 在gpu上提高效果的提示

考慮將floatX=float32 加到你的 .theanorc 文件中。
使用theano flag allow_gc=False. 見 GPU Async capabilities
推薦使用構造器，如matrix, vector 和 scalar 來替換dmatrix, dvector 和 dscalar。由於前者當設定floatX = float32 的時候回使用float32類型的變量。
確保你的輸出變量爲float32 dtype而不是float64。在graph中更多的float32變量會讓你將更多的工做放在gpu上實現。
使用shared float32變量存儲頻繁訪問的數據（見shared（））來最大程度的減小轉移到gpu設備上花費的時間。當使用gpu的時候，float32 張量共享變量存儲在gpu上，並默認的使用這些變量來消除到gpu上的傳輸時間。（這裏的意思應該是建立的時候就放在gpu上，而無需每次調用都從cpu上傳給gpu，從而這份數據可以一直保持在gpu上，減小屢次的傳輸）。
若是你對你獲得的效果不滿意，試着用 mode='ProfileMode'來創建你的函數。這在程序終止的時候，會打印出一些時間信息。若是一個op或者apply花費了它共享還多的時間，那麼若是你知道一些gpu變成，就能夠看看在theano.sandbox.cuda上它是怎麼實現的。檢查下載cpu上花費的時間比例Xs(X%) ，和在gpu上花費的時間比例 Xs(X%) 和在傳輸操做上花費的時間比例 Xs(X%) 。這能夠告訴你你的graph所花費的時間是在gpu上仍是更多的在內存的傳輸上。
使用 nvcc 選項。 nvcc 支持一些選項來加速某些計算： -ftz=true to flush denormals values to zeros., –prec-div=false 和 –prec-sqrt=false 選項能夠經過使用更少的精度來對除法和平方根操做進行加速，。你能夠經過 nvcc.flags=–use_fast_math Theano flag 來一次啓用它們，或者如子nvcc.flags=-ftz=true –prec-div=false同樣分別對它們進行啓用。

1.5 GPU 異步功能

從Theano 0.6開始，咱們就開始使用gpu的異步功能了。這可讓咱們運行的更快，不過可能會讓一些錯誤在它們本應該出現的地方延遲拋出異常。則會致使當分析 theano apply節點的時候有些困難。這裏有一個 NVIDIA 驅動特性有助於解決這些問題。若是你將環境變量設置成CUDA_LAUNCH_BLOCKING=1 那麼，全部的kernel調用都會自動同步的。這會下降性能，不過卻提供很好的profiling和合理的位置錯誤信息。

該特性會與theano的中間結果的垃圾回收相關聯。爲了獲取該特性的大部分效果，你須要禁用gc來在graph中插入同步點。設置theano flag allow_gc=False 來獲得甚至更快的速度！不過這會引發內存使用率上升的問題。

1.6 改變共享變量的值

爲了改變共享變量的值，即對進程提供新的數據，可使用函數shared_variable.set_value(new_value). 更詳細的資料，查閱 Understanding Memory Aliasing for Speed and Correctness.

練習：再次拿邏輯迴歸作例子

 1 import numpy
 2 import theano
 3 import theano.tensor as T
 4 rng = numpy.random
 5 
 6 N = 400
 7 feats = 784
 8 D = (rng.randn(N, feats).astype(theano.config.floatX),
 9 rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
10 training_steps = 10000
11 
12 # Declare Theano symbolic variables
13 x = T.matrix("x")
14 y = T.vector("y")
15 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
16 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
17 x.tag.test_value = D[0]
18 y.tag.test_value = D[1]
19 #print "Initial model:"
20 #print w.get_value(), b.get_value()
21 
22 # Construct Theano expression graph
23 p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
24 prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
25 xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
26 cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
27 gw,gb = T.grad(cost, [w,b])
28 
29 # Compile expressions to functions
30 train = theano.function(
31             inputs=[x,y],
32             outputs=[prediction, xent],
33             updates={w:w-0.01*gw, b:b-0.01*gb},
34             name = "train")
35 predict = theano.function(inputs=[x], outputs=prediction,
36             name = "predict")
37 
38 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
39         train.maker.fgraph.toposort()]):
40     print 'Used the cpu'
41 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
42           train.maker.fgraph.toposort()]):
43     print 'Used the gpu'
44 else:
45     print 'ERROR, not able to tell if theano used the cpu or the gpu'
46     print train.maker.fgraph.toposort()
47 
48 for i in range(training_steps):
49     pred, err = train(D[0], D[1])
50 #print "Final model:"
51 #print w.get_value(), b.get_value()
52 
53 print "target values for D"
54 print D[1]
55 
56 print "prediction on D"
57 print predict(D[0])

修改並經過使用floatX= float32來在gpu上執行該例子，並使用time python file.py。來查看執行時間 (幫助資料：Configuration Settings and Compiling Mode)。

從cpu到gpu上有速度的提高嗎？

Where does it come from? (Use ProfileMode)

在gpu上如何有更好的速度的提高？

note：

當前只支持32 位 floats (其餘待開發)。
有着float32 dtype的Shared 變量默認會放到gpu內存空間上.
當前一個gpu被限制成只容許一個進程。
使用Theano flag device=gpu 來請求使用gpu設備。
當你有多個gpu的時候，使用 device=gpu{0, 1, ...} 來指定具體的那個。
在代碼中使用Theano flag floatX=float32 (through theano.config.floatX) 。
在存儲到一個shared變量以前記得Cast 輸入。
避免本該cast到float32的int32 自動變成float64：
- 在代碼中手動插入cast或者使用 [u]int{8,16}.
- 在均值操做的周圍手動插入cast (這會涉及到length的除法，而這是一個int64類型的).
- 注意：一個新的casting機制在開發中。

答案（Solution）

  1 #!/usr/bin/env python
  2 # Theano tutorial
  3 # Solution to Exercise in section 'Using the GPU'
  4 
  5 
  6 # 1. Raw results
  7 
  8 
  9 from __future__ import print_function
 10 import numpy
 11 import theano
 12 import theano.tensor as tt
 13 
 14 from theano import sandbox, Out
 15 
 16 theano.config.floatX = 'float32'
 17 
 18 rng = numpy.random
 19 
 20 N = 400
 21 feats = 784
 22 D = (rng.randn(N, feats).astype(theano.config.floatX),
 23 rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
 24 training_steps = 10000
 25 
 26 # Declare Theano symbolic variables
 27 x = theano.shared(D[0], name="x")
 28 y = theano.shared(D[1], name="y")
 29 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
 30 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
 31 x.tag.test_value = D[0]
 32 y.tag.test_value = D[1]
 33 #print "Initial model:"
 34 #print w.get_value(), b.get_value()
 35 
 36 # Construct Theano expression graph
 37 p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
 38 prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
 39 xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
 40 cost = tt.cast(xent.mean(), 'float32') + \
 41        0.01 * (w ** 2).sum()  # The cost to optimize
 42 gw, gb = tt.grad(cost, [w, b])
 43 
 44 """
 45 # Compile expressions to functions
 46 train = theano.function(
 47             inputs=[x, y],
 48             outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
 49             updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
 50             name="train")
 51 predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
 52             name="predict")
 53 """
 54 
 55 # Compile expressions to functions
 56 train = theano.function(
 57             inputs=[],
 58             outputs=[prediction, xent],
 59             updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
 60             name="train")
 61 predict = theano.function(inputs=[], outputs=prediction,
 62             name="predict")
 63 
 64 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
 65 train.maker.fgraph.toposort()]):
 66     print('Used the cpu')
 67 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
 68 train.maker.fgraph.toposort()]):
 69     print('Used the gpu')
 70 else:
 71     print('ERROR, not able to tell if theano used the cpu or the gpu')
 72     print(train.maker.fgraph.toposort())
 73 
 74 for i in range(training_steps):
 75     pred, err = train()
 76 #print "Final model:"
 77 #print w.get_value(), b.get_value()
 78 
 79 print("target values for D")
 80 print(D[1])
 81 
 82 print("prediction on D")
 83 print(predict())
 84 
 85 """
 86 
 87 # 2. Profiling
 88 
 89 
 90 # 2.1 Profiling for CPU computations
 91 
 92 # In your terminal, type:
 93 $ THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
 94 
 95 # You'll see first the output of the script:
 96 Used the cpu
 97 target values for D
 98 prediction on D
 99 
100 # Followed by the output of profiling.. You'll see profiling results for each function
101 # in the script, followed by a summary for all functions.
102 # We'll show here only the summary:
103 
104 Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
105 
106 Function profiling
107 ==================
108   Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
109   Time in 10002 calls to Function.__call__: 1.590916e+00s
110   Time in Function.fn.__call__: 1.492365e+00s (93.805%)
111   Time in thunks: 1.408159e+00s (88.512%)
112   Total compile time: 6.309664e+00s
113     Number of Apply nodes: 25
114     Theano Optimizer time: 4.848340e-01s
115        Theano validate time: 5.454302e-03s
116     Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
117 
118 Class
119 ---
120 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
121   59.6%    59.6%       0.839s       4.19e-05s     C    20001       3   theano.tensor.blas_c.CGemv
122   30.1%    89.7%       0.424s       4.71e-06s     C    90001      10   theano.tensor.elemwise.Elemwise
123    5.5%    95.2%       0.078s       7.79e-02s     Py       1       1   theano.tensor.blas.Gemv
124    1.9%    97.1%       0.026s       1.30e-06s     C    20001       3   theano.tensor.basic.Alloc
125    1.3%    98.4%       0.018s       1.85e-06s     C    10000       1   theano.tensor.elemwise.Sum
126    1.0%    99.4%       0.014s       4.78e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
127    0.6%   100.0%       0.008s       4.23e-07s     C    20001       3   theano.compile.ops.Shape_i
128    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
129 
130 Ops
131 ---
132 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
133   59.6%    59.6%       0.839s       4.19e-05s     C     20001        3   CGemv{inplace}
134   15.8%    75.4%       0.223s       2.23e-05s     C     10000        1   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
135    7.7%    83.1%       0.109s       1.09e-05s     C     10000        1   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
136    5.5%    88.7%       0.078s       7.79e-02s     Py       1        1   Gemv{no_inplace}
137    4.3%    92.9%       0.060s       6.00e-06s     C     10000        1   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
138    1.9%    94.8%       0.026s       1.30e-06s     C     20001        3   Alloc
139    1.3%    96.1%       0.018s       1.85e-06s     C     10000        1   Sum{acc_dtype=float64}
140    0.7%    96.8%       0.009s       4.73e-07s     C     20001        3   InplaceDimShuffle{x}
141    0.6%    97.4%       0.009s       8.52e-07s     C     10000        1   Elemwise{sub,no_inplace}
142    0.6%    98.0%       0.008s       4.23e-07s     C     20001        3   Shape_i{0}
143    0.5%    98.5%       0.007s       7.06e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
144    0.5%    98.9%       0.007s       6.57e-07s     C     10000        1   Elemwise{neg,no_inplace}
145    0.3%    99.3%       0.005s       4.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
146    0.3%    99.5%       0.004s       3.78e-07s     C     10000        1   Elemwise{inv,no_inplace}
147    0.2%    99.8%       0.003s       3.44e-07s     C     10000        1   Elemwise{Cast{float32}}
148    0.2%   100.0%       0.003s       3.01e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
149    0.0%   100.0%       0.000s       8.11e-06s     C        1        1   Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
150    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
151 
152 Apply
153 ------
154 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
155   31.6%    31.6%       0.445s       4.45e-05s   10000     7   CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
156   27.9%    59.6%       0.393s       3.93e-05s   10000    17   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
157   15.8%    75.4%       0.223s       2.23e-05s   10000    14   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
158    7.7%    83.1%       0.109s       1.09e-05s   10000    15   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
159    5.5%    88.7%       0.078s       7.79e-02s      1     0   Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
160    4.3%    92.9%       0.060s       6.00e-06s   10000    13   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
161    1.3%    94.2%       0.018s       1.85e-06s   10000    16   Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
162    1.0%    95.2%       0.013s       1.34e-06s   10000     5   Alloc(TensorConstant{0.0}, Shape_i{0}.0)
163    0.9%    96.1%       0.013s       1.27e-06s   10000    12   Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
164    0.6%    96.7%       0.009s       8.52e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
165    0.5%    97.2%       0.007s       7.06e-07s   10000     9   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
166    0.5%    97.6%       0.007s       6.57e-07s   10000    11   Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
167    0.4%    98.1%       0.006s       6.27e-07s   10000     0   InplaceDimShuffle{x}(b)
168    0.4%    98.5%       0.006s       5.90e-07s   10000     1   Shape_i{0}(x)
169    0.3%    98.9%       0.005s       4.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
170    0.3%    99.1%       0.004s       3.78e-07s   10000    10   Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
171    0.2%    99.4%       0.003s       3.44e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
172    0.2%    99.6%       0.003s       3.19e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
173    0.2%    99.8%       0.003s       3.01e-07s   10000    18   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
174    0.2%   100.0%       0.003s       2.56e-07s   10000     3   Shape_i{0}(y)
175    ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
176 
177 
178 
179 # 2.2 Profiling for GPU computations
180 
181 # In your terminal, type:
182 $ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
183 
184 # You'll see first the output of the script:
185 Used the gpu
186 target values for D
187 prediction on D
188 
189 Results were produced using a GeForce GTX TITAN
190 
191 # Profiling summary for all functions:
192 
193 Function profiling
194 ==================
195   Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
196   Time in 10002 calls to Function.__call__: 3.535239e+00s
197   Time in Function.fn.__call__: 3.420863e+00s (96.765%)
198   Time in thunks: 2.865905e+00s (81.067%)
199   Total compile time: 4.728150e-01s
200     Number of Apply nodes: 36
201     Theano Optimizer time: 4.283385e-01s
202        Theano validate time: 7.687330e-03s
203     Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
204 
205 Class
206 ---
207 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
208   45.7%    45.7%       1.308s       1.64e-05s     C    80001       9   theano.sandbox.cuda.basic_ops.GpuElemwise
209   17.2%    62.8%       0.492s       2.46e-05s     C    20002       4   theano.sandbox.cuda.blas.GpuGemv
210   15.1%    77.9%       0.433s       2.17e-05s     C    20001       3   theano.sandbox.cuda.basic_ops.GpuAlloc
211    8.2%    86.1%       0.234s       1.17e-05s     C    20002       4   theano.sandbox.cuda.basic_ops.HostFromGpu
212    7.2%    93.3%       0.207s       2.07e-05s     C    10000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
213    4.4%    97.7%       0.127s       1.27e-05s     C    10003       4   theano.sandbox.cuda.basic_ops.GpuFromHost
214    0.9%    98.6%       0.025s       8.23e-07s     C    30001       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
215    0.7%    99.3%       0.020s       9.88e-07s     C    20001       3   theano.tensor.elemwise.Elemwise
216    0.5%    99.8%       0.014s       7.18e-07s     C    20001       3   theano.compile.ops.Shape_i
217    0.2%   100.0%       0.006s       5.78e-07s     C    10000       1   theano.tensor.elemwise.DimShuffle
218    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
219 
220 Ops
221 ---
222 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
223   17.2%    17.2%       0.492s       2.46e-05s     C     20001        3   GpuGemv{inplace}
224    8.2%    25.3%       0.234s       1.17e-05s     C     20002        4   HostFromGpu
225    8.0%    33.3%       0.228s       2.28e-05s     C     10001        2   GpuAlloc{memset_0=True}
226    7.4%    40.7%       0.211s       2.11e-05s     C     10000        1   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
227    7.2%    47.9%       0.207s       2.07e-05s     C     10000        1   GpuCAReduce{add}{1}
228    7.1%    55.0%       0.205s       2.05e-05s     C     10000        1   GpuAlloc
229    6.9%    62.0%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
230    6.9%    68.9%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{inv,no_inplace}
231    6.2%    75.1%       0.178s       1.78e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
232    5.6%    80.6%       0.159s       1.59e-05s     C     10000        1   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
233    4.4%    85.1%       0.127s       1.27e-05s     C     10003        4   GpuFromHost
234    4.3%    89.4%       0.124s       1.24e-05s     C     10000        1   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
235    4.2%    93.6%       0.121s       1.21e-05s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
236    4.2%    97.7%       0.119s       1.19e-05s     C     10000        1   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
237    0.5%    98.2%       0.014s       7.18e-07s     C     20001        3   Shape_i{0}
238    0.5%    98.7%       0.013s       1.33e-06s     C     10001        2   Elemwise{gt,no_inplace}
239    0.3%    99.0%       0.010s       9.81e-07s     C     10000        1   GpuDimShuffle{1,0}
240    0.3%    99.3%       0.008s       7.90e-07s     C     10000        1   GpuDimShuffle{0}
241    0.2%    99.6%       0.007s       6.97e-07s     C     10001        2   GpuDimShuffle{x}
242    0.2%    99.8%       0.006s       6.50e-07s     C     10000        1   Elemwise{Cast{float32}}
243    ... (remaining 3 Ops account for   0.20%(0.01s) of the runtime)
244 
245 Apply
246 ------
247 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
248    8.8%     8.8%       0.251s       2.51e-05s   10000    22   GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
249    8.4%    17.2%       0.241s       2.41e-05s   10000     7   GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
250    8.0%    25.1%       0.228s       2.28e-05s   10000     5   GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
251    7.4%    32.5%       0.211s       2.11e-05s   10000    13   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
252    7.2%    39.7%       0.207s       2.07e-05s   10000    21   GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
253    7.1%    46.9%       0.205s       2.05e-05s   10000    17   GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
254    6.9%    53.8%       0.198s       1.98e-05s   10000     4   GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
255    6.9%    60.7%       0.198s       1.98e-05s   10000    12   GpuElemwise{inv,no_inplace}(GpuFromHost.0)
256    6.2%    66.9%       0.178s       1.78e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
257    5.6%    72.5%       0.159s       1.59e-05s   10000    19   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
258    4.8%    77.3%       0.138s       1.38e-05s   10000    18   HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
259    4.4%    81.7%       0.126s       1.26e-05s   10000    10   GpuFromHost(Elemwise{Cast{float32}}.0)
260    4.3%    86.0%       0.124s       1.24e-05s   10000     9   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
261    4.2%    90.2%       0.121s       1.21e-05s   10000    15   GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
262    4.2%    94.4%       0.119s       1.19e-05s   10000    23   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
263    3.4%    97.7%       0.096s       9.61e-06s   10000    16   HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
264    0.5%    98.2%       0.013s       1.33e-06s   10000    20   Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
265    0.3%    98.5%       0.010s       9.81e-07s   10000     2   GpuDimShuffle{1,0}(x)
266    0.3%    98.8%       0.008s       8.27e-07s   10000     1   Shape_i{0}(x)
267    0.3%    99.1%       0.008s       7.90e-07s   10000    14   GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
268    ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
269 
270 
271 # 3. Conclusions
272 
273 Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
274 consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
275 Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
276 if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
277 
278 """

2、 GpuArray Backend

若是你尚未準備好，你須要安裝 libgpuarray 和至少一個計算工具箱。能夠看相關的介紹說明 libgpuarray.

若是使用OpenGL，那麼全部設備的類型都支持的，對於該章節剩下的部分，無論你使用的計算設備是什麼，都表示是gpu。

waring：咱們想徹底支持OpenCL, 在2014年5月的時候，該支持仍然是個想法而已。一些有用的ops仍然沒有被支持，由於想要在舊的後端以最小化變化來移植。

2.1 Testing Theano with GPU

爲了查看是否使用的是GPU，能夠將下面代碼剪切而後建立個文件運行：

 1 from theano import function, config, shared, tensor, sandbox
 2 import numpy
 3 import time
 4 
 5 vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
 6 iters = 1000
 7 
 8 rng = numpy.random.RandomState(22)
 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
10 f = function([], tensor.exp(x))
11 print f.maker.fgraph.toposort()
12 t0 = time.time()
13 for i in xrange(iters):
14     r = f()
15 t1 = time.time()
16 print 'Looping %d times took' % iters, t1 - t0, 'seconds'
17 print 'Result is', r
18 if numpy.any([isinstance(x.op, tensor.Elemwise) and
19               ('Gpu' not in type(x.op).__name__)
20               for x in f.maker.fgraph.toposort()]):
21     print 'Used the cpu'
22 else:
23     print 'Used the gpu'

該程序只計算一羣隨機數的 exp() 。注意到咱們使用 theano.shared() 函數來確保輸入x存儲在gpu上。

 1 $ THEANO_FLAGS=device=cpu python check1.py
 2 [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
 3 Looping 1000 times took 2.6071999073 seconds
 4 Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
 5   1.62323285]
 6 Used the cpu
 7 
 8 $ THEANO_FLAGS=device=cuda0 python check1.py
 9 Using device cuda0: GeForce GTX 275
10 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
11 Looping 1000 times took 2.28562092781 seconds
12 Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
13   1.62323285]
14 Used the gpu

2.2 返回在設備上分配數據的句柄

在默認狀況下，在gpu上執行的函數仍然返回一個標準的numpy ndarray。在獲得結果以前會有一個遷移操做，將數據傳輸會cpu上從而來確保與cpu代碼的兼容。這可讓在不改變源代碼的狀況下只使用flag device來改變代碼運行的位置。

若是不建議損失一些靈活性，可讓theano直接返回gpu對象。下面的代碼就是這樣：

 1 from theano import function, config, shared, tensor, sandbox
 2 import numpy
 3 import time
 4 
 5 vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
 6 iters = 1000
 7 
 8 rng = numpy.random.RandomState(22)
 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
10 f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
11 print f.maker.fgraph.toposort()
12 t0 = time.time()
13 for i in xrange(iters):
14     r = f()
15 t1 = time.time()
16 print 'Looping %d times took' % iters, t1 - t0, 'seconds'
17 print 'Result is', numpy.asarray(r)
18 if numpy.any([isinstance(x.op, tensor.Elemwise) and
19               ('Gpu' not in type(x.op).__name__)
20               for x in f.maker.fgraph.toposort()]):
21     print 'Used the cpu'
22 else:
23     print 'Used the gpu'

這裏的 theano.sandbox.gpuarray.basic.gpu_from_host() 調用的意思是「將輸入複製到 GPU上」。然而在優化的階段中，由於結果已經在gpu上了，它會被移除掉（即該函數會被忽略）。這裏是爲了告訴theano咱們想要gpu上的結果。

輸出爲：

1 $ THEANO_FLAGS=device=cuda0 python check2.py
2 Using device cuda0: GeForce GTX 275
3 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
4 Looping 1000 times took 0.455810785294 seconds
5 Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
6   1.62323285]
7 Used the gpu

然而每次調用的時間看上去會比以前的兩個調用更少 (的確是會更少，由於這裏避免了數據傳輸r)這裏這麼大的加速是由於gpu上執行的異步過程所致使的，也就是說工做並無完成，只是「啓動」了。

返回的對象是一個從pygou上獲得的 GpuArray。它幾乎扮演着帶有一些異常的 numpy ndarray ，由於它的數據都在gpu上，你能夠將它複製到主機上，而後經過使用日常的numpy cast ，例如numpy.asarray()來轉換成一個常規的ndarray 。

爲了更快的速度，可使用borrow flag，查閱： Borrowing when Constructing Function Objects.

2.3 什麼可以在gpu上加速？

固然在不一樣設備之間，性能特性仍是不太的，一樣的，咱們會改進咱們的實現。

該backend支持全部的常規theano數據類型 (float32, float64, int, ...)，然而GPU的支持是變化的，並且一些單元無法處理 double (float64)或者更小的 (小於32 位，好比 int16)數據類型。若是使用了這些單元，那麼會在編譯的時候或者運行的時候獲得一個錯誤。

複雜的支持還未測試，並且大多數都不行。

一般來講，大的操做，好比矩陣乘法或者有着大量輸入的逐元素操做將會明顯更快的。

2.4 GPU 異步功能

默認狀況下，在gpu上全部的操做都是異步的，這能夠經過底層的libgpuarray來使得這些操做都是透明的。

當在設備和主機之間進行內存遷移的時候，能夠經過引入同步點。當在gpu上釋放活動的（活動的緩衝區就是仍然會被kernel使用的緩衝區）內存緩衝區的時候，能夠引入另外一個同步點。

能夠經過調用它的sync（）方法來對一個特定的GpuArray強制同步。這在作基準的時候能夠用來獲得準確的耗時計算。

強制的同步點會和中間結果的垃圾回收相關聯。爲了獲得最快的速度，你應該經過使用theano flag allow_gc=False來禁用垃圾回收器。不過要注意這會致使內存使用提高的問題。

3、直接對gpu編程的一些軟件

撇開theano這種元編程，有：

CUDA: GPU 編程API，是NVIDIA 對C的擴展 (CUDA C)
- 特定供應商
- 成熟的數值庫 (BLAS, RNG, FFT) 。
OpenCL: CUDA的多供應商版本
- 更加的通用和標準。
- 更少的庫，傳播不廣
PyCUDA:對CUDA驅動接口的python綁定，容許經過python來訪問 Nvidia的 CUDA 並行計算API
- 方便:
  
  使用python來更容易的進行GPU 元編程。
  
  從python中可以抽象的編譯更低層的 CUDA 代碼 (pycuda.driver.SourceModule).
  
  GPU 內存緩存 (pycuda.gpuarray.GPUArray).
  
  幫助文檔.
- 完整性: 綁定了全部的CUDA驅動 API.
- 自動的錯誤檢測：全部的 CUDA 錯誤都會自動的轉到python異常。
- 速度: PyCUDA的底層是用 C++寫的。
- 針對GPU對象，具備很好的內存管理：
  
  對象的清理是和對象的生命週期綁定的 (RAII, ‘Resource Acquisition Is Initialization’).
  
  使得更容易編寫正確的，無漏洞的和不容易崩潰的代碼。
  
  PyCUDA 會知道依賴條件 (例如，它不會在全部分配的內存釋放以前對上下文進行分離)。
(查閱PyCUDA的 documentation 和在PyCUDA上Andreas Kloeckner的 website )
PyOpenCL: PyCUDA for OpenCL

4、學習用PyCUDA編程

若是你已經精通C了，那麼你就能夠很容易的經過學習來充分利用你的知識，首先用CUDA C來編寫GPU，而後，使用 PyCUDA來訪問 CUDA API。

下面的資源有助於你學習的過程：

CUDA API 和CUDA C: 入門
- NVIDIA’s slides
- Stein’s (NYU) slides
CUDA API 和 CUDA C: 高級
- MIT IAP2009 CUDA (full coverage: lectures, leading Kirk-Hwu textbook, 例子，額外的資源)
- Course U. of Illinois (full lectures, Kirk-Hwu 教科書)
- NVIDIA’s knowledge base (覆蓋範圍廣，從入門到高級)
- practical issues ( grids, blocks 和 threads之間的關係；並在同一頁還有相對應的問題)
- CUDA optimisation
PyCUDA: 入門
- Kloeckner’s slides
- Kloeckner’ website
PYCUDA: 高級
- PyCUDA documentation website

下面的例子是用來講明用PyCUDA來對GPU編程的一個預言。一旦你以爲徹底足夠了，你就能夠嘗試去作相對應的練習。

Example: PyCUDA

 1 # (from PyCUDA's documentation)
 2 import pycuda.autoinit
 3 import pycuda.driver as drv
 4 import numpy
 5 
 6 from pycuda.compiler import SourceModule
 7 mod = SourceModule("""
 8 __global__ void multiply_them(float *dest, float *a, float *b)
 9 {
10   const int i = threadIdx.x;
11   dest[i] = a[i] * b[i];
12 }
13 """)
14 
15 multiply_them = mod.get_function("multiply_them")
16 
17 a = numpy.random.randn(400).astype(numpy.float32)
18 b = numpy.random.randn(400).astype(numpy.float32)
19 
20 dest = numpy.zeros_like(a)
21 multiply_them(
22         drv.Out(dest), drv.In(a), drv.In(b),
23         block=(400,1,1), grid=(1,1))
24 
25 assert numpy.allclose(dest, a*b)
26 print dest

Exercise

運行以前的例子

修改並執行一個shape（20，10）的矩陣

Example: Theano + PyCUDA

 1 import numpy, theano
 2 import theano.misc.pycuda_init
 3 from pycuda.compiler import SourceModule
 4 import theano.sandbox.cuda as cuda
 5 
 6 class PyCUDADoubleOp(theano.Op):
 7     def __eq__(self, other):
 8         return type(self) == type(other)
 9 
10     def __hash__(self):
11         return hash(type(self))
12 
13     def __str__(self):
14         return self.__class__.__name__
15 
16     def make_node(self, inp):
17         inp = cuda.basic_ops.gpu_contiguous(
18            cuda.basic_ops.as_cuda_ndarray_variable(inp))
19         assert inp.dtype == "float32"
20         return theano.Apply(self, [inp], [inp.type()])
21 
22     def make_thunk(self, node, storage_map, _, _2):
23         mod = SourceModule("""
24     __global__ void my_fct(float * i0, float * o0, int size) {
25     int i = blockIdx.x*blockDim.x + threadIdx.x;
26     if(i<size){
27         o0[i] = i0[i]*2;
28     }
29   }""")
30         pycuda_fct = mod.get_function("my_fct")
31         inputs = [storage_map[v] for v in node.inputs]
32         outputs = [storage_map[v] for v in node.outputs]
33 
34         def thunk():
35             z = outputs[0]
36             if z[0] is None or z[0].shape != inputs[0][0].shape:
37                 z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
38             grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
39             pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
40                        block=(512, 1, 1), grid=grid)
41         return thunk

使用這個代碼來測試：

1 >>> x = theano.tensor.fmatrix()
2 >>> f = theano.function([x], PyCUDADoubleOp()(x))
3 >>> xv = numpy.ones((4, 5), dtype="float32")
4 >>> assert numpy.allclose(f(xv), xv*2)
5 >>> print numpy.asarray(f(xv))