到目前爲止,全部kernel都是在host端調用,GPU的工做徹底在CPU的控制下。CUDA Dynamic Parallelism容許GPU kernel在device端建立調用。Dynamic Parallelism使遞歸更容易實現和理解,因爲啓動的配置能夠由device上的thread在運行時決定,這也減小了host和device之間傳遞數據和執行控制。咱們接下來會分析理解使用Dynamic Parallelism。算法
在host調用kernel和在device調用kernel的語法徹底同樣。kernel的執行則被分爲兩種類型:parent和child。一個parent thread,parent block或者parent grid能夠啓動一個新的grid,即child grid。child grid必須在parent 以前完成,也就是說,parent必須等待全部child完成。ide
當parent啓動一個child grid時,在parent顯式調用synchronize以前,child不保證會開始執行。parent和child共享同一個global和constant memory,可是有不一樣的shared 和local memory。不難理解的是,只有兩個時刻能夠保證child和parent見到的global memory徹底一致:child剛開始和child完成。全部parent對global memory的操做對child都是可見的,而child對global memory的操做只有在parent進行synchronize操做後對parent纔是可見的。性能
爲了更清晰的講解Dynamic Parallelism,咱們改編最開始寫的hello world程序。下圖顯示了使用Dynamic Parallelism的執行過程,host調用parent grid(每一個block八個thread)。thread 0調用一個child grid(每一個block四個thread),thread 0 的第一個thread又調用一個child grid(每一個block兩個thread),依次類推。this
下面是具體的代碼,每一個thread會先打印出Hello World;而後,每一個thread再檢查本身是否該中止。spa
__global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid recursively if(tid == 0 && nthreads > 0) { nestedHelloWorld<<<1, nthreads>>>(nthreads,++iDepth); printf("-------> nested execution depth: %d\n",iDepth); } }
編譯:線程
$ nvcc -arch=sm_35 -rdc=true nestedHelloWorld.cu -o nestedHelloWorld -lcudadevrt
-lcudadevrt是用來鏈接runtime庫的,跟gcc鏈接庫同樣。-rdc=true使device代碼可重入,這是DynamicParallelism所必須的,至於緣由則將是一個比較大的話題,之後探討。code
代碼的輸出爲:blog
./nestedHelloWorld Execution Configuration: grid 1 block 8 Recursion=0: Hello World from thread 0 block 0 Recursion=0: Hello World from thread 1 block 0 Recursion=0: Hello World from thread 2 block 0 Recursion=0: Hello World from thread 3 block 0 Recursion=0: Hello World from thread 4 block 0 Recursion=0: Hello World from thread 5 block 0 Recursion=0: Hello World from thread 6 block 0 Recursion=0: Hello World from thread 7 block 0 -------> nested execution depth: 1 Recursion=1: Hello World from thread 0 block 0 Recursion=1: Hello World from thread 1 block 0 Recursion=1: Hello World from thread 2 block 0 Recursion=1: Hello World from thread 3 block 0 -------> nested execution depth: 2 Recursion=2: Hello World from thread 0 block 0 Recursion=2: Hello World from thread 1 block 0 -------> nested execution depth: 3 Recursion=3: Hello World from thread 0 block 0
這裏的01234….輸出順序挺詭異的,太規整了,咱們暫且認爲CUDA對printf作過修改吧。還有就是,按照CPU遞歸程序的經驗,這裏的輸出順序就更怪了,固然,確定不是編譯器錯誤或者CUDA的bug,你們能夠在調用kernel後邊加上cudaDeviceSynchronize,就能夠看到「正常」的順序了,緣由也就清楚了。遞歸
使用nvvp能夠查看執行狀況,空白說明parent在等待child執行結束:資源
$nvvp ./nesttedHelloWorld
接着,咱們嘗試使用兩個block而不是一個:
$ ./nestedHelloWorld 2
輸出是:
./nestedHelloWorld 2Execution Configuration: grid 2 block 8 Recursion=0: Hello World from thread 0 block 1 Recursion=0: Hello World from thread 1 block 1 Recursion=0: Hello World from thread 2 block 1 Recursion=0: Hello World from thread 3 block 1 Recursion=0: Hello World from thread 4 block 1 Recursion=0: Hello World from thread 5 block 1 Recursion=0: Hello World from thread 6 block 1 Recursion=0: Hello World from thread 7 block 1 Recursion=0: Hello World from thread 0 block 0 Recursion=0: Hello World from thread 1 block 0 Recursion=0: Hello World from thread 2 block 0 Recursion=0: Hello World from thread 3 block 0 Recursion=0: Hello World from thread 4 block 0 Recursion=0: Hello World from thread 5 block 0 Recursion=0: Hello World from thread 6 block 0 Recursion=0: Hello World from thread 7 block 0 -------> nested execution depth: 1 -------> nested execution depth: 1 Recursion=1: Hello World from thread 0 block 0 Recursion=1: Hello World from thread 1 block 0 Recursion=1: Hello World from thread 2 block 0 Recursion=1: Hello World from thread 3 block 0 Recursion=1: Hello World from thread 0 block 0 Recursion=1: Hello World from thread 1 block 0 Recursion=1: Hello World from thread 2 block 0 Recursion=1: Hello World from thread 3 block 0 -------> nested execution depth: 2 -------> nested execution depth: 2 Recursion=2: Hello World from thread 0 block 0 Recursion=2: Hello World from thread 1 block 0 Recursion=2: Hello World from thread 0 block 0 Recursion=2: Hello World from thread 1 block 0 -------> nested execution depth: 3 -------> nested execution depth: 3 Recursion=3: Hello World from thread 0 block 0 Recursion=3: Hello World from thread 0 block 0
從上面結果來看,首先應該注意到,全部child的block的id都是0。下圖是調用過程,parent有兩個block了,可是全部child都只有一個blcok:
nestedHelloWorld<<<1, nthreads>>>(nthreads, ++iDepth);
注意:Dynamic Parallelism只有在CC3.5以上才被支持。經過Dynamic Parallelism調用的kernel不能執行於不一樣的device(物理上實際存在的)上。調用的最大深度是24,但實際狀況是,kernel要受限於memory資源,其中包括爲了同步parent和child而須要的額外的memory資源。
學過算法導論之類的算法書應該知道,由於遞歸比較消耗資源的,因此若是能夠的話最好是展開,而這裏要講的偏偏相反,咱們要實現遞歸,這部分主要就是再次證實DynamicParallelism的好處,有了它就能夠實現像C那樣寫遞歸代碼了。
下面的代碼就是一份實現,和以前同樣,每一個child的有一個block,block中第一個thread調用kernel,不一樣的是,parent的grid有不少的block。第一步仍是講global memory的地址g_idata轉化爲每一個block本地地址。而後,if判斷是否該退出,退出的話,就將結果拷貝回global memory。若是不應退出,就進行本地reduction,通常的線程執行in-place(就地)reduction,而後,同步block來保證全部部分和的計算。thread0再次產生一個只有一個block和當前一半數量thread的child grid。
__global__ void gpuRecursiveReduce (int *g_idata, int *g_odata, unsigned int isize) { // set thread ID unsigned int tid = threadIdx.x; // convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x*blockDim.x; int *odata = &g_odata[blockIdx.x]; // stop condition if (isize == 2 && tid == 0) { g_odata[blockIdx.x] = idata[0]+idata[1]; return; } // nested invocation int istride = isize>>1; if(istride > 1 && tid < istride) { // in place reduction idata[tid] += idata[tid + istride]; } // sync at block level __syncthreads(); // nested invocation to generate child grids if(tid==0) { gpuRecursiveReduce <<<1, istride>>>(idata,odata,istride); // sync all child grids launched in this block cudaDeviceSynchronize(); } // sync at block level again __syncthreads(); }
編譯運行,下面結果是運行在Kepler K40上面:
$ nvcc -arch=sm_35 -rdc=true nestedReduce.cu -o nestedReduce -lcudadevrt ./nestedReduce starting reduction at device 0: Tesla K40c array 1048576 grid 2048 block 512 cpu reduce elapsed 0.000689 sec cpu_sum: 1048576 gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>> gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
相較於neighbored,nested的結果是很是差的。
從上面結果看,2048個block被初始化了。每一個block執行了8個recursion,16384個child block被建立,__syncthreads也被調用了16384次。這都是致使效率很低的緣由。
當一個child grid被調用後,他看到的memory是和parent徹底同樣的,由於child只須要parent的一部分數據,block在每一個child grid的啓動前的同步操做是沒必要要的,修改後:
__global__ void gpuRecursiveReduceNosync (int *g_idata, int *g_odata,unsigned int isize) { // set thread ID unsigned int tid = threadIdx.x; // convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x * blockDim.x; int *odata = &g_odata[blockIdx.x]; // stop condition if (isize == 2 && tid == 0) { g_odata[blockIdx.x] = idata[0] + idata[1]; return; } // nested invoke int istride = isize>>1; if(istride > 1 && tid < istride) { idata[tid] += idata[tid + istride]; if(tid==0) { gpuRecursiveReduceNosync<<<1, istride>>>(idata,odata,istride); } } }
運行輸出,時間減小到原來的三分之一:
./nestedReduceNoSync starting reduction at device 0: Tesla K40c array 1048576 grid 2048 block 512 cpu reduce elapsed 0.000689 sec cpu_sum: 1048576 gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>> gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>> gpu nestedNosyn elapsed 0.059125 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
不過,性能仍是比neighbour-paired要慢。接下來在作點改動,主要想法以下圖所示,kernel的調用增長了一個參數iDim,這是由於每次遞歸調用,child block的大小就減半,parent 的blockDim必須傳遞給child grid,從而使每一個thread都能計算正確的global memory偏移地址。注意,全部空閒的thread都被移除了。相較於以前的實現,每次都會有一半的thread空閒下來而被移除,也就釋放了一半的計算資源。
__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,int const iDim) { // convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x*iDim; // stop condition if (iStride == 1 && threadIdx.x == 0) { g_odata[blockIdx.x] = idata[0]+idata[1]; return; } // in place reduction idata[threadIdx.x] += idata[threadIdx.x + iStride]; // nested invocation to generate child grids if(threadIdx.x == 0 && blockIdx.x == 0) { gpuRecursiveReduce2 <<<gridDim.x,iStride/2>>>( g_idata,g_odata,iStride/2,iDim); } }
編譯運行:
./nestedReduce2 starting reduction at device 0: Tesla K40c array 1048576 grid 2048 block 512 cpu reduce elapsed 0.000689 sec cpu_sum: 1048576 gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>> gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>> gpu nestedNosyn elapsed 0.059125 sec gpu_sum: 1048576<<<grid 2048 block 512>>> gpu nested2 elapsed 0.000797 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
從這個結果看,數據又好看了很多,能夠猜想,大約是因爲調用了較少的child grid,咱們能夠用nvprof來驗證下:
$ nvprof ./nestedReduce2
部分輸出結果以下,第二列上顯示了dievice kernel 的調用次數,第一個和第二個建立了16384個child grid。gpuRecursiveReduce2八層nested Parallelism只建立了8個child。
Calls (host) Calls (device) Avg Min Max Name 1 16384 441.48us 2.3360us 171.34ms gpuRecursiveReduce 1 16384 51.140us 2.2080us 57.906ms gpuRecursiveReduceNosync 1 8 56.195us 22.048us 100.74us gpuRecursiveReduce2 1 0 352.67us 352.67us 352.67us reduceNeighbored
對於一個給定的算法,咱們能夠有不少種實現方式,避免大量的nested 調用能夠提高不少性能。同步對算法的正確性相當重要,但也是一個消耗比較大的操做,block內部的同步操做卻是能夠去掉。由於在device上運行nested程序須要額外的資源,nested調用是有限的。