CUDA ---- Dynamic Parallelism

Dynamic Parallelism

到目前爲止,全部kernel都是在host端調用,GPU的工做徹底在CPU的控制下。CUDA Dynamic Parallelism容許GPU kernel在device端建立調用。Dynamic Parallelism使遞歸更容易實現和理解,因爲啓動的配置能夠由device上的thread在運行時決定,這也減小了host和device之間傳遞數據和執行控制。咱們接下來會分析理解使用Dynamic Parallelism。算法

Nested Execution

在host調用kernel和在device調用kernel的語法徹底同樣。kernel的執行則被分爲兩種類型:parent和child。一個parent thread,parent block或者parent grid能夠啓動一個新的grid,即child grid。child grid必須在parent 以前完成,也就是說,parent必須等待全部child完成。ide

當parent啓動一個child grid時,在parent顯式調用synchronize以前,child不保證會開始執行。parent和child共享同一個global和constant memory,可是有不一樣的shared 和local memory。不難理解的是,只有兩個時刻能夠保證child和parent見到的global memory徹底一致:child剛開始和child完成。全部parent對global memory的操做對child都是可見的,而child對global memory的操做只有在parent進行synchronize操做後對parent纔是可見的。性能

 

Nested Hello World on the GPU

爲了更清晰的講解Dynamic Parallelism,咱們改編最開始寫的hello world程序。下圖顯示了使用Dynamic Parallelism的執行過程,host調用parent grid(每一個block八個thread)。thread 0調用一個child grid(每一個block四個thread),thread 0 的第一個thread又調用一個child grid(每一個block兩個thread),依次類推。this

 

下面是具體的代碼,每一個thread會先打印出Hello World;而後,每一個thread再檢查本身是否該中止。spa

__global__ void nestedHelloWorld(int const iSize,int iDepth) {
    int tid = threadIdx.x;
    printf("Recursion=%d: Hello World from thread %d block %d\n",iDepth,tid,blockIdx.x);
    // condition to stop recursive execution
    if (iSize == 1) return;
    // reduce block size to half
    int nthreads = iSize>>1;
    // thread 0 launches child grid recursively
    if(tid == 0 && nthreads > 0) {
        nestedHelloWorld<<<1, nthreads>>>(nthreads,++iDepth);
        printf("-------> nested execution depth: %d\n",iDepth);
    }
}                        

編譯:線程

$ nvcc -arch=sm_35 -rdc=true nestedHelloWorld.cu -o nestedHelloWorld -lcudadevrt

-lcudadevrt是用來鏈接runtime庫的,跟gcc鏈接庫同樣。-rdc=true使device代碼可重入,這是DynamicParallelism所必須的,至於緣由則將是一個比較大的話題,之後探討。code

代碼的輸出爲:blog

./nestedHelloWorld Execution Configuration: grid 1 block 8
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0

這裏的01234….輸出順序挺詭異的,太規整了,咱們暫且認爲CUDA對printf作過修改吧。還有就是,按照CPU遞歸程序的經驗,這裏的輸出順序就更怪了,固然,確定不是編譯器錯誤或者CUDA的bug,你們能夠在調用kernel後邊加上cudaDeviceSynchronize,就能夠看到「正常」的順序了,緣由也就清楚了。遞歸

使用nvvp能夠查看執行狀況,空白說明parent在等待child執行結束:資源

$nvvp ./nesttedHelloWorld

接着,咱們嘗試使用兩個block而不是一個:

$ ./nestedHelloWorld 2

輸出是:

./nestedHelloWorld 2Execution Configuration: grid 2 block 8
Recursion=0: Hello World from thread 0 block 1
Recursion=0: Hello World from thread 1 block 1
Recursion=0: Hello World from thread 2 block 1
Recursion=0: Hello World from thread 3 block 1
Recursion=0: Hello World from thread 4 block 1
Recursion=0: Hello World from thread 5 block 1
Recursion=0: Hello World from thread 6 block 1
Recursion=0: Hello World from thread 7 block 1
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
-------> nested execution depth: 1
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0
Recursion=3: Hello World from thread 0 block 0

從上面結果來看,首先應該注意到,全部child的block的id都是0。下圖是調用過程,parent有兩個block了,可是全部child都只有一個blcok:

nestedHelloWorld<<<1, nthreads>>>(nthreads, ++iDepth);

 

注意:Dynamic Parallelism只有在CC3.5以上才被支持。經過Dynamic Parallelism調用的kernel不能執行於不一樣的device(物理上實際存在的)上。調用的最大深度是24,但實際狀況是,kernel要受限於memory資源,其中包括爲了同步parent和child而須要的額外的memory資源。

Nested Reduction

學過算法導論之類的算法書應該知道,由於遞歸比較消耗資源的,因此若是能夠的話最好是展開,而這裏要講的偏偏相反,咱們要實現遞歸,這部分主要就是再次證實DynamicParallelism的好處,有了它就能夠實現像C那樣寫遞歸代碼了。

下面的代碼就是一份實現,和以前同樣,每一個child的有一個block,block中第一個thread調用kernel,不一樣的是,parent的grid有不少的block。第一步仍是講global memory的地址g_idata轉化爲每一個block本地地址。而後,if判斷是否該退出,退出的話,就將結果拷貝回global memory。若是不應退出,就進行本地reduction,通常的線程執行in-place(就地)reduction,而後,同步block來保證全部部分和的計算。thread0再次產生一個只有一個block和當前一半數量thread的child grid。

__global__ void gpuRecursiveReduce (int *g_idata, int *g_odata,
unsigned int isize) {
// set thread ID
unsigned int tid = threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x*blockDim.x;
int *odata = &g_odata[blockIdx.x];
// stop condition
if (isize == 2 && tid == 0) {
g_odata[blockIdx.x] = idata[0]+idata[1];
return;
}
// nested invocation
int istride = isize>>1;
if(istride > 1 && tid < istride) {
// in place reduction
idata[tid] += idata[tid + istride];
}
// sync at block level
__syncthreads();
// nested invocation to generate child grids
if(tid==0) {
gpuRecursiveReduce <<<1, istride>>>(idata,odata,istride);
// sync all child grids launched in this block
cudaDeviceSynchronize();
}
// sync at block level again
__syncthreads();
}

編譯運行,下面結果是運行在Kepler K40上面:

$ nvcc -arch=sm_35 -rdc=true nestedReduce.cu -o nestedReduce -lcudadevrt
./nestedReduce starting reduction at device 0: Tesla K40c
array 1048576 grid 2048 block 512
cpu reduce elapsed 0.000689 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>

相較於neighbored,nested的結果是很是差的。

從上面結果看,2048個block被初始化了。每一個block執行了8個recursion,16384個child block被建立,__syncthreads也被調用了16384次。這都是致使效率很低的緣由。

當一個child grid被調用後,他看到的memory是和parent徹底同樣的,由於child只須要parent的一部分數據,block在每一個child grid的啓動前的同步操做是沒必要要的,修改後:

__global__ void gpuRecursiveReduceNosync (int *g_idata, int *g_odata,unsigned int isize) {
// set thread ID
unsigned int tid = threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * blockDim.x;
int *odata = &g_odata[blockIdx.x];
// stop condition
if (isize == 2 && tid == 0) {
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// nested invoke
int istride = isize>>1;
if(istride > 1 && tid < istride) {
idata[tid] += idata[tid + istride];
if(tid==0) {
gpuRecursiveReduceNosync<<<1, istride>>>(idata,odata,istride);
}
}
}

運行輸出,時間減小到原來的三分之一:

./nestedReduceNoSync starting reduction at device 0: Tesla K40c
array 1048576 grid 2048 block 512
cpu reduce elapsed 0.000689 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.059125 sec gpu_sum: 1048576<<<grid 2048 block 512>>>

不過,性能仍是比neighbour-paired要慢。接下來在作點改動,主要想法以下圖所示,kernel的調用增長了一個參數iDim,這是由於每次遞歸調用,child block的大小就減半,parent 的blockDim必須傳遞給child grid,從而使每一個thread都能計算正確的global memory偏移地址。注意,全部空閒的thread都被移除了。相較於以前的實現,每次都會有一半的thread空閒下來而被移除,也就釋放了一半的計算資源。

 

__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,int const iDim) {
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x*iDim;
// stop condition
if (iStride == 1 && threadIdx.x == 0) {
g_odata[blockIdx.x] = idata[0]+idata[1];
return;
}
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
// nested invocation to generate child grids
if(threadIdx.x == 0 && blockIdx.x == 0) {
gpuRecursiveReduce2 <<<gridDim.x,iStride/2>>>(
g_idata,g_odata,iStride/2,iDim);
}
}

編譯運行:

./nestedReduce2 starting reduction at device 0: Tesla K40c
array 1048576 grid 2048 block 512
cpu reduce elapsed 0.000689 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.000532 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
gpu nested elapsed 0.172036 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.059125 sec gpu_sum: 1048576<<<grid 2048 block 512>>>
gpu nested2 elapsed 0.000797 sec gpu_sum: 1048576<<<grid 2048 block 512>>>

從這個結果看,數據又好看了很多,能夠猜想,大約是因爲調用了較少的child grid,咱們能夠用nvprof來驗證下:

$ nvprof ./nestedReduce2

部分輸出結果以下,第二列上顯示了dievice kernel 的調用次數,第一個和第二個建立了16384個child grid。gpuRecursiveReduce2八層nested Parallelism只建立了8個child。

Calls (host) Calls (device) Avg Min Max Name
1 16384 441.48us 2.3360us 171.34ms gpuRecursiveReduce
1 16384 51.140us 2.2080us 57.906ms gpuRecursiveReduceNosync
1 8 56.195us 22.048us 100.74us gpuRecursiveReduce2
1 0 352.67us 352.67us 352.67us reduceNeighbored

對於一個給定的算法,咱們能夠有不少種實現方式,避免大量的nested 調用能夠提高不少性能。同步對算法的正確性相當重要,但也是一個消耗比較大的操做,block內部的同步操做卻是能夠去掉。由於在device上運行nested程序須要額外的資源,nested調用是有限的。

相關文章
相關標籤/搜索