【CUDA開發】CUDA從入門到精通

時間 2019-11-30

標籤 CUDA開發 cuda 入門精通简体版

原文原文鏈接

在老闆的要求下，本博主從2012年上高性能計算課程開始接觸CUDA編程，隨後將該技術應用到了實際項目中，使處理程序加速超過1K，可見基於圖形顯示器的並行計算對於追求速度的應用來講無疑是一個理想的選擇。還有不到一年畢業，怕是畢業後這些技術也就隨畢業而去，準備這個暑假開闢一個CUDA專欄，從入門到精通，步步爲營，順便分享設計的一些經驗教訓，但願能給學習CUDA的童鞋提供必定指導。我的能力所及，錯誤不免，歡迎討論。html

PS：申請專欄好像須要先發原創帖超過15篇。。。算了，先寫夠再申請吧，到時候一併轉過去。算法

CUDA從入門到精通（一）：環境搭建編程

NVIDIA於2006年推出CUDA（Compute Unified Devices Architecture），能夠利用其推出的GPU進行通用計算，將並行計算從大型集羣擴展到了普通顯卡，使得用戶只須要一臺帶有Geforce顯卡的筆記本就能跑較大規模的並行處理程序。windows

使用顯卡的好處是，和大型集羣相比功耗很是低，成本也不高，但性能很突出。以個人筆記本爲例，Geforce 610M，用DeviceQuery程序測試，可獲得以下硬件參數：服務器

計算能力達48X0.95 = 45.6 GFLOPS。而筆記本的CPU參數以下：多線程

CPU計算能力爲（4核）：2.5G*4 = 10GFLOPS，可見，顯卡計算性能是4核i5 CPU的4~5倍，所以咱們能夠充分利用這一資源來對一些耗時的應用進行加速。架構

好了，工欲善其事必先利其器，爲了使用CUDA對GPU進行編程，咱們須要準備如下必備工具：併發

1. 硬件平臺，就是顯卡，若是你用的不是NVIDIA的顯卡，那麼只能說抱歉，其餘都不支持CUDA。app

2. 操做系統，我用過windows XP，Windows 7都沒問題，本博客用Windows7。

3. C編譯器，建議VS2008，和本博客一致。

4. CUDA編譯器NVCC，能夠免費免註冊免license從官網下載CUDA ToolkitCUDA下載，最新版本爲5.0，本博客用的就是該版本。

5. 其餘工具（如Visual Assist，輔助代碼高亮）

準備完畢，開始安裝軟件。VS2008安裝比較費時間，建議安裝完整版（NVIDIA官網說Express版也能夠），過程沒必要詳述。CUDA Toolkit 5.0裏面包含了NVCC編譯器、設計文檔、設計例程、CUDA運行時庫、CUDA頭文件等必備的原材料。

安裝完畢，咱們在桌面上發現這個圖標：

不錯，就是它，雙擊運行，能夠看到一大堆例程。咱們找到Simple OpenGL這個運行看看效果：

點右邊黃線標記處的Run便可看到美妙的三維正弦曲面，鼠標左鍵拖動能夠轉換角度，右鍵拖動能夠縮放。若是這個運行成功，說明你的環境基本搭建成功。

出現問題的可能：

1. 你使用遠程桌面鏈接登陸到另外一臺服務器，該服務器上有顯卡支持CUDA，但你遠程終端不能運行CUDA程序。這是由於遠程登陸使用的是你本地顯卡資源，在遠程登陸時看不到服務器端的顯卡，因此會報錯：沒有支持CUDA的顯卡！解決方法：1. 遠程服務器裝兩塊顯卡，一塊只用於顯示，另外一塊用於計算；2.不要用圖形界面登陸，而是用命令行界面如telnet登陸。

2.有兩個以上顯卡都支持CUDA的狀況，如何區分是在哪一個顯卡上運行？這個須要你在程序裏控制，選擇符合必定條件的顯卡，如較高的時鐘頻率、較大的顯存、較高的計算版本等。詳細操做見後面的博客。

好了，先說這麼多，下一節咱們介紹如何在VS2008中給GPU編程。

CUDA從入門到精通（二）：第一個CUDA程序

書接上回，咱們既然直接運行例程成功了，接下來就是了解如何實現例程中的每一個環節。固然，咱們先從簡單的作起，通常編程語言都會找個helloworld例子，而咱們的顯卡是不會說話的，只能作一些簡單的加減乘除運算。因此，CUDA程序的helloworld，我想應該最合適不過的就是向量加了。

打開VS2008，選擇File->New->Project，彈出下面對話框，設置以下：

以後點OK，直接進入工程界面。

工程中，咱們看到只有一個.cu文件，內容以下：

[cpp]view plaincopy 
    
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
   
 #include   
   
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
   
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i] + b[i];  
 }  
   
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
   
     // Add vectors in parallel.  
     cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
   
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",  
         c[0], c[1], c[2], c[3], c[4]);  
   
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
   
     return 0;  
 }  
   
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
   
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
     // Launch a kernel on the GPU with one thread for each element.  
     addKernel<<<1, size>>>(dev_c, dev_a, dev_b);  
   
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
   
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);  
       
     return cudaStatus;  
 }  

能夠看出，CUDA程序和C程序並沒有區別，只是多了一些以"cuda"開頭的一些庫函數和一個特殊聲明的函數：

[cpp]view plaincopy 
    
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i] + b[i];  
 }  

這個函數就是在GPU上運行的函數，稱之爲核函數，英文名Kernel Function，注意要和操做系統內核函數區分開來。

咱們直接按F7編譯，能夠獲得以下輸出：

[html]view plaincopy 
    
 1>------ Build started: Project: cuda_helloworld, Configuration: Debug Win32 ------    
 1>Compiling with CUDA Build Rule...    
 1>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\nvcc.exe"  -G   -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 32 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  "  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\include" -maxrregcount=0   --compile -o "Debug/kernel.cu.obj" kernel.cu      
 1>tmpxft_000000ec_00000000-8_kernel.compute_10.cudafe1.gpu    
 1>tmpxft_000000ec_00000000-14_kernel.compute_10.cudafe2.gpu    
 1>tmpxft_000000ec_00000000-5_kernel.compute_20.cudafe1.gpu    
 1>tmpxft_000000ec_00000000-17_kernel.compute_20.cudafe2.gpu    
 1>kernel.cu    
 1>kernel.cu    
 1>tmpxft_000000ec_00000000-8_kernel.compute_10.cudafe1.cpp    
 1>tmpxft_000000ec_00000000-24_kernel.compute_10.ii    
 1>Linking...    
 1>Embedding manifest...    
 1>Performing Post-Build Event...    
 1>copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart*.dll" "C:\Users\DongXiaoman\Documents\Visual Studio 2008\Projects\cuda_helloworld\Debug"    
 1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart32_50_35.dll    
 1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart64_50_35.dll    
 1>已複製         2 個文件。    
 1>Build log was saved at "file://c:\Users\DongXiaoman\Documents\Visual Studio 2008\Projects\cuda_helloworld\cuda_helloworld\Debug\BuildLog.htm"    
 1>cuda_helloworld - 0 error(s), 105 warning(s)    
 ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========    

可見，編譯.cu文件須要利用nvcc工具。該工具的詳細使用見後面博客。

直接運行，能夠獲得結果圖以下：

若是顯示正確，那麼咱們的第一個程序宣告成功！

CUDA從入門到精通（三）：必備資料

剛入門CUDA，跑過幾個官方提供的例程，看了看人家的代碼，以爲並不難，但本身動手寫代碼時，老是不知道要先幹什麼，後幹什麼，也不知道從哪一個知識點學起。這時就須要有一本能提供指導的書籍或者教程，一步步跟着作下去，直到真正掌握。

通常講述CUDA的書，我認爲不錯的有下面這幾本：

初學者能夠先看美國人寫的這本《GPU高性能編程CUDA實戰》，可操做性很強，但不要指望能全看懂（Ps：裏面有些概念其實我如今仍是不怎麼懂），但不影響你進一步學習。若是想更全面地學習CUDA，《GPGPU編程技術》比較客觀詳細地介紹了通用GPU編程的策略，看過這本書，能夠對顯卡有更深刻的瞭解，揭開GPU的神祕面紗。後面《OpenGL編程指南》徹底是爲了體驗圖形交互帶來的樂趣，能夠有選擇地看；《GPU高性能運算之CUDA》這本是師兄給的，適合快速查詢（感受是將官方編程手冊翻譯了一遍）一些關鍵技術和概念。

有了這些指導材料還不夠，咱們在作項目的時候，遇到的問題在這些書上確定找不到，因此還須要有下面這些利器：

這裏面有不少工具的使用手冊，如CUDA_GDB，Nsight，CUDA_Profiler等，方便調試程序；還有一些有用的庫，如CUFFT是專門用來作快速傅里葉變換的，CUBLAS是專用於線性代數（矩陣、向量計算）的，CUSPASE是專用於稀疏矩陣表示和計算的庫。這些庫的使用能夠下降咱們設計算法的難度，提升開發效率。另外還有些入門教程也是值得一讀的，你會對NVCC編譯器有更近距離的接觸。

好了，前言就這麼多，本博主計劃按以下順序來說述CUDA：

1.瞭解設備

2.線程並行

3.塊並行

4.流並行

5.線程通訊

6.線程通訊實例：規約

7.存儲模型

8.常數內存

9.紋理內存

10.主機頁鎖定內存

11.圖形互操做

12.優化準則

13.CUDA與MATLAB接口

14.CUDA與MFC接口

CUDA從入門到精通（四）：加深對設備的認識

前面三節已經對CUDA作了一個簡單的介紹，這一節開始真正進入編程環節。

首先，初學者應該對本身使用的設備有較爲紮實的理解和掌握，這樣對後面學習並行程序優化頗有幫助，瞭解硬件詳細參數能夠經過上節介紹的幾本書和官方資料得到，但若是仍然以爲不夠直觀，那麼咱們能夠本身動手得到這些內容。

以第二節例程爲模板，咱們稍加改動的部分代碼以下：

[cpp]view plaincopy 
    
 // Add vectors in parallel.  
 cudaError_t cudaStatus;  
 int num = 0;  
 cudaDeviceProp prop;  
 cudaStatus = cudaGetDeviceCount(&num);  
 for(int i = 0;i<num;i++)  < span="" style="word-wrap: break-word;">
 {  
     cudaGetDeviceProperties(&prop,i);  
 }  
 cudaStatus = addWithCuda(c, a, b, arraySize);  

這個改動的目的是讓咱們的程序自動經過調用cuda API函數得到設備數目和屬性，所謂「知己知彼，百戰不殆」。

cudaError_t 是cuda錯誤類型，取值爲整數。

cudaDeviceProp爲設備屬性結構體，其定義能夠從cuda Toolkit安裝目錄中找到，個人路徑爲：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h，找到定義爲：

[cpp]view plaincopy 
    
 /** 
  * CUDA device properties 
  */  
 struct __device_builtin__ cudaDeviceProp  
 {  
     char   name[256];                  /**< ASCII string identifying device */  
     size_t totalGlobalMem;             /**< Global memory available on device in bytes */  
     size_t sharedMemPerBlock;          /**< Shared memory available per block in bytes */  
     int    regsPerBlock;               /**< 32-bit registers available per block */  
     int    warpSize;                   /**< Warp size in threads */  
     size_t memPitch;                   /**< Maximum pitch in bytes allowed by memory copies */  
     int    maxThreadsPerBlock;         /**< Maximum number of threads per block */  
     int    maxThreadsDim[3];           /**< Maximum size of each dimension of a block */  
     int    maxGridSize[3];             /**< Maximum size of each dimension of a grid */  
     int    clockRate;                  /**< Clock frequency in kilohertz */  
     size_t totalConstMem;              /**< Constant memory available on device in bytes */  
     int    major;                      /**< Major compute capability */  
     int    minor;                      /**< Minor compute capability */  
     size_t textureAlignment;           /**< Alignment requirement for textures */  
     size_t texturePitchAlignment;      /**< Pitch alignment requirement for texture references bound to pitched memory */  
     int    deviceOverlap;              /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */  
     int    multiProcessorCount;        /**< Number of multiprocessors on device */  
     int    kernelExecTimeoutEnabled;   /**< Specified whether there is a run time limit on kernels */  
     int    integrated;                 /**< Device is integrated as opposed to discrete */  
     int    canMapHostMemory;           /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */  
     int    computeMode;                /**< Compute mode (See ::cudaComputeMode) */  
     int    maxTexture1D;               /**< Maximum 1D texture size */  
     int    maxTexture1DMipmap;         /**< Maximum 1D mipmapped texture size */  
     int    maxTexture1DLinear;         /**< Maximum size for 1D textures bound to linear memory */  
     int    maxTexture2D[2];            /**< Maximum 2D texture dimensions */  
     int    maxTexture2DMipmap[2];      /**< Maximum 2D mipmapped texture dimensions */  
     int    maxTexture2DLinear[3];      /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */  
     int    maxTexture2DGather[2];      /**< Maximum 2D texture dimensions if texture gather operations have to be performed */  
     int    maxTexture3D[3];            /**< Maximum 3D texture dimensions */  
     int    maxTextureCubemap;          /**< Maximum Cubemap texture dimensions */  
     int    maxTexture1DLayered[2];     /**< Maximum 1D layered texture dimensions */  
     int    maxTexture2DLayered[3];     /**< Maximum 2D layered texture dimensions */  
     int    maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */  
     int    maxSurface1D;               /**< Maximum 1D surface size */  
     int    maxSurface2D[2];            /**< Maximum 2D surface dimensions */  
     int    maxSurface3D[3];            /**< Maximum 3D surface dimensions */  
     int    maxSurface1DLayered[2];     /**< Maximum 1D layered surface dimensions */  
     int    maxSurface2DLayered[3];     /**< Maximum 2D layered surface dimensions */  
     int    maxSurfaceCubemap;          /**< Maximum Cubemap surface dimensions */  
     int    maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */  
     size_t surfaceAlignment;           /**< Alignment requirements for surfaces */  
     int    concurrentKernels;          /**< Device can possibly execute multiple kernels concurrently */  
     int    ECCEnabled;                 /**< Device has ECC support enabled */  
     int    pciBusID;                   /**< PCI bus ID of the device */  
     int    pciDeviceID;                /**< PCI device ID of the device */  
     int    pciDomainID;                /**< PCI domain ID of the device */  
     int    tccDriver;                  /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */  
     int    asyncEngineCount;           /**< Number of asynchronous engines */  
     int    unifiedAddressing;          /**< Device shares a unified address space with the host */  
     int    memoryClockRate;            /**< Peak memory clock frequency in kilohertz */  
     int    memoryBusWidth;             /**< Global memory bus width in bits */  
     int    l2CacheSize;                /**< Size of L2 cache in bytes */  
     int    maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */  
 };  

後面的註釋已經說明了其字段表明意義，可能有些術語對於初學者理解起來仍是有必定困難，不要緊，咱們如今只須要關注如下幾個指標：

name：就是設備名稱；

totalGlobalMem：就是顯存大小；

major,minor：CUDA設備版本號，有1.1, 1.2, 1.3, 2.0, 2.1等多個版本；

clockRate：GPU時鐘頻率；

multiProcessorCount：GPU大核數，一個大核（專業點稱爲流多處理器，SM，Stream-Multiprocessor）包含多個小核（流處理器，SP，Stream-Processor）

編譯，運行，咱們在VS2008工程的cudaGetDeviceProperties()函數處放一個斷點，單步執行這一函數，而後用Watch窗口，切換到Auto頁，展開+，在個人筆記本上獲得以下結果：

能夠看到，設備名爲GeForce 610M，顯存1GB，設備版本2.1（比較高端了，哈哈），時鐘頻率爲950MHz（注意950000單位爲kHz），大核數爲1。在一些高性能GPU上（如Tesla，Kepler系列），大核數可能達到幾十甚至上百，能夠作更大規模的並行處理。

PS：今天看SDK代碼時發如今help_cuda.h中有個函數實現從CUDA設備版本查詢相應大核中小核的數目，以爲頗有用，之後編程序能夠借鑑，摘抄以下：

[cpp]view plaincopy 
    
 // Beginning of GPU Architecture definitions  
 inline int _ConvertSMVer2Cores(int major, int minor)  
 {  
     // Defines for GPU Architecture types (using the SM version to determine the # of cores per SM  
     typedef struct  
     {  
         int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version  
         int Cores;  
     } sSMtoCores;  
   
     sSMtoCores nGpuArchCoresPerSM[] =  
     {  
         { 0x10,  8 }, // Tesla Generation (SM 1.0) G80 class  
         { 0x11,  8 }, // Tesla Generation (SM 1.1) G8x class  
         { 0x12,  8 }, // Tesla Generation (SM 1.2) G9x class  
         { 0x13,  8 }, // Tesla Generation (SM 1.3) GT200 class  
         { 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class  
         { 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class  
         { 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class  
         { 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class  
         {   -1, -1 }  
     };  
   
     int index = 0;  
   
     while (nGpuArchCoresPerSM[index].SM != -1)  
     {  
         if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))  
         {  
             return nGpuArchCoresPerSM[index].Cores;  
         }  
   
         index++;  
     }  
   
     // If we don't find the values, we default use the previous one to run properly  
     printf("MapSMtoCores for SM %d.%d is undefined.  Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[7].Cores);  
     return nGpuArchCoresPerSM[7].Cores;  
 }  
 // end of GPU Architecture definitions  

可見，設備版本2.1的一個大核有48個小核，而版本3.0以上的一個大核有192個小核！

前文說到過，當咱們用的電腦上有多個顯卡支持CUDA時，怎麼來區分在哪一個上運行呢？這裏咱們看一下addWithCuda這個函數是怎麼作的。

[cpp]view plaincopy 
    
 cudaError_t cudaStatus;  
   
 // Choose which GPU to run on, change this on a multi-GPU system.  
 cudaStatus = cudaSetDevice(0);  
 if (cudaStatus != cudaSuccess) {  
     fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
     goto Error;  
 }  

使用了cudaSetDevice(0)這個操做，0表示能搜索到的第一個設備號，若是有多個設備，則編號爲0,1,2...。

再看咱們本節添加的代碼，有個函數cudaGetDeviceCount(&num)，這個函數用來獲取設備總數，這樣咱們選擇運行CUDA程序的設備號取值就是0,1,...num-1，因而能夠一個個枚舉設備，利用cudaGetDeviceProperties(&prop)得到其屬性,而後利用必定排序、篩選算法，找到最符合咱們應用的那個設備號opt，而後調用cudaSetDevice(opt)便可選擇該設備。選擇標準能夠從處理能力、版本控制、名稱等各個角度出發。後面講述流併發過程時，還要用到這些API。

若是但願瞭解更多硬件內容能夠結合http://www.geforce.cn/hardware獲取。

CUDA從入門到精通（五）：線程並行

多線程咱們應該都不陌生，在操做系統中，進程是資源分配的基本單元，而線程是CPU時間調度的基本單元（這裏假設只有1個CPU）。

將線程的概念引伸到CUDA程序設計中，咱們能夠認爲線程就是執行CUDA程序的最小單元，前面咱們創建的工程代碼中，有個核函數概念不知各位童鞋還記得沒有，在GPU上每一個線程都會運行一次該核函數。

但GPU上的線程調度方式與CPU有很大不一樣。CPU上會有優先級分配，從高到低，一樣優先級的能夠採用時間片輪轉法實現線程調度。GPU上線程沒有優先級概念，全部線程機會均等，線程狀態只有等待資源和執行兩種狀態，若是資源未就緒，那麼就等待；一旦就緒，當即執行。當GPU資源很充裕時，全部線程都是併發執行的，這樣加速效果很接近理論加速比；而GPU資源少於總線程個數時，有一部分線程就會等待前面執行的線程釋放資源，從而變爲串行化執行。

代碼仍是用上一節的吧，改動不多，再貼一遍：

[cpp]view plaincopy 
    
 #include "cuda_runtime.h"           //CUDA運行時API  
 #include "device_launch_parameters.h"     
 #include   
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i] + b[i];  
 }  
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  < span="" style="word-wrap: break-word;">
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
 // 重點理解這個函數  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0; //GPU設備端數據指針  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;     //狀態指示  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  //選擇運行平臺  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // 分配GPU設備端內存  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // 拷貝數據到GPU  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     // 運行核函數  
 <span style="BACKGROUND-COLOR: #ff6666">    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);  
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();   //同步線程  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);      //拷貝結果回主機  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
     cudaFree(dev_c);    //釋放GPU設備端內存  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

紅色部分即啓動核函數的調用過程，這裏看到調用方式和C不太同樣。<<<>>>表示運行時配置符號，裏面1表示只分配一個線程組（又稱線程塊、Block），size表示每一個線程組有size個線程（Thread）。本程序中size根據前面傳遞參數個數應該爲5，因此運行的時候，核函數在5個GPU線程單元上分別運行了一次，總共運行了5次。這5個線程是如何知道本身「身份」的？是靠threadIdx這個內置變量，它是個dim3類型變量，接受<<<>>>中第二個參數，它包含x,y,z 3維座標，而咱們傳入的參數只有一維，因此只有x值是有效的。經過核函數中int i = threadIdx.x;這一句，每一個線程能夠得到自身的id號，從而找到本身的任務去執行。

CUDA從入門到精通（六）：塊並行

同一版本的代碼用了這麼屢次，有點過意不去，因而此次我要作較大的改動，你們要擦亮眼睛，拭目以待。

塊並行至關於操做系統中多進程的狀況，上節說到，CUDA有線程組（線程塊）的概念，將一組線程組織到一塊兒，共同分配一部分資源，而後內部調度執行。線程塊與線程塊之間，毫無瓜葛。這有利於作更粗粒度的並行。咱們將上一節的代碼改成塊並行版本以下：

下節咱們介紹塊並行。

[cpp]view plaincopy 
    
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
 #include   
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
 <span style="BACKGROUND-COLOR: #ff0000">    int i = blockIdx.x;  
     c[i] = a[i] + b[i];  
 }  
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  < span="" style="word-wrap: break-word;">
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     // Launch a kernel on the GPU with one thread for each element.  
  <span style="BACKGROUND-COLOR: #ff0000">   addKernel<<>>(dev_c, dev_a, dev_b);  
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

和上一節相比，只有這兩行有改變，<<<>>>裏第一個參數改爲了size，第二個改爲了1，表示咱們分配size個線程塊，每一個線程塊僅包含1個線程，總共仍是有5個線程。這5個線程相互獨立，執行核函數獲得相應的結果，與上一節不一樣的是，每一個線程獲取id的方式變爲int i = blockIdx.x；這是線程塊ID。

因而有童鞋提問了，線程並行和塊並行的區別在哪裏？

線程並行是細粒度並行，調度效率高；塊並行是粗粒度並行，每次調度都要從新分配資源，有時資源只有一份，那麼全部線程塊都只能排成一隊，串行執行。

那是否是咱們全部時候都應該用線程並行，儘量不用塊並行？

固然不是，咱們的任務有時能夠採用分治法，將一個大問題分解爲幾個小規模問題，將這些小規模問題分別用一個線程塊實現，線程塊內能夠採用細粒度的線程並行，而塊之間爲粗粒度並行，這樣能夠充分利用硬件資源，下降線程並行的計算複雜度。適當分解，下降規模，在一些矩陣乘法、向量內積計算應用中能夠獲得充分的展現。

實際應用中，經常是兩者的結合。線程塊、線程組織圖以下所示。

多個線程塊組織成了一個Grid，稱爲線程格（經歷了從一位線程，二維線程塊到三維線程格的過程，立體感很強啊）。

好了，下一節咱們介紹流並行，是更高層次的並行。

CUDA從入門到精通（七）：流並行

前面咱們沒有講程序的結構，我想有些童鞋可能火燒眉毛想知道CUDA程序究竟是怎麼一個執行過程。好的，這一節在介紹流以前，先把CUDA程序結構簡要說一下。

CUDA程序文件後綴爲.cu，有些編譯器可能不認識這個後綴的文件，咱們能夠在VS2008的Tools->Options->Text Editor->File Extension裏添加cu後綴到VC++中，以下圖：

一個.cu文件內既包含CPU程序（稱爲主機程序），也包含GPU程序（稱爲設備程序）。如何區分主機程序和設備程序？根據聲明，凡是掛有「__global__」或者「__device__」前綴的函數，都是在GPU上運行的設備程序，不一樣的是__global__設備程序可被主機程序調用，而__device__設備程序則只能被設備程序調用。

沒有掛任何前綴的函數，都是主機程序。主機程序顯示聲明能夠用__host__前綴。設備程序須要由NVCC進行編譯，而主機程序只須要由主機編譯器（如VS2008中的cl.exe，Linux上的GCC）。主機程序主要完成設備環境初始化，數據傳輸等必備過程，設備程序只負責計算。

主機程序中，有一些「cuda」打頭的函數，這些都是CUDA Runtime API，即運行時函數，主要負責完成設備的初始化、內存分配、內存拷貝等任務。咱們前面第三節用到的函數cudaGetDeviceCount()，cudaGetDeviceProperties()，cudaSetDevice()都是運行時API。這些函數的具體參數聲明咱們沒必要一一記下來，拿出第三節的官方利器就能夠輕鬆查詢，讓咱們打開這個文件：

打開後，在pdf搜索欄中輸入一個運行時函數，例如cudaMemcpy，查到的結果以下：

能夠看到，該API函數的參數形式爲，第一個表示目的地，第二個表示來源地，第三個參數表示字節數，第四個表示類型。若是對類型不瞭解，直接點擊超連接，獲得詳細解釋以下：

可見，該API能夠實現從主機到主機、主機到設備、設備到主機、設備到設備的內存拷貝過程。同時能夠發現，利用該API手冊能夠很方便地查詢咱們須要用的這些API函數，因此之後編CUDA程序必定要把它打開，隨時準備查詢，這樣能夠大大提升編程效率。

好了，進入今天的主題：流並行。

前面已經介紹了線程並行和塊並行，知道了線程並行爲細粒度的並行，而塊並行爲粗粒度的並行，同時也知道了CUDA的線程組織狀況，即Grid-Block-Thread結構。一組線程並行處理能夠組織爲一個block，而一組block並行處理能夠組織爲一個Grid，很天然地想到，Grid只是一個網格，咱們是否能夠利用多個網格來完成並行處理呢？答案就是利用流。

流能夠實如今一個設備上運行多個核函數。前面的塊並行也好，線程並行也好，運行的核函數都是相同的（代碼同樣，傳遞參數也同樣）。而流並行，能夠執行不一樣的核函數，也能夠實現對同一個核函數傳遞不一樣的參數，實現任務級別的並行。

CUDA中的流用cudaStream_t類型實現，用到的API有如下幾個：cudaStreamCreate(cudaStream_t * s)用於建立流，cudaStreamDestroy(cudaStream_t s)用於銷燬流，cudaStreamSynchronize()用於單個流同步，cudaDeviceSynchronize()用於整個設備上的全部流同步，cudaStreamQuery()用於查詢一個流的任務是否已經完成。具體的含義能夠查詢API手冊。

下面咱們將前面的兩個例子中的任務改用流實現，仍然是{1,2,3,4,5}+{10,20,30,40,50} = {11,22,33,44,55}這個例子。代碼以下：

[cpp]view plaincopy 
    
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
 #include   
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = blockIdx.x;  
     c[i] = a[i] + b[i];  
 }  
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  < span="" style="word-wrap: break-word;">
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 <span style="BACKGROUND-COLOR: #ff6666">  cudaStream_t stream[5];  
     for(int i = 0;i<5;i++)  
     {  
         cudaStreamCreate(&stream[i]);   //建立流  
     }  
     // Launch a kernel on the GPU with one thread for each element.  
 <span style="BACKGROUND-COLOR: #ff6666">  for(int i = 0;i<5;i++)  
     {  
         addKernel<<<1,1,0,stream[i]>>>(dev_c+i, dev_a+i, dev_b+i);    //執行流  
     }  
     cudaDeviceSynchronize();  
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
 <span style="BACKGROUND-COLOR: #ff6666">  for(int i = 0;i<5;i++)  
     {  
         cudaStreamDestroy(stream[i]);   //銷燬流  
     }  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

注意到，咱們的核函數代碼仍然和塊並行的版本同樣，只是在調用時作了改變，<<<>>>中的參數多了兩個，其中前兩個和塊並行、線程並行中的意義相同，仍然是線程塊數（這裏爲1）、每一個線程塊中線程數（這裏也是1）。第三個爲0表示每一個block用到的共享內存大小，這個咱們後面再講；第四個爲流對象，表示當前核函數在哪一個流上運行。咱們建立了5個流，每一個流上都裝載了一個核函數，同時傳遞參數有些不一樣，也就是每一個核函數做用的對象也不一樣。這樣就實現了任務級別的並行，當咱們有幾個互不相關的任務時，能夠寫多個核函數，資源容許的狀況下，咱們將這些核函數裝載到不一樣流上，而後執行，這樣能夠實現更粗粒度的並行。

好了，流並行就這麼簡單，咱們處理任務時，能夠根據須要，選擇最適合的並行方式。

UDA從入門到精通（八）：線程通訊

咱們前面幾節主要介紹了三種利用GPU實現並行處理的方式：線程並行，塊並行和流並行。在這些方法中，咱們一再強調，各個線程所進行的處理是互不相關的，即兩個線程不回產生交集，每一個線程都只關注本身的一畝三分地，對其餘線程毫無興趣，就當不存在。。。。

固然，實際應用中，這樣的例子太少了，也就是遇到向量相加、向量對應點乘這類纔會有如此高的並行度，而其餘一些應用，如一組數求和，求最大（小）值，各個線程再也不是相互獨立的，而是產生必定關聯，線程2可能會用到線程1的結果，這時就須要利用本節的線程通訊技術了。

線程通訊在CUDA中有三種實現方式：

1. 共享存儲器；

2. 線程同步；

3. 原子操做；

最經常使用的是前兩種方式，共享存儲器，術語Shared Memory，是位於SM中的特殊存儲器。還記得SM嗎，就是流多處理器，大核是也。一個SM中不只包含若干個SP（流處理器，小核），還包括一部分高速Cache，寄存器組，共享內存等，結構如圖所示：

從圖中可看出，一個SM內有M個SP，Shared Memory由這M個SP共同佔有。另外指令單元也被這M個SP共享，即SIMT架構（單指令多線程架構），一個SM中全部SP在同一時間執行同一代碼。

爲了實現線程通訊，僅僅靠共享內存還不夠，須要有同步機制才能使線程之間實現有序處理。一般狀況是這樣：當線程A須要線程B計算的結果做爲輸入時，須要確保線程B已經將結果寫入共享內存中，而後線程A再從共享內存中讀出。同步必不可少，不然，線程A可能讀到的是無效的結果，形成計算錯誤。同步機制能夠用CUDA內置函數：__syncthreads()；當某個線程執行到該函數時，進入等待狀態，直到同一線程塊（Block）中全部線程都執行到這個函數爲止，即一個__syncthreads()至關於一個線程同步點，確保一個Block中全部線程都達到同步，而後線程進入運行狀態。

綜上兩點，咱們能夠寫一段線程通訊的僞代碼以下：

[cpp]view plaincopy 
    
 //Begin  
 if this is thread B  
      write something to Shared Memory;  
 end if  
 __syncthreads();  
 if this is thread A  
     read something from Shared Memory;  
 end if  
 //End  

上面代碼在CUDA中實現時，因爲SIMT特性，全部線程都執行一樣的代碼，因此在線程中須要判斷本身的身份，以避免誤操做。

注意的是，位於同一個Block中的線程才能實現通訊，不一樣Block中的線程不能經過共享內存、同步進行通訊，而應採用原子操做或主機介入。

對於原子操做，若是感興趣能夠翻閱《GPU高性能編程CUDA實戰》第九章「原子性」。

本節完。下節咱們給出一個實例來看線程通訊的代碼怎麼設計。

CUDA從入門到精通（九）：線程通訊實例

接着上一節，咱們利用剛學到的共享內存和線程同步技術，來作一個簡單的例子。先看下效果吧：

很簡單，就是分別求出1~5這5個數字的和，平方和，連乘積。相信學過C語言的童鞋都能用for循環作出同上面同樣的效果，但爲了學習CUDA共享內存和同步技術，咱們仍是要把簡單的東西複雜化(^_^)。

簡要分析一下，上面例子的輸入都是同樣的，1,2,3,4,5這5個數，但計算過程有些變化，並且每一個輸出和全部輸入都相關，不是前幾節例子中那樣，一個輸出只和一個輸入有關。因此咱們在利用CUDA編程時，須要針對特殊問題作些讓步，把一些步驟串行化實現。

輸入數據本來位於主機內存，經過cudaMemcpy API已經拷貝到GPU顯存（術語爲全局存儲器，Global Memory），每一個線程運行時須要從Global Memory讀取輸入數據，而後完成計算，最後將結果寫回Global Memory。當咱們計算須要屢次相同輸入數據時，你們可能想到，每次都分別去Global Memory讀數據好像有點浪費，若是數據很大，那麼反覆屢次讀數據會至關耗時間。索性咱們把它從Global Memory一次性讀到SM內部，而後在內部進行處理，這樣能夠節省反覆讀取的時間。

有了這個思路，結合上節看到的SM結構圖，看到有一片存儲器叫作Shared Memory，它位於SM內部，處理時訪問速度至關快（差很少每一個時鐘週期讀一次），而全局存儲器讀一次須要耗費幾十甚至上百個時鐘週期。因而，咱們就制定A計劃以下：

線程塊數：1，塊號爲0；（只有一個線程塊內的線程才能進行通訊，因此咱們只分配一個線程塊，具體工做交給每一個線程完成）

線程數：5，線程號分別爲0~4；（線程並行，前面講過）

共享存儲器大小：5個int型變量大小（5 * sizeof(int））。

步驟一：讀取輸入數據。將Global Memory中的5個整數讀入共享存儲器，位置一一對應，和線程號也一一對應，因此能夠同時完成。

步驟二：線程同步，確保全部線程都完成了工做。

步驟三：指定線程，對共享存儲器中的輸入數據完成相應處理。

代碼以下：

[cpp]view plaincopy 
    
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
   
 #include   
   
 cudaError_t addWithCuda(int *c, const int *a, size_t size);  
   
 __global__ void addKernel(int *c, const int *a)  
 {  
     int i = threadIdx.x;  
 <span style="font-size:24px;">  extern __shared__ int smem[];  
    smem[i] = a[i];  
     __syncthreads();  
     if(i == 0)  // 0號線程作平方和  
     {  
         c[0] = 0;  
         for(int d = 0; d < 5; d++)  
         {  
             c[0] += smem[d] * smem[d];  
         }  
     }  
     if(i == 1)//1號線程作累加  
     {  
         c[1] = 0;  
         for(int d = 0; d < 5; d++)  
         {  
             c[1] += smem[d];  
         }  
     }  
     if(i == 2)  //2號線程作累乘  
     {  
         c[2] = 1;  
         for(int d = 0; d < 5; d++)  
         {  
             c[2] *= smem[d];  
         }  
     }  
 }  
   
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus = addWithCuda(c, a, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("\t1+2+3+4+5 = %d\n\t1^2+2^2+3^2+4^2+5^2 = %d\n\t1*2*3*4*5 = %d\n\n\n\n\n\n", c[1], c[0], c[2]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
   
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a,  size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
   
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     // Launch a kernel on the GPU with one thread for each element.  
 <span style="font-size:24px;">    addKernel<<<1, size, size * sizeof(int), 0>>>(dev_c, dev_a);  
   
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
   
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);      
     return cudaStatus;  
 }  

從代碼中看到執行配置<<<>>>中第三個參數爲共享內存大小（字節數），這樣咱們就知道了所有4個執行配置參數的意義。恭喜，你的CUDA終於入門了！

CUDA從入門到精通（十）：性能剖析和Visual Profiler

入門後的進一步學習的內容，就是如何優化本身的代碼。咱們前面的例子沒有考慮任何性能方面優化，是爲了更好地學習基本知識點，而不是其餘細節問題。從本節開始，咱們要從性能出發考慮問題，不斷優化代碼，使執行速度提升是並行處理的惟一目的。

測試代碼運行速度有不少方法，C語言裏提供了相似於SystemTime()這樣的API得到系統時間，而後計算兩個事件之間的時長從而完成計時功能。在CUDA中，咱們有專門測量設備運行時間的API，下面一一介紹。

翻開編程手冊《CUDA_Toolkit_Reference_Manual》，隨時準備查詢不懂得API。咱們在運行核函數先後，作以下操做：

[cpp]view plaincopy 
    
 cudaEvent_t start, stop;<span style=< span="" style="word-wrap: break-word;">"white-space:pre">  //事件對象  
 cudaEventCreate(&start);<span style="white-space:pre">  //建立事件  
 cudaEventCreate(&stop);<span style="white-space:pre">       //建立事件  
 cudaEventRecord(start, stream);<span style="white-space:pre">   //記錄開始  
 myKernel<<>>(parameter list);//執行核函數  
   
 cudaEventRecord(stop,stream);<span style="white-space:pre"> //記錄結束事件  
 cudaEventSynchronize(stop);<span style="white-space:pre">   //事件同步，等待結束事件以前的設備操做均已完成  
 float elapsedTime;  
 cudaEventElapsedTime(&elapsedTime,start,stop);//計算兩個事件之間時長（單位爲ms）  

核函數執行時間將被保存在變量elapsedTime中。經過這個值咱們能夠評估算法的性能。下面給一個例子，來看怎麼使用計時功能。

前面的例子規模很小，只有5個元素，處理量過小不足以計時，下面將規模擴大爲1024，此外將反覆運行1000次計算總時間，這樣估計不容易受隨機擾動影響。咱們經過這個例子對比線程並行和塊並行的性能如何。代碼以下：

[cpp]view plaincopy 
    
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
 #include   
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel_blk(int *c, const int *a, const int *b)  
 {  
     int i = blockIdx.x;  
     c[i] = a[i]+ b[i];  
 }  
 __global__ void addKernel_thd(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i]+ b[i];  
 }  
 int main()  
 {  
     const int arraySize = 1024;  
     int a[arraySize] = {0};  
     int b[arraySize] = {0};  
     for(int i = 0;i<arraysize;i++)  < span="" style="word-wrap: break-word;">
     {  
         a[i] = i;  
         b[i] = arraySize-i;  
     }  
     int c[arraySize] = {0};  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  < span="" style="word-wrap: break-word;">
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
   
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     for(int i = 0;i<arraysize;i++)  < span="" style="word-wrap: break-word;">
     {  
         if(c[i] != (a[i]+b[i]))  
         {  
             printf("Error in %d\n",i);  
         }  
     }  
     return 0;  
 }  
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaEvent_t start,stop;  
     cudaEventCreate(&start);  
     cudaEventCreate(&stop);  
     cudaEventRecord(start,0);  
     for(int i = 0;i<1000;i++)  
     {  
 //      addKernel_blk<<>>(dev_c, dev_a, dev_b);  
         addKernel_thd<<<1,size>>>(dev_c, dev_a, dev_b);  
     }  
     cudaEventRecord(stop,0);  
     cudaEventSynchronize(stop);  
     float tm;  
     cudaEventElapsedTime(&tm,start,stop);  
     printf("GPU Elapsed time:%.6f ms.\n",tm);  
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

addKernel_blk是採用塊並行實現的向量相加操做，而addKernel_thd是採用線程並行實現的向量相加操做。分別運行，獲得的結果以下圖所示：

線程並行：

塊並行：

可見性能居然相差近16倍！所以選擇並行處理方法時，若是問題規模不是很大，那麼採用線程並行是比較合適的，而大問題分多個線程塊處理時，每一個塊內線程數不要太少，像本文中的只有1個線程，這是對硬件資源的極大浪費。一個理想的方案是，分N個線程塊，每一個線程塊包含512個線程，將問題分解處理，效率每每比單一的線程並行處理或單一塊並行處理高不少。這也是CUDA編程的精髓。

上面這種分析程序性能的方式比較粗糙，只知道大概運行時間長度，對於設備程序各部分代碼執行時間沒有一個深刻的認識，這樣咱們就有個問題，若是對代碼進行優化，那麼優化哪一部分呢？是將線程數調節呢，仍是改用共享內存？這個問題最好的解決方案就是利用Visual Profiler。下面內容摘自《CUDA_Profiler_Users_Guide》

「Visual Profiler是一個圖形化的剖析工具，能夠顯示你的應用程序中CPU和GPU的活動狀況，利用分析引擎幫助你尋找優化的機會。」

其實除了可視化的界面，NVIDIA提供了命令行方式的剖析命令：nvprof。對於初學者，使用圖形化的方式比較容易上手，因此本節使用Visual Profiler。

打開Visual Profiler，能夠從CUDA Toolkit安裝菜單處找到。主界面以下：

咱們點擊File->New Session，彈出新建會話對話框，以下圖所示：

其中File一欄填入咱們須要進行剖析的應用程序exe文件，後面能夠都不填（若是須要命令行參數，能夠在第三行填入），直接Next，見下圖：

第一行爲應用程序執行超時時間設定，可不填；後面三個單選框都勾上，這樣咱們分別使能了剖析，使能了併發核函數剖析，而後運行分析器。

點Finish，開始運行咱們的應用程序並進行剖析、分析性能。

上圖中，CPU和GPU部分顯示了硬件和執行內容信息，點某一項則將時間條對應的部分高亮，便於觀察，同時右邊詳細信息會顯示運行時間信息。從時間條上看出，cudaMalloc佔用了很大一部分時間。下面分析器給出了一些性能提高的關鍵點，包括：低計算利用率（計算時間只佔總時間的1.8%，也難怪，加法計算複雜度原本就很低呀！）；低內存拷貝/計算交疊率（一點都沒有交疊，徹底是拷貝——計算——拷貝）；低存儲拷貝尺寸（輸入數據量過小了，至關於你淘寶買了個日記本，運費比實物價格還高！）；低存儲拷貝吞吐率（只有1.55GB/s）。這些對咱們進一步優化程序是很是有幫助的。

咱們點一下Details，就在Analysis窗口旁邊。獲得結果以下所示：

經過這個窗口能夠看到每一個核函數執行時間，以及線程格、線程塊尺寸，佔用寄存器個數，靜態共享內存、動態共享內存大小等參數，以及內存拷貝函數的執行狀況。這個提供了比前面cudaEvent函數測時間更精確的方式，直接看到每一步的執行時間，精確到ns。

在Details後面還有一個Console，點一下看看。

這個其實就是命令行窗口，顯示運行輸出。看到加入了Profiler信息後，總執行時間變長了（原來線程並行版本的程序運行時間只需4ms左右）。這也是「測不許定理」決定的，若是咱們但願測量更細微的時間，那麼總時間確定是不許的；若是咱們但願測量總時間，那麼細微的時間就被忽略掉了。

後面Settings就是咱們創建會話時的參數配置，再也不詳述。

經過本節，咱們應該能對CUDA性能提高有了一些想法，好，下一節咱們將討論如何優化CUDA程序。

http://blog.csdn.net/kkk584520/article/details/9413973

http://blog.csdn.net/kkk584520/article/details/9414191

http://blog.csdn.net/kkk584520/article/details/9415199

http://blog.csdn.net/kkk584520/article/details/9417251

http://blog.csdn.net/kkk584520/article/details/9420793

http://blog.csdn.net/kkk584520/article/details/9428389