CUDA從入門到精通

時間 2019-11-16

標籤 cuda 入門精通简体版

原文原文鏈接

http://blog.csdn.net/augusdi/article/details/12833235html

CUDA從入門到精通（零）：寫在前面
算法

在老闆的要求下。本博主從2012年上高性能計算課程開始接觸CUDA編程，隨後將該技術應用到了實際項目中。使處理程序加速超過1K，可見基於圖形顯示器的並行計算對於追求速度的應用來講無疑是一個理想的選擇。編程

還有不到一年畢業。怕是畢業後這些技術也就隨畢業而去，準備這個暑假開闢一個CUDA專欄，從入門到精通。步步爲營。順便分享設計的一些經驗教訓。但願能給學習CUDA的童鞋提供必定指導。windows

我的能力所及，錯誤不免。歡迎討論。多線程

PS：申請專欄好像需要先發原創帖超過15篇。架構

。。算了，先寫夠再申請吧。到時候一併轉過去。併發

CUDA從入門到精通（一）：環境搭建app

NVIDIA於2006年推出CUDA（Compute Unified Devices Architecture），可以利用其推出的GPU進行通用計算。將並行計算從大型集羣擴展到了普通顯卡，使得用戶僅僅需要一臺帶有Geforce顯卡的筆記本就能跑較大規模的並行處理程序。dom

使用顯卡的優勢是，和大型集羣相比功耗很是低。成本也不高。但性能很是突出。以個人筆記本爲例，Geforce 610M，用DeviceQuery程序測試。可獲得例如如下硬件參數：async

計算能力達48X0.95 = 45.6 GFLOPS。而筆記本的CPU參數例如如下：

CPU計算能力爲（4核）：2.5G*4 = 10GFLOPS，可見，顯卡計算性能是4核i5 CPU的4~5倍，所以咱們可以充分利用這一資源來對一些耗時的應用進行加速。

好了，工欲善其事必先利其器，爲了使用CUDA對GPU進行編程，咱們需要準備下面必備工具：

1. 硬件平臺，就是顯卡，假設你用的不是NVIDIA的顯卡。那麼僅僅能說抱歉，其它都不支持CUDA。

2. 操做系統，我用過windows XP，Windows 7都沒問題。本博客用Windows7。

3. C編譯器，建議VS2008，和本博客一致。

4. CUDA編譯器NVCC。可以免費免註冊免license從官網下載CUDA ToolkitCUDA下載，最新版本號爲5.0。本博客用的就是該版本號。

5. 其它工具（如Visual Assist，輔助代碼高亮）

準備完成。開始安裝軟件。VS2008安裝比較費時間，建議安裝完整版（NVIDIA官網說Express版也可以）。過程沒必要詳述。CUDA Toolkit 5.0裏面包括了NVCC編譯器、設計文檔、設計例程、CUDA執行時庫、CUDA頭文件等必備的原材料。

安裝完成，咱們在桌面上發現這個圖標：

不錯，就是它。雙擊執行。可以看到一大堆例程。

咱們找到Simple OpenGL這個執行看看效果：

點右邊黃線標記處的Run就能夠看到美妙的三維正弦曲面，鼠標左鍵拖動可以轉換角度，右鍵拖動可以縮放。假設這個執行成功，說明你的環境基本搭建成功。

出現故障的可能：

1. 你使用遠程桌面鏈接登陸到還有一臺server。該server上有顯卡支持CUDA。但你遠程終端不能執行CUDA程序。這是因爲遠程登陸使用的是你本地顯卡資源，在遠程登陸時看不到server端的顯卡，因此會報錯：沒有支持CUDA的顯卡！解決方法：1. 遠程server裝兩塊顯卡。一塊僅僅用於顯示，還有一塊用於計算。2.不要用圖形界面登陸，而是用命令行界面如telnet登陸。

2.有兩個以上顯卡都支持CUDA的狀況。怎樣區分是在哪一個顯卡上執行？這個需要你在程序裏控制，選擇符合必定條件的顯卡。如較高的時鐘頻率、較大的顯存、較高的計算版本號等。

具體操做見後面的博客。

好了。先說這麼多。下一節咱們介紹怎樣在VS2008中給GPU編程。

CUDA從入門到精通（二）：第一個CUDA程序

書接上回，咱們既然直接執行例程成功了。接下來就是了解怎樣實現例程中的每個環節。固然。咱們先從簡單的作起，通常編程語言都會找個helloworld樣例，而咱們的顯卡是不會說話的，僅僅能作一些簡單的加減乘除運算。因此，CUDA程序的helloworld。我想應該最合適只是的就是向量加了。

打開VS2008，選擇File->New->Project，彈出如下對話框，設置例如如下：

以後點OK，直接進入project界面。

project中。咱們看到僅僅有一個.cu文件。內容例如如下：

[cpp]view plaincopy 
     
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
   
 #include <stdio.h>  
   
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
   
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i] + b[i];  
 }  
   
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
   
     // Add vectors in parallel.  
     cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
   
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",  
         c[0], c[1], c[2], c[3], c[4]);  
   
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
   
     return 0;  
 }  
   
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
   
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
     // Launch a kernel on the GPU with one thread for each element.  
     addKernel<<<1, size>>>(dev_c, dev_a, dev_b);  
   
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
   
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess) {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);  
       
     return cudaStatus;  
 }  

可以看出，CUDA程序和C程序並沒有差異，僅僅是多了一些以"cuda"開頭的一些庫函數和一個特殊聲明的函數：

[cpp]view plaincopy 
     
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i] + b[i];  
 }  

這個函數就是在GPU上執行的函數，稱之爲核函數。英文名Kernel Function，注意要和操做系統內核函數區分開來。

咱們直接按F7編譯。可以獲得例如如下輸出：

[html]view plaincopy 
     
 1>------ Build started: Project: cuda_helloworld, Configuration: Debug Win32 ------    
 1>Compiling with CUDA Build Rule...    
 1>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\nvcc.exe"  -G   -gencode=arch=compute_10,code=\"sm_10,compute_10\" -gencode=arch=compute_20,code=\"sm_20,compute_20\"  --machine 32 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  "  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\include" -maxrregcount=0   --compile -o "Debug/kernel.cu.obj" kernel.cu      
 1>tmpxft_000000ec_00000000-8_kernel.compute_10.cudafe1.gpu    
 1>tmpxft_000000ec_00000000-14_kernel.compute_10.cudafe2.gpu    
 1>tmpxft_000000ec_00000000-5_kernel.compute_20.cudafe1.gpu    
 1>tmpxft_000000ec_00000000-17_kernel.compute_20.cudafe2.gpu    
 1>kernel.cu    
 1>kernel.cu    
 1>tmpxft_000000ec_00000000-8_kernel.compute_10.cudafe1.cpp    
 1>tmpxft_000000ec_00000000-24_kernel.compute_10.ii    
 1>Linking...    
 1>Embedding manifest...    
 1>Performing Post-Build Event...    
 1>copy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart*.dll" "C:\Users\DongXiaoman\Documents\Visual Studio 2008\Projects\cuda_helloworld\Debug"    
 1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart32_50_35.dll    
 1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\\bin\cudart64_50_35.dll    
 1>已複製         2 個文件。    
 1>Build log was saved at "file://c:\Users\DongXiaoman\Documents\Visual Studio 2008\Projects\cuda_helloworld\cuda_helloworld\Debug\BuildLog.htm"    
 1>cuda_helloworld - 0 error(s), 105 warning(s)    
 ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========    

可見，編譯.cu文件需要利用nvcc工具。該工具的具體使用見後面博客。

直接執行，可以獲得結果圖例如如下：

假設顯示正確，那麼咱們的第一個程序宣告成功。

CUDA從入門到精通（三）：必備資料

剛入門CUDA，跑過幾個官方提供的例程，看了看人家的代碼。認爲並不難。但本身動手寫代碼時，老是不知道要先幹什麼。後幹什麼，也不知道從哪一個知識點學起。這時就需要有一本能提供指導的書籍或者教程，一步步跟着作下去。直到真正掌握。

通常講述CUDA的書，我以爲不錯的有如下這幾本：

剛開始學習的人可以先看美國人寫的這本《GPU高性能編程CUDA實戰》。可操做性很是強，但不要指望能全看懂（Ps：裏面有些概念事實上我現在仍是不怎麼懂），但不影響你進一步學習。假設想更全面地學習CUDA，《GPGPU編程技術》比較客觀具體地介紹了通用GPU編程的策略。看過這本書，可以對顯卡有更深刻的瞭解，揭開GPU的神奇面紗。後面《OpenGL編程指南》全然是爲了體驗圖形交互帶來的樂趣，可以有選擇地看；《GPU高性能運算之CUDA》這本是師兄給的，適合高速查詢（感受是將官方編程手冊翻譯了一遍）一些關鍵技術和概念。

有了這些指導材料還不夠，咱們在作項目的時候，遇到的問題在這些書上確定找不到，因此還需要有如下這些利器：

這裏面有很是多工具的使用手冊，如CUDA_GDB，Nsight。CUDA_Profiler等，方便調試程序；另外一些實用的庫，如CUFFT是專門用來作高速傅里葉變換的。CUBLAS是專用於線性代數（矩陣、向量計算）的，CUSPASE是專用於稀疏矩陣表示和計算的庫。這些庫的使用可以減小咱們設計算法的難度，提升開發效率。另外還有些新手教程也是值得一讀的，你會對NVCC編譯器有更近距離的接觸。

好了，前言就這麼多，本博主計劃按例如如下順序來說述CUDA：

1.瞭解設備

2.線程並行

3.塊並行

4.流並行

5.線程通訊

6.線程通訊實例：規約

7.存儲模型

8.常數內存

9.紋理內存

10.主機頁鎖定內存

11.圖形互操做

12.優化準則

13.CUDA與MATLAB接口

14.CUDA與MFC接口

CUDA從入門到精通（四）：加深對設備的認識

前面三節已經對CUDA作了一個簡單的介紹，這一節開始真正進入編程環節。

首先，剛開始學習的人應該對本身使用的設備有較爲紮實的理解和掌握。這樣對後面學習並行程序優化很是有幫助。瞭解硬件具體參數可以經過上節介紹的幾本書和官方資料得到。但假設仍然認爲不夠直觀，那麼咱們可以本身動手得到這些內容。

以第二節例程爲模板，咱們稍加修改的部分代碼例如如下：

[cpp]view plaincopy 
     
 // Add vectors in parallel.  
 cudaError_t cudaStatus;  
 int num = 0;  
 cudaDeviceProp prop;  
 cudaStatus = cudaGetDeviceCount(&num);  
 for(int i = 0;i<num;i++)  
 {  
     cudaGetDeviceProperties(&prop,i);  
 }  
 cudaStatus = addWithCuda(c, a, b, arraySize);  

這個修改的目的是讓咱們的程序本身主動經過調用cuda API函數得到設備數目和屬性，所謂「知己知彼，百戰不殆」。

cudaError_t 是cuda錯誤類型，取值爲整數。

cudaDeviceProp爲設備屬性結構體，其定義可以從cuda Toolkit安裝文件夾中找到。個人路徑爲：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h。找到定義爲：

[cpp]view plaincopy 
     
 /** 
  * CUDA device properties 
  */  
 struct __device_builtin__ cudaDeviceProp  
 {  
     char   name[256];                  /**< ASCII string identifying device */  
     size_t totalGlobalMem;             /**< Global memory available on device in bytes */  
     size_t sharedMemPerBlock;          /**< Shared memory available per block in bytes */  
     int    regsPerBlock;               /**< 32-bit registers available per block */  
     int    warpSize;                   /**< Warp size in threads */  
     size_t memPitch;                   /**< Maximum pitch in bytes allowed by memory copies */  
     int    maxThreadsPerBlock;         /**< Maximum number of threads per block */  
     int    maxThreadsDim[3];           /**< Maximum size of each dimension of a block */  
     int    maxGridSize[3];             /**< Maximum size of each dimension of a grid */  
     int    clockRate;                  /**< Clock frequency in kilohertz */  
     size_t totalConstMem;              /**< Constant memory available on device in bytes */  
     int    major;                      /**< Major compute capability */  
     int    minor;                      /**< Minor compute capability */  
     size_t textureAlignment;           /**< Alignment requirement for textures */  
     size_t texturePitchAlignment;      /**< Pitch alignment requirement for texture references bound to pitched memory */  
     int    deviceOverlap;              /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */  
     int    multiProcessorCount;        /**< Number of multiprocessors on device */  
     int    kernelExecTimeoutEnabled;   /**< Specified whether there is a run time limit on kernels */  
     int    integrated;                 /**< Device is integrated as opposed to discrete */  
     int    canMapHostMemory;           /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */  
     int    computeMode;                /**< Compute mode (See ::cudaComputeMode) */  
     int    maxTexture1D;               /**< Maximum 1D texture size */  
     int    maxTexture1DMipmap;         /**< Maximum 1D mipmapped texture size */  
     int    maxTexture1DLinear;         /**< Maximum size for 1D textures bound to linear memory */  
     int    maxTexture2D[2];            /**< Maximum 2D texture dimensions */  
     int    maxTexture2DMipmap[2];      /**< Maximum 2D mipmapped texture dimensions */  
     int    maxTexture2DLinear[3];      /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */  
     int    maxTexture2DGather[2];      /**< Maximum 2D texture dimensions if texture gather operations have to be performed */  
     int    maxTexture3D[3];            /**< Maximum 3D texture dimensions */  
     int    maxTextureCubemap;          /**< Maximum Cubemap texture dimensions */  
     int    maxTexture1DLayered[2];     /**< Maximum 1D layered texture dimensions */  
     int    maxTexture2DLayered[3];     /**< Maximum 2D layered texture dimensions */  
     int    maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */  
     int    maxSurface1D;               /**< Maximum 1D surface size */  
     int    maxSurface2D[2];            /**< Maximum 2D surface dimensions */  
     int    maxSurface3D[3];            /**< Maximum 3D surface dimensions */  
     int    maxSurface1DLayered[2];     /**< Maximum 1D layered surface dimensions */  
     int    maxSurface2DLayered[3];     /**< Maximum 2D layered surface dimensions */  
     int    maxSurfaceCubemap;          /**< Maximum Cubemap surface dimensions */  
     int    maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */  
     size_t surfaceAlignment;           /**< Alignment requirements for surfaces */  
     int    concurrentKernels;          /**< Device can possibly execute multiple kernels concurrently */  
     int    ECCEnabled;                 /**< Device has ECC support enabled */  
     int    pciBusID;                   /**< PCI bus ID of the device */  
     int    pciDeviceID;                /**< PCI device ID of the device */  
     int    pciDomainID;                /**< PCI domain ID of the device */  
     int    tccDriver;                  /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */  
     int    asyncEngineCount;           /**< Number of asynchronous engines */  
     int    unifiedAddressing;          /**< Device shares a unified address space with the host */  
     int    memoryClockRate;            /**< Peak memory clock frequency in kilohertz */  
     int    memoryBusWidth;             /**< Global memory bus width in bits */  
     int    l2CacheSize;                /**< Size of L2 cache in bytes */  
     int    maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */  
 };  

後面的凝視已經說明了其字段表明意義，可能有些術語對於剛開始學習的人理解起來仍是有必定困難。不要緊。咱們現在僅僅需要關注下面幾個指標：

name：就是設備名稱；

totalGlobalMem：就是顯存大小；

major,minor：CUDA設備版本號號，有1.1, 1.2, 1.3, 2.0, 2.1等多個版本號；

clockRate：GPU時鐘頻率。

multiProcessorCount：GPU大核數。一個大核（專業點稱爲流多處理器，SM。Stream-Multiprocessor）包括多個小核（流處理器。SP，Stream-Processor）

編譯。執行。咱們在VS2008project的cudaGetDeviceProperties()函數處放一個斷點，單步執行這一函數，而後用Watch窗體。切換到Auto頁，展開+。在個人筆記本上獲得例如如下結果：

可以看到，設備名爲GeForce 610M。顯存1GB。設備版本號2.1（比較高端了。哈哈），時鐘頻率爲950MHz（注意950000單位爲kHz），大核數爲1。

在一些高性能GPU上（如Tesla，Kepler系列）。大核數可能達到幾十甚至上百。可以作更大規模的並行處理。

PS：今天看SDK代碼時發現在help_cuda.h中有個函數實現從CUDA設備版本號查詢對應大核中小核的數目。認爲很是實用，之後編程序可以借鑑，摘抄例如如下：

[cpp]view plaincopy 
     
 // Beginning of GPU Architecture definitions  
 inline int _ConvertSMVer2Cores(int major, int minor)  
 {  
     // Defines for GPU Architecture types (using the SM version to determine the # of cores per SM  
     typedef struct  
     {  
         int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version  
         int Cores;  
     } sSMtoCores;  
   
     sSMtoCores nGpuArchCoresPerSM[] =  
     {  
         { 0x10,  8 }, // Tesla Generation (SM 1.0) G80 class  
         { 0x11,  8 }, // Tesla Generation (SM 1.1) G8x class  
         { 0x12,  8 }, // Tesla Generation (SM 1.2) G9x class  
         { 0x13,  8 }, // Tesla Generation (SM 1.3) GT200 class  
         { 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class  
         { 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class  
         { 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class  
         { 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class  
         {   -1, -1 }  
     };  
   
     int index = 0;  
   
     while (nGpuArchCoresPerSM[index].SM != -1)  
     {  
         if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))  
         {  
             return nGpuArchCoresPerSM[index].Cores;  
         }  
   
         index++;  
     }  
   
     // If we don't find the values, we default use the previous one to run properly  
     printf("MapSMtoCores for SM %d.%d is undefined.  Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[7].Cores);  
     return nGpuArchCoresPerSM[7].Cores;  
 }  
 // end of GPU Architecture definitions  

可見。設備版本號2.1的一個大核有48個小核，而版本號3.0以上的一個大核有192個小核！

前文說到過。當咱們用的電腦上有多個顯卡支持CUDA時。怎麼來區分在哪一個上執行呢？這裏咱們看一下addWithCuda這個函數是怎麼作的。

[cpp]view plaincopy 
     
 cudaError_t cudaStatus;  
   
 // Choose which GPU to run on, change this on a multi-GPU system.  
 cudaStatus = cudaSetDevice(0);  
 if (cudaStatus != cudaSuccess) {  
     fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?
");  
     goto Error;  
 }  

使用了cudaSetDevice(0)這個操做，0表示能搜索到的第一個設備號，假設有多個設備，則編號爲0,1,2...。

再看咱們本節加入的代碼。有個函數cudaGetDeviceCount(&num)，這個函數用來獲取設備總數，這樣咱們選擇執行CUDA程序的設備號取值就是0,1,...num-1，因而可以一個個枚舉設備，利用cudaGetDeviceProperties(&prop)得到其屬性,而後利用必定排序、篩選算法，找到最符合咱們應用的那個設備號opt，而後調用cudaSetDevice(opt)就能夠選擇該設備。

選擇標準可以從處理能力、版本號控制、名稱等各個角度出發。

後面講述流併發過程時，還要用到這些API。

假設但願瞭解不少其它硬件內容可以結合http://www.geforce.cn/hardware獲取。

CUDA從入門到精通（五）：線程並行

多線程咱們應該都不陌生，在操做系統中，進程是資源分配的基本單元，而線程是CPU時間調度的基本單元（這裏若是僅僅有1個CPU）。

將線程的概念引伸到CUDA程序設計中，咱們可以以爲線程就是執行CUDA程序的最小單元，前面咱們創建的project代碼中，有個核函數概念不知各位童鞋還記得沒有，在GPU上每個線程都會執行一次該核函數。

但GPU上的線程調度方式與CPU有很是大不一樣。CPU上會有優先級分配，從高到低，相同優先級的可以採用時間片輪轉法實現線程調度。GPU上線程沒有優先級概念。所有線程機會均等，線程狀態僅僅有等待資源和運行兩種狀態。假設資源未就緒。那麼就等待；一旦就緒。立刻運行。

當GPU資源很是充裕時。所有線程都是併發運行的。這樣加速效果很是接近理論加速比；而GPU資源少於總線程個數時。有一部分線程就會等待前面運行的線程釋放資源，從而變爲串行化運行。

代碼仍是用上一節的吧。修改很是少，再貼一遍：

[cpp]view plaincopy 
     
 #include "cuda_runtime.h"           //CUDA執行時API  
 #include "device_launch_parameters.h"     
 #include <stdio.h>  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i] + b[i];  
 }  
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
 // 重點理解這個函數  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0; //GPU設備端數據指針  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;     //狀態指示  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  //選擇執行平臺  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?
");  
         goto Error;  
     }  
     // 分配GPU設備端內存  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // 拷貝數據到GPU  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     // 執行核函數  
 <span style="BACKGROUND-COLOR: #ff6666"><strong>    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);</strong>  
 </span>    // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();   //同步線程  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);      //拷貝結果回主機  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
     cudaFree(dev_c);    //釋放GPU設備端內存  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

紅色部分即啓動核函數的調用過程。這裏看到調用方式和C不太同樣。<<<>>>表示執行時配置符號。裏面1表示僅僅分配一個線程組（又稱線程塊、Block）。size表示每個線程組有size個線程（Thread）。本程序中size依據前面傳遞參數個數應該爲5，因此執行的時候。核函數在5個GPU線程單元上分別執行了一次，總共執行了5次。

這5個線程是怎樣知道本身「身份」的？是靠threadIdx這個內置變量，它是個dim3類型變量。接受<<<>>>中第二個參數，它包括x,y,z 3維座標，而咱們傳入的參數僅僅有一維。因此僅僅有x值是有效的。經過核函數中int i = threadIdx.x;這一句，每個線程可以得到自身的id號，從而找到本身的任務去運行。

CUDA從入門到精通（六）：塊並行

同一版本號的代碼用了這麼屢次。有點過意不去，因而此次我要作較大的修改。你們要擦亮眼睛，拭目以待。

塊並行至關於操做系統中多進程的狀況。上節說到。CUDA有線程組（線程塊）的概念，將一組線程組織到一塊兒，共同分配一部分資源，而後內部調度運行。線程塊與線程塊之間。毫無瓜葛。這有利於作更粗粒度的並行。

咱們將上一節的代碼改成塊並行版本號例如如下：

下節咱們介紹塊並行。

[cpp]view plaincopy 
     
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
 #include <stdio.h>  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
 <span style="BACKGROUND-COLOR: #ff0000">    int i = blockIdx.x;  
 </span>    c[i] = a[i] + b[i];  
 }  
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     // Launch a kernel on the GPU with one thread for each element.  
  <span style="BACKGROUND-COLOR: #ff0000">   addKernel<<<size,1 >>>(dev_c, dev_a, dev_b);  
 </span>    // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

和上一節相比。僅僅有這兩行有改變。<<<>>>裏第一個參數改爲了size，第二個改爲了1，表示咱們分配size個線程塊，每個線程塊僅包括1個線程，總共仍是有5個線程。這5個線程相互獨立，運行核函數獲得對應的結果。與上一節不一樣的是，每個線程獲取id的方式變爲int i = blockIdx.x。這是線程塊ID。

因而有童鞋提問了。線程並行和塊並行的差異在哪裏？

線程並行是細粒度並行，調度效率高；塊並行是粗粒度並行，每次調度都要又一次分配資源。有時資源僅僅有一份。那麼所有線程塊都僅僅能排成一隊，串行運行。

那是否是咱們所有時候都應該用線程並行，儘量不用塊並行？

固然不是，咱們的任務有時可以採用分治法，將一個大問題分解爲幾個小規模問題，將這些小規模問題分別用一個線程塊實現，線程塊內可以採用細粒度的線程並行。而塊之間爲粗粒度並行。這樣可以充分利用硬件資源，減小線程並行的計算複雜度。

適當分解。減小規模，在一些矩陣乘法、向量內積計算應用中可以獲得充分的展現。

實際應用中。常常是兩者的結合。線程塊、線程組織圖例如如下所看到的。

多個線程塊組織成了一個Grid，稱爲線程格（經歷了從一位線程，二維線程塊到三維線程格的過程，立體感很是強啊）。

好了，下一節咱們介紹流並行。是更高層次的並行。

CUDA從入門到精通（七）：流並行

前面咱們沒有講程序的結構。我想有些童鞋可能火燒眉毛想知道CUDA程序到底是怎麼一個運行過程。好的，這一節在介紹流以前。先把CUDA程序結構簡要說一下。

CUDA程序文件後綴爲.cu，有些編譯器可能不認識這個後綴的文件，咱們可以在VS2008的Tools->Options->Text Editor->File Extension里加入cu後綴到VC++中，例如如下圖：

一個.cu文件內既包括CPU程序（稱爲主機程序），也包括GPU程序（稱爲設備程序）。怎樣區分主機程序和設備程序？依據聲明。凡是掛有「__global__」或者「__device__」前綴的函數，都是在GPU上執行的設備程序。不一樣的是__global__設備程序可被主機程序調用，而__device__設備程序則僅僅能被設備程序調用。

沒有掛不論什麼前綴的函數。都是主機程序。主機程序顯示聲明可以用__host__前綴。設備程序需要由NVCC進行編譯。而主機程序僅僅需要由主機編譯器（如VS2008中的cl.exe，Linux上的GCC）。

主機程序主要完畢設備環境初始化。傳輸數據等必備過程，設備程序僅僅負責計算。

主機程序中，有一些「cuda」打頭的函數，這些都是CUDA Runtime API，即執行時函數，主要負責完畢設備的初始化、內存分配、內存拷貝等任務。

咱們前面第三節用到的函數cudaGetDeviceCount()。cudaGetDeviceProperties()。cudaSetDevice()都是執行時API。這些函數的詳細參數聲明咱們沒必要一一記下來。拿出第三節的官方利器就可以輕鬆查詢，讓咱們打開這個文件：

打開後，在pdf搜索欄中輸入一個執行時函數，好比cudaMemcpy，查到的結果例如如下：

可以看到，該API函數的參數形式爲，第一個表示目的地。第二個表示來源地。第三個參數表示字節數。第四個表示類型。假設對類型不瞭解，直接點擊超連接，獲得詳解例如如下：

可見，該API可以實現從主機到主機、主機到設備、設備到主機、設備到設備的內存拷貝過程。同一時候可以發現。利用該API手冊可以很是方便地查詢咱們需要用的這些API函數，因此之後編CUDA程序必定要把它打開，隨時準備查詢，這樣可以大大提升編程效率。

好了，進入今天的主題：流並行。

前面已經介紹了線程並行和塊並行。知道了線程並行爲細粒度的並行，而塊並行爲粗粒度的並行，同一時候也知道了CUDA的線程組織狀況，即Grid-Block-Thread結構。

一組線程並行處理可以組織爲一個block，而一組block並行處理可以組織爲一個Grid，很是天然地想到。Grid僅僅是一個網格，咱們可否夠利用多個網格來完畢並行處理呢？答案就是利用流。

流可以實現在一個設備上執行多個核函數。前面的塊並行也好。線程並行也好。執行的核函數都是一樣的（代碼同樣，傳遞參數也同樣）。而流並行，可以執行不一樣的核函數，也可以實現對同一個核函數傳遞不一樣的參數，實現任務級別的並行。

CUDA中的流用cudaStream_t類型實現。用到的API有下面幾個：cudaStreamCreate(cudaStream_t * s)用於建立流，cudaStreamDestroy(cudaStream_t s)用於銷燬流，cudaStreamSynchronize()用於單個流同步。cudaDeviceSynchronize()用於整個設備上的所有流同步，cudaStreamQuery()用於查詢一個流的任務是否已經完畢。

詳細的含義可以查詢API手冊。

如下咱們將前面的兩個樣例中的任務改用流實現。仍然是{1,2,3,4,5}+{10,20,30,40,50} = {11,22,33,44,55}這個樣例。代碼例如如下：

[cpp]view plaincopy 
     
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
 #include <stdio.h>  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel(int *c, const int *a, const int *b)  
 {  
     int i = blockIdx.x;  
     c[i] = a[i] + b[i];  
 }  
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     const int b[arraySize] = { 10, 20, 30, 40, 50 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",c[0],c[1],c[2],c[3],c[4]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 <span style="BACKGROUND-COLOR: #ff6666">  cudaStream_t stream[5];  
     for(int i = 0;i<5;i++)  
     {  
         cudaStreamCreate(&stream[i]);   //建立流  
     }  
 </span>    // Launch a kernel on the GPU with one thread for each element.  
 <span style="BACKGROUND-COLOR: #ff6666">  for(int i = 0;i<5;i++)  
     {  
         addKernel<<<1,1,0,stream[i]>>>(dev_c+i, dev_a+i, dev_b+i);    //運行流  
     }  
     cudaDeviceSynchronize();  
 </span>    // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
 <span style="BACKGROUND-COLOR: #ff6666">  for(int i = 0;i<5;i++)  
     {  
         cudaStreamDestroy(stream[i]);   //銷燬流  
     }  
 </span>    cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }  

注意到，咱們的核函數代碼仍然和塊並行的版本號同樣，僅僅是在調用時作了改變，<<<>>>中的參數多了兩個，當中前兩個和塊並行、線程並行中的意義一樣，仍然是線程塊數（這裏爲1）、每個線程塊中線程數（這裏也是1）。第三個爲0表示每個block用到的共享內存大小。這個咱們後面再講；第四個爲流對象，表示當前核函數在哪一個流上執行。咱們建立了5個流，每個流上都裝載了一個核函數。同一時候傳遞參數有些不一樣，也就是每個核函數做用的對象也不一樣。這樣就實現了任務級別的並行，當咱們有幾個互不相關的任務時，可以寫多個核函數。資源贊成的狀況下，咱們將這些核函數裝載到不一樣流上，而後執行，這樣可以實現更粗粒度的並行。

好了，流並行就這麼簡單，咱們處理任務時。可以依據需要。選擇最適合的並行方式。

UDA從入門到精通（八）：線程通訊

咱們前面幾節主要介紹了三種利用GPU實現並行處理的方式：線程並行，塊並行和流並行。

在這些方法中。咱們一再強調，各個線程所進行的處理是互不相關的，即兩個線程不回產生交集，每個線程都僅僅關注本身的一畝三分地，對其它線程毫無興趣，就當不存在。。。。

固然。實際應用中，這種樣例太少了，也就是遇到向量相加、向量相應點乘這類纔會有如此高的並行度，而其它一些應用，如一組數求和，求最大（小）值，各個線程再也不是相互獨立的，而是產生必定關聯，線程2可能會用到線程1的結果，這時就需要利用本節的線程通訊技術了。

線程通訊在CUDA中有三種實現方式：

1. 共享存儲器；

2. 線程同步。

3. 原子操做；

最常用的是前兩種方式，共享存儲器，術語Shared Memory，是位於SM中的特殊存儲器。還記得SM嗎，就是流多處理器，大核是也。

一個SM中不只包括若干個SP（流處理器，小核），還包括一部分快速Cache，寄存器組。共享內存等，結構如圖所看到的：

從圖中可看出，一個SM內有M個SP，Shared Memory由這M個SP共同佔有。

另外指令單元也被這M個SP共享，即SIMT架構（單指令多線程架構）。一個SM中所有SP在同一時間運行同一代碼。

爲了實現線程通訊，只靠共享內存還不夠，需要有同步機制才幹使線程之間實現有序處理。

一般狀況是這樣：當線程A需要線程B計算的結果做爲輸入時。需要確保線程B已經將結果寫入共享內存中，而後線程A再從共享內存中讀出。同步不可缺乏，不然。線程A可能讀到的是無效的結果。形成計算錯誤。同步機制可以用CUDA內置函數：__syncthreads()；當某個線程運行到該函數時，進入等待狀態。直到同一線程塊（Block）中所有線程都運行到這個函數爲止，即一個__syncthreads()至關於一個線程同步點，確保一個Block中所有線程都達到同步，而後線程進入運行狀態。

綜上兩點，咱們可以寫一段線程通訊的僞代碼例如如下：

[cpp]view plaincopy 
     
 //Begin  
 if this is thread B  
      write something to Shared Memory;  
 end if  
 __syncthreads();  
 if this is thread A  
     read something from Shared Memory;  
 end if  
 //End  

上面代碼在CUDA中實現時，由於SIMT特性。所有線程都運行相同的代碼，因此在線程中需要推斷本身的身份，以避免誤操做。

注意的是，位於同一個Block中的線程才幹實現通訊，不一樣Block中的線程不能經過共享內存、同步進行通訊，而應採用原子操做或主機介入。

對於原子操做，假設感興趣可以翻閱《GPU高性能編程CUDA實戰》第九章「原子性」。

本節完。下節咱們給出一個實例來看線程通訊的代碼怎麼設計。

CUDA從入門到精通（九）：線程通訊實例

接着上一節，咱們利用剛學到的共享內存和線程同步技術，來作一個簡單的樣例。

先看下效果吧：

很是easy。就是分別求出1~5這5個數字的和，平方和，連乘積。相信學過C語言的童鞋都能用for循環作出同上面同樣的效果。但爲了學習CUDA共享內存和同步技術，咱們仍是要把簡單的東西複雜化(^_^)。

簡要分析一下，上面樣例的輸入都是同樣的，1,2,3,4,5這5個數，但計算過程有些變化。而且每個輸出和所有輸入都相關，不是前幾節樣例中那樣，一個輸出僅僅和一個輸入有關。

因此咱們在利用CUDA編程時。需要針對特殊問題作些讓步，把一些步驟串行化實現。

輸入數據本來位於主機內存，經過cudaMemcpy API已經複製到GPU顯存（術語爲全局存儲器，Global Memory）。每個線程執行時需要從Global Memory讀取輸入數據。而後完畢計算。最後將結果寫回Global Memory。當咱們計算需要屢次一樣輸入數據時，你們可能想到，每次都分別去Global Memory讀數據好像有點浪費，假設數據很是大。那麼重複屢次讀數據會至關耗時間。

索性咱們把它從Global Memory一次性讀到SM內部，而後在內部進行處理，這樣可以節省重複讀取的時間。

有了這個思路，結合上節看到的SM結構圖。看到有一片存儲器叫作Shared Memory，它位於SM內部。處理時訪問速度至關快（差點兒相同每個時鐘週期讀一次），而全局存儲器讀一次需要耗費幾十甚至上百個時鐘週期。因而，咱們就制定A計劃例如如下：

線程塊數：1，塊號爲0。（僅僅有一個線程塊內的線程才幹進行通訊。因此咱們僅僅分配一個線程塊。詳細工做交給每個線程完畢）

線程數：5，線程號分別爲0~4。（線程並行。前面講過）

共享存儲器大小：5個int型變量大小（5 * sizeof(int））。

步驟一：讀取輸入數據。將Global Memory中的5個整數讀入共享存儲器，位置一一相應，和線程號也一一相應，因此可以同一時候完畢。

步驟二：線程同步。確保所有線程都完畢了工做。

步驟三：指定線程，對共享存儲器中的輸入數據完畢對應處理。

代碼例如如下：

[cpp]view plaincopy 
     
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
   
 #include <stdio.h>  
   
 cudaError_t addWithCuda(int *c, const int *a, size_t size);  
   
 __global__ void addKernel(int *c, const int *a)  
 {  
     int i = threadIdx.x;  
 <span style="font-size:24px;"><strong>  extern __shared__ int smem[];</strong>  
 </span>   smem[i] = a[i];  
     __syncthreads();  
     if(i == 0)  // 0號線程作平方和  
     {  
         c[0] = 0;  
         for(int d = 0; d < 5; d++)  
         {  
             c[0] += smem[d] * smem[d];  
         }  
     }  
     if(i == 1)//1號線程作累加  
     {  
         c[1] = 0;  
         for(int d = 0; d < 5; d++)  
         {  
             c[1] += smem[d];  
         }  
     }  
     if(i == 2)  //2號線程作累乘  
     {  
         c[2] = 1;  
         for(int d = 0; d < 5; d++)  
         {  
             c[2] *= smem[d];  
         }  
     }  
 }  
   
 int main()  
 {  
     const int arraySize = 5;  
     const int a[arraySize] = { 1, 2, 3, 4, 5 };  
     int c[arraySize] = { 0 };  
     // Add vectors in parallel.  
     cudaError_t cudaStatus = addWithCuda(c, a, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
     printf("\t1+2+3+4+5 = %d\n\t1^2+2^2+3^2+4^2+5^2 = %d\n\t1*2*3*4*5 = %d\n\n\n\n\n\n", c[1], c[0], c[2]);  
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     return 0;  
 }  
   
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a,  size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
   
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
   
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     // Launch a kernel on the GPU with one thread for each element.  
 <span style="font-size:24px;"><strong>    addKernel<<<1, size, size * sizeof(int), 0>>>(dev_c, dev_a);</strong>  
 </span>  
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
   
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
   
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);      
     return cudaStatus;  
 }  

從代碼中看到運行配置<<<>>>中第三個參數爲共享內存大小（字節數），這樣咱們就知道了全部4個運行配置參數的意義。恭喜，你的CUDA最終入門了！

CUDA從入門到精通（十）：性能剖析和Visual Profiler

入門後的進一步學習的內容。就是怎樣優化本身的代碼。咱們前面的樣例沒有考慮不論什麼性能方面優化，是爲了更好地學習基本知識點，而不是其它細節問題。

從本節開始，咱們要從性能出發考慮問題，不斷優化代碼，使運行速度提升是並行處理的惟一目的。

測試代碼執行速度有很是多方法。C語言裏提供了相似於SystemTime()這種API得到系統時間，而後計算兩個事件之間的時長從而完畢計時功能。在CUDA中，咱們有專門測量設備執行時間的API。如下一一介紹。

翻開編程手冊《CUDA_Toolkit_Reference_Manual》，隨時準備查詢不懂得API。咱們在執行核函數先後，作例如如下操做：

[cpp]view plaincopy 
      
 cudaEvent_t start, stop;<span style="white-space:pre">  </span>//事件對象  
 cudaEventCreate(&start);<span style="white-space:pre">  </span>//建立事件  
 cudaEventCreate(&stop);<span style="white-space:pre">       </span>//建立事件  
 cudaEventRecord(start, stream);<span style="white-space:pre">   </span>//記錄開始  
 myKernel<<<dimg,dimb,size_smem,stream>>>(parameter list);//運行核函數  
   
 cudaEventRecord(stop,stream);<span style="white-space:pre"> </span>//記錄結束事件  
 cudaEventSynchronize(stop);<span style="white-space:pre">   </span>//事件同步，等待結束事件以前的設備操做均已完畢  
 float elapsedTime;  
 cudaEventElapsedTime(&elapsedTime,start,stop);//計算兩個事件之間時長（單位爲ms）  

核函數運行時間將被保存在變量elapsedTime中。經過這個值咱們可以評估算法的性能。

如下給一個樣例。來看怎麼使用計時功能。

前面的樣例規模很是小，僅僅有5個元素，處理量過小不足以計時。如下將規模擴大爲1024。此外將重複執行1000次計算總時間，這樣預計不easy受隨機擾動影響。咱們經過這個樣例對照線程並行和塊並行的性能怎樣。代碼例如如下：

[cpp]view plaincopy 
      
 #include "cuda_runtime.h"  
 #include "device_launch_parameters.h"  
 #include <stdio.h>  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);  
 __global__ void addKernel_blk(int *c, const int *a, const int *b)  
 {  
     int i = blockIdx.x;  
     c[i] = a[i]+ b[i];  
 }  
 __global__ void addKernel_thd(int *c, const int *a, const int *b)  
 {  
     int i = threadIdx.x;  
     c[i] = a[i]+ b[i];  
 }  
 int main()  
 {  
     const int arraySize = 1024;  
     int a[arraySize] = {0};  
     int b[arraySize] = {0};  
     for(int i = 0;i<arraySize;i++)  
     {  
         a[i] = i;  
         b[i] = arraySize-i;  
     }  
     int c[arraySize] = {0};  
     // Add vectors in parallel.  
     cudaError_t cudaStatus;  
     int num = 0;  
     cudaDeviceProp prop;  
     cudaStatus = cudaGetDeviceCount(&num);  
     for(int i = 0;i<num;i++)  
     {  
         cudaGetDeviceProperties(&prop,i);  
     }  
     cudaStatus = addWithCuda(c, a, b, arraySize);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "addWithCuda failed!");  
         return 1;  
     }  
   
     // cudaThreadExit must be called before exiting in order for profiling and  
     // tracing tools such as Nsight and Visual Profiler to show complete traces.  
     cudaStatus = cudaThreadExit();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadExit failed!");  
         return 1;  
     }  
     for(int i = 0;i<arraySize;i++)  
     {  
         if(c[i] != (a[i]+b[i]))  
         {  
             printf("Error in %d\n",i);  
         }  
     }  
     return 0;  
 }  
 // Helper function for using CUDA to add vectors in parallel.  
 cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size)  
 {  
     int *dev_a = 0;  
     int *dev_b = 0;  
     int *dev_c = 0;  
     cudaError_t cudaStatus;  
   
     // Choose which GPU to run on, change this on a multi-GPU system.  
     cudaStatus = cudaSetDevice(0);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");  
         goto Error;  
     }  
     // Allocate GPU buffers for three vectors (two input, one output)    .  
     cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMalloc failed!");  
         goto Error;  
     }  
     // Copy input vectors from host memory to GPU buffers.  
     cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
     cudaEvent_t start,stop;  
     cudaEventCreate(&start);  
     cudaEventCreate(&stop);  
     cudaEventRecord(start,0);  
     for(int i = 0;i<1000;i++)  
     {  
 //      addKernel_blk<<<size,1>>>(dev_c, dev_a, dev_b);  
         addKernel_thd<<<1,size>>>(dev_c, dev_a, dev_b);  
     }  
     cudaEventRecord(stop,0);  
     cudaEventSynchronize(stop);  
     float tm;  
     cudaEventElapsedTime(&tm,start,stop);  
     printf("GPU Elapsed time:%.6f ms.\n",tm);  
     // cudaThreadSynchronize waits for the kernel to finish, and returns  
     // any errors encountered during the launch.  
     cudaStatus = cudaThreadSynchronize();  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);  
         goto Error;  
     }  
     // Copy output vector from GPU buffer to host memory.  
     cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);  
     if (cudaStatus != cudaSuccess)   
     {  
         fprintf(stderr, "cudaMemcpy failed!");  
         goto Error;  
     }  
 Error:  
     cudaFree(dev_c);  
     cudaFree(dev_a);  
     cudaFree(dev_b);      
     return cudaStatus;  
 }