結合CUDA範例精解以及CUDA並行編程。因爲正在學習CUDA,CUDA用的比較多,所以翻譯一些我的認爲重點的章節和句子,做爲學習,程序將經過NVIDIA K40服務器得出結果。若是想經過本書進行CUDA編程,又不太懂CUDA和GPU的架構,能夠將這個博客做爲入門博客(可是但願你能有些基礎,由於我介紹的並非特別全面,只是撿了一些我困惑好久後來明白的知識點,若是徹底不懂GPU的話,建議通讀本書和介紹GPU的架構的書),我儘可能在一個月更新完這本書的中文內容(部分)並補充一些本身的認識。歡迎你們評論和提問,轉載請註明出處。html
Unlike previous generations that partitioned computing resources into vertex and pixel shaders, the CUDA Architecture included a unified shader pipeline, allowing each and every arithmetic logic unit (ALU) on the chip to be marshaled by a program intending to perform general-purpose computations. Because NVIDIA intended this new family of graphics processors to be used for general purpose computing, these ALUs were built to comply with IEEE requirements for single-precision floating-point arithmetic and were designed to use an instruction set tailored for general computation rather than specifically for graphics.服務器
Furthermore, the execution units on the GPU were allowed arbitrary read and write access to memory as well as access to a software-managed cache known as shared memory. All of these features of the CUDA Architecture were added in order to create a GPU that would excel at computation in addition to performing well at traditional graphics tasks.app
#include "../common/book.h" int main() { printf("hello world!\n"); return 0; }
This program makes two notable additions to the original 「Hello, World!」
• An empty function named kernel() qualified with __global__
• A call to the empty function, embellished with <<<1,1>>>
上面那段程序和一開始的那個"hello world"相比主要有2個額外值得注意的地方。
• 一個不帶參數的空 kernel()函數 和它的前綴 __global__ 關鍵字
• 經過<<<1,1>>> 和kernel函數創建聯繫
As we saw in the previous section, code is compiled by your system’s standard C compiler by default. For example, GNU gcc might compile your host code on Linux operating systems, while Microsoft Visual C compiles it on Windows systems. The NVIDIA tools simply feed this host compiler your code, and everything behaves as it would in a world without CUDA.
正如咱們在前一節所看到的,代碼是由您系統標準的編譯器默認的。例如,GNU GCC可能在Linux操做系統下編譯你的主機代碼,而微軟的Visual C是基於Windows系統下編譯的。NVIDIA的工具只是提供(feed)你的代碼給主機編譯者(編譯器),接下來的行爲是沒有任何CUDA的。
Now we see that CUDA C adds the __global__ qualifier to standard C. This mechanism alerts the compiler that a function should be compiled to run on a device instead of the host. In this simple example, nvcc gives the function kernel() to the compiler that handles device code, and it feeds main() to the host compiler as it did in the previous example. So, what is the mysterious call to kernel(), and why must we vandalize our standard C with angle brackets and a numeric tuple? Brace yourself, because this is where the magic happens.
如今咱們看到,CUDA C加__global__關鍵字來限定標準C函數(相似於一種改寫了該方法的意思)。該機制通知編譯器函數應該編譯運行在設備上而不是主機。在這個簡單的例子中,NVCC給出了kernel()功能函數來處理設備代碼,它提供main()函數到host端就像前面的例子中寫的同樣(大概意思就是kernel執行在設備端,其它代碼執行在CPU上,原文是否是很拗口?!!)。因此,kernel()神祕的召喚是什麼,以及咱們爲何要破壞咱們的標準C角括號和數字元組?振做起來,由於這是魔法發生的地方。
192 cores(都是SIMT cores(Single Instruction Multiple Threads) and 64k registers GPU中的SIMT對應於CPU中的SIMD(Single Instruction Multiple Data) 64KB of shared memory / L1 cache 8KB cache for constants 48KB texture cache for read-only arrays up to 2K threads per SMX
Tesla K40服務器架構基於 NVIDIA Kepler™ 架構的,如圖下所示(官網Kepler架構說明,點擊下載)
圖中L2 Cache爲2級緩存。K40一共15個SMX,每一個SMX中以前已經說明,其中每一個SMX核心數爲192,得CUDA核心數爲2880枚(15*192),內存大小12G。圖下爲各個參數:
SMX內的結構圖以下。Warp是CUDA線程執行的最小單元,一個單元32個線程並行執行。寄存器文件大小:65536*32bit。 32個特殊功能單元 (SFU), 32個負載/存儲單元(LD/ST),48k 只讀數據一級數據緩存。64K共享內存或者128K,平臺不一樣數據不一樣。Tex爲紋理存儲單元。
一、程序寫完後,以.cu結尾,切勿.cpp什麼的。執行方式 $: nvcc test.cu -o test
二、若是理解了GPU底層架構將會更加清楚線程的執行方式(我不會對GPU的架構作過多的贅述,只介紹K40服務器的架構特色)若是感受很迷糊說不清楚請參考這篇:顯卡帝教你讀懂GPU架構圖 輕鬆作達人 以及下面官方的PDF