1、On-board memory
2、On-chip memory
1、An intra-block thread communication channel 一個block中線程間交流通道
2、A program-managed cache for global memory data 可編程的cache
3、Scratch pad memory for transforming data to improve global memory access patterns
__shared__ float tile[size_y][size_x];
如果在kernel中聲明的話,其作用域就是kernel內,否則是對所有kernel有效。如果shared memory的大小在編譯期未知的話,可以使用 extern關鍵字修飾,例如下面聲明一個未知大小的1D數組:extern __shared__ int tile[];
由於其大小在編譯期未知,我們需要在每個kernel調用時,動態的分配其shared memory,也就是最開始提及的第三個參數:kernel<<<grid, block, isize * sizeof(int)>>>(...)
注意: 只有1D數組才能這樣動態使用。· Parallel access:多個地址分散在多個bank。
· Serial access:多個地址落在同一個bank。
· Broadcast access:一個地址讀操作落在一個bank。
· Conflict-free broadcast access if threads access the same address within a bank
· Bank conflict access if threads access different addresses within a bank
· 4 bytes for devices of compute capability 2.x
· 8 bytes for devices of compute capability 3.x
bank index = (byte address ÷ 4 bytes/bank) % 32 banks
下圖是Fermi的地址映射關係,注意到,bank中每個地址相差32,相鄰的word分到不同的bank中以便使warp能夠獲得更多的並行獲取內存操作(獲取連續內存時,連續地址分配到了不同bank中)。· 64-bit mode
· 32-bit mode
bank index = (byte address ÷ 8 bytes/bank) % 32 banks
這裏,如果兩個thread訪問同一個64-bit中的任意一個sub-word(1byte)也不會導致bank conflict,因爲一次64-bit(bank帶寬64bit/cycle)的讀操作就可以滿足請求了。也就是說, 同等情況下,64-bit模式一般比32-bit模式更少碰到bank conflict。cudaError_t cudaDeviceGetSharedMemConfig(cudaSharedMemConfig *pConfig);
返回結果放在pConfig中,其結果可以是下面兩種:cudaSharedMemBankSizeFourByte
cudaSharedMemBankSizeEightByte
cudaError_t cudaDeviceSetSharedMemConfig(cudaSharedMemConfig config);
cudaSharedMemBankSizeDefault
cudaSharedMemBankSizeFourByte
cudaSharedMemBankSizeEightByte
Per-device configuration
Per-kernel configuration
cudaError_t cudaDeviceSetCacheConfig(cudaFuncCache cacheConfig);
· Barriers
· Memory fences
void __syncthreads();
if (threadID % 2 == 0)
__syncthreads();
else如果在block之間不同步的話,thread blocks可能以任意順序,並行或者串行,在任意的SM上被執行。如果一個CUDA kernel需要全局同步,可以通過在同步點分割kernel和啓動多個kernel來達到這種期望的行爲。
__syncthreads();
void __threadfence_block();
void __threadfence();
void __threadfence_system();
其中,第一個函數是對應的block範圍的,也就是保證同一個block中thread在fence之前寫完的值對block中其它的thread可見,不同於barrier,該函數不需要所有的thread都執行;第二個函數是對應grid範圍的;第三個對用system的,其範圍針對整個系統,包括device和host。· Mapping data elements across Memory banks
· Mapping from thread index to shared Memory offset
__shared__ int tile[N][N];
因爲是方陣,可以從2D線程塊中以相鄰的thread獲取相鄰的元素的方式訪問數據:tile[threadIdx.y][threadIdx.x]
tile[threadIdx.x][threadIdx.y]
#define BDIMX 32我們對這個kernel有如下兩個操作:
#define BDIMY 32
dim3 block(BDIMX,BDIMY);
dim3 grid(1,1);
· 將thread索引以row-major寫到2D的shared memory數組中;
· 從shared memory中讀取這些值並寫入到global memory中。
__global__ void setRowReadRow(int *out) {
// declare static 2D shared memory
__shared__ int tile[BDIMY][BDIMX];
// 因爲block只有一個
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.y][threadIdx.x] = idx;
// 這裏同步是爲了使下面shared memory的獲取以row-major執行
// 避免若有的線程未完成,而其他線程已經在讀shared memory的情況
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.y][threadIdx.x] ;
}
__global__ void setColReadCol(int *out) {編譯運行結果如下(在K40上以4-byte模式運行):
// static shared memor
__shared__ int tile[BDIMX][BDIMY];
// mapping from thread index to global memory index
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.x][threadIdx.y] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
$ nvcc checkSmemSquare.cu –o smemSquare
$ nvprof ./smemSquare
./smemSquare at device 0 of Tesla K40c with Bank Mode:4-byte從結果可以看出,row-major的kernel表現更出色。
<<< grid (1,1) block (32,32)>>
Time(%) Time Calls Avg Min Max Name
13.25% 2.6880us 1 2.6880us 2.6880us 2.6880us setColReadCol(int*)
11.36% 2.3040us 1 2.3040us 2.3040us 2.3040us setRowReadRow(int*)
shared_load_transactions_per_request shared_store_transactions_per_request
運行結果如下(K40,8-byte模式下),row-major只有一次transaction,而column-major需要16次,如果4-byte模式下,可能需要32次:Kernel:setColReadCol (int*)(2) Writing Row-Major and Reading Column-Major
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 16.000000
Kernel:setRowReadRow(int*)
1 shared_load_transactions_per_request 1.000000
1 shared_store_transactions_per_request 1.000000
Writing Row-Major and Reading Column-Major
__global__ void setRowReadCol(int *out) {下圖展示了用簡單的5路bank shared memory實現兩種內存操作:
// static shared memory
__shared__ int tile[BDIMY][BDIMX];
// mapping from thread index to global memory index
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.y][threadIdx.x] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
Kernel:setRowReadCol (int*)從結果可以看出:寫操作是沒有conflict的,讀操作則引起了一個16次的transaction。
1 shared_load_transactions_per_request 16.000000
1 shared_store_transactions_per_request 1.000000
正如前文所說,我們可以全局範圍的動態聲明shared Memory,也可以在kernel內部動態聲明一個局部範圍的shared Memory。注意,動態聲明必須是未確定大小一維數組,因此,我們就需要重新計算索引。因爲我們將要以row-major寫,以colu-major讀,所以就需要保持下面兩個索引值:
· row_idx:1D row-major 內存的偏移
· col_idx:1D column-major內存偏移
kernel代碼:
__global__ void setRowReadColDyn(int *out) { // dynamic shared memory extern __shared__ int tile[]; // mapping from thread index to global memory index unsigned int row_idx = threadIdx.y * blockDim.x + threadIdx.x; unsigned int col_idx = threadIdx.x * blockDim.y + threadIdx.y; // shared memory store operation tile[row_idx] = row_idx; // wait for all threads to complete __syncthreads(); // shared memory load operation out[row_idx] = tile[col_idx]; }
kernel調用時配置的shared Memory:
setRowReadColDyn<<<grid, block, BDIMX * BDIMY * sizeof(int)>>>(d_C);
查看transaction:
Kernel: setRowReadColDyn(int*) 1 shared_load_transactions_per_request 16.000000 1 shared_store_transactions_per_request 1.000000
該結果和之前的例子相同,不過這裏使用的是動態聲明。
直接看kernel代碼:
__global__ void setRowReadColPad(int *out) { // static shared memory __shared__ int tile[BDIMY][BDIMX+IPAD]; // mapping from thread index to global memory offset unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x; // shared memory store operation tile[threadIdx.y][threadIdx.x] = idx; // wait for all threads to complete __syncthreads(); // shared memory load operation out[idx] = tile[threadIdx.x][threadIdx.y]; }
改代碼是setRowReadCol的翻版,查看結果:
Kernel: setRowReadColPad(int*) 1 shared_load_transactions_per_request 1.000000 1 shared_store_transactions_per_request 1.000000
正如期望的那樣,load的bank_conflict已經消失。在Fermi上,只需要加上一列就可以解決bank-conflict,但是在Kepler上卻不一定,這取決於2D shared Memory的大小,因此對於8-byte模式,可能需要多次試驗才能得到正確結果。
原文鏈接:http://blog.163.com/[email protected]/blog/static/71988399201712735436357/