OpenCL C

時間 2019-11-16

標籤 opencl 欄目 C&C++ 简体版

原文原文鏈接

OpenCL C

OpenCL

簡介

opencl C是ISO C99的一個擴展，主要區別以下：html

去除了C99的一些特性，如：標準C99頭文件，函數指針，遞歸，變長數組，和位域
增長了一些特性用於並行計算，如：工做項和工做組，向量類型，同步，地址空間限定符(Address space qualifiers)

內建類型

標量數據類型ios

char , uchar, short, ushort, int, uint, long, ulong, float
bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
圖像類型
image2d_t, image3d_t, sampler_t
向量數據類型
Vector lengths 2, 4, 8, & 16 (char2, ushort4, int8, float16, double2, …)

向量操做

向量的n能夠選擇大小爲2, 3, 4, 8, and 16，能夠直接使用向量字面值，例如：
(float4)( float, float, float, float )
(float4)( float2, float, float )
(float4)( float, float2, float )
(float4)( float, float, float2 )
(float4)( float2, float2 )
(float4)( float3, float )
(float4)( float, float3 )
(float4)( float )只一個值則賦給全組算法

向量下標：
能夠用xyzw表示0123進行索引，如s.xy將索引0, 1位置的值。數組

使用.odd, .even索引偶數，奇數位置值，下標是從0開始。
int8 v = (int8)(1, 2, 3, 4, 5, 6, 7, 8);
int4 v1 = v.odd; // 奇數索引位置值{2, 4, 6, 8}緩存

使用.lo表示向量前半部分，.hi表示向量的後半部分。
int8 v = (int8)(1, 2, 3, 4, 5, 6, 7, 8);
int4 v1 = v.lo; // {1, 2, 3, 4}
int4 v2 = v.hi; // {5, 6, 7, 8}
對於3個元素的向量，v.hi, v.odd的第二個元素爲未定義。
int3 v = (int3)(1, 2, 3);
int2 v1 = v.lo; // {1, 2}
int2 v2 = v.hi; // {3, undefined}併發

在作賦值時，必須保證兩邊向量的元素個數相同：
float4 v = (float4)(1);
v.odd = (float2)(3, 3); //左邊是2個元素，右邊必需要是float2less

關係運算符

關係運算符返回值：
標量：specified relation is false返回0, true返回1
向量：specified relation is false返回0, true返回-1dom

NaN的狀況：異步

The equality operator equal(==) returns 0 if one or both arguments are not a number (NaN).
The equality operator not equal (!=) returns 1 (for scalar source operands) or -1 (for vector source operands) if one or botharguments are not a number (NaN)

相關函數：async

int isequal (float x, float y)
intn isequal (floatn x, floatn y)
int isless (float x, float y) intn isless (floatn x, floatn y)
int isless (double x, double y) longn isless (doublen x, doublen y)
int isnan (float) intn isnan (floatn)
int isnan (double) longn isnan (doublen)

bitwise operator

bitwise operators and (&), or (|), exclusive or (^), and not (~)

類型轉換

destType convert_destType<_sat><_roundingMode> (sourceType)
destTypen convert_destTypen<_sat><_roundingMode> (sourceTypen)

Modifier Rounding Mode Description
_rte Round to nearest even
_rtz Round toward zero
_rtp Round toward positive infinity
_rtn Round toward negative infinity

整型默認 _rtz，float是_rte;

標量支持顯示轉換，也能夠用convert_type函數。

char n = 3;
int m = (int)n;
或
int m = convert_int(n);

向量轉換，不支持顯示轉換，必須使用convert_type函數進行轉換。

float4 v1 = (float4)(1.0 1.0 1.0 1.0);
int4 v2 = convert_int4(v1);

as_type不改變元素bit位，從新使用新的類型解析，注意不一樣平臺字節序（Endianness）可能不一致，不具備可移植性：

float4 v1 = (float4)(1.0 1.0 1.0 1.0);
int4 v3 = as_int4(v1); //(int4)(0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000)，不是1

內存操做

返回(p + offset * n)處的值：

gentypen vloadn(size_t offset, const __global gentype *p)

將data寫到(p + offset *n)位置：

void vstoren (gentypen data, size_t offset, __global gentype *p)

判斷地址類型：

bool is_global (const void *ptr)
bool is_local (const void *ptr)
bool is_private (const void *ptr)
cl_mem_fence_flags get_fence (const void *ptr)：返回地址對應的cl_mem_fence_flags

從global memory 到 local memory，或local memory 到 global memory 的異步拷貝，可使用DMA實現，快速。
參數event是須要等待的事件
返回一個event，能夠給wait_group_events使用。

event_t async_work_group_copy(local gentype
*dst, const global gentype *src, size_t
num_gentypes, event_t event);

event_t async_work_group_strided_copy(__local gentype *dst, const __global gentype *src, size_t num_gentypes, size_t src_stride, event_t event);

將全局內存num_gentypes * sizeof(gentype)字節緩存到global cache中。

void prefetch(const _global gentype *_p, size_t num_gentypes)

同步

work_group_barrier之前的叫barrier函數，新標準仍然兼容barrier函數。一個工做組裏的全部線程必須都執行到這個函數，才能繼續往下執行。

void work_group_barrier (cl_mem_fence_flags flags)

cl_mem_fence_flags：
CLK_LOCAL_MEM_FENCE local內存操做對全部同組item可見
CLK_GLOBAL_MEM_FENCE global內存操做對同組可見

不論是CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE，都只能對相同的work-group裏的item進行同步，沒法同步全局item的內存操做。

若是真的須要進行全局全部item同步，那麼最好將同步先後拆分紅兩個kernel，在host端調用時進行同步。

原子操做

使用原子操做作同步開銷是至關大的，可是相對於使用更原始的阻塞當前線程執行的同步方式而言又是比較高效的。所以，當對某些特定數據作同步更新時，不須要使用柵欄（fence）等這種更低效的同步處理機制，咱們能夠直接對那些存儲地址採用原子操做。

在一個原子事務中執行。讀取 p 指向位置的內容（用做返回值），將 p 指向位置的內容加上 val 後再存入該位置。

int atomic_add (volatile __global int *p, int val)

原子加 1 操做。讀取 p 指向位置的內容（用做返回值），將 p 指向位置的內容加上常量值 1 後再存入該位置。原子減 1 操做 atomic_dec 和加 1 操做相似。

int atomic_inc(volatile __global int *p)

pipe

pipe能夠用於在不一樣kernel程序間傳遞數據。多個kernel程序(甚至是硬件許可)對同一pipe的同時訪問結果都是不肯定的。主機端沒法訪問pipe。

OpenCL2.0新增了一個主機API函數來建立pipe，再經過設置參數將pipe傳遞給不一樣的kernel使用：

cl_mem clCreatePipe ( cl_context context, cl_mem_flags flags, cl_uint pipe_packet_size, cl_uint pipe_max_packets,  
const cl_pipe_properties * properties,  cl_int *errcode_ret)

一個kernel進行寫入：

//reserve space in pipe for writing random numbers.  
reserve_id_t rid = work_group_reserve_write_pipe(rng_pipe, szgr);  
write_pipe(rng_pipe,rid,lid, &gfrn);  
work_group_commit_write_pipe(rng_pipe, rid);

一個kernel進行讀取：

//reserve pipe for reading  
reserve_id_t rid = work_group_reserve_read_pipe(rng_pipe, szgr);  
if(is_valid_reserve_id(rid)) {  
  //read random number from the pipe.  
  read_pipe(rng_pipe,rid,lid, &rn);  
  work_group_commit_read_pipe(rng_pipe, rid);  
  }

打印

printf常規：
%d
%x
%f
%s

打印向量vn, n取2, 3, 4, 8, 16:

int4 value = (int4)(1, 2, 3, 4);
printf("%v4d\n", value);

描述符

加下劃線不加下劃線均可以。

函數描述符:
__kernel and kernel

內存位置描述符：
__global, global,
__local, local,
__constant, constant,
__private and private

訪問權限描述符：
__read_only, read_only,
__write_only, write_only,
__read_write and read_write

work item函數

get_local_id: 返回當前thread在group中的位置
get_group_id: 返回當前group的位置
get_global_id: 返回當前thread在全局thread中的位置

get_local_size返回一個work-group的大小
get_global_size返回全局work-item的個數，NDRange中的global_work_size

整體上有：
get_global_id = get_group_id * get_local_size + get_local_id

wave

wave是線程調度的基本單位，相似cuda裏的warp(32), AMD的實現中，wave大小被定義爲64。

訪存合併

對於全局內存，一次訪問，須要幾百個cycles，咱們但願進行訪存合併，減小內存訪問次數。

不必定要全部thread要進行數據讀取，但要保證以下兩點才能進行合併訪問：

Aligned Memory access 對齊
Coalesced Memory access 連續

當要獲取的Memory首地址是cache line的倍數時，就是Aligned Memory Access，若是是非對齊的，就會致使浪費帶寬。至於Coalesced Memory Access則是warp的32個thread請求的是連續的內存塊。

L1爲128 byte，一次最小讀入128 byte大小。

如下二者方式均可以一次傳輸：

enter description here

下面落入兩個128-byte，因此須要兩次傳輸：

enter description here

下面落入更多的區域，因此須要更多的傳輸：

enter description here

Uncached Loads

這裏就是指不走L1可是仍是要走L2，也就是cache line從128-byte變爲32-byte了.
下圖是理想的對齊且連續情形，全部的128 bytes都落在四塊32 bytes的塊中

enter description here

下圖請求沒有對齊，請求落在了160-byte範圍內，bus有效使用率是百分之八十，相對使用L1，性能要好很多。

enter description here

下圖是全部thread都請求同一塊數據的情形，bus有效使用率爲4bytes/32bytes=12.5%，依然要比L1表現好。

enter description here

下圖是狀況最糟糕的，數據很是分散，可是因爲所請求的128 bytes落在了多個以32 bytes爲單位的segment中，所以無效的數據傳輸要少的多。

enter description here

收集來自： http://www.javashuo.com/article/p-gvervabt-cw.html

bank conflict

如今的warp通常是32個thread，在local memory中，存在32個bank，每一個bank是4 bytes，性能高的也多是8 bytes。

以下，一個local memory被映射到不一樣的bank中，在一個warp中若是thread 0訪問bank0，thread31訪問bank31，這樣就沒有conflict。

int lid = get_local_id(0);
int v = data[lid];

但若是是下面的訪問方法, thread 0, 8, 16, 24都會訪問bank0，這就是一個4 way conflict，致使性能降低爲原來的1/4。

int lid = get_local_id(0);
int v = data[lid*4];

bank

對於局部內存，一個warp中若是多個thread訪問到相同的bank的不一樣位置，便會產生bank conflict，這樣訪問會順序執行。

另外，若是全部thread都訪問到一個bank，會產生廣播，不會形成conflict，如你們都訪問data[0]，只會是一次訪問。

延時隱藏

若是warp中線程執行一條指令須要等待前面啓動的長延時操做的結果（就是該warp須要從全局存儲器中提取數值計算），那麼就不選擇該warp，而是選擇另外一個不須要等待結果的駐留的warp（這個warp已經獲得了本身須要的結果，因此已經無需等待了，能夠直接執行了），當多個warp準備執行的時候，採用優先機制選擇一個warp執行，這種機制不產生延時的線程先執行，這就是所謂的延時隱藏（latency hiding）。

同一個warp中的thread能夠以任意順序執行，active warps被sm資源限制。當一個warp空閒時，SM就能夠調度駐留在該SM中另外一個可用warp。在併發的warp之間切換是沒什麼消耗的，由於硬件資源早就被分配到全部thread和block，因此該新調度的warp的狀態已經存儲在SM中了。不一樣於CPU，CPU切換線程須要保存/讀取線程上下文（register內容），這是很是耗時的，而GPU爲每一個threads提供物理register，無需保存/讀取上下文。

Occupancy

要保證較高的CU資源利用率，如何保證呢，就是在進行內存訪問請求資源時，有足夠多的算術計算佔據這部分時間。

向量化

向量化容許一個線程同時執行多個操做。咱們能夠在kernel代碼中，使用向量數據類型，好比float4來得到加速。向量化在AMD的GPU上效果更爲明顯，這是由於AMD的顯卡的stream core是(x,y,z,w)這樣的向量運算單元。
下圖是在簡單的向量賦值運算中，使用float和float4的性能比較。

opencl優化方法

思路：

更好的算法思想，如對矩陣相乘進行分塊
使用本地內存（Local Memory）
- 本地內存的延遲比全局內存低，但可能會存在隱性開銷。例如，使用本地內存常常有一個本地內存屏障，這種屏障將致使同步延遲，抵銷了低延遲帶來的好處。
- 在您將多級算法合併至單一內核函數中時，本地內存對於存儲中間數據是有好處的，能夠節省 DDR 帶寬，從而下降功耗。
- 若是您但願在本地內存緩存數據，便於屢次訪問，一個好的經驗法則是保證緩存數據被訪問3次以上纔有必要這麼作。
避免本地內存的bank conflict
優化全局內存的訪存合併
對於work-group大小，最好是wave的整數倍，若是是非整數倍，有部分wave裏是空置的；若是小於wave的話，也會有一部分線程空操做
kernel要簡單些，複雜的話須要的寄存器數量會增多，而一個sm所擁有的寄存器個數是固定的（GTX 1080TI 個數爲: 65536）
儘可能按行操做,須要按列操做時能夠先對矩陣進行轉置
循環展開，減小分支（分支是分步執行的，好比說一個if (tid % 2)這樣的分支，先執行奇數線程，再執行偶數線程）
向量化操做，向量化容許一個線程同時執行多個操做。咱們能夠在 kernel 代碼中,使用向量數據類型,好比 float4 來得到加速。

圖像

採樣器對象描述了讀取圖像數據時如何對圖像進行採樣。圖像讀取函數 read_imageX 包含一個採樣器參數，該參數能夠在主機端經過調用 OpenCL API 函數建立，而後使用 clSetKernelArg 傳遞給內核；也能夠在內核程序中聲明，在內核程序中聲明的採樣器對象爲 sampler_t 類型的常量。採樣器對象包含了一些屬性，這些屬性描述了在讀取圖像對象的像素時如何採樣。分別是規格化浮點座標，尋址模式和過濾模式。

規格化座標：指定傳遞的 x、y 和 z 座標值是規格化浮點座標仍是非規格化座標值。能夠是 CLK_NORMALIZED_COORDS_TRUE 或者 CLK_NORMALIZED_COORDS_FALSE 枚舉類型的值；
尋址模式：指定圖像的尋址模式。即，當傳遞的座標值超過圖像座標區域時該如何處理。能夠是下面的枚舉類型的值：
- CLK_ADDRESS_MIRRORED_REPEAT：圖像區域外的座標設置爲區域內座標的反射值對應的顏色；
- CLK_ADDRESS_REPEAT：圖像區域外的座標重複區域內座標的顏色，只對規格化座標有效；
- CLK_ADDRESS_CLAMP_TO_EDGE：圖像區域外的座標返回圖像邊緣的顏色；
- CLK_ADDRESS_CLAMP：圖像區域外座標返回的顏色和邊框顏色保持一致；
過濾模式：指定使用的過濾模式。能夠是 CLK_FILTER_NEAREST 或 CLK_FILTER_LINEAR 枚舉類型值，分別表示最近鄰插值和雙線性插值。

Sample

1. vector add

每一個thread執行一個元素：

vector add

2. image scale

image scale

3. reduction

__kernel void reduce(__global uint4* input, __global uint4* output, int NUM) {
    NUM = NUM / 4;    //每四個數爲一個總體uint4。
    unsigned int tid = get_local_id(0);
    unsigned int localSize = get_local_size(0);
    unsigned int globalSize = get_global_size(0);

    uint4 res=(uint4){0,0,0,0};
    __local uint4 resArray[64];

    
    unsigned int i = get_global_id(0);
    while(i < NUM)
    {
        res+=input[i];
        i+=globalSize;
    }
    resArray[tid]=res;    //將每一個work-item計算結果保存到對應__local memory中
    barrier(CLK_LOCAL_MEM_FENCE);

    // do reduction in shared mem
    for(unsigned int s = localSize >> 1; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            resArray[tid] += resArray[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) 
        output[get_group_id(0)] = resArray[0];
}

#include <CL/cl.h>
#include "tool.h"
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <fstream>
using namespace std;

int isVerify(int NUM,int groupNUM,int *res) //校驗結果 {
       int sum1 = (NUM+1)*NUM/2;
    int sum2 = 0;
    for(int i = 0;i < groupNUM*4; i++)
        sum2 += res[i];
    if(sum1 == sum2)
        return 0;
    return -1;
}

void isStatusOK(cl_int status) //判斷狀態碼 {
    if(status == CL_SUCCESS)
        cout<<"RIGHT"<<endl;
    else
        cout<<"ERROR"<<endl;
}

int main(int argc, char* argv[]) {
    cl_int    status;
    /**Step 1: Getting platforms and choose an available one(first).*/
    cl_platform_id platform;
    getPlatform(platform);

    /**Step 2:Query the platform and choose the first GPU device if has one.*/
    cl_device_id *devices=getCl_device_id(platform);

    /**Step 3: Create context.*/
    cl_context context = clCreateContext(NULL,1, devices,NULL,NULL,NULL);

    /**Step 4: Creating command queue associate with the context.*/
    cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);

    /**Step 5: Create program object */
    const char *filename = "Own_Reduction_Kernels.cl";
    string sourceStr;
    status = convertToString(filename, sourceStr);
    const char *source = sourceStr.c_str();
    size_t sourceSize[] = {strlen(source)};
    cl_program program = clCreateProgramWithSource(context, 1, &source, sourceSize, NULL);

    /**Step 6: Build program. */
    status=clBuildProgram(program, 1,devices,NULL,NULL,NULL);

    /**Step 7: Initial input,output for the host and create memory objects for the kernel*/
    int NUM=25600;    //6400*4
    size_t global_work_size[1] = {640};  ///
    size_t local_work_size[1]={64};    ///256 PE
    size_t groupNUM=global_work_size[0]/local_work_size[0];
    int* input = new int[NUM];
    for(int i=0;i<NUM;i++)
        input[i]=i+1;
    int* output = new int[(global_work_size[0]/local_work_size[0])*4];

    cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, (NUM) * sizeof(int),(void *) input, NULL);
    cl_mem outputBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY , groupNUM*4* sizeof(int), NULL, NULL);

    /**Step 8: Create kernel object */
    cl_kernel kernel = clCreateKernel(program,"reduce", NULL);

    /**Step 9: Sets Kernel arguments.*/
    status = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
    status = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputBuffer);
    status = clSetKernelArg(kernel, 2, sizeof(int), &NUM);

    /**Step 10: Running the kernel.*/
    cl_event enentPoint;
    status = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, &enentPoint);
    clWaitForEvents(1,&enentPoint); ///wait
    clReleaseEvent(enentPoint);
    isStatusOK(status);
            
    /**Step 11: Read the cout put back to host memory.*/
    status = clEnqueueReadBuffer(commandQueue, outputBuffer, CL_TRUE, 0,groupNUM*4 * sizeof(int), output, 0, NULL, NULL);
    isStatusOK(status);
    if(isVerify(NUM, groupNUM ,output) == 0)
        cout<<"The result is right!!!"<<endl;
    else
        cout<<"The result is wrong!!!"<<endl;

    /**Step 12: Clean the resources.*/
    status = clReleaseKernel(kernel);//*Release kernel.
    status = clReleaseProgram(program);    //Release the program object.
    status = clReleaseMemObject(inputBuffer);//Release mem object.
    status = clReleaseMemObject(outputBuffer);
    status = clReleaseCommandQueue(commandQueue);//Release Command queue.
    status = clReleaseContext(context);//Release context.

    free(input);
    free(output);
    free(devices);
    return 0;
}