CUDA是GPU通用計算的一種,其中如今大熱的深度學習底層GPU計算差很少都選擇的CUDA,在這咱們先簡單瞭解下其中的一些概念,爲了好理解,咱們先用DX11裏的Compute shader來和CUDA比較下,這兩者均可用於GPU通用計算。html
先上一張微軟MSDN上的圖.python
Compute shader:算法
線程塊: Dispatch(x,y,z), 索引SV_GroupID編程
線程組: [numthreads(SIZE_X, SIZE_Y, 1)], 索引SV_GroupThreadID.數組
組內索引: CS以組爲單位,shader共享在一個組內,groupshared / GroupMemoryBarrierWithGroupSync,其SV_GroupIndex爲組內索引,組內共享塊通常用此SV_GroupIndex作爲索引,或是這個的倍數,SV_GroupIndex = SV_GroupThreadID.x + SV_GroupThreadID.y*SIZE_X(在這假定二維)ide
全部線程惟一索引:在整個空間的索引三維索引爲SV_DispatchThreadID,SV_DispatchThreadID = SV_GroupThreadID+SV_GroupID*numthreads;函數
若是提供一個width,height的數據,有關係width=x*SIZE_X,height=y*SIZE_Y.(因此通常咱們獲得數據的長寬,而後設定線程組後,調度就直接求出來了,可是可能不是整除,因此能夠把真實的傳入進去).而SV_DispatchThreadID表示在整個width,height中的索引,通常來講,咱們直接用SV_DispatchThreadID就夠了,可是若是使用了groupshared/GroupMemoryBarrierWithGroupSync,就會用SV_GroupIndex來整個當個線程組計算。工具
一樣的概念CUDA中:increment_kernel<<<gridDim, blockDim, 0, 0>>>性能
線程塊: gridDim, 索引 blockIdx學習
線程組: blockDim 索引 threadIdx
組內索引:給組內共享塊索引用.__shared__/__syncthreads,那針對的對象應該用groupIndex來當索引。
int groupIndex = threadIdx.x;(假定一維)
int groupIndex = threadIdx.x + threadIdx.y*blockDim.x;(假定二維)
全部線程惟一索引: 在線程組裏的索引 threadIdx ,和dx cs不一樣,這裏是三維的。
如上找到在整個width,height中的位置和上面的SV_DispatchThreadID同樣。
const int idx = threadIdx.x + blockIdx.x * blockDim.x; const int idy = threadIdx.y + blockIdx.y * blockDim.y;
同理dx11裏經常使用內存顯存交換API如map/unmap對應cudaMemcpyAsync cudaMemcpyDeviceToHost/cudaMemcpyHostToDevice這些。
回到正題上,導向濾波算法 是何凱明提出的一種保邊濾波算法,這種算法在這不細說,opencv已經自帶這種算法,這篇文章只是針對這種算法實現一個cuda版本的,畢竟對於圖像處理來講,不少點都是並行計算的,這個連接裏有提到matlab的實現,咱們看下多通道下的快速實現來看下效果。
效果很是牛,其中我把相應matlab實現貼出來,先簡單理解下,其中matlab的這種寫法很像一些深度學習庫python接口裏提供的數據操做類的用法,看起來仍是很好理解的。
function q = fastguidedfilter_color(I, p, r, eps, s) % GUIDEDFILTER_COLOR O(1) time implementation of guided filter using a color image as the guidance. % % - guidance image: I (should be a color (RGB) image) % - filtering input image: p (should be a gray-scale/single channel image) % - local window radius: r % - regularization parameter: eps % - subsampling ratio: s (try s = r/4 to s=r) I_sub = imresize(I, 1/s, 'nearest'); % NN is often enough p_sub = imresize(p, 1/s, 'nearest'); r_sub = r / s; % make sure this is an integer [hei, wid] = size(p_sub); N = boxfilter(ones(hei, wid), r_sub); % the size of each local patch; N=(2r+1)^2 except for boundary pixels. mean_I_r = boxfilter(I_sub(:, :, 1), r_sub) ./ N; mean_I_g = boxfilter(I_sub(:, :, 2), r_sub) ./ N; mean_I_b = boxfilter(I_sub(:, :, 3), r_sub) ./ N; mean_p = boxfilter(p_sub, r_sub) ./ N; mean_Ip_r = boxfilter(I_sub(:, :, 1).*p_sub, r_sub) ./ N; mean_Ip_g = boxfilter(I_sub(:, :, 2).*p_sub, r_sub) ./ N; mean_Ip_b = boxfilter(I_sub(:, :, 3).*p_sub, r_sub) ./ N; % covariance of (I, p) in each local patch. cov_Ip_r = mean_Ip_r - mean_I_r .* mean_p; cov_Ip_g = mean_Ip_g - mean_I_g .* mean_p; cov_Ip_b = mean_Ip_b - mean_I_b .* mean_p; % variance of I in each local patch: the matrix Sigma in Eqn (14). % Note the variance in each local patch is a 3x3 symmetric matrix: % rr, rg, rb % Sigma = rg, gg, gb % rb, gb, bb var_I_rr = boxfilter(I_sub(:, :, 1).*I_sub(:, :, 1), r_sub) ./ N - mean_I_r .* mean_I_r; var_I_rg = boxfilter(I_sub(:, :, 1).*I_sub(:, :, 2), r_sub) ./ N - mean_I_r .* mean_I_g; var_I_rb = boxfilter(I_sub(:, :, 1).*I_sub(:, :, 3), r_sub) ./ N - mean_I_r .* mean_I_b; var_I_gg = boxfilter(I_sub(:, :, 2).*I_sub(:, :, 2), r_sub) ./ N - mean_I_g .* mean_I_g; var_I_gb = boxfilter(I_sub(:, :, 2).*I_sub(:, :, 3), r_sub) ./ N - mean_I_g .* mean_I_b; var_I_bb = boxfilter(I_sub(:, :, 3).*I_sub(:, :, 3), r_sub) ./ N - mean_I_b .* mean_I_b; a = zeros(hei, wid, 3); for y=1:hei for x=1:wid Sigma = [var_I_rr(y, x), var_I_rg(y, x), var_I_rb(y, x); var_I_rg(y, x), var_I_gg(y, x), var_I_gb(y, x); var_I_rb(y, x), var_I_gb(y, x), var_I_bb(y, x)]; cov_Ip = [cov_Ip_r(y, x), cov_Ip_g(y, x), cov_Ip_b(y, x)]; a(y, x, :) = cov_Ip * inv(Sigma + eps * eye(3)); % very inefficient. Replace this in your C++ code. end end b = mean_p - a(:, :, 1) .* mean_I_r - a(:, :, 2) .* mean_I_g - a(:, :, 3) .* mean_I_b; % Eqn. (15) in the paper; mean_a(:, :, 1) = boxfilter(a(:, :, 1), r_sub)./N; mean_a(:, :, 2) = boxfilter(a(:, :, 2), r_sub)./N; mean_a(:, :, 3) = boxfilter(a(:, :, 3), r_sub)./N; mean_b = boxfilter(b, r_sub)./N; mean_a = imresize(mean_a, [size(I, 1), size(I, 2)], 'bilinear'); % bilinear is recommended mean_b = imresize(mean_b, [size(I, 1), size(I, 2)], 'bilinear'); q = mean_a(:, :, 1) .* I(:, :, 1)... + mean_a(:, :, 2) .* I(:, :, 2)... + mean_a(:, :, 3) .* I(:, :, 3)... + mean_b; end
實現方式很簡單,咱們須要作的就是把這代碼轉換成CUDA代碼,若是隻是CUDA代碼,相應顯示圖像效果很差搞,咱們引入opencv,並使用其中提供的cv::cuda::GpuMat來簡單封裝,咱們來看下如何實現CUDA自己庫以及咱們本身的核函數加opencv提供的CUDA函數一塊兒來實現以下matlab的CUDA實現。
inline __host__ __device__ void inverseMat3x3(const float3& col0, const float3& col1, const float3& col2, float3& invCol0, float3& invCol1, float3& invCol2) { float det = col0.x*(col1.y*col2.z - col2.y*col1.z) - col0.y*(col1.x*col2.z - col1.z*col2.x) + col0.z*(col1.x*col2.y - col1.y*col2.x); if (det > 0.000000001f) { float invdet = 1.0f / det; invCol0.x = (col1.y*col2.z - col2.y*col1.z)*invdet; invCol0.y = (col0.z*col2.y - col0.y*col2.z)*invdet; invCol0.z = (col0.y*col1.z - col0.z*col1.y)*invdet; invCol1.x = (col1.z*col2.x - col1.x*col2.z)*invdet; invCol1.y = (col0.x*col2.z - col0.z*col2.x)*invdet; invCol1.z = (col1.x*col0.z - col0.x*col1.z)*invdet; invCol2.x = (col1.x*col2.y - col2.x*col1.y)*invdet; invCol2.y = (col2.x*col0.y - col0.x*col2.y)*invdet; invCol2.z = (col0.x*col1.y - col1.x*col0.y)*invdet; } } inline __host__ __device__ float3 mulMat(const float3 data, const float3& col0, const float3& col1, const float3& col2) { float3 dest; dest.x = dot(data, make_float3(col0.x, col1.x, col2.x)); dest.y = dot(data, make_float3(col0.y, col1.y, col2.y)); dest.z = dot(data, make_float3(col0.z, col1.z, col2.z)); return dest; } inline __global__ void findMatrix(PtrStepSz<float4> source, PtrStepSz<float3> dest, PtrStepSz<float3> dest1, PtrStepSz<float3> dest2) { const int idx = blockDim.x * blockIdx.x + threadIdx.x; const int idy = blockDim.y * blockIdx.y + threadIdx.y; if (idx < source.cols && idy < source.rows) { float4 scolor = source(idy, idx);// rgbauchar42float4(source(idy, idx)); float3 color = make_float3(scolor); dest(idy, idx) = color*scolor.w; dest1(idy, idx) = color.x*color; dest2(idy, idx) = make_float3(color.y*color.y, color.y*color.z, color.z*color.z); } } //導向濾波求值 Guided filter 論文地址http://kaiminghe.com/publications/pami12guidedfilter.pdf inline __global__ void guidedFilter(PtrStepSz<float4> source, PtrStepSz<float3> col1, PtrStepSz<float3> col2, PtrStepSz<float3> col3, PtrStepSz<float4> dest, float eps) { const int idx = blockDim.x * blockIdx.x + threadIdx.x; const int idy = blockDim.y * blockIdx.y + threadIdx.y; if (idx < source.cols && idy < source.rows) { float4 color = source(idy, idx);// rgbauchar42float4(source(idy, idx)); float3 mean_I = make_float3(color); float mean_p = color.w; float3 mean_Ip = col1(idy, idx);// rgbauchar32float3(col1(idy, idx)); float3 var_I_r = col2(idy, idx) - mean_I.x*mean_I;// rgbauchar32float3(col2(idy, idx)) - mean_I.x*mean_I;//col0 float3 var_I_gbxfv = col3(idy, idx);// rgbauchar32float3(col3(idy, idx)); float gg = var_I_gbxfv.x - mean_I.y*mean_I.y; float gb = var_I_gbxfv.y - mean_I.y*mean_I.z; float bb = var_I_gbxfv.z - mean_I.z*mean_I.z; float3 cov_Ip = mean_Ip - mean_I*mean_p; float3 col0 = var_I_r + make_float3(eps, 0.f, 0.f); float3 col1 = make_float3(var_I_r.y, gg + eps, gb); float3 col2 = make_float3(var_I_r.z, gb, bb + eps); float3 invCol0 = make_float3(1.f, 0.f, 0.f); float3 invCol1 = make_float3(0.f, 1.f, 0.f); float3 invCol2 = make_float3(0.f, 0.f, 1.f); inverseMat3x3(col0, col1, col2, invCol0, invCol1, invCol2); float3 a = mulMat(cov_Ip, invCol0, invCol1, invCol2); float b = mean_p - dot(a, mean_I); dest(idy, idx) = make_float4(a, b); } } inline __global__ void guidedFilterResult(PtrStepSz<float4> source, PtrStepSz<float4> guid, PtrStepSz<uchar4> dest, PtrStepSz<uchar> destp) { const int idx = blockDim.x * blockIdx.x + threadIdx.x; const int idy = blockDim.y * blockIdx.y + threadIdx.y; if (idx < source.cols && idy < source.rows) { float4 color = source(idy, idx);// rgbauchar42float4(source(idy, idx));//I float4 mean = guid(idy, idx); float q = clamp(color.x*mean.x + color.y*mean.y + color.z*mean.z + mean.w, 0.f, 1.f); float3 rgb = make_float3(color*q); dest(idy, idx) = rgbafloat42uchar4(make_float4(rgb, q)); destp(idy, idx) = (uchar)(__saturatef(q)*255.0f); } }
這種主要是matlab實現裏,CUDA庫和opencv自己沒有提供的,咱們本身須要實現的一部分,主要是導向濾波里多通道計算的一部分,以下咱們給出opencv裏的完整實現。
#include <cuda.h> #include <cuda_runtime.h> #include <opencv2/core.hpp> #include <opencv2/core/cuda.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/core/cuda_stream_accessor.hpp> #include <opencv2/cudaimgproc.hpp> #include <opencv2/cudawarping.hpp> #include <opencv2/cudafilters.hpp> #include "cuda_help.h" #include "fastguidedfilter.h" extern "C" void test11() { #pragma region xxxx cv::cuda::setDevice(0); std::string windowNameIP = "vvvvvIP"; namedWindow(windowNameIP); std::string windowNameP = "vvvvvP"; namedWindow(windowNameP); Stream curentStream = {}; cudaStream_t cudaStream = StreamAccessor::getStream(curentStream); int scale = 8; int width = 556; int height = 568; int scaleWidth = width / scale; int scaleHeight = height / scale; Mat frame(height, width, CV_8UC3); Mat resultFrame;// (height, width, CV_8UC3); Mat cpuIP;// (scaleHeight, scaleWidth, CV_8UC4); Mat cpuP; Mat I(height, width, CV_8UC3); Mat P(height, width, CV_8UC3); cv::cuda::GpuMat gpuI; cv::cuda::GpuMat gpuP; cv::cuda::GpuMat gpuKeying(height, width, CV_32FC4); cv::cuda::GpuMat gpuCvting; //I_sub+p_sub cv::cuda::GpuMat gpuResize(scaleHeight, scaleWidth, CV_32FC4);//I_sub+p_sub cv::cuda::GpuMat mean_I(scaleHeight, scaleWidth, CV_32FC4); //box I_sub+p_sub mean_Irgb+mean_p cv::cuda::GpuMat mean_Ipv(scaleHeight, scaleWidth, CV_32FC3); cv::cuda::GpuMat var_I_rxv(scaleHeight, scaleWidth, CV_32FC3); cv::cuda::GpuMat var_I_gbxfv(scaleHeight, scaleWidth, CV_32FC3); cv::cuda::GpuMat mean_Ip(scaleHeight, scaleWidth, CV_32FC3); cv::cuda::GpuMat var_I_rx(scaleHeight, scaleWidth, CV_32FC3); cv::cuda::GpuMat var_I_gbxf(scaleHeight, scaleWidth, CV_32FC3); cv::cuda::GpuMat meanv(scaleHeight, scaleWidth, CV_32FC4); cv::cuda::GpuMat means(scaleHeight, scaleWidth, CV_32FC4); cv::cuda::GpuMat mean(scaleHeight, scaleWidth, CV_32FC4); cv::cuda::GpuMat resultIP(height, width, CV_8UC4); cv::cuda::GpuMat resultP(height, width, CV_8UC1); const char imgPathI[] = "D:\\下載\\fast-guided-filter-code-v1\\img_feathering\\toy.bmp"; const char imgPathP[] = "D:\\下載\\fast-guided-filter-code-v1\\img_feathering\\toy-mask.bmp"; I = cv::imread(imgPathI, IMREAD_COLOR); P = cv::imread(imgPathP, IMREAD_GRAYSCALE); #pragma endregion #pragma region paramet dim3 block(32, 4); dim3 grid(divUp(width, block.x), divUp(height, block.y)); dim3 grid2(divUp(scaleWidth, block.x), divUp(scaleHeight, block.y)); //建立blue auto filter = cv::cuda::createBoxFilter(CV_8UC4, CV_8UC4, Size(3, 3));//包裝的NPP裏的nppiFilterBox_8u_C4R int softness = 21; float eps = 0.000001f; NppiSize oSizeROI; //NPPI blue oSizeROI.width = scaleWidth; oSizeROI.height = scaleHeight; NppiSize oMaskSize; oMaskSize.height = softness; oMaskSize.width = softness; NppiPoint oAnchor; oAnchor.x = oMaskSize.width / 2; oAnchor.y = oMaskSize.height / 2; NppiPoint oSrcOffset = { 0, 0 }; setNppStream(cudaStream); #pragma endregion while (int key = cv::waitKey(1)) { gpuI.upload(I); gpuP.upload(P); //把顏色通道與導向通道合併,他們不少計算是一樣的,合成四通道的加速並容易對齊32/64/128這些值 combin << <grid, block, 0, cudaStream >> > (gpuI, gpuP, gpuKeying); //導向濾波這算法的優點,與圖像大小能夠作到無關,在這咱們使用縮小8倍後的大小 cv::cuda::resize(gpuKeying, gpuResize, cv::Size(scaleWidth, scaleHeight), 0, 0, cv::INTER_NEAREST, curentStream); //計算矩陣 rr, rg, rb/rg, gg, gb/rb, gb, bb findMatrix << <grid2, block, 0, cudaStream >> > (gpuResize, mean_Ipv, var_I_rxv, var_I_gbxfv); //模糊縮小後的原始值 nppiFilterBoxBorder_32f_C4R((Npp32f*)gpuResize.ptr<float4>(), gpuResize.step, oSizeROI, oSrcOffset, (Npp32f*)mean_I.ptr<float4>(), mean_I.step, oSizeROI, oMaskSize, oAnchor, NPP_BORDER_REPLICATE); //模糊矩陣 nppiFilterBoxBorder_32f_C3R((Npp32f*)mean_Ipv.ptr<float3>(), mean_Ipv.step, oSizeROI, oSrcOffset, (Npp32f*)mean_Ip.ptr<float3>(), mean_Ip.step, oSizeROI, oMaskSize, oAnchor, NPP_BORDER_REPLICATE); nppiFilterBoxBorder_32f_C3R((Npp32f*)var_I_rxv.ptr<float3>(), var_I_rxv.step, oSizeROI, oSrcOffset, (Npp32f*)var_I_rx.ptr<float3>(), var_I_rx.step, oSizeROI, oMaskSize, oAnchor, NPP_BORDER_REPLICATE); nppiFilterBoxBorder_32f_C3R((Npp32f*)var_I_gbxfv.ptr<float3>(), var_I_gbxfv.step, oSizeROI, oSrcOffset, (Npp32f*)var_I_gbxf.ptr<float3>(), var_I_gbxf.step, oSizeROI, oMaskSize, oAnchor, NPP_BORDER_REPLICATE); //求導 guidedFilter << <grid2, block, 0, cudaStream >> > (mean_I, mean_Ip, var_I_rx, var_I_gbxf, meanv, eps); //模糊求導的結果 nppiFilterBoxBorder_32f_C4R((Npp32f*)meanv.ptr<float4>(), meanv.step, oSizeROI, oSrcOffset, (Npp32f*)means.ptr<float4>(), means.step, oSizeROI, oMaskSize, oAnchor, NPP_BORDER_REPLICATE); //返回到原圖像大小 cv::cuda::resize(means, mean, cv::Size(width, height), 0, 0, cv::INTER_LINEAR, curentStream); //求結果 guidedFilterResult << <grid, block, 0, cudaStream >> > (gpuKeying, mean, resultIP, resultP); //顯示結果 resultIP.download(cpuIP); resultP.download(cpuP); cv::imshow(windowNameIP, cpuIP); cv::imshow(windowNameP, cpuP); } }
一樣,給出實現效果。
和matlab的效果,哈哈,反正我肉眼沒看到區別。
說下這個實現過程的一些想法和彎路,其中matlab主要不同的地方是,他把顏色圖與導向圖分開處理的,可是這兩者大部分處理是同樣的,爲了加速計算,在cuda裏,我首先把導向圖與顏色圖合併而後一塊兒作計算,別的處理都是差很少的。最開始其中的boxfilter方法準備用opncv提供的,可是發現他封裝的CUDA實現並不完善,一些如float3的模糊並無封裝,因此就用原生的CUDA庫裏的實現,其中作了次測試,剛開始爲了更好的性能,其中開始合併的圖,中間的矩陣全用的是char來表示的float,畢竟剛開始覺得縮小圖像影響很大,而模糊的核大通常設大點(根據算法原理,核越大,突出邊緣拿到的越大), 而核大了,又用的float,box模糊很容易塞滿局部共享存儲器,這玩意滿了優化的速度就沒了,而後發現結果徹底不對,周邊全是半透的模糊值,而後把中間的矩陣全改爲float計算,改後發現效果好多了,可是邊緣部分有些燥點,嗯,把合併的圖改爲float值類型後,結果和matlab以肉眼來看是同樣了。還好,發現其中的縮小的倍率不影響結果,直接把原圖縮小二倍和八倍效果同樣,縮小八倍後,1070以10%左右佔用在不到一毫秒下完成算法,雖然是由於圖像比較小的緣由,可是這個確實牛叉。
在1070下,能夠看到,不到一毫秒的處理,可是須要注意的,其中顯存與內存的交互佔了差很少一毫秒,而這是小分辨率的,而在1080P下,處理過程還在一毫秒左右,而顯存與內存的上傳下載差很少五毫秒了,因此考慮GPU通用計算必定要注意這種交互時間,因此最後的結果若是是給引擎渲染的就很不錯,只須要從內存到顯存這一步,或是顯存大了,全顯存交互就更快了。
在opencv下用CUDA仍是很方便的,一是提供的GpuMat太方便了,幫咱們保存了長寬,pitch,畢竟針對圖像處理,每一個核函數差很少都要這三個參數,有了GpuMat,感受寫的時候都爽了些,二是提供圖像一些基本處理,讀顯示處理這些都能實現,大大提升效率。
最後說下,咱們知道GPU有多個流處理器,因此很善長一個並行計算,因此一些咱們在CPU下的方法拿到GPU下,不少就不同,咱們來簡單分析下opencv下的cuda模塊中的算法,幫助咱們理解GPU編程模式與優化。
以下是常見的聚合算法處理,如一個列表裏的最大值,最小值,列表的和等等。
這個算法在CPU下,咱們常見就是一個for,找到最大值,在GPU中,分紅多組,而後折半計算,而到32時,咱們看到沒有用__syncthreads來同步塊了,主要是由於CUDA中,最小的執行單元是線程束,線程束是一塊兒執行的,而一個線程束是32個線程組,故上在一個線程束裏,由於是一塊兒執行,至關於自然的__syncthreads。
以下是一段簡單的代碼,拿出來主要是由於opencv/CUDA底層會用這段,拿到數據,根據你的grid/block肯定總大小N,而後只須要循環總長/N,不過通常來講,圖像處理,咱們會根據長寬自動分配對應的grid/block,因此咱們的不須要外面這層循環。
還有用局部共享存儲器的優化,畢竟局部共享存儲器比全局存儲器訪問要快,好比模糊,每一個塊要取周塊一些點運行,致使訪問全局存儲器過多,在這就能夠先根據模糊的設置每塊讀取相應數據到局部共享存儲器,而後作計算時從局部共享存儲器取值,有興趣的能夠參看opencv_cudafilter裏的如linearRowFilter的實現。
其中須要注意的一些事項,GpuMat建立顯存單元用的是cudaMallocPitch而 cuda 中這樣分配的二維數組內存保證了數組每一行首元素的地址值都按照 256 或 512 的倍數對齊,提升訪問效率,但使得每行末尾元素與下一行首元素地址可能不連貫,使用指針尋址時要注意考慮尾部。好比960*540 CV_8UC4的step就不是960*4=3840,而是4096。1920*1080 cv_8uc3 的是6144,1920*1080 cv_8uc4 的是7680,與1920*4相等,咱們在內存要拿到連續值就須要用cudaHostAlloc,而後用cudaMemcpy2DToArray複製過來。而出現an illegal memory access was encountered這個錯誤通常來講,確定是你核函數裏訪問超出你對應數據的索引,通常來講,是對應id計算失誤或是分配grid/block考慮不對。
CUDA有個很是好用的調試器Nsight,如上an illegal memory access was encountered就容易很清楚知道那個塊索引出了問題,對應數據是不是正常的,以及相應的代碼,調試數據和VS自身的差很少,很是方便,熟悉這工具,事半功倍。