編程除了使程序在全部可能的狀況下都正確工做,還須要考慮程序的運行效率,上一節主要介紹了關於讀寫的優化,本節將對運算的優化進行分析。讀寫優化html
編寫高效程序須要作到如下兩點:算法
第一點合適的算法和數據結構每每是你們寫程序時會首先考慮到的,而第二點常被忽略。這裏咱們就代碼優化而言,主要討論如何編寫可以被編譯器有效優化的源代碼,其中理解優化編譯器的能力和侷限性是很重要的。編程
除了讀寫與運算的區別,本節與上一節最大的不一樣的是:本次的優化示例會影響程序的可讀性。數組
但這也是編程中時常會遇到的狀況,在沒有更好的優化手段,但又對程序有迫切的性能需求時,採起空間換時間,或下降代碼可讀性換取運行效率的方法並不是不可取。數據結構
當你編寫一個小工具臨時處理某種事務(也許之後並不重用),或者想驗證本身的某個想法是否可行時(好比測試某個算法是否正確),如果編寫了一個可讀性不錯但運行很慢的程序,每每會浪費不少沒必要要的時間。這時候你就能夠不須要那麼在意代碼的可讀性,而是去多關注當前程序的運行性能來更早得到想要的結果。函數
如下咱們將舉例對常見的矩陣運算進行代碼優化。工具
平滑操做要求:性能
原理圖:
一、二、3處分別表明角點、邊緣點以及內部點的相鄰像素測試
咱們用如下結構體表示一張圖像的像素點:優化
typedef struct { unsigned short red; /* R value */ unsigned short green; /* G value */ unsigned short blue; /* B value */ } pixel;
red、green、blue分別表示一張彩色圖像的紅綠藍三個通道。
原平滑函數以下:
static void accumulate_sum(pixel_sum *sum, pixel p) { sum->red += (int) p.red; sum->green += (int) p.green; sum->blue += (int) p.blue; sum->num++; return; } static void assign_sum_to_pixel(pixel *current_pixel, pixel_sum sum) { current_pixel->red = (unsigned short) (sum.red/sum.num); current_pixel->green = (unsigned short) (sum.green/sum.num); current_pixel->blue = (unsigned short) (sum.blue/sum.num); return; } static pixel avg(int dim, int i, int j, pixel *src) { int ii, jj; pixel_sum sum; pixel current_pixel; initialize_pixel_sum(&sum); for(ii = max(i-1, 0); ii <= min(i+1, dim-1); ii++) for(jj = max(j-1, 0); jj <= min(j+1, dim-1); jj++) accumulate_sum(&sum, src[RIDX(ii, jj, dim)]); assign_sum_to_pixel(¤t_pixel, sum); return current_pixel; } void naive_smooth(int dim, pixel *src, pixel *dst) { int i, j; for (i = 0; i < dim; i++) for (j = 0; j < dim; j++) dst[RIDX(i, j, dim)] = avg(dim, i, j, src); }
圖像是標準的正方形,用一維數組表示,第(i,j)個像素表示爲I[RIDX(i,j,n)],n爲圖像邊長。
參數:
當前咱們擁有一個driver.c文件,能夠對原函數和咱們優化的函數進行測試,獲得表示程序運行性能的CPE(每元素週期數)參數。
咱們的任務就是實現優化代碼,與原有代碼同時運行進行參數的對比,查看代碼優化狀況。
循環主體只存在一條語句,該語句主要是進行大量的均值運算,並且調用了多層的函數,這樣運行時會出現多個函數棧的調用。
經過分析,本節的優化手段比上一節的矩陣讀寫要更直接。當前程序主要的性能瓶頸在於兩個方面:
本節的優化就是針對這兩點進行改進,
多層函數調用比較容易解決,只須要把被調用函數轉移在平滑函數中實現就行(原代碼下降了耦合度,但卻致使了性能的降低)。
下面主要分析重複運算的問題,如圖:
計算紅色區域平均值與黃色區域平均值時,有兩行是重複運算的。相應的優化策略是以1*3小矩陣爲組計算和,這樣每次計算均值只須要3個已知的和相加除以9,減小了必定的運算量。
相應的優化代碼以下:
int rsum[4096][4096]; int gsum[4096][4096]; int bsum[4096][4096]; void smooth(int dim, pixel *src, pixel *dst) { int dim2 = dim * dim; for(int i = 0; i < dim; i++){ for(int j = 0; j < dim-2; j++){ int z = i*dim; rsum[i][j] = 0, gsum[i][j] = 0, bsum[i][j] = 0; for(int k = j; k < j + 3; k++){ rsum[i][j] += src[z+k].red; gsum[i][j] += src[z+k].green; bsum[i][j] += src[z+k].blue; } } } // 四個角 dst[0].red = (src[0].red + src[1].red + src[dim].red + src[dim+1].red) / 4; dst[0].green = (src[0].green + src[1].green + src[dim].green + src[dim+1].green) / 4; dst[0].blue = (src[0].blue + src[1].blue + src[dim].blue + src[dim+1].blue) / 4; dst[dim-1].red = (src[dim-2].red + src[dim-1].red + src[dim+dim-2].red + src[dim+dim-1].red) / 4; dst[dim-1].green = (src[dim-2].green + src[dim-1].green + src[dim+dim-2].green + src[dim+dim-1].green) / 4; dst[dim-1].blue = (src[dim-2].blue + src[dim-1].blue + src[dim+dim-2].blue + src[dim+dim-1].blue) / 4; dst[dim2-dim].red = (src[dim2-dim-dim].red + src[dim2-dim-dim+1].red + src[dim2-dim].red + src[dim2-dim+1].red) / 4; dst[dim2-dim].green = (src[dim2-dim-dim].green + src[dim2-dim-dim+1].green + src[dim2-dim].green + src[dim2-dim+1].green) / 4; dst[dim2-dim].blue = (src[dim2-dim-dim].blue + src[dim2-dim-dim+1].blue + src[dim2-dim].blue + src[dim2-dim+1].blue) / 4; dst[dim2-1].red = (src[dim2-dim-2].red + src[dim2-dim-1].red + src[dim2-2].red + src[dim2-1].red) / 4; dst[dim2-1].green = (src[dim2-dim-2].green + src[dim2-dim-1].green + src[dim2-2].green + src[dim2-1].green) / 4; dst[dim2-1].blue = (src[dim2-dim-2].blue + src[dim2-dim-1].blue + src[dim2-2].blue + src[dim2-1].blue) / 4; // 四條邊 for(int j = 1; j < dim-1; j++){ dst[j].red = (rsum[0][j-1]+rsum[1][j-1]) / 6; dst[j].green = (gsum[0][j-1]+gsum[1][j-1]) / 6; dst[j].blue = (bsum[0][j-1]+bsum[1][j-1]) / 6; } for(int i = 1; i < dim-1; i++){ int a = (i-1)*dim, b = (i-1)*dim+1, c = i*dim, d = i*dim+1, e = (i+1)*dim, f = (i+1)*dim+1; dst[c].red = (src[a].red + src[b].red + src[c].red + src[d].red + src[e].red + src[f].red) / 6; dst[c].green = (src[a].green + src[b].green + src[c].green + src[d].green + src[e].green + src[f].green) / 6; dst[c].blue = (src[a].blue + src[b].blue + src[c].blue + src[d].blue + src[e].blue + src[f].blue) / 6; } for(int i = 1; i < dim-1; i++){ int a = i*dim-2, b = i*dim-1, c = (i+1)*dim-2, d = (i+1)*dim-1, e = (i+2)*dim-2, f = (i+2)*dim-1; dst[d].red = (src[a].red + src[b].red + src[c].red + src[d].red + src[e].red + src[f].red) / 6; dst[d].green = (src[a].green + src[b].green + src[c].green + src[d].green + src[e].green + src[f].green) / 6; dst[d].blue = (src[a].blue + src[b].blue + src[c].blue + src[d].blue + src[e].blue + src[f].blue) / 6; } for(int j = 1; j < dim-1; j++){ dst[dim2-dim+j].red = (rsum[dim-1][j-1]+rsum[dim-2][j-1]) / 6; dst[dim2-dim+j].green = (gsum[dim-1][j-1]+gsum[dim-2][j-1]) / 6; dst[dim2-dim+j].blue = (bsum[dim-1][j-1]+bsum[dim-2][j-1]) / 6; } // 中間部分 for(int i = 1; i < dim-1; i++){ int k = i*dim; for(int j = 1; j < dim-1; j++){ dst[k+j].red = (rsum[i-1][j-1]+rsum[i][j-1]+rsum[i+1][j-1]) / 9; dst[k+j].green = (gsum[i-1][j-1]+gsum[i][j-1]+gsum[i+1][j-1]) / 9; dst[k+j].blue = (bsum[i-1][j-1]+bsum[i][j-1]+bsum[i+1][j-1]) / 9; } } }
運行效率以下:
原函數加速比爲10.5,優化後加速比提高到24.5,雖然必定程度上損失了些代碼的可讀性,但提高了咱們想要的運行效率。
優化在必定程度上減小了重複的運算,但並無徹底消除重複運算,若是有更好的優化方法歡迎交流。