SSE圖像算法優化系列十二:多尺度的圖像細節提高。

  無心中瀏覽一篇文章,中間提到了基於多尺度的圖像的細節提高算法,嘗試了一下,仍是有必定的效果的,結合最近一直研究的SSE優化,把算法的步驟和優化過程分享給你們。html

  論文的全名是DARK IMAGE ENHANCEMENT BASED ON PAIRWISE TARGET CONTRAST AND MULTI-SCALE DETAIL BOOSTING,好像在百度上搜索不到,因爲博客的空間很少了,這裏就不上傳了, 我貼出論文核心的字段。算法

   

  論文的核心思想相似於Retinex,使用了三個尺度的高斯模糊,再和原圖作減法,得到不一樣程度的細節信息,而後經過必定的組合方式把這些細節信息融合到原圖中,從而獲得增強原圖信息的能力。less

  值得一提的就是對D1的係數作了特殊的處理,這個是值得學習的。ide

  這個算法的編碼實在是簡單,一個簡單的C語言代碼以下:函數

int IM_MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius)
{
    int Channel = Stride / Width;
    if ((Src == NULL) || (Dest == NULL))                                return IM_STATUS_NULLREFRENCE;
    if ((Width <= 0) || (Height <= 0))                                    return IM_STATUS_INVALIDPARAMETER;
    if ((Channel != 1) && (Channel != 3) && (Channel != 4))                return IM_STATUS_INVALIDPARAMETER;

    int Status = IM_STATUS_OK;

    unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));        //    若是最後的大圖在這裏內存有問題,一種解決辦法就是下面的模糊用BoxBlur
    unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));        //    而後不要調用現有的BoxBlur函數,而是直接在本函數實現三種不一樣半徑的模糊
    unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));        //    運算速度(由於有些循環是公共的)會有點點提升,額外的內存佔用則能夠忽略

    if ((B1 == NULL) || (B2 == NULL) || (B3 == NULL))
    {
        if (B1 != NULL) free(B1);
        if (B2 != NULL) free(B2);
        if (B3 != NULL) free(B3);
        return IM_STATUS_OUTOFMEMORY;
    }

    Status = IM_ExpBlur(Src, B1, Width, Height, Stride, Radius);                        
    if (Status != IM_STATUS_OK)    goto FreeMemory;
    Status = IM_ExpBlur(Src, B2, Width, Height, Stride, Radius * 2);
    if (Status != IM_STATUS_OK)    goto FreeMemory;
    Status = IM_ExpBlur(Src, B3, Width, Height, Stride, Radius * 4);
    if (Status != IM_STATUS_OK)    goto FreeMemory;
    for (int Y = 0; Y < Height * Stride; Y++)
    {
        int DiffB1 = Src[Y] - B1[Y];
        int DiffB2 = B1[Y] - B2[Y];
        int DiffB3 = B2[Y] - B3[Y];
        Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]);
    }

FreeMemory:
    free(B1);
    free(B2);
    free(B3);
    return Status;
}

  爲避免浮點計算,咱們將W一、W二、W3放大四倍,累加玩後在除以4就能夠了,整個的書寫過程就是按照公式(13)進行的。post

  其中IM_Sign函數定義以下:學習

inline int IM_Sign(int X)
{
    return (X >> 31) | (unsigned(-X)) >> 31;        
}

 

               處理前                                      處理後(半徑5)測試

  處理效果由上面兩幅圖的比較來講,仍是至關明顯的。優化

  上面的代碼中我用的ExpBlur代替了高斯模糊,關於指數模糊能夠參考:SSE圖像算法優化系列五:超高速指數模糊算法的實現和優化(10000*10000在100ms左右實現) 一文,他的效果和高斯模糊差很少,速度要快很多。編碼

  當指數模糊使用SSE優化後,剩下的代碼用純C實現,對於1080P的24位彩色圖,測試時間爲73毫秒,若是除去3次指數模糊,純C部分代碼耗時約40 ms,因此有很大的優化空間,咱們就用SSE來處理。

  第一,咱們須要來處理IM_Sign這個函數,這個函數當參數大於0時,返回1,參數小於0時,返回-1,參數等於0時,返回0,SSE沒有直接這樣的函數,幸虧以前收集過一個這個的自定義函數,寫的也很巧妙:

inline __m128i _mm_sgn_epi16(__m128i v)
{
#ifdef __SSSE3__
    v = _mm_sign_epi16(_mm_set1_epi16(1), v); // use PSIGNW on SSSE3 and later
#else
    v = _mm_min_epi16(v, _mm_set1_epi16(1));  // use PMINSW/PMAXSW on SSE2/SSE3.
    v = _mm_max_epi16(v, _mm_set1_epi16(-1));
    //_mm_set1_epi16(1) = _mm_srli_epi16(_mm_cmpeq_epi16(v, v), 15);
    //_mm_set1_epi16(-1) = _mm_cmpeq_epi16(v, v);

#endif
    return v;
}

  如上所示,當系統只支持SSE2以及其下的版本時,直接用_mm_min_epi16和_mm_max_epi16這樣的函數硬實現,這個邏輯也很好理解。

  若是系統支持SSE3及其以上的版本,系統提供了_mm_sign_epi16這個函數,關於這個函數其做用解釋以下:

//  extern __m128i _mm_sign_epi16 (__m128i a, __m128i b); 
//  Negate packed words in a if corresponding sign in b is less than zero. 
//  Interpreting a, b, and r as arrays of signed 16-bit integers: 
for (i = 0; i < 8; i++)
{
    if (b[i] < 0)
    {
        r[i] = -a[i];
    }
    else if (b[i] == 0)
    {
        r[i] = 0;
    }
    else
    {
        r[i] = a[i];
    }
}

  若是參數a傳值爲1,不就能實現IM_Sign這個效果了嗎,真好玩。

  解決了IM_Sign這個函數,其餘的部分都很簡單了,考慮到數據範圍,把字節數據擴展爲signed short類型處理就能夠了,這樣一次性能夠處理8個字節的數據,修改後的代碼以下所示:

int IM_MultiScaleSharpen(unsigned char *Src, unsigned char *Dest, int Width, int Height, int Stride, int Radius)
{
    int Channel = Stride / Width;
    if ((Src == NULL) || (Dest == NULL))                                return IM_STATUS_NULLREFRENCE;
    if ((Width <= 0) || (Height <= 0))                                    return IM_STATUS_INVALIDPARAMETER;
    if ((Channel != 1) && (Channel != 3) && (Channel != 4))                return IM_STATUS_INVALIDPARAMETER;
    int Status = IM_STATUS_OK;
    unsigned char *B1 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));        //    若是最後的大圖在這裏內存有問題,一種解決辦法就是下面的模糊用BoxBlur
    unsigned char *B2 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));        //    而後不要調用現有的BoxBlur函數,而是直接在本函數實現三種不一樣半徑的模糊
    unsigned char *B3 = (unsigned char *)malloc(Height * Stride * sizeof(unsigned char));        //    運算速度(由於有些循環是公共的)會有點點提升,額外的內存佔用則能夠忽略
    if ((B1 == NULL) || (B2 == NULL) || (B3 == NULL))
    {
        if (B1 != NULL) free(B1);
        if (B2 != NULL) free(B2);
        if (B3 != NULL) free(B3);
        return IM_STATUS_OUTOFMEMORY;
    }
    Status = IM_ExpBlur(Src, B1, Width, Height, Stride, Radius);                        
    if (Status != IM_STATUS_OK)    goto FreeMemory;
    Status = IM_ExpBlur(Src, B2, Width, Height, Stride, Radius * 2);
    if (Status != IM_STATUS_OK)    goto FreeMemory;
    Status = IM_ExpBlur(Src, B3, Width, Height, Stride, Radius * 4);
    if (Status != IM_STATUS_OK)    goto FreeMemory;
    int BlockSize = 8, Block = (Height * Stride) / BlockSize;
    __m128i Zero = _mm_setzero_si128();
    __m128i Four = _mm_set1_epi16(4);
    for (int Y = 0; Y < Block * BlockSize; Y += BlockSize)
    {
        __m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero);
        __m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero);
        __m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero);
        __m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero);
        __m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1);
        __m128i DiffB2 = _mm_sub_epi16(SrcB1, SrcB2);
        __m128i DiffB3 = _mm_sub_epi16(SrcB2, SrcB3);
        __m128i Offset = _mm_srai_epi16(_mm_add_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_sub_epi16(Four, _mm_slli_epi16(_mm_sgn_epi16(DiffB1), 1)), DiffB1), _mm_slli_epi16(DiffB2, 1)), DiffB3), 2);
        _mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero));
    }
    for (int Y = Block * BlockSize; Y < Height * Stride; Y++)
    {
        int DiffB1 = Src[Y] - B1[Y];
        int DiffB2 = B1[Y] - B2[Y];
        int DiffB3 = B2[Y] - B3[Y];
        Dest[Y] = IM_ClampToByte(((4 - 2 * IM_Sign(DiffB1)) * DiffB1 + 2 * DiffB2 + DiffB3) / 4 + Src[Y]);
    }
FreeMemory:
    free(B1);
    free(B2);
    free(B3);
    return Status;
}

  基本上就是按照C語言或者公式(13)所示的流程一步一步的編寫,只不過把有些乘法變成了移位。

   對於1080P的彩色圖像,上述改動後處理時間變爲了35ms,純C語言部分的耗時約在11ms左右,同以前的相比速度提升了4倍多,提速仍是至關的明顯的。

  在仔細觀察,以爲在IM_Sign這個的處理上仍是有問題,雖然上述代碼能完美解決問題,可是總以爲有改進空間,當咱們把IM_Sign(DiffB1) * DiffB1放在一塊兒觀察時,就會發現這個總體不是能夠直接使用_mm_sign_epi16予以實現嗎,好比 _mm_sign_epi16(DiffB1, DiffB1) 難道不是嗎? 這樣就少了一次乘法。

  最後,咱們把DiffB2, DiffB3展開,合併掉一些同類項,而後把乘以相同係數的變量在合併,又能夠優化掉一些計算,最終的SSE部分代碼以下:

    int BlockSize = 8, Block = (Height * Stride) / BlockSize;
    __m128i Zero = _mm_setzero_si128();
    for (int Y = 0; Y < Block * BlockSize; Y += BlockSize)
    {
        __m128i SrcV = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(Src + Y)), Zero);
        __m128i SrcB1 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B1 + Y)), Zero);
        __m128i SrcB2 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B2 + Y)), Zero);
        __m128i SrcB3 = _mm_unpacklo_epi8(_mm_loadl_epi64((__m128i *)(B3 + Y)), Zero);
        __m128i DiffB1 = _mm_sub_epi16(SrcV, SrcB1);
        __m128i Offset = _mm_add_epi16(_mm_srai_epi16(_mm_sub_epi16(_mm_slli_epi16(_mm_sub_epi16(SrcB1, _mm_sign_epi16(DiffB1, DiffB1)), 1), _mm_add_epi16(SrcB2, SrcB3)), 2), DiffB1);
        _mm_storel_epi64((__m128i *)(Dest + Y), _mm_packus_epi16(_mm_add_epi16(SrcV, Offset), Zero));
    }

  雖然測試代表,速度沒有提升多少(主要是計算量太少),但這樣寫明顯合理不少。

  測試工程的地址:http://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar

 

相關文章
相關標籤/搜索