ARM NEON編程初探——一個簡單的BGR888轉YUV444實例詳解

原文請猛戳這裏app

敲黑板劃重點——順求異構計算/高性能計算/CUDA/ARM優化類開發職位ide


最近在學習ARM的SIMD指令集NEON,發現這方面的資料真是太少了,我便來給NEON湊湊人氣,姑且以這篇入門文章來分享一些心得吧。函數

學習一門新技術,老是有一些經典是繞不開的,對於NEON來講,這份必備的武林祕籍天然就是ARM官方的《NEON Programmer's Guide》(如下簡稱Guide)啦。別看這份Guide有四百多頁,其實只有一百來頁是正文,後面都是供查閱的手冊,通讀一番仍是不難的。因此這裏我也就不打算把Guide裏的內容翻譯過來敷衍了事了。在此我想借一個簡單例子,展現我是如何把一個沒采用NEON的普通程序改寫爲NEON程序、中間又是如何debug、如何調優的。固然,做爲一枚ARM小白,我接觸NEON指令集畢竟也才兩週左右時間,錯誤在所不免,還請各位方家多多指正。oop

BGR888ToYUV444

在衆多並行操做模式(Map, Reduce, Scatter, Stencil等)中,最簡單的也許要算Map了,因此我選了一個Map的例子——BGR888轉YUV444。這二者都是顏色空間的格式:前者表示一個像素有B、G、R三個顏色通道,每一個通道佔8-bit,在內存中按照B G R的順序從高位到低位排列;後者表示一個像素有Y、U、V三個通道,每一個通道也是8-bit(444僅指Y、U、V的採樣率比值爲4:4:4,其餘類型的採樣率還有YUV42二、YUV420),咱們也假設它在內存中按照V U Y的順序從高位到低位排列。如何把BGR888格式轉成YUV444呢?根據wiki上的轉換公式,咱們能夠寫出以下代碼(很顯然,這是一一對應,典型的Map模式):性能

void BGR888ToYUV444(unsigned char *yuv, unsigned char *bgr, int pixel_num)
{
    int i;
    for (i = 0; i < pixel_num; ++i) {
        uint8_t r = bgr[i * 3];
        uint8_t g = bgr[i * 3 + 1];
        uint8_t b = bgr[i * 3 + 2];

        uint8_t y = 0.299 * r + 0.587 * g + 0.114 * b;
        uint8_t u = -0.169 * r - 0.331 * g + 0.5 * b + 128;
        uint8_t v = 0.5 * r - 0.419 * g - 0.081 * b + 128;

        yuv[i * 3] = y;
        yuv[i * 3 + 1] = u;
        yuv[i * 3 + 2] = v;
    }
}

這段代碼的意思很簡單,咱們遍歷全部像素,每次把B、G、R三個通道的值從內存中加載進來,再作浮點數乘法和加減法,獲得Y、U、V的值,寫入相應的內存中。那麼,使用NEON能夠怎樣幫助這段程序跑得更快呢?學習

前面提到,NEON是一種SIMD(Single Instruction Multiple Data)指令,也就是說,NEON能夠把若干源(source)操做數(operand)打包放到一個源寄存器中,對他們執行相同的操做,產生若干目的(dest)操做數,這種方式也叫向量化(vectorization)。這樣的話,能打包多少數據同時作運算就取決於寄存器位寬,在ARMv7的NEON unit中,register file總大小是1024-bit,能夠劃分爲16個128-bit的Q-register(Quadword register)或者32個64-bit的D-register(Dualword register),也就是說,最長的寄存器位寬是128-bit(詳見Guide第一章)。以上面的R888ToYUV444函數爲例,假設咱們採用32-bit單精度浮點數float來作浮點運算,那麼咱們能夠 把最多128/32=4個浮點數打包放到Q-register中作SIMD運算,一次拿4個BGR算出4個YUV,從而提升吞吐量,減小loop次數。優化

(細心的看官可能會問到雙精度浮點數double的運算吧?遺憾的是,根據Guide,NEON並不支持double,你能夠考慮使用VFP/Vector Floating Point,但VFP並不是SIMD單元)。ui

從浮點運算到整型運算

那麼,咱們還能夠繼續提升向量化程度嗎?若是咱們回頭看wiki,咱們會發如今早期不支持浮點操做的SIMD處理器中,使用了以下整型運算來把BGR轉成YUV:scala

// Refer to: https://en.wikipedia.org/wiki/YUV (Full swing for BT.601)
void BGR888ToYUV444(unsigned char *yuv, unsigned char *bgr, int pixel_num)
{
    int i;
    for (i = 0; i < pixel_num; ++i) {
        uint8_t r = bgr[i * 3];
        uint8_t g = bgr[i * 3 + 1];
        uint8_t b = bgr[i * 3 + 2];

        // 1. Multiply transform matrix (Y′: unsigned, U/V: signed)
        uint16_t y_tmp = 76 * r + 150 * g + 29 * b;
        int16_t u_tmp = -43 * r - 84 * g + 127 * b;
        int16_t v_tmp = 127 * r - 106 * g - 21 * b;

        // 2. Scale down (">>8") to 8-bit values with rounding ("+128") (Y′: unsigned, U/V: signed)
        y_tmp = (y_tmp + 128) >> 8;
        u_tmp = (u_tmp + 128) >> 8;
        v_tmp = (v_tmp + 128) >> 8;

        // 3. Add an offset to the values to eliminate any negative values (all results are 8-bit unsigned)
        yuv[i * 3] = (uint8_t) y_tmp;
        yuv[i * 3 + 1] = (uint8_t) (u_tmp + 128);
        yuv[i * 3 + 2] = (uint8_t) (v_tmp + 128);
    }
}

從這段代碼咱們不難發現,32-bit的float運算被16-bit的加減、乘法和移位運算所代替。這樣的話,咱們能夠把最多128/16=8個整型數放到Q-register中作SIMD運算,一次拿8個BGR算出8個YUV,把向量化程度再提一倍。使用整型運算還有一個好處:通常而言,整型運算指令所須要的時鐘週期少於浮點運算的時鐘週期。因此,我以這段代碼爲基準(baseline),用NEON來加速它。(細心的看官也許已經看到我說法中的紕漏:雖然單個整型指令的週期小於單個相同運算的浮點指令的週期,但整型版本的BGR888ToYUV444比起浮點版本的多了移位和加法的overhead,指令數目是不一樣的,總的時鐘週期不必定就短。Good point! 看官不妨看完本文後也用NEON改寫浮點版本練練手,兩相比較就不可貴出最終的結論啦XD)翻譯

NEON Intrinsics

接下來能夠着手寫NEON代碼了。我的推薦新手先別急着一上來就寫彙編,NEON提供了intrinsics,其實就是一套可在C/C++中調用的函數API,編譯器會代勞把這些intrinsics轉成NEON指令(詳見Guide的第四章),這樣就方便一些。我用NEON intrinsics改寫的BGR888ToYUV444函數以下:

void BGR888ToYUV444(unsigned char * __restrict__ yuv, unsigned char * __restrict__ bgr, int pixel_num)
{
    const uint8x8_t u8_zero = vdup_n_u8(0);
    const uint16x8_t u16_rounding = vdupq_n_u16(128);
    const int16x8_t s16_rounding = vdupq_n_s16(128);
    const int8x16_t s8_rounding = vdupq_n_s8(128);

    int count = pixel_num / 16;

    int i;
    for (i = 0; i < count; ++i) {
        // Load bgr
        uint8x16x3_t pixel_bgr = vld3q_u8(bgr);

        uint8x8_t high_r = vget_high_u8(pixel_bgr.val[0]);
        uint8x8_t low_r = vget_low_u8(pixel_bgr.val[0]);
        uint8x8_t high_g = vget_high_u8(pixel_bgr.val[1]);
        uint8x8_t low_g = vget_low_u8(pixel_bgr.val[1]);
        uint8x8_t high_b = vget_high_u8(pixel_bgr.val[2]);
        uint8x8_t low_b = vget_low_u8(pixel_bgr.val[2]);
        int16x8_t signed_high_r = vreinterpretq_s16_u16(vaddl_u8(high_r, u8_zero));
        int16x8_t signed_low_r = vreinterpretq_s16_u16(vaddl_u8(low_r, u8_zero));
        int16x8_t signed_high_g = vreinterpretq_s16_u16(vaddl_u8(high_g, u8_zero));
        int16x8_t signed_low_g = vreinterpretq_s16_u16(vaddl_u8(low_g, u8_zero));
        int16x8_t signed_high_b = vreinterpretq_s16_u16(vaddl_u8(high_b, u8_zero));
        int16x8_t signed_low_b = vreinterpretq_s16_u16(vaddl_u8(low_b, u8_zero));

        // NOTE:
        // declaration may not appear after executable statement in block
        uint16x8_t high_y;
        uint16x8_t low_y;
        uint8x8_t scalar = vdup_n_u8(76);
        int16x8_t high_u;
        int16x8_t low_u;
        int16x8_t signed_scalar = vdupq_n_s16(-43);
        int16x8_t high_v;
        int16x8_t low_v;
        uint8x16x3_t pixel_yuv;
        int8x16_t u;
        int8x16_t v;

        // 1. Multiply transform matrix (Y′: unsigned, U/V: signed)
        high_y = vmull_u8(high_r, scalar);
        low_y = vmull_u8(low_r, scalar);

        high_u = vmulq_s16(signed_high_r, signed_scalar);
        low_u = vmulq_s16(signed_low_r, signed_scalar);

        signed_scalar = vdupq_n_s16(127);
        high_v = vmulq_s16(signed_high_r, signed_scalar);
        low_v = vmulq_s16(signed_low_r, signed_scalar);

        scalar = vdup_n_u8(150);
        high_y = vmlal_u8(high_y, high_g, scalar);
        low_y = vmlal_u8(low_y, low_g, scalar);

        signed_scalar = vdupq_n_s16(-84);
        high_u = vmlaq_s16(high_u, signed_high_g, signed_scalar);
        low_u = vmlaq_s16(low_u, signed_low_g, signed_scalar);

        signed_scalar = vdupq_n_s16(-106);
        high_v = vmlaq_s16(high_v, signed_high_g, signed_scalar);
        low_v = vmlaq_s16(low_v, signed_low_g, signed_scalar);

        scalar = vdup_n_u8(29);
        high_y = vmlal_u8(high_y, high_b, scalar);
        low_y = vmlal_u8(low_y, low_b, scalar);

        signed_scalar = vdupq_n_s16(127);
        high_u = vmlaq_s16(high_u, signed_high_b, signed_scalar);
        low_u = vmlaq_s16(low_u, signed_low_b, signed_scalar);

        signed_scalar = vdupq_n_s16(-21);
        high_v = vmlaq_s16(high_v, signed_high_b, signed_scalar);
        low_v = vmlaq_s16(low_v, signed_low_b, signed_scalar);

        // 2. Scale down (">>8") to 8-bit values with rounding ("+128") (Y′: unsigned, U/V: signed)
        // 3. Add an offset to the values to eliminate any negative values (all results are 8-bit unsigned)

        high_y = vaddq_u16(high_y, u16_rounding);
        low_y = vaddq_u16(low_y, u16_rounding);

        high_u = vaddq_s16(high_u, s16_rounding);
        low_u = vaddq_s16(low_u, s16_rounding);

        high_v = vaddq_s16(high_v, s16_rounding);
        low_v = vaddq_s16(low_v, s16_rounding);

        pixel_yuv.val[0] = vcombine_u8(vqshrn_n_u16(low_y, 8), vqshrn_n_u16(high_y, 8));

        u = vcombine_s8(vqshrn_n_s16(low_u, 8), vqshrn_n_s16(high_u, 8));

        v = vcombine_s8(vqshrn_n_s16(low_v, 8), vqshrn_n_s16(high_v, 8));

        u = vaddq_s8(u, s8_rounding);
        pixel_yuv.val[1] = vreinterpretq_u8_s8(u);

        v = vaddq_s8(v, s8_rounding);
        pixel_yuv.val[2] = vreinterpretq_u8_s8(v);

        // Store
        vst3q_u8(yuv, pixel_yuv);

        bgr += 3 * 16;
        yuv += 3 * 16;
    }

    // Handle leftovers
    for (i = count * 16; i < pixel_num; ++i) {
        uint8_t r = bgr[i * 3];
        uint8_t g = bgr[i * 3 + 1];
        uint8_t b = bgr[i * 3 + 2];

        uint16_t y_tmp = 76 * r + 150 * g + 29 * b;
        int16_t u_tmp = -43 * r - 84 * g + 127 * b;
        int16_t v_tmp = 127 * r - 106 * g - 21 * b;

        y_tmp = (y_tmp + 128) >> 8;
        u_tmp = (u_tmp + 128) >> 8;
        v_tmp = (v_tmp + 128) >> 8;

        yuv[i * 3] = (uint8_t) y_tmp;
        yuv[i * 3 + 1] = (uint8_t) (u_tmp + 128);
        yuv[i * 3 + 2] = (uint8_t) (v_tmp + 128);
    }
}

這個函數中大部分都是很經常使用的NEON intrinsics,看官不妨結合查閱Guide的附錄D自行熟悉,這裏我僅針對幾個點解釋一下:

  • 我用vld3q_u8指令從內存一次加載16個像素(也就是uint8_t類型的B、G、R三個通道的數值),將各個通道的16個數值放到一個Q-register中(也就是用了三個Q-register,每一個分別存放16個B、16個G和16個R),vst3q_u8的寫入操做也是相似的,這充分利用128-bit的帶寬,使內存存取(memory access)的次數儘量少。可是,後邊運算的向量化程度其實仍然不變,只能同時進行8個16-bit整型運算,也就是說,對於運算的部分,理想加速比(speedup)是8而非16。(固然,一次從內存加載多少數據是一個design choice,沒有絕對的答案,看官也能夠嘗試別的方式XD)
  • 對於像素個數不能被16整除的狀況,可能會有剩下的1到15個像素沒被上述NEON代碼處理到,這裏我把它們稱做leftovers,簡單粗暴地跑了以前的for循環。其實leftover的狀況若是佔比高的話,還有更高明的處理方式,請各位看官參見Guide的6.2 Handling non-multiple array lengths一節。
  • 我把Y通道計算的一部分放到一塊兒,稍做解釋,其餘的不少運算都是相似的:
// 複製8個值爲128的uint16_t類型整數到某個Q-register,該Q-register對應的C變量是u16_rounding
const uint16x8_t u16_rounding = vdupq_n_u16(128);
// 複製8個值爲128的uint8_t類型整數到某個D-register,該D-register對應的C變量是scalar
uint8x8_t scalar = vdup_n_u8(76);
// pixel_bgr.val[0]爲一個Q-register,16個uint8_t類型的數,對應R通道
// pixel_bgr.val[1]爲一個Q-register,16個uint8_t類型的數,對應G通道
// pixel_bgr.val[2]爲一個Q-register,16個uint8_t類型的數,對應B通道
uint8x16x3_t pixel_bgr = vld3q_u8(bgr);
// 一個Q-register對應有兩個D-register
// 這是拿到對應R通道的Q-register高位的D-register,有8個R值
// 參見Guide中Register overlap一節(這部份內容很重要!)
// 其餘如low_r、high_g的狀況類似,這裏從略
uint8x8_t high_r = vget_high_u8(pixel_bgr.val[0]);
// 對8個R值執行乘76操做
// vmull是變長指令(常以單詞末尾額外的L做爲標記),源操做數是兩個uint8x8_t的向量,目的操做數是uint16x8_t的向量
// 到這一步:Y = R * 76
// low_y的狀況相似,從略
high_y = vmull_u8(high_r, scalar);
// 把scalar的8個128改成8個150
scalar = vdup_n_u8(150);
// 執行乘加運算
// 到這一步:Y = R * 76 + G * 150
high_y = vmlal_u8(high_y, high_g, scalar);
// 把scalar的8個150改成8個29
scalar = vdup_n_u8(29);
// 執行乘加運算
// 到這一步:Y = R * 76 + G * 150 + B * 29
high_y = vmlal_u8(high_y, high_b, scalar);
// 到這一步:Y = (R * 76 + G * 150 + B * 29) + 128
high_y = vaddq_u16(high_y, u16_rounding);
// vqshrn_n_u16是變窄指令(常以單詞末尾額外的N做爲標記,N爲Narrow),將uint16x8_t的向量壓縮爲uint8x8_t
// 到這一步:Y = ((R * 76 + G * 150 + B * 29) + 128) >> 8
// vcombine_u8用於將兩個D-register的值組裝到一個Q-register中
pixel_yuv.val[0] = vcombine_u8(vqshrn_n_u16(low_y, 8), vqshrn_n_u16(high_y, 8));
  • 看官也許已經注意到了,在前面的BGR888ToYUV444函數中,我並不是像上面的代碼同樣,將一個通道的運算放到一塊,而是有意將Y、U、V三個通道的運算打散攪在一塊兒。這是爲了儘量減小data dependency所引發的stall,這也是優化NEON代碼須要着重考慮的方向之一。
  • 對NEON稍有了解的細心看官可能會問:乘法和加法所須要的12八、76等常數爲什麼不用當即數——例如NEON的vmul_n——而是採用先批量複製常數到向量再作向量運算?這是由於咱們不少運算的源操做數是int8x8_tuint8x8_t的向量,vmul_n等API很不幸不支持這種格式。這裏也良心提醒看官,寫NEON intrinsics或者彙編必定要對照Guide後面的附錄列出的格式,不然編譯器經常會報一些風馬牛不相及的錯誤,把人往坑裏帶——我當初但是各類踩坑各類在不相關的地方糾結啊555...

交叉編譯及debug

NEON彙編

GCC內聯彙編

從intrinsics到彙編

如何debug彙編代碼

繼續優化

相關文章
相關標籤/搜索