SIMD---AVX系列

時間 2019-12-14

標籤 simd avx 系列简体版

原文原文鏈接

AVX全稱Advanced Vcetor Extension，是對SSE的後續擴展，主要分爲AVX、AVX二、AVX512三種。在目前常見的機器上，大多隻支持到AVX系列，所以其餘SIMD擴展指令咱們就先不學習了。git

1. AVX系列

1.1 AVX

AVX使用了16個YMM寄存器，主要針對的是浮點數計算優化，支持32位單精度和64位雙精度。AVX將打包長度由SSE的128位擴展爲256位。github

AVX主要有兩個改進：學習

256位浮點打包數據長度。
三位操做數：計算形式能夠由先前的A = A + B改成A = B + C。

AVX使用了SSE的128的寄存器，YMM寄存器的低位部分是XMM寄存器：fetch

1.2 AVX2

AVX2是AVX指令的擴展，主要在整形數據方面作了完善：優化

256位整形打包數據。
算數運算支持完善。

1.3 AVX-512

AVX-512指令擴展主要把256位數據擴展到512位，在數據級並行又邁進了一步。AVX-512擴展包含好幾個部分：ui

AVX-512 Foundation
AVX-512 Conflict Detection Instructions (CD)
AVX-512 Exponential and Reciprocal Instructions (ER)
AVX-512 Prefetch Instructions (PF)
AVX-512 Vector Length Extensions (VL)
AVX-512 Byte and Word Instructions (BW)
AVX-512 Doubleword and Quadword Instructions (DQ)
AVX-512 Integer Fused Multiply Add (IFMA)
AVX-512 Vector Byte Manipulation Instructions (VBMI)
AVX-512 Vector Neural Network Instructions Word variable precision (4VNNIW)
AVX-512 Fused Multiply Accumulation Packed Single precision (4FMAPS)
AVX-512 Vector Neural Network Instructions (VNNI)
AVX-512 Galois Field New Instructions(GFNI)
AVX-512 Vector AES instructions (VAES)
AVX-512 Vector Byte Manipulation Instructions 2 (VBMI2)
AVX-512 Bit Algorithms (BITALG)

可是只有Foundation部分是各實現保證支持的。code

2. AVX功能支持檢測

不是全部機型都通用的指令集須要調用cpuid指令來檢測：blog

push ecx

    mov eax, 0
    cpuid
    cmp ecx, 1
    jb notSupported //  check if supports EAX=1 when using CPUID

    mov eax, 1
    cpuid
    and ecx, 0x18000000 //  clear non-related bits
    cmp ecx, 0x18000000 //  check OSXSAVE and avx
    jne notSupported

    mov ecx, 0
    XGETBV              //  get XCR0 register value
    and eax, 0x6
    cmp eax, 0x6        //  check XMM and YMM state
    jne notSupported

    mov eax, 1
    jmp done

notSupported:
    mov eax, 0

done:
    pop ecx

根據Intel開發者指南，咱們須要檢測OSXSAVE、AVX、XMM state、YMM state這四個功能。cpuid隱式使用eax寄存器做爲指令參數執行：當eax位0時，cpuid返回eax可傳入最大值；傳入1時，返回功能標記爲，這時候咱們經過檢查ecx寄存器的第2八、29位就能夠判斷是否分別支持OSXSAVE和AVX功能；以後咱們要給ecx賦值0來做爲參數調用XGETBV指令，這個指令返回結果的第二、3位代表XMM、YMM狀態是否開啓。ip

3. AVX優化使用

與以前的隨筆同樣，咱們對10000000個單精度浮點數進行加操做，可是我電腦機型不支持AVX2，所以沒法演示AVX系列的整數優化操做：ci

__m256 step = _mm256_set_ps(10.0, 10.0, 10.0, 10.0, 
                            10.0, 10.0, 10.0, 10.0);
__m256* dst = reinterpret_cast<__m256*>(data);
for (unsigned i = 0; i < count; i += 8)
{
    __m256 sum = _mm256_add_ps(*dst, step);
    *dst++ = sum;
}

4. 運行結果

這個運行時間代表，有時候簡單的使用AVX來進行計算優化並不必定會提高程序的運行效率，得深刻分析，完整代碼見連接。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。