Game Engine Architecture 5

時間 2019-11-05

原文原文鏈接

【Game Engine Architecture 5】app

一、Memory Ordering Semanticsdom

　　These mysterious and vexing problems can only occur on a multicore machine with a multilevel cache.ide

　　A cache coherency protocol is a communication mechanism that permits cores to share data between their local L1 caches in this way. Most CPUs use either the MESI or MOESI protocol.　　函數

二、The MESI Protocoloop

　　• Modified. This cache line has been modified (written to) locally.優化

　　• Exclusive. The main RAM memory block corresponding to this cache line exists only in this core’s L1 cache—no other core has a copy of it.ui

　　• Shared. The main RAM memory block corresponding to this cache line exists in more than one core’s L1 cache, and all cores have an identical copy of it.this

　　• Invalid. This cache line no longer contains valid data—the next read will need to obtain the line either from another core’s L1 cache, or from main RAM.atom

　　The MOESI protocol adds another state named Owned, which allows cores to share modified data without writing it back to main RAM first.idea

　　Under the MESI protocol, all cores’ L1 caches are connected via a special bus called the interconnect bus (ICB).

2.一、How MESI Can Go Wrong

　　As with compiler optimizations and CPU out-of-order execution optimizations, MESI optimizations are carefully crafted so as to be undetectable by a single thread.

　　和 compiler 優化、CPU OOO優化同樣，MESI精心設計以使單線程程序沒法察覺。

　　Under certain circumstances, optimizations in the MESI protocol can cause the new value of g_ready to become visible to other cores within the cache coherency domain before the updated value of g_data becomes visible.

　　MESI的優化，可能致使某值先寫的值，更晚傳遞到其餘Cache，從而致使其餘Core的線程提早運行，而讀取到錯誤數據的問題。

三、Memory Fences / memory barriers

　　To prevent the memory effects of a read or write instruction passing other reads and/or writes.

　　There are four ways in which one instruction can pass another:

　　　　1. A read can pass another read,
　　　　2. a read can pass a write,
　　　　3. a write can pass another write, or
　　　　4. a write can pass a read.

　　We could imagine a CPU that provides twelve distinct fence instructions—a bidirectional, forward, and reverse variant of each of the four basic fence types listed above.

　　All fence instructions have two very useful side-effects:

　　　　1）They serve as compiler barriers, and

　　　　2）they prevent the CPU’s out-of-order logic from reordering instructions across the fence.

四、Acquire and Release Semantics

　　Memory ordering semantics are really properties of read or write instructions：

　　• Release semantics. This semantic guarantees that a write to shared memory can never be passed by any other read or write that precedes it in program order.

　　　　　　　　　　　　write 前面的 rw 保證在前面， write 後面的 rw 無論。

　　• Acquire semantics. This semantic guarantees that a read from shared memory can never be passed by any other read or write that occurs after it in program order.

　　　　　　　　　　　　 read 後面的 rw 保證在後面，read 前面的 rw無論。

　　• Full fence semantics.

五、When to Use Acquire and Release Semantics

　　A write-release is most often used in a producer scenario，performs two consecutive writes (e.g., writing to g_data and then g_ready), and we need to ensure that all other threads will see the two writes in the correct order. We can enforce this ordering by making the second of these two writes a write-release.

　　A read-acquire is typically used in a consumer scenario—in which a thread performs two consecutive reads in which the second is conditional on the first (e.g., only reading g_data after a read of the flag g_ready comes back true). We enforce this ordering by making sure that the first read is a read-acquire.

int32_t g_data = 0;
int32_t g_ready = 0;
void ProducerThread() // running on Core 1
{
    g_data = 42;
    
    // make the write to g_ready into a write-release
    // by placing a release fence *before* it
    RELEASE_FENCE();
    
    g_ready = 1;
}
void ConsumerThread() // running on Core 2
{
    // make the read of g_ready into a read-acquire
    // by placing an acquire fence *after* it
    while (!g_ready)
        PAUSE();
    
    ACQUIRE_FENCE();
    
    // we can now read g_data safely...
    ASSERT(g_data == 42);
}

View Code

六、CPU Memory Models

　　the DEC Alpha has notoriously weak semantics by default, requiring careful fencing in almost all situations. At the other end of the spectrum, an Intel x86 CPU has quite strong memory ordering semantics by default.

七、Fence Instructions on Real CPUs

　　The Intel x86 ISA specifies three fence instructions: sfence provides release semantics, lfence provides acquire semantics, and mfence acts as a full fence.

　　Certain x86 instructions may also be prefixed by a lock modifier to make them behave atomically, and to provide a memory fence prior to execution of the instruction.

　　The x86 ISA is strongly ordered by default, meaning that fences aren’t actually required in many cases where they would be on CPUs with weaker default ordering semantics.　　

八、Atomic Variables

　　C++11, the class template std::atomic<T> allows virtually any data type to be converted into an atomic variable. Using these facilities, our producer-consumer example can be written as follows:

　　the std::atomic family of templates provides its variables with 「full fence」 memory ordering semantics by default (although weaker semantics can be specified if desired). This frees us to write lock-free code without having to worry about any of the three causes of data race bugs.

std::atomic<float> g_data;
std::atomic_flag g_ready = false;

void ProducerThread()
{
    // produce some data
    g_data = 42;
    
    // inform the consumer
    g_ready = true;
}

void ConsumerThread()
{
    // wait for the data to be ready
    while (!g_ready)
        PAUSE();
        
    // consume the data
    ASSERT(g_data == 42);
}

View Code

　　atomic<> 變量，不單單保證了原子，也保證了 compiler、cpu、memory order 致使的bug。

　　the implementations of std::atomic<T> and std:: atomic_flag are, of course, complex. The standard C++ library has to be portable, so its implementation makes use of whichever atomic and barrier machine language instructions happen to be available on the target platform.

　　atomic<> 根據 target platform 的不一樣，而被編譯成不一樣的 ML instruction。

　　atomic<T>。當T很小時（32/64位如下），atomic 採用 atomic instruction；當 T 很大時（32/64位以上），atomic 採用 mutex。You can call is_lock_free() on any atomic variable to find out whether its implementation really is lock-free on

your target hardware.

九、C++ Memory Order

　　By default, C++ atomic variables make use of full memory barriers, to ensure that they will work correctly in all possible use cases. However, it is possible to relax these guarantees, by passing a memory order semantic.

　　1） Relaxed. guarantees only atomicity. No barrier or fence is used.

　　2）Consume. no other read or write within the same thread can be reordered before this read. In other words, this semantic only serves to prevent compiler optimizations and out-of-order execution from reordering the instructions—it does nothing to ensure any particular memory ordering semantics within the cache coherency domain.

　　　　　　　　相似於 acquire，但不負責 memory-order 問題。（acquire負責）

　　3）Release. A write performed with release semantics guarantees that no other read or write in this thread can be reordered after it, and the write is guaranteed to be visible by other threads reading the same address. Itemploys a release fence in the CPU’s cache coherency domain to accomplish this.

　　　　　　　　等同於 Release semantics。

　　4）Acquire.

　　　　　　　　徹底等同於 acquire。

　　5）Acquire/Release. full memory fence.

　　Using these memory order specifiers requires switching from std:: atomic’s overloaded assignment and cast operators to explicit calls to store() and load().

std::atomic<float> g_data;
std::atomic<bool> g_ready = false;

void ProducerThread()
{
    // produce some data
    g_data.store(42, std::memory_order_relaxed);
    
    // inform the consumer
    g_ready.store(true, std::memory_order_release);
}

void ConsumerThread()
{
    // wait for the data to be ready
    while (!g_ready.load(std::memory_order_acquire))
    PAUSE();
    
    // consume the data
    ASSERT(g_data.load(std::memory_order_relaxed) == 42);
}

View Code

十、Concurrency in Interpreted Programming Languages

　　C++ compile or assemble down to raw machine code that is executed directly by the CPU. As such, atomic operations and locks must be implemented with the aid of special machine language instructions that provide atomic operations and cache coherent memory barriers, with additional help from the kernel (which makes sure threads are put to sleep and awoken appropriately) and the compiler (which respects barrier instructions when optimizing the code).

　　In C/C++ a volatile variable is not atomic. However, in Java and C#, the volatile type qualifier does guarantee atomicity.

十一、Basic Spin Lock

　　std::memory_order_acquire：保證後面的讀不會發生在本次讀的前面。

　　std::memory_order_release：保證全部前面的寫都已完成。

class SpinLock
{
    std::atomic_flag m_atomic;
    public:
    SpinLock() : m_atomic(false) { }
    
    bool TryAcquire()
    {
        // use an acquire fence to ensure all subsequent
        // reads by this thread will be valid
        bool alreadyLocked = m_atomic.test_and_set(std::memory_order_acquire);
        return !alreadyLocked;
    }
    
    void Acquire()
    {
        // spin until successful acquire
        while (!TryAcquire())
        {
            // reduce power consumption on Intel CPUs
            // (can substitute with std::this_thread::yield()
            // on platforms that don't support CPU pause, if
            // thread contention is expected to be high)
            PAUSE();
        }
    }
    
    void Release()
    {
        // use release semantics to ensure that all prior
        // writes have been fully committed before we unlock
        m_atomic.clear(std::memory_order_release);
    }
};

View Code

十二、Scoped Locks

　　相似智能指針。自動給 lock 加鎖、解鎖。

1三、Reentrant Locks

　　We can relax this reentrancy restriction if we can arrange for our spin lock class to cache the id of the thread that has locked it.

class ReentrantLock32
{
    std::atomic<std::size_t> m_atomic;
    std::int32_t m_refCount;
    
public:
    ReentrantLock32() : m_atomic(0), m_refCount(0) { }
    
    void Acquire()
    {
        std::hash<std::thread::id> hasher;
        std::size_t tid = hasher(std::this_thread::get_id());
        
        // if this thread doesn't already hold the lock...
        if (m_atomic.load(std::memory_order_relaxed) != tid)
        {
            // ... spin wait until we do hold it
            std::size_t unlockValue = 0;
            while (!m_atomic.compare_exchange_weak(
                                unlockValue,
                                tid,
                                std::memory_order_relaxed, // fence below!
                                std::memory_order_relaxed))
            {
                unlockValue = 0;
                PAUSE();
            }
        }
        
        // increment reference count so we can verify that
        // Acquire() and Release() are called in pairs
        ++m_refCount;
        
        // use an acquire fence to ensure all subsequent
        // reads by this thread will be valid
        std::atomic_thread_fence(std::memory_order_acquire);
    }

    void Release()
    {
        // use release semantics to ensure that all prior
        // writes have been fully committed before we unlock
        std::atomic_thread_fence(std::memory_order_release);
        
        std::hash<std::thread::id> hasher;
        std::size_t tid = hasher(std::this_thread::get_id());
        std::size_t actual = m_atomic.load(std::memory_order_relaxed);
        assert(actual == tid);
        
        --m_refCount;

        if (m_refCount == 0)
        {
            // release lock, which is safe because we own it
            m_atomic.store(0, std::memory_order_relaxed);
        }
    }

    bool TryAcquire()
    {
        std::hash<std::thread::id> hasher;
        std::size_t tid = hasher(std::this_thread::get_id());
        bool acquired = false;
        if (m_atomic.load(std::memory_order_relaxed) == tid)
        {
            acquired = true;
        }
        else
        {
            std::size_t unlockValue = 0;
            acquired = m_atomic.compare_exchange_strong(
            unlockValue,
            tid,
            std::memory_order_relaxed, // fence below!
            std::memory_order_relaxed);
        }
        if (acquired)
        {
            ++m_refCount;
            std::atomic_thread_fence(std::memory_order_acquire);
        }
        
        return acquired;
    }
    
};

View Code

1四、Readers-Writer Locks

　　Whenever a writer thread attempts to acquire the lock, it should wait until all readers are finished.

　　we’ll store a reference count indicating how many readers currently hold the lock.

　　A readers-writer lock suffers from starvation problems: A writer that holds the lock too long can cause all of the readers to starve, and likewise a lot of readers can cause the writers to starve.

1五、Lock-Not-Needed Assertions

　　經過 Assert 來 Check 多個線程是否會互相 overlap。

class UnnecessaryLock
{
    volatile bool m_locked;
    
public:
    void Acquire()
    {
        // assert no one already has the lock
        assert(!m_locked);
        
        // now lock (so we can detect overlapping
        // critical operations if they happen)
        m_locked = true;
    }

    void Release()
    {
        // assert correct usage (that Release()
        // is only called after Acquire())
        assert(m_locked);
        
        // unlock
        m_locked = false;
    }

};

#if ASSERTIONS_ENABLED
#define BEGIN_ASSERT_LOCK_NOT_NECESSARY(L) (L).Acquire()
#define END_ASSERT_LOCK_NOT_NECESSARY(L) (L).Release()
#else
#define BEGIN_ASSERT_LOCK_NOT_NECESSARY(L)
#define END_ASSERT_LOCK_NOT_NECESSARY(L)
#endif

// Example usage...
UnnecessaryLock g_lock;
void EveryCriticalOperation()
{
    BEGIN_ASSERT_LOCK_NOT_NECESSARY(g_lock);
    printf("perform critical op...\n");
    END_ASSERT_LOCK_NOT_NECESSARY(g_lock);
}

View Code

　　We could also wrap the locks in a janitor.

class UnnecessaryLockJanitor
{
    UnnecessaryLock* m_pLock;
public:
    explicit
    UnnecessaryLockJanitor(UnnecessaryLock& lock)
    : m_pLock(&lock) { m_pLock->Acquire(); }
    
    ~UnnecessaryLockJanitor() { m_pLock->Release(); }
};

#if ASSERTIONS_ENABLED
    #define ASSERT_LOCK_NOT_NECESSARY(J,L) \
    UnnecessaryLockJanitor J(L)
#else
    #define ASSERT_LOCK_NOT_NECESSARY(J,L)
#endif

// Example usage...
UnnecessaryLock g_lock;
void EveryCriticalOperation()
{
    ASSERT_LOCK_NOT_NECESSARY(janitor, g_lock);
    printf("perform critical op...\n");
}

View Code

　　We implemented this at Naughty Dog and it successfully caught a number of cases of critical operations overlapping when the programmers had assumed they never could do so.

1六、Lock-Free Transactions

　　To perform a critical operation in a lock-free manner, we need to think of each such operation as a transaction that can either succeed in its entirety, or fail in its entirety. If it fails, the transaction is simply retried until it does succeed.

　　To implement any transaction, no matter how complex, we perform the majority of the work locally (i.e., using data that’s visible only to the current thread, rather than operating directly on the shared data). When all of our ducks are in a row and the transaction is ready to commit, we execute a single atomic instruction, such as CAS or LL/SC. If this atomic instruction succeeds, we have successfully 「published」 our transaction globally—it becomes a permanent part of the shared data structure on which we’re operating. But if the atomic instruction fails, that means some other thread was attempting to commit a transaction at the same time we were.

　　This fail-and-retry approach works because whenever one thread fails to commit its transaction, we know that it was because some other thread managed to succeed. As a result, one thread in the system is always making forward progress (just maybe not us). And that is the definition of lock-free.

1七、A Lock-Free Linked List

template< class T >
class SList
{
    struct Node
    {
        T m_data;
        Node* m_pNext;
    };
    
    std::atomic< Node* > m_head { nullptr };
    
public:

    void push_front(T data)
    {
        // prep the transaction locally
        auto pNode = new Node();
        pNode->m_data = data;
        pNode->m_pNext = m_head;
        
        // commit the transaction atomically
        // (retrying until it succeeds)
        while (!m_head.compare_exchange_weak(
                pNode->m_pNext, pNode))
        { }
    }
};

View Code

　　注意，compare_exchange_weak 並不等於同 CAS。CAS爲false時，並不會修改 expectedValue，可是 compare_exchange_weak會修改 expectedValue。因此上述代碼 pNode->m_pNext 會自動更新。

1八、SIMD/Vector Processing

　　SIMD and multithreading are combined into a form of parallelsm known single instruction multiple thread (SIMT), which forms the basis of all modern GPUs.

　　Intel first introduced its MMX instruction set with their Pentium line of CPUs in 1994. These instructions permitted SIMD calculations to be performed on eight 8-bit integers, four 16-bit integers, or two 32-bit integers packed into special 64-bit MMX registers.

　　MMX 爲64位。

　　streaming SIMD extensions, or SSE, the first version of which appeared in the Pentium III processor. The SSE instruction set utilizes 128-bit registers that can contain integer or IEEE floating-point data.

　　SSE 爲 128位。

　　The SSE mode most commonly used by game engines is called packed 32-bit floating-point mode. In this mode, four 32-bit float values are packed into a single 128-bit register. An operation such as addition or multiplication can thus be performed on four pairs of floats in parallel by taking two of these 128-bit registers as its inputs. Intel has since introduced various upgrades to the SSE instruction set, named SSE2, SSE3, SSSE3 and SSE4.

　　In 2011, Intel introduced a new, wider SIMD register file and accompanying instruction set named advanced vector extensions (AVX). AVX registers are 256 bits wide, permitting a single instruction to operate on pairs of up to eight 32-bit floating-point operands in parallel. The AVX2 instruction set is an extension to AVX. Some Intel CPUs now support AVX-512, an extension to AVX permitting 16 32-bit floats to be packed into a 512-bit register.

　　Avx 爲256位， Avx-512爲 512位。

1九、The SSE Instruction Set and Its Registers

　　These instructions are denoted with a ps suffix, indicating that we’re dealing with packed data (p), The SSE registers are named XMMi, where i is an integer between 0 and 15. In packed 32-bit floating-point mode, each 128-bit XMMi register contains four 32-bit floats.

　　In AVX, the registers are 256 bits wide and are named YMMi; in AVX-512, they are 512 bits in width and are named ZMMi.　　

20、The __m128 Data Type

　　Declaring automatic variables and function arguments to be of type __m128 often results in the compiler treating those values as direct proxies for SSE registers.

　　自動變量、函數參數的 __m126 會編譯成對 SSE register 的使用。

　　But using the __m128 type to declare global variables, structure/class members, and sometimes local variables results in the data being stored as a 16-byte aligned array of float in memory.

　　全局變量、成員變量會編譯成128 bit 對齊的內存數據。

　　Using a memory-based __m128 variable in an SSE calculation will cause the compiler to implicitly emit instructions for loading the data from memory into an SSE register prior to performing the calculation on it, and likewise to emit instructions for storing the results of the calculation back into the memory that 「backs」 each such variable.

2一、Alignment of SSE Data

　　Whenever data that is intended for use in an XMMi register is stored in memory, it must be 16-byte (128-bit) aligned. Likewise.

　　The compiler ensures that global and local variables of type __m128 are aligned automatically. It also pads struct and class members so that any __m128 members are aligned properly relative to the start of the object, and ensures that the alignment of the entire struct or class is equal to the worstcase alignment of its members.

2二、SSE Compiler Intrinsics

　　modern compilers provide intrinsics—special syntax that looks and behaves like a regular C function, but is actually boiled down to inline assembly code by the compiler.

　　In order to use SSE and AVX intrinsics, your .cpp file must #include <xmmintrin.h> in Visual Studio, or <x86intrin.h> when compiling with Clang or gcc.

2三、Some Useful SSE Intrinsics

　　• __m128 _mm_set_ps(float w, float z, float y, float x);

　　• __m128 _mm_load_ps(const float* pData);

　　• void _mm_store_ps(float* pData, __m128 v);

　　• __m128 _mm_add_ps(__m128 a, __m128 b);

　　• __m128 _mm_mul_ps(__m128 a, __m128 b);

#include <xmmintrin.h>

void TestAddSSE()
{
    alignas(16) float A[4];
    alignas(16) float B[4] = { 2.0f, 4.0f, 6.0f, 8.0f };
    
    // set a = (1, 2, 3, 4) from literal values, and
    // load b = (2, 4, 6, 8) from a floating-point array
    // just to illustrate the two ways of doing this
    // (remember that _mm_set_ps() is backwards!)
    __m128 a = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);
    __m128 b = _mm_load_ps(&B[0]);
    
    // add the vectors
    __m128 r = _mm_add_ps(a, b);
    
    // store '__m128 a' into a float array for printing
    _mm_store_ps(&A[0], a);
    
    // store result into a float array for printing
    alignas(16) float R[4];
    _mm_store_ps(&R[0], r);
    
    // inspect results
    printf("a = %.1f %.1f %.1f %.1f\n", A[0], A[1], A[2], A[3]);
    printf("b = %.1f %.1f %.1f %.1f\n", B[0], B[1], B[2], B[3]);
    printf("r = %.1f %.1f %.1f %.1f\n", R[0], R[1], R[2], R[3]);
}

View Code

2四、Using SSE to Vectorize a Loop

void AddArrays_ref(int count, 
                    float* results,
                    const float* dataA,
                    const float* dataB)
{
    for (int i = 0; i < count; ++i)
    {
        results[i] = dataA[i] + dataB[i];
    }
}

View Code

　　上面的代碼，使用SSE優化後以下。

void AddArrays_sse(int count,
                    float* results,
                    const float* dataA,
                    const float* dataB)
{
    // NOTE: the caller needs to ensure that the size of
    // all 3 arrays are equal, and a multiple of four!
    assert(count % 4 == 0);
    for (int i = 0; i < count; i += 4)
    {
        __m128 a = _mm_load_ps(&dataA[i]);
        __m128 b = _mm_load_ps(&dataB[i]);
        __m128 r = _mm_add_ps(a, b);
        _mm_store_ps(&results[i], r);
    }
}

View Code

　　This is called vectorizing our loop.

　　This particular example isn’t quite four times faster, due to the overhead of having to load the values in groups of four and store the results on each iteration; but when I measured these functions running on very large arrays of floats, the nonvectorized loop took roughly 3.8 times as long to do its work as the vectorized loop.

2五、Vectorized Dot Products

void DotArrays_ref(int count,
                    float r[],
                    const float a[],
                    const float b[])
{
    for (int i = 0; i < count; ++i)
    {
        // treat each block of four floats as a
        // single four-element vector
        const int j = i * 4;
        r[i] = a[j+0]*b[j+0] // ax*bx
             + a[j+1]*b[j+1] // ay*by
             + a[j+2]*b[j+2] // az*bz
             + a[j+3]*b[j+3]; // aw*bw
    }
}

View Code

_mm_hadd_ps() (horizontal add). This intrinsic operates on a single input register (x, y, z,w) and calculates two sums: s = x +y and t = z+w. It stores these two sums into the four slots of the destination register as (t, s, t, s). Performing this operation twice allows us to calculate the sum d = x +y+z+w. This is called adding across a register.

　　first attempt at using SSE intrinsics:

void DotArrays_sse_horizontal(int count,
                                float r[],
                                const float a[],
                                const float b[])
{
    for (int i = 0; i < count; ++i)
    {
        // treat each block of four floats as a
        // single four-element vector
        const int j = i * 4;
        __m128 va = _mm_load_ps(&a[j]); // ax,ay,az,aw
        __m128 vb = _mm_load_ps(&b[j]); // bx,by,bz,bw
        __m128 v0 = _mm_mul_ps(va, vb);
        
        // add across the register...
        __m128 v1 = _mm_hadd_ps(v0, v0);
        
        // (v0w+v0z, v0y+v0x, v0w+v0z, v0y+v0x)
        __m128 vr = _mm_hadd_ps(v1, v1);
        
        // (v0w+v0z+v0y+v0x, v0w+v0z+v0y+v0x,
        // v0w+v0z+v0y+v0x, v0w+v0z+v0y+v0x)
        _mm_store_ss(&r[i], vr); // extract vr.x as a float
    }
}

View Code

　　上面的第一個優化代碼. Adding across a register is not usually a good idea because it’s a very slow operation. Profiling the DotArrays_sse() implementation shows that it actually takes a little bit more time than the reference implementation. Using SSE here has actually slowed us down!12

　　A better approach. 優化掉耗時的 _mm_hadd_ps().注意輸入是：By storing them in transposed order。

void DotArrays_sse(int count,
                    float r[],
                    const float a[],
                    const float b[])
{
    for (int i = 0; i < count; i += 4)
    {
        __m128 vaX = _mm_load_ps(&a[(i+0)*4]); // a[0,4,8,12]
        __m128 vaY = _mm_load_ps(&a[(i+1)*4]); // a[1,5,9,13]
        __m128 vaZ = _mm_load_ps(&a[(i+2)*4]); // a[2,6,10,14]
        __m128 vaW = _mm_load_ps(&a[(i+3)*4]); // a[3,7,11,15]
        __m128 vbX = _mm_load_ps(&b[(i+0)*4]); // b[0,4,8,12]
        __m128 vbY = _mm_load_ps(&b[(i+1)*4]); // b[1,5,9,13]
        __m128 vbZ = _mm_load_ps(&b[(i+2)*4]); // b[2,6,10,14]
        __m128 vbW = _mm_load_ps(&b[(i+3)*4]); // b[3,7,11,15]
        
        __m128 result;
        result = _mm_mul_ps(vaX, vbX);
        result = _mm_add_ps(result, _mm_mul_ps(vaY, vbY));
        result = _mm_add_ps(result, _mm_mul_ps(vaZ, vbZ));
        result = _mm_add_ps(result, _mm_mul_ps(vaW, vbW));
        
        _mm_store_ps(&r[i], result);
    }
}

View Code

2六、The MADD Instruction

　　It’s interesting to note that a multiply followed by an add is such a common operation that it has its own name—madd.

　　前面代碼中的 _mm_add_ps(result, _mm_mul_ps(vaY, vbY)); 能夠用另外一個函數來替代 madd。

2七、Transpose as We Go

　　25中的最後一個示例，要求輸入的 a、b 是 transpose的。但一般咱們面對的是正常存儲的數據，爲了transpose，可使用 TRANSPOSE4() 方法。

void DotArrays_sse_transpose(int count,
                                float r[],
                                const float a[],
                                const float b[])
{
    for (int i = 0; i < count; i += 4)
    {
        __m128 vaX = _mm_load_ps(&a[(i+0)*4]); // a[0,1,2,3]
        __m128 vaY = _mm_load_ps(&a[(i+1)*4]); // a[4,5,6,7]
        __m128 vaZ = _mm_load_ps(&a[(i+2)*4]); // a[8,9,10,11]
        __m128 vaW = _mm_load_ps(&a[(i+3)*4]); // a[12,13,14,15]
        __m128 vbX = _mm_load_ps(&b[(i+0)*4]); // b[0,1,2,3]
        __m128 vbY = _mm_load_ps(&b[(i+1)*4]); // b[4,5,6,7]
        __m128 vbZ = _mm_load_ps(&b[(i+2)*4]); // b[8,9,10,11]
        __m128 vbW = _mm_load_ps(&b[(i+3)*4]); // b[12,13,14,15]
        
        _MM_TRANSPOSE4_PS(vaX, vaY, vaZ, vaW);
        // vaX = a[0,4,8,12]
        // vaY = a[1,5,9,13]
        // ...
        
        _MM_TRANSPOSE4_PS(vbX, vbY, vbZ, vbW);
        // vbX = b[0,4,8,12]
        // vbY = b[1,5,9,13]
        // ...
        
        __m128 result;
        result = _mm_mul_ps(vaX, vbX);
        result = _mm_add_ps(result, _mm_mul_ps(vaY, vbY));
        result = _mm_add_ps(result, _mm_mul_ps(vaZ, vbZ));
        result = _mm_add_ps(result, _mm_mul_ps(vaW, vbW));
        _mm_store_ps(&r[i], result);
    }
}

View Code

　　Profiling all three implementations of DotArrays() shows that our final version (the one that transposes the vectors as it goes) is roughly 3.5 times faster than the reference implementation.

2八、Shuffle and Transpose

　　 here’s what the _MM_TRANSPOSE() macro looks like:

#define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) \
{ __m128 tmp3, tmp2, tmp1, tmp0; \
    \
    tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \
    tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \
    tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \
    tmp3 = _mm_shuffle_ps((row2), (row3), 0xEE); \
    \
    (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88); \
    (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD); \
    (row2) = _mm_shuffle_ps(tmp2, tmp3, 0x88); \
    (row3) = _mm_shuffle_ps(tmp2, tmp3, 0xDD); }

View Code

　　Mask 的來源爲：

　　_mm_shuffle_ps() 的完整涵義爲：

2九、Vector-Matrix Multiplication with SSE

　　Vector-Matrix 本質上與 VectorDot 的計算過程是同樣的。因此實現原理相似。

union Mat44
{
    float c[4][4]; // components
    __m128 row[4]; // rows
};

__m128 MulVecMat_sse(const __m128& v, const Mat44& M)
{
    // first transpose v
    __m128 vX = _mm_shuffle_ps(v, v, 0x00); // (vx,vx,vx,vx)
    __m128 vY = _mm_shuffle_ps(v, v, 0x55); // (vy,vy,vy,vy)
    __m128 vZ = _mm_shuffle_ps(v, v, 0xAA); // (vz,vz,vz,vz)
    __m128 vW = _mm_shuffle_ps(v, v, 0xFF); // (vw,vw,vw,vw)

    __m128 r = _mm_mul_ps(vX, M.row[0]);
    
    r = _mm_add_ps(r, _mm_mul_ps(vY, M.row[1]));
    r = _mm_add_ps(r, _mm_mul_ps(vZ, M.row[2]));
    r = _mm_add_ps(r, _mm_mul_ps(vW, M.row[3]));\
    
    return r;
}

View Code

30、Matrix-Matrix Multiplication with SSE

　　矩陣-矩陣相乘，分別計算前一矩陣的每一行，讓前一矩陣的每一行分別與後面的矩陣進行 Vector-Matrix Multiplication運算。

void MulMatMat_sse(Mat44& R, const Mat44& A, const Mat44& B)
{
    R.row[0] = MulVecMat_sse(A.row[0], B);
    R.row[1] = MulVecMat_sse(A.row[1], B);
    R.row[2] = MulVecMat_sse(A.row[2], B);
    R.row[3] = MulVecMat_sse(A.row[3], B);
}

View Code

3一、Generalized Vectorizaton

　　Interestingly, most optimizing compilers can vectorize some kinds of singlelane loops automatically. In fact, when writing the above examples, it took some doing to force the compiler not to vectorize my single-lane code, so that I could compare its performance to my SIMD implementation!

3二、Vector Predication

　　下面是一個普通函數，用於計算平方根。

#include <cmath>

void SqrtArray_ref(float* __restrict__ r,
const float* __restrict__ a,

int count)
{
    for (unsigned i = 0; i < count; ++i)
    {
        if (a[i] >= 0.0f)
            r[i] = std::sqrt(a[i]);
        else
            r[i] = 0.0f;
    }
}

View Code

　　下面是第一個SSE優化，可是在除0的部分有問題。

#include <xmmintrin.h>

void SqrtArray_sse_broken(float* __restrict__ r,
const float* __restrict__ a,

int count)
{
    assert(count % 4 == 0);
    __m128 vz = _mm_set1_ps(0.0f); // all zeros
    for (int i = 0; i < count; i += 4)
    {
        __m128 va = _mm_load_ps(a + i);
        __m128 vr;
        if (_mm_cmpge_ps(va, vz)) // ???
            vr = _mm_sqrt_ps(va);
        else
            vr = vz;
        _mm_store_ps(r + i, vr);
    }
}

View Code

　　_mm_cmpge_ps() produce a four component result, stored in an SSE register. the result consists of four 32-bit masks. Each mask contains all binary 1s (0xFFFFFFFFU) if the corresponding component in the input register passed the test, and all binary 0s (0x0U) if that component failed the test.

　　將 _mm_cmpge_ps 的返回值與 sqrt 的結果相 & 就獲得正確的結果。注意，下面取結果的代碼略顯麻煩。

#include <xmmintrin.h>

void SqrtArray_sse(float* __restrict__ r,
const float* __restrict__ a,

int count)
{
    assert(count % 4 == 0);
    __m128 vz = _mm_set1_ps(0.0f);
    for (int i = 0; i < count; i += 4)
    {
        __m128 va = _mm_load_ps(a + i);
        
        // always do the quotient, but it may end
        // up producing QNaN in some or all lanes
        __m128 vq = _mm_sqrt_ps(va);
        
        // now select between vq and vz, depending
        // on whether the input was greater than
        // or equal to zero or not
        __m128 mask = _mm_cmpge_ps(va, zero);
        
        // (vq & mask) | (vz & ~mask)
        __m128 qmask = _mm_and_ps(mask, vq);
        __m128 znotmask = _mm_andnot_ps(mask, vz);
        __m128 vr = _mm_or_ps(qmask, znotmask);
        _mm_store_ps(r + i, vr);
    }
}

View Code

　　SSE4—it is emitted by the intrinsic _mm_blendv_ps(). Let’s take a look at how we might implement a vector select operation ourselves. We can write it like this:

__m128 _mm_select_ps(const __m128 a,
                    const __m128 b,
                    const __m128 mask)
{
    // (b & mask) | (a & ~mask)
    __m128 bmask = _mm_and_ps(mask, b);
    __m128 anotmask = _mm_andnot_ps(mask, a);
    return _mm_or_ps(bmask, anotmask);
}

View Code

　　也可使用異或來實現 select函數。

__m128 _mm_select_ps(const __m128 a,
                        const __m128 b,
                        const __m128 mask)
{
    // (((a ^ b) & mask) ^ a)
    __m128 diff = _mm_xor_ps(a, b);
    return _mm_xor_ps(a, _mm_and_ps(mask, diff));
}

View Code

3三、

3四、

3五、

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。