將32位循環計數器替換爲64位會在Intel CPU上使用_mm_popcnt_u64引發瘋狂的性能誤差

時間 2020-08-10

標籤循環計數器替換會在 intel cpu 使用 popcnt u64 引發瘋狂性能誤差欄目 Intel 简体版

原文原文鏈接

問題：

I was looking for the fastest way to popcount large arrays of data. 我一直在尋找最快的方法來popcount大量數據的數量。 I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. 我遇到了一個很是奇怪的效果：將循環變量從unsigned更改成uint64_t使PC上的性能降低了50％。 ios

The Benchmark 基準測試

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. 如您所見，咱們建立了一個隨機數據緩衝區，大小爲x兆字節，其中從命令行讀取x 。 Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. 以後，咱們遍歷緩衝區並使用x86 popcount內部版本的展開版原本執行popcount。 To get a more precise result, we do the popcount 10,000 times. 爲了得到更精確的結果，咱們將彈出次數進行了10,000次。 We measure the times for the popcount. 咱們計算彈出次數的時間。 In the upper case, the inner loop variable is unsigned , in the lower case, the inner loop variable is uint64_t . 在大寫狀況下，內部循環變量是unsigned ，在小寫狀況下，內部循環變量是uint64_t 。 I thought that this should make no difference, but the opposite is the case. 我認爲這應該沒有什麼區別，但狀況偏偏相反。 c++

The (absolutely crazy) results （絕對瘋狂）結果

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1): 我這樣編譯（g ++版本：Ubuntu 4.8.2-19ubuntu1）： ubuntu

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data): 這是個人Haswell Core i7-4770K CPU @ 3.50 GHz，運行test 1 （所以有1 MB隨機數據）的結果： sass

unsigned 41959360000 0.401554 sec 26.113 GB/s 未簽名41959360000 0.401554秒26.113 GB / s
uint64_t 41959360000 0.759822 sec 13.8003 GB/s uint64_t 41959360000 0.759822秒13.8003 GB / s

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! 如您所見， uint64_t版本的吞吐量僅爲 unsigned版本的吞吐量的一半！ The problem seems to be that different assembly gets generated, but why? 問題彷佛在於生成了不一樣的程序集，可是爲何呢？ First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3): 首先，我想到了編譯器錯誤，所以嘗試了clang++ （Ubuntu Clang版本3.4-1ubuntu3）： less

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1 結果： test 1 dom

unsigned 41959360000 0.398293 sec 26.3267 GB/s 無符號41959360000 0.398293秒26.3267 GB / s
uint64_t 41959360000 0.680954 sec 15.3986 GB/s uint64_t 41959360000 0.680954秒15.3986 GB / s

So, it is almost the same result and is still strange. 所以，它幾乎是相同的結果，但仍然很奇怪。 But now it gets super strange. 可是如今變得超級奇怪。 I replace the buffer size that was read from input with a constant 1 , so I change: 我用常量1替換了從輸入中讀取的緩衝區大小，所以我進行了更改： oop

uint64_t size = atol(argv[1]) << 20;

to 至性能

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time. 所以，編譯器如今在編譯時就知道緩衝區的大小。 Maybe it can add some optimizations! 也許它能夠添加一些優化！ Here are the numbers for g++ : 這是g++的數字： 測試

unsigned 41959360000 0.509156 sec 20.5944 GB/s 未簽名41959360000 0.509156秒20.5944 GB / s
uint64_t 41959360000 0.508673 sec 20.6139 GB/s uint64_t 41959360000 0.508673秒20.6139 GB / s

Now, both versions are equally fast. 如今，兩個版本都一樣快。 However, the unsigned got even slower ! 可是， unsigned 變得更慢 ！ It dropped from 26 to 20 GB/s , thus replacing a non-constant by a constant value lead to a deoptimization . 它從26 20 GB/s降低到20 GB/s ，所以用恆定值替換非恆定值會致使優化不足 。 Seriously, I have no clue what is going on here! 說真的，我不知道這是怎麼回事！ But now to clang++ with the new version: 可是如今使用新版本的clang++ ： 優化

unsigned 41959360000 0.677009 sec 15.4884 GB/s 未簽名41959360000 0.677009秒15.4884 GB / s
uint64_t 41959360000 0.676909 sec 15.4906 GB/s uint64_t 41959360000 0.676909秒15.4906 GB / s

Wait, what? 等一下 Now, both versions dropped to the slow number of 15 GB/s. 如今，兩個版本的速度都下降到了15 GB / s的緩慢速度 。 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang! 所以，在兩種狀況下，用常數替換很是數都會致使Clang！代碼緩慢。

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. 我要求具備Ivy Bridge CPU的同事來編譯個人基準測試。 He got similar results, so it does not seem to be Haswell. 他獲得了相似的結果，所以彷佛不是Haswell。 Because two compilers produce strange results here, it also does not seem to be a compiler bug. 由於兩個編譯器在這裏產生奇怪的結果，因此它彷佛也不是編譯器錯誤。 We do not have an AMD CPU here, so we could only test with Intel. 咱們這裏沒有AMD CPU，所以只能在Intel上進行測試。

More madness, please! 請更加瘋狂！

Take the first example (the one with atol(argv[1]) ) and put a static before the variable, ie: 以第一個示例（帶有atol(argv[1])示例）爲例，而後在變量前放置一個static變量，即：

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++: 這是我在g ++中的結果：

unsigned 41959360000 0.396728 sec 26.4306 GB/s 無符號41959360000 0.396728秒26.4306 GB / s
uint64_t 41959360000 0.509484 sec 20.5811 GB/s uint64_t 41959360000 0.509484秒20.5811 GB / s

Yay, yet another alternative . 是的，另外一種選擇 。 We still have the fast 26 GB/s with u32 , but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! 咱們仍然有快26 GB / s的u32 ，但咱們設法u64從13 GB至少/ S到20 GB / s的版本！ On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. 在我collegue的PC中， u64版本成爲速度甚至超過了u32的版本，產生全部的最快的結果。 Sadly, this only works for g++ , clang++ does not seem to care about static . 可悲的是，這僅適用於g++ ， clang++彷佛並不關心static 。

My question 個人問題

Can you explain these results? 您能解釋這些結果嗎？ Especially: 特別：

How can there be such a difference between u32 and u64 ? 哪有之間的這種差別u32和u64 ？
How can replacing a non-constant by a constant buffer size trigger less optimal code ? 如何用恆定的緩衝區大小替換很是數會觸發次優代碼 ？
How can the insertion of the static keyword make the u64 loop faster? 插入static關鍵字如何使u64循環更快？ Even faster than the original code on my collegue's computer! 甚至比同事計算機上的原始代碼還要快！

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. 我知道優化是一個棘手的領域，可是，我從未想到過如此小的更改會致使執行時間差別100％ ，而諸如恆定緩衝區大小之類的小因素又會徹底混和結果。 Of course, I always want to have the version that is able to popcount 26 GB/s. 固然，我一直但願擁有可以以26 GB / s的速度計數的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. 我能想到的惟一可靠的方法是針對這種狀況複製粘貼程序集並使用內聯程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes. 這是我擺脫彷佛對微小更改感到惱火的編譯器的惟一方法。 What do you think? 你怎麼看？ Is there another way to reliably get the code with most performance? 還有另外一種方法能夠可靠地得到性能最高的代碼嗎？

The Disassembly 拆卸

Here is the disassembly for the various results: 這是各類結果的反彙編：

26 GB/s version from g++ / u32 / non-const bufsize : 來自g ++ / u32 / non-const bufsize的 26 GB / s版本：

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s version from g++ / u64 / non-const bufsize : 來自g ++ / u64 / non-const bufsize的 13 GB / s版本：

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 GB/s version from clang++ / u64 / non-const bufsize : 來自clang ++ / u64 / non-const bufsize的 15 GB / s版本：

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 GB/s version from g++ / u32&u64 / const bufsize : 來自g ++ / u32＆u64 / const bufsize的 20 GB / s版本：

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 GB/s version from clang++ / u32&u64 / const bufsize : 來自clang ++ / u32＆u64 / const bufsize的 15 GB / s版本：

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26 GB/s) version is also the longest! 有趣的是，最快的版本（26 GB / s）也是最長的！ It seems to be the only solution that uses lea . 它彷佛是使用lea的惟一解決方案。 Some versions use jb to jump, others use jne . 一些版本使用jb跳轉，另外一些版本使用jne 。 But apart from that, all versions seem to be comparable. 可是除此以外，全部版本彷佛都是可比的。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. 我看不出100％的性能差距可能源於何處，但我不太擅長破譯彙編。 The slowest (13 GB/s) version looks even very short and good. 最慢的版本（13 GB / s）看起來很是短並且很好。 Can anyone explain this? 誰能解釋一下？

Lessons learned 獲得教訓

No matter what the answer to this question will be; 無論這個問題的答案是什麼； I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code . 我瞭解到，在真正的熱循環中， 每一個細節均可能很重要， 即便那些彷佛與該熱代碼沒有任何關聯的細節也是如此 。 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! 我從未考慮過要爲循環變量使用哪一種類型，可是如您所見，如此小的更改可能會產生100％的差別！ Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! 正如咱們在size變量前面插入static關鍵字所看到的那樣，甚至緩衝區的存儲類型也能夠產生巨大的變化！ In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance. 未來，在編寫對系統性能相當重要的緊密而又熱的循環時，我將始終在各類編譯器上測試各類替代方案。

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. 有趣的是，儘管我已經四次展開循環，但性能差別仍然很高。 So even if you unroll, you can still get hit by major performance deviations. 所以，即便展開，仍然會受到重大性能誤差的影響。 Quite interesting. 挺有意思。