I was looking for the fastest way to popcount
large arrays of data. 我一直在尋找最快的方法來popcount
大量數據的數量。 I encountered a very weird effect: Changing the loop variable from unsigned
to uint64_t
made the performance drop by 50% on my PC. 我遇到了一個很是奇怪的效果:將循環變量從unsigned
更改成uint64_t
使PC上的性能降低了50%。 ios
#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1; } uint64_t size = atol(argv[1])<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=0; i<size; ++i) charbuffer[i] = rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with unsigned for (unsigned i=0; i<size/8; i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { startP = chrono::system_clock::now(); count=0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with uint64_t for (uint64_t i=0;i<size/8;i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }
As you see, we create a buffer of random data, with the size being x
megabytes where x
is read from the command line. 如您所見,咱們建立了一個隨機數據緩衝區,大小爲x
兆字節,其中從命令行讀取x
。 Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount
intrinsic to perform the popcount. 以後,咱們遍歷緩衝區並使用x86 popcount
內部版本的展開版原本執行popcount。 To get a more precise result, we do the popcount 10,000 times. 爲了得到更精確的結果,咱們將彈出次數進行了10,000次。 We measure the times for the popcount. 咱們計算彈出次數的時間。 In the upper case, the inner loop variable is unsigned
, in the lower case, the inner loop variable is uint64_t
. 在大寫狀況下,內部循環變量是unsigned
,在小寫狀況下,內部循環變量是uint64_t
。 I thought that this should make no difference, but the opposite is the case. 我認爲這應該沒有什麼區別,但狀況偏偏相反。 c++
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1): 我這樣編譯(g ++版本:Ubuntu 4.8.2-19ubuntu1): ubuntu
g++ -O3 -march=native -std=c++11 test.cpp -o test
Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1
(so 1 MB random data): 這是個人Haswell Core i7-4770K CPU @ 3.50 GHz,運行test 1
(所以有1 MB隨機數據)的結果: sass
As you see, the throughput of the uint64_t
version is only half the one of the unsigned
version! 如您所見, uint64_t
版本的吞吐量僅爲 unsigned
版本的吞吐量的一半 ! The problem seems to be that different assembly gets generated, but why? 問題彷佛在於生成了不一樣的程序集,可是爲何呢? First, I thought of a compiler bug, so I tried clang++
(Ubuntu Clang version 3.4-1ubuntu3): 首先,我想到了編譯器錯誤,所以嘗試了clang++
(Ubuntu Clang版本3.4-1ubuntu3): less
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Result: test 1
結果: test 1
dom
So, it is almost the same result and is still strange. 所以,它幾乎是相同的結果,但仍然很奇怪。 But now it gets super strange. 可是如今變得超級奇怪。 I replace the buffer size that was read from input with a constant 1
, so I change: 我用常量1
替換了從輸入中讀取的緩衝區大小,所以我進行了更改: oop
uint64_t size = atol(argv[1]) << 20;
to 至 性能
uint64_t size = 1 << 20;
Thus, the compiler now knows the buffer size at compile time. 所以,編譯器如今在編譯時就知道緩衝區的大小。 Maybe it can add some optimizations! 也許它能夠添加一些優化! Here are the numbers for g++
: 這是g++
的數字: 測試
Now, both versions are equally fast. 如今,兩個版本都一樣快。 However, the unsigned
got even slower ! 可是, unsigned
變得更慢 ! It dropped from 26
to 20 GB/s
, thus replacing a non-constant by a constant value lead to a deoptimization . 它從26
20 GB/s
降低到20 GB/s
,所以用恆定值替換非恆定值會致使優化不足 。 Seriously, I have no clue what is going on here! 說真的,我不知道這是怎麼回事! But now to clang++
with the new version: 可是如今使用新版本的clang++
: 優化
Wait, what? 等一下 Now, both versions dropped to the slow number of 15 GB/s. 如今,兩個版本的速度都下降到了15 GB / s的緩慢速度 。 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang! 所以,在兩種狀況下,用常數替換很是數都會致使Clang!代碼緩慢。
I asked a colleague with an Ivy Bridge CPU to compile my benchmark. 我要求具備Ivy Bridge CPU的同事來編譯個人基準測試。 He got similar results, so it does not seem to be Haswell. 他獲得了相似的結果,所以彷佛不是Haswell。 Because two compilers produce strange results here, it also does not seem to be a compiler bug. 由於兩個編譯器在這裏產生奇怪的結果,因此它彷佛也不是編譯器錯誤。 We do not have an AMD CPU here, so we could only test with Intel. 咱們這裏沒有AMD CPU,所以只能在Intel上進行測試。
Take the first example (the one with atol(argv[1])
) and put a static
before the variable, ie: 以第一個示例(帶有atol(argv[1])
示例)爲例,而後在變量前放置一個static
變量,即:
static uint64_t size=atol(argv[1])<<20;
Here are my results in g++: 這是我在g ++中的結果:
Yay, yet another alternative . 是的,另外一種選擇 。 We still have the fast 26 GB/s with u32
, but we managed to get u64
at least from the 13 GB/s to the 20 GB/s version! 咱們仍然有快26 GB / s的u32
,但咱們設法u64
從13 GB至少/ S到20 GB / s的版本! On my collegue's PC, the u64
version became even faster than the u32
version, yielding the fastest result of all. 在我collegue的PC中, u64
版本成爲速度甚至超過了u32
的版本,產生全部的最快的結果。 Sadly, this only works for g++
, clang++
does not seem to care about static
. 可悲的是,這僅適用於g++
, clang++
彷佛並不關心static
。
Can you explain these results? 您能解釋這些結果嗎? Especially: 特別:
u32
and u64
? 哪有之間的這種差別u32
和u64
? static
keyword make the u64
loop faster? 插入static
關鍵字如何使u64
循環更快? Even faster than the original code on my collegue's computer! 甚至比同事計算機上的原始代碼還要快! I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. 我知道優化是一個棘手的領域,可是,我從未想到過如此小的更改會致使執行時間差別100% ,而諸如恆定緩衝區大小之類的小因素又會徹底混和結果。 Of course, I always want to have the version that is able to popcount 26 GB/s. 固然,我一直但願擁有可以以26 GB / s的速度計數的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. 我能想到的惟一可靠的方法是針對這種狀況複製粘貼程序集並使用內聯程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes. 這是我擺脫彷佛對微小更改感到惱火的編譯器的惟一方法。 What do you think? 你怎麼看? Is there another way to reliably get the code with most performance? 還有另外一種方法能夠可靠地得到性能最高的代碼嗎?
Here is the disassembly for the various results: 這是各類結果的反彙編:
26 GB/s version from g++ / u32 / non-const bufsize : 來自g ++ / u32 / non-const bufsize的 26 GB / s版本:
0x400af8: lea 0x1(%rdx),%eax popcnt (%rbx,%rax,8),%r9 lea 0x2(%rdx),%edi popcnt (%rbx,%rcx,8),%rax lea 0x3(%rdx),%esi add %r9,%rax popcnt (%rbx,%rdi,8),%rcx add $0x4,%edx add %rcx,%rax popcnt (%rbx,%rsi,8),%rcx add %rcx,%rax mov %edx,%ecx add %rax,%r14 cmp %rbp,%rcx jb 0x400af8
13 GB/s version from g++ / u64 / non-const bufsize : 來自g ++ / u64 / non-const bufsize的 13 GB / s版本:
0x400c00: popcnt 0x8(%rbx,%rdx,8),%rcx popcnt (%rbx,%rdx,8),%rax add %rcx,%rax popcnt 0x10(%rbx,%rdx,8),%rcx add %rcx,%rax popcnt 0x18(%rbx,%rdx,8),%rcx add $0x4,%rdx add %rcx,%rax add %rax,%r12 cmp %rbp,%rdx jb 0x400c00
15 GB/s version from clang++ / u64 / non-const bufsize : 來自clang ++ / u64 / non-const bufsize的 15 GB / s版本:
0x400e50: popcnt (%r15,%rcx,8),%rdx add %rbx,%rdx popcnt 0x8(%r15,%rcx,8),%rsi add %rdx,%rsi popcnt 0x10(%r15,%rcx,8),%rdx add %rsi,%rdx popcnt 0x18(%r15,%rcx,8),%rbx add %rdx,%rbx add $0x4,%rcx cmp %rbp,%rcx jb 0x400e50
20 GB/s version from g++ / u32&u64 / const bufsize : 來自g ++ / u32&u64 / const bufsize的 20 GB / s版本:
0x400a68: popcnt (%rbx,%rdx,1),%rax popcnt 0x8(%rbx,%rdx,1),%rcx add %rax,%rcx popcnt 0x10(%rbx,%rdx,1),%rax add %rax,%rcx popcnt 0x18(%rbx,%rdx,1),%rsi add $0x20,%rdx add %rsi,%rcx add %rcx,%rbp cmp $0x100000,%rdx jne 0x400a68
15 GB/s version from clang++ / u32&u64 / const bufsize : 來自clang ++ / u32&u64 / const bufsize的 15 GB / s版本:
0x400dd0: popcnt (%r14,%rcx,8),%rdx add %rbx,%rdx popcnt 0x8(%r14,%rcx,8),%rsi add %rdx,%rsi popcnt 0x10(%r14,%rcx,8),%rdx add %rsi,%rdx popcnt 0x18(%r14,%rcx,8),%rbx add %rdx,%rbx add $0x4,%rcx cmp $0x20000,%rcx jb 0x400dd0
Interestingly, the fastest (26 GB/s) version is also the longest! 有趣的是,最快的版本(26 GB / s)也是最長的! It seems to be the only solution that uses lea
. 它彷佛是使用lea
的惟一解決方案。 Some versions use jb
to jump, others use jne
. 一些版本使用jb
跳轉,另外一些版本使用jne
。 But apart from that, all versions seem to be comparable. 可是除此以外,全部版本彷佛都是可比的。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. 我看不出100%的性能差距可能源於何處,但我不太擅長破譯彙編。 The slowest (13 GB/s) version looks even very short and good. 最慢的版本(13 GB / s)看起來很是短並且很好。 Can anyone explain this? 誰能解釋一下?
No matter what the answer to this question will be; 無論這個問題的答案是什麼; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code . 我瞭解到,在真正的熱循環中, 每一個細節均可能很重要, 即便那些彷佛與該熱代碼沒有任何關聯的細節也是如此 。 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! 我從未考慮過要爲循環變量使用哪一種類型,可是如您所見,如此小的更改可能會產生100%的差別! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static
keyword in front of the size variable! 正如咱們在size變量前面插入static
關鍵字所看到的那樣,甚至緩衝區的存儲類型也能夠產生巨大的變化! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance. 未來,在編寫對系統性能相當重要的緊密而又熱的循環時,我將始終在各類編譯器上測試各類替代方案。
The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. 有趣的是,儘管我已經四次展開循環,但性能差別仍然很高。 So even if you unroll, you can still get hit by major performance deviations. 所以,即便展開,仍然會受到重大性能誤差的影響。 Quite interesting. 挺有意思。