Why is processing a sorted array faster than an unsorted array?

時間 2019-12-11

標籤 processing sorted array faster unsorted 简体版

原文原文鏈接

這是我在逛 Stack Overflow 時碰見的一個高分問題：Why is processing a sorted array faster than an unsorted array?，我以爲這是一個很是好的用來說分支預測（Branch Prediction）的例子，分享給你們看看ios

1、問題引入

先看這個代碼：數組

#include <algorithm>
#include <ctime>
#include <iostream>
#include <stdint.h>

int main() {
    uint32_t arraySize = 20000;
    uint32_t data[arraySize];

    for (uint32_t i = 0; i < arraySize; ++ i) {
        data[i] = std::rand() % 256;
    }
    
    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    clock_t start = clock();
    uint64_t sum = 0;
    for (uint32_t cnt = 0; cnt < 100000; ++ cnt) {
        for (uint32_t i = 0; i < arraySize; ++ i) {
            if (data[i] > 128) {
                sum += data[i];
            }
        }
    }

    double processTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << "processTime: " << processTime << std::endl;
    std::cout << "sum: " << sum << std::endl;

    return 0;
};

注意：這裏特意沒有加隨機數種子是爲了確保 data 數組中的僞隨機數始終不變，爲接下來的對比分析作準備，儘量減小實驗中的變量bash

咱們編譯並運行這段代碼（gcc 版本 4.1.2，過高的話會被優化掉）：ide

$ g++ a.cpp -o a -O3
$ ./a
processTime: 1.78
sum: 191444000000

下面，把下面的這一行註釋掉，而後再編譯並運行：oop

std::sort(data, data + arraySize);

$ g++ a.cpp -o b -O3
$ ./b
processTime: 10.06
sum: 191444000000

注意到了嗎？去掉那一行排序的代碼後，整個計算時間被延長了十倍！佈局

2、是 Cache Miss 致使的嗎？

答案顯然是否認的。cache miss 率並不會由於數組是否排序而改變，由於兩份代碼取數據的順序是同樣的，數據量大小是同樣的，數據佈局也是同樣的，而且在同一臺機器上運行，並無任何差異，因此能夠確定的是：和 cache miss 無任何關係優化

爲了驗證咱們的分析，能夠用 valgrind 提供的 cachegrind tool 查看 cache miss 率：ui

$ valgrind --tool=cachegrind ./a
==26548== Cachegrind, a cache and branch-prediction profiler
==26548== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==26548== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==26548== Command: ./a
==26548==
--26548-- warning: L3 cache found, using its data for the LL simulation.
--26548-- warning: specified LL cache: line_size 64  assoc 20  total_size 15,728,640
--26548-- warning: simulated LL cache: line_size 64  assoc 30  total_size 15,728,640
processTime: 68.57
sum: 191444000000
==26548==
==26548== I   refs:      14,000,637,620
==26548== I1  misses:             1,327
==26548== LLi misses:             1,293
==26548== I1  miss rate:           0.00%
==26548== LLi miss rate:           0.00%
==26548==
==26548== D   refs:       2,001,434,596  (2,000,993,511 rd   + 441,085 wr)
==26548== D1  misses:       125,115,133  (  125,112,303 rd   +   2,830 wr)
==26548== LLd misses:             7,085  (        4,770 rd   +   2,315 wr)
==26548== D1  miss rate:            6.3% (          6.3%     +     0.6%  )
==26548== LLd miss rate:            0.0% (          0.0%     +     0.5%  )
==26548==
==26548== LL refs:          125,116,460  (  125,113,630 rd   +   2,830 wr)
==26548== LL misses:              8,378  (        6,063 rd   +   2,315 wr)
==26548== LL miss rate:             0.0% (          0.0%     +     0.5%  )

$ valgrind --tool=cachegrind ./b
==13898== Cachegrind, a cache and branch-prediction profiler
==13898== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==13898== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==13898== Command: ./b
==13898==
--13898-- warning: L3 cache found, using its data for the LL simulation.
--13898-- warning: specified LL cache: line_size 64  assoc 20  total_size 15,728,640
--13898-- warning: simulated LL cache: line_size 64  assoc 30  total_size 15,728,640
processTime: 76.7
sum: 191444000000
==13898==
==13898== I   refs:      13,998,930,559
==13898== I1  misses:             1,316
==13898== LLi misses:             1,281
==13898== I1  miss rate:           0.00%
==13898== LLi miss rate:           0.00%
==13898==
==13898== D   refs:       2,000,938,800  (2,000,663,898 rd   + 274,902 wr)
==13898== D1  misses:       125,010,958  (  125,008,167 rd   +   2,791 wr)
==13898== LLd misses:             7,083  (        4,768 rd   +   2,315 wr)
==13898== D1  miss rate:            6.2% (          6.2%     +     1.0%  )
==13898== LLd miss rate:            0.0% (          0.0%     +     0.8%  )
==13898==
==13898== LL refs:          125,012,274  (  125,009,483 rd   +   2,791 wr)
==13898== LL misses:              8,364  (        6,049 rd   +   2,315 wr)
==13898== LL miss rate:             0.0% (          0.0%     +     0.8%  )

對比能夠發現，他們倆的 cache miss rate 和 cache miss 數幾乎相同，所以確實和 cache miss 無關this

3、Branch Prediction

使用到 valgrind 提供的 callgrind tool 能夠查看分支預測失敗率：code

$ valgrind --tool=callgrind --branch-sim=yes ./a
==29373== Callgrind, a call-graph generating cache profiler
==29373== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==29373== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==29373== Command: ./a
==29373==
==29373== For interactive control, run 'callgrind_control -h'.
processTime: 288.68
sum: 191444000000
==29373==
==29373== Events    : Ir Bc Bcm Bi Bim
==29373== Collected : 14000637633 4000864744 293254 23654 395
==29373==
==29373== I   refs:      14,000,637,633
==29373==
==29373== Branches:       4,000,888,398  (4,000,864,744 cond + 23,654 ind)
==29373== Mispredicts:          293,649  (      293,254 cond +    395 ind)
==29373== Mispred rate:             0.0% (          0.0%     +    1.7%   )

能夠看到，在計算 sum 以前對數組排序，分支預測失敗率很是低，幾乎至關於沒有失敗

$ valgrind --tool=callgrind --branch-sim=yes ./b
==23202== Callgrind, a call-graph generating cache profiler
==23202== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==23202== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==23202== Command: ./b
==23202==
==23202== For interactive control, run 'callgrind_control -h'.
processTime: 287.12
sum: 191444000000
==23202==
==23202== Events    : Ir Bc Bcm Bi Bim
==23202== Collected : 13998930783 4000477534 1003409950 23654 395
==23202==
==23202== I   refs:      13,998,930,783
==23202==
==23202== Branches:       4,000,501,188  (4,000,477,534 cond + 23,654 ind)
==23202== Mispredicts:    1,003,410,345  (1,003,409,950 cond +    395 ind)
==23202== Mispred rate:            25.1% (         25.1%     +    1.7%   )

而這個未排序的就不一樣了，分支預測失敗率達到了 25%。所以能夠肯定的是：兩份代碼在運行時 CPU 分支預測失敗率不一樣致使了運行時間的不一樣

4、分支預測

那麼到底什麼是分支預測，分支預測的策略是什麼呢？這兩個問題我以爲 Mysticial 的回答解釋的很是好：

假設咱們如今處於 1800 年代，那會長途通訊或者無線通訊尚未出現。你是某個鐵路分叉口的操做員，當你正在打盹的時候，遠方傳來了火車轟隆隆的聲音。你知道又有一輛列車開過來了，可是你不知道它要走哪條路，所以列車不得不停下來，在得知它要去哪一個方向後，你把開關撥向正確的位置，列車緩緩啓動駛向遠方。

可是列車很重，自身的慣性很大，中止和啓動都須要花很長很長的時間。有什麼方法能讓列車更快的到達目的地嗎？有：你來猜想列車將駛向哪一個方向。

若是你猜中了，列車繼續前進；若是沒有猜中：司機發現路不對後剎車、倒車、衝你發一頓火，最後你把開關撥到另外一邊，而後司機啓動列車，走另外一條路。

如今讓咱們來看看那條 if 語句：

if (data[i] >= 128) {
    sum += data[i]
}

如今假設你是 CPU，當遇到這個 if 語句時，接下來該作什麼：把 data[i] 累加到 sum 上面仍是什麼都不作？

怎麼辦？難道是暫停下來，等待 if 表達式算出結果，若是是 true 就執行 sum += data[i]，不然什麼也不作？

通過幾十年的發展，現代處理器異常複雜並擁有者超長的 pipeline，它須要花費很長的時間「暫停」和從新執行命令，爲了加快執行速度，處理器須要猜想接下來要作什麼，也就是說：你先忽略 if 表達式的結果，讓它一邊算去，你選擇其中一個分支繼續執行下去。

若是你猜對了，程序繼續執行；若是猜錯了，須要 flush pipeline、回滾到分支判斷那、選擇另外一個分支執行下去。

若是每次都猜中：程序執行過程當中永遠不會出現中途暫停的狀況
若是大多數都猜錯了：你將消耗大量的時間在「暫停、回滾、從新執行」上面

這就是分支預測。那麼 CPU 在猜想接下來要執行哪一個分支時有什麼策略嗎？固然是根據已有的經驗啦：根據歷史經驗尋找一個模式

若是過去 99% 的火車都走了左邊，你就猜想下次火車到來仍是會走左邊；若是是左右交替着走，那麼每次火車來的時候你把開關撥向另外一邊就能夠了；若是每三輛車走右邊後會有一輛車走左邊，那麼你也對應的猜想並操做開關...

也就是說：從火車的行進方向歷史中找到一個固有的模式，而後按照這個模式猜想下次火車將走哪一個方向。這種工做方式和處理器的分支預測器很是類似

大多數應用程序都有表現良好的分支選擇（讓 CPU 有跡可循）模式，所以現代分支預測器基本上都有着 90% 以上的命中率。可是當面臨有着沒法識別的分支選擇模式時，分支預測器的命中率極度低下，毫無可用性可言，好比上面未排序的隨機數組 data

關於分支預測的更多解釋，感興趣的話你們能夠看看維基百科的解釋：Branch predictor

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。