使用valgrind檢查cache命中率

時間 2019-12-05

標籤使用 valgrind 檢查 cache 命中率简体版

原文原文鏈接

Valgrind爲一個debugging 和 profiling的工具包，檢查內存問題只是其最知名的一個用途。今天介紹一下，valgrind工具包中的cachegrind。關於cachegrind的具體介紹，請參見valgrind的在線文檔http://www.valgrind.org/docs/manual/cg-manual.htmlhtml

下面使用一個古老的cache示例：數組

#include <stdio.h>工具

#include <stdlib.h>性能

#define SIZE 100優化

int main(int argc, char **argv)debug

{htm

int array[SIZE][SIZE] = {0};內存

int i,j;文檔

#if 1it

for (i = 0; i < SIZE; ++i) {

for (j = 0; j < SIZE; ++j) {

array[i][j] = i + j;

}

#else

for (j = 0; j < SIZE; ++j) {

for (i = 0; i < SIZE; ++i) {

array[i][j] = i + j;

}

#endif

return 0;

}

這個示例代碼從好久就開始用於說明利用局部性來增長cache的命中率。傳統的是第一個for循環的性能要優於第二個循環。

我使用條件編譯，在沒有打開任何優化開關的條件下，第一種狀況生成文件爲test1，第二種狀況生成文件爲test2。

下面是輸出

[fgao@fgao-vm-fc13 test]$ valgrind --tool=cachegrind ./test1

==2079== Cachegrind, a cache and branch-prediction profiler

==2079== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info

==2079== Command: ./test1

==2079==

==2079== I refs: 219,767

==2079== I1 misses: 614

==2079== L2i misses: 608

==2079== I1 miss rate: 0.27%

==2079== L2i miss rate: 0.27%

==2079==

==2079== D refs: 124,402 (95,613 rd + 28,789 wr)

==2079== D1 misses: 2,041 ( 621 rd + 1,420 wr)

==2079== L2d misses: 1,292 ( 537 rd + 755 wr)

==2079== D1 miss rate: 1.6% ( 0.6% + 4.9% )

==2079== L2d miss rate: 1.0% ( 0.5% + 2.6% )

==2079==

==2079== L2 refs: 2,655 ( 1,235 rd + 1,420 wr)

==2079== L2 misses: 1,900 ( 1,145 rd + 755 wr)

==2079== L2 miss rate: 0.5% ( 0.3% + 2.6% )

[fgao@fgao-vm-fc13 test]$ valgrind --tool=cachegrind ./test2

==2080== Cachegrind, a cache and branch-prediction profiler

==2080== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info

==2080== Command: ./test2

==2080==

==2080== I refs: 219,767

==2080== I1 misses: 614

==2080== L2i misses: 608

==2080== I1 miss rate: 0.27%

==2080== L2i miss rate: 0.27%

==2080==

==2080== D refs: 124,402 (95,613 rd + 28,789 wr)

==2080== D1 misses: 1,788 ( 621 rd + 1,167 wr)

==2080== L2d misses: 1,292 ( 537 rd + 755 wr)

==2080== D1 miss rate: 1.4% ( 0.6% + 4.0% )

==2080== L2d miss rate: 1.0% ( 0.5% + 2.6% )

==2080==

==2080== L2 refs: 2,402 ( 1,235 rd + 1,167 wr)

==2080== L2 misses: 1,900 ( 1,145 rd + 755 wr)

==2080== L2 miss rate: 0.5% ( 0.3% + 2.6% )

結果有點出人意料，第一種狀況在D1的命中率反而低於第二種狀況。

這個結果實際上是應該能夠理解的。

1. 如今的CPU的cache是以line爲單位的。這樣，當數組的size不大時，第二種狀況的循環，雖然沒有使用局部性原則，可是並不會所以下降cache的命中率，而且可能能夠迅速的將數據填到cache中

2. 如今的CPU的cache空間較大。這樣，當數組的size不大時，即便沒有使用局部性原則，也不會致使cache的頻繁更新。

因爲我對cache的理解，也比較粗淺，因此不能明確的指出這個結果的根本緣由。根據上面的兩個條件，基本上也能夠理解爲何第二種狀況更快。

爲了使cachegrind的結果與傳統的同樣，咱們就須要破壞上面兩個條件。那麼，如今將SIZE從100增大的1000。再次看一下輸出結果：

[fgao@fgao-vm-fc13 test]$ valgrind --tool=cachegrind ./test1

==2094== Cachegrind, a cache and branch-prediction profiler

==2094== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info

==2094== Command: ./test1

==2094==

==2094== I refs: 11,519,463

==2094== I1 misses: 617

==2094== L2i misses: 611

==2094== I1 miss rate: 0.00%

==2094== L2i miss rate: 0.00%

==2094==

==2094== D refs: 7,305,498 (6,038,310 rd + 1,267,188 wr)

==2094== D1 misses: 125,791 ( 621 rd + 125,170 wr)

==2094== L2d misses: 125,763 ( 595 rd + 125,168 wr)

==2094== D1 miss rate: 1.7% ( 0.0% + 9.8% )

==2094== L2d miss rate: 1.7% ( 0.0% + 9.8% )

==2094==

==2094== L2 refs: 126,408 ( 1,238 rd + 125,170 wr)

==2094== L2 misses: 126,374 ( 1,206 rd + 125,168 wr)

==2094== L2 miss rate: 0.6% ( 0.0% + 9.8% )

[fgao@fgao-vm-fc13 test]$ valgrind --tool=cachegrind ./test2

==2095== Cachegrind, a cache and branch-prediction profiler

==2095== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info

==2095== Command: ./test2

==2095==

==2095== I refs: 11,519,463

==2095== I1 misses: 617

==2095== L2i misses: 611

==2095== I1 miss rate: 0.00%

==2095== L2i miss rate: 0.00%

==2095==

==2095== D refs: 7,305,498 (6,038,310 rd + 1,267,188 wr)

==2095== D1 misses: 1,063,300 ( 621 rd + 1,062,679 wr)

==2095== L2d misses: 116,261 ( 595 rd + 115,666 wr)

==2095== D1 miss rate: 14.5% ( 0.0% + 83.8% )

==2095== L2d miss rate: 1.5% ( 0.0% + 9.1% )

==2095==

==2095== L2 refs: 1,063,917 ( 1,238 rd + 1,062,679 wr)

==2095== L2 misses: 116,872 ( 1,206 rd + 115,666 wr)

==2095== L2 miss rate: 0.6% ( 0.0% + 9.1% )

對比紅色的兩行，第一種狀況的miss率爲1.7%，而第二種狀況的miss率高達14.5%。如今符合了傳統。

總結一下：

1. 咱們可使用cachegrind來檢查cache的命中率，提升程序性能；

2. 盡信書不如無書。書中的一些結果面對如今的環境，極可能是錯誤的。畢竟IT技術更新太快。仍是本身動手實踐一下更好！

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。