在執行if判斷時,可使用GCC提供了__builtin_expect對代碼進行優化,能夠提升代碼的運行速度,參考GCC手冊的"3.10 Options That Control Optimization".原理是:CPU在執行指令時採用的是流水線的方式,一條指令的執行大體會通過"取碼 --> 譯碼 -->執行",若是在執行時發現須要進行跳轉的話,會flush流水線,而後重新的地址從新開始"取碼 --> 譯碼 --> 執行",這個過程會下降代碼的執行效率,因此儘可能減小跳轉的可能(也就是flush流水線的發生頻率),就能夠提升代碼的執行效率 。下面用一個簡單的程序爲例分析一下。1 #include <stdio.h> 2 3 #define likely(x) __builtin_expect(!!(x), 1) 4 #define unlikely(x) __builtin_expect(!!(x), 0) 5 6 void func1(int a) 7 { 8 int b; 9 10 if (unlikely(a >= 0)) { 11 b = a + 1; 12 printf("b = %d\n", b); 13 } else { 14 b = a + 2; 15 printf("b = %d\n", b); 16 } 17 } 18 19 void func2(int a) 20 { 21 int b; 22 23 if (likely(a >= 0)) { 24 b = a + 1; 25 printf("b = %d\n", b); 26 } else { 27 b = a + 2; 28 printf("b = %d\n", b); 29 } 30 31 } 32 33 int main(int argc, const char *argv[]) 34 { 35 int a = 0; 36 37 scanf("a = %d", &a); 38 39 func1(a); 40 func2(a); 41 42 return 0; 43 }likely(x)用於x爲真的可能性更大的場景,unlikey(x)用於x爲假的可能性更大的場景,這兩個宏的最終目的就是儘可能減小跳轉,由於只要跳轉,pipeline就會flush,就會下降效率。linux
想讓上面的優化生效的話,須要指定必定的優化等級,由於默認是-O0,沒有任何優化。下面是-O0的反彙編:
00000000004005bc <func1>: 4005bc: a9bd7bfd stp x29, x30, [sp, #-48]! 4005c0: 910003fd mov x29, sp 4005c4: b9001fa0 str w0, [x29, #28] 4005c8: b9401fa0 ldr w0, [x29, #28] 4005cc: 2a2003e0 mvn w0, w0 4005d0: 531f7c00 lsr w0, w0, #31 4005d4: 12001c00 and w0, w0, #0xff 4005d8: 92401c00 and x0, x0, #0xff 4005dc: f100001f cmp x0, #0x0 4005e0: 54000120 b.eq 400604 <func1+0x48> // b.none 4005e4: b9401fa0 ldr w0, [x29, #28] 4005e8: 11000400 add w0, w0, #0x1 4005ec: b9002fa0 str w0, [x29, #44] 4005f0: 90000000 adrp x0, 400000 <_init-0x430> 4005f4: 911e4000 add x0, x0, #0x790 4005f8: b9402fa1 ldr w1, [x29, #44] 4005fc: 97ffffad bl 4004b0 <printf@plt> 400600: 14000008 b 400620 <func1+0x64> 400604: b9401fa0 ldr w0, [x29, #28] 400608: 11000800 add w0, w0, #0x2 40060c: b9002fa0 str w0, [x29, #44] 400610: 90000000 adrp x0, 400000 <_init-0x430> 400614: 911e4000 add x0, x0, #0x790 400618: b9402fa1 ldr w1, [x29, #44] 40061c: 97ffffa5 bl 4004b0 <printf@plt> 400620: d503201f nop 400624: a8c37bfd ldp x29, x30, [sp], #48 400628: d65f03c0 ret 000000000040062c <func2>: 40062c: a9bd7bfd stp x29, x30, [sp, #-48]! 400630: 910003fd mov x29, sp 400634: b9001fa0 str w0, [x29, #28] 400638: b9401fa0 ldr w0, [x29, #28] 40063c: 2a2003e0 mvn w0, w0 400640: 531f7c00 lsr w0, w0, #31 400644: 12001c00 and w0, w0, #0xff 400648: 92401c00 and x0, x0, #0xff 40064c: f100001f cmp x0, #0x0 400650: 54000120 b.eq 400674 <func2+0x48> // b.none 400654: b9401fa0 ldr w0, [x29, #28] 400658: 11000400 add w0, w0, #0x1 40065c: b9002fa0 str w0, [x29, #44] 400660: 90000000 adrp x0, 400000 <_init-0x430> 400664: 911e4000 add x0, x0, #0x790 400668: b9402fa1 ldr w1, [x29, #44] 40066c: 97ffff91 bl 4004b0 <printf@plt> 400670: 14000008 b 400690 <func2+0x64> 400674: b9401fa0 ldr w0, [x29, #28] 400678: 11000800 add w0, w0, #0x2 40067c: b9002fa0 str w0, [x29, #44] 400680: 90000000 adrp x0, 400000 <_init-0x430> 400684: 911e4000 add x0, x0, #0x790 400688: b9402fa1 ldr w1, [x29, #44] 40068c: 97ffff89 bl 4004b0 <printf@plt> 400690: d503201f nop 400694: a8c37bfd ldp x29, x30, [sp], #48 400698: d65f03c0 ret
能夠看到,反彙編徹底是按照C語言邏輯走的,一五一十,循序漸進,上面的優化宏沒有起到任何做用。優化
下面先用-O1看看效果。GCC對-O和-O1的描述是:the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.ui
aarch64-linux-gnu-gcc predict.c -o predict -O1
aarch64-linux-gnu-objdump -D predict > predict.S
下面是func1的反彙編結果:this
00000000004005bc <func1>: 4005bc: a9bf7bfd stp x29, x30, [sp, #-16]! 4005c0: 910003fd mov x29, sp 4005c4: 36f800e0 tbz w0, #31, 4005e0 <func1+0x24> 4005c8: 11000801 add w1, w0, #0x2 4005cc: 90000000 adrp x0, 400000 <_init-0x430> 4005d0: 911c6000 add x0, x0, #0x718 4005d4: 97ffffb7 bl 4004b0 <printf@plt> 4005d8: a8c17bfd ldp x29, x30, [sp], #16 4005dc: d65f03c0 ret 4005e0: 11000401 add w1, w0, #0x1 4005e4: 90000000 adrp x0, 400000 <_init-0x430> 4005e8: 911c6000 add x0, x0, #0x718 4005ec: 97ffffb1 bl 4004b0 <printf@plt> 4005f0: 17fffffa b 4005d8 <func1+0x1c>
00000000004005f4 <func2>: 4005f4: a9bf7bfd stp x29, x30, [sp, #-16]! 4005f8: 910003fd mov x29, sp 4005fc: 37f800e0 tbnz w0, #31, 400618 <func2+0x24> 400600: 11000401 add w1, w0, #0x1 400604: 90000000 adrp x0, 400000 <_init-0x430> 400608: 911c6000 add x0, x0, #0x718 40060c: 97ffffa9 bl 4004b0 <printf@plt> 400610: a8c17bfd ldp x29, x30, [sp], #16 400614: d65f03c0 ret 400618: 11000801 add w1, w0, #0x2 40061c: 90000000 adrp x0, 400000 <_init-0x430> 400620: 911c6000 add x0, x0, #0x718 400624: 97ffffa3 bl 4004b0 <printf@plt> 400628: 17fffffa b 400610 <func2+0x1c>
固然,若是likely和unlikely用的不符合實際狀況,代碼的執行效率更惡化。
00000000004005f8 <func1>: 4005f8: 90000002 adrp x2, 400000 <_init-0x430> 4005fc: 36f80080 tbz w0, #31, 40060c <func1+0x14> 400600: 11000801 add w1, w0, #0x2 400604: 911ba040 add x0, x2, #0x6e8 400608: 17ffffaa b 4004b0 <printf@plt> 40060c: 11000401 add w1, w0, #0x1 400610: 911ba040 add x0, x2, #0x6e8 400614: 17ffffa7 b 4004b0 <printf@plt> 0000000000400618 <func2>: 400618: 90000002 adrp x2, 400000 <_init-0x430> 40061c: 37f80080 tbnz w0, #31, 40062c <func2+0x14> 400620: 11000401 add w1, w0, #0x1 400624: 911ba040 add x0, x2, #0x6e8 400628: 17ffffa2 b 4004b0 <printf@plt> 40062c: 11000801 add w1, w0, #0x2 400630: 911ba040 add x0, x2, #0x6e8 400634: 17ffff9f b 4004b0 <printf@plt>
-O3:Optimize yet more. ‘-O3’ turns on all optimizations specified by ‘-O2’ and also turns on more optimization flags
00000000004005f8 <func1>: 4005f8: 90000002 adrp x2, 400000 <_init-0x430> 4005fc: 36f80080 tbz w0, #31, 40060c <func1+0x14> 400600: 11000801 add w1, w0, #0x2 400604: 911ba040 add x0, x2, #0x6e8 400608: 17ffffaa b 4004b0 <printf@plt> 40060c: 11000401 add w1, w0, #0x1 400610: 911ba040 add x0, x2, #0x6e8 400614: 17ffffa7 b 4004b0 <printf@plt> 0000000000400618 <func2>: 400618: 90000002 adrp x2, 400000 <_init-0x430> 40061c: 37f80080 tbnz w0, #31, 40062c <func2+0x14> 400620: 11000401 add w1, w0, #0x1 400624: 911ba040 add x0, x2, #0x6e8 400628: 17ffffa2 b 4004b0 <printf@plt> 40062c: 11000801 add w1, w0, #0x2 400630: 911ba040 add x0, x2, #0x6e8 400634: 17ffff9f b 4004b0 <printf@plt>
00000000004005f4 <func1>: 4005f4: 90000002 adrp x2, 400000 <_init-0x430> 4005f8: 37f80080 tbnz w0, #31, 400608 <func1+0x14> 4005fc: 11000401 add w1, w0, #0x1 400600: 911b8040 add x0, x2, #0x6e0 400604: 17ffffab b 4004b0 <printf@plt> 400608: 11000801 add w1, w0, #0x2 40060c: 17fffffd b 400600 <func1+0xc> 0000000000400610 <func2>: 400610: 90000002 adrp x2, 400000 <_init-0x430> 400614: 37f80080 tbnz w0, #31, 400624 <func2+0x14> 400618: 11000401 add w1, w0, #0x1 40061c: 911b8040 add x0, x2, #0x6e0 400620: 17ffffa4 b 4004b0 <printf@plt> 400624: 11000801 add w1, w0, #0x2 400628: 17fffffd b 40061c <func2+0xc>
-Os主要是對代碼尺寸的優化(能夠看到,此時兩個func反彙編出來的彙編指令是最少的),可是從執行效率看,就差點,likely和unlikey此時對代碼沒有起到任何優化效果。