global_timer_upper和global_timer_low指向的存儲空間是 global timer 的高32位和低32位,它們會不停變化,下面的程序功能是讀取當前的global timer值。這裏會出現一個常見的編譯器優化帶來的錯誤。函數
源程序以下:優化
#0 [volatile] unsigned long *global_timer_upper, *global_timer_low;
inline unsigned long long read_cycle(unsigned int *re_low, unsigned int *re_upper) { #1 [volatile] unsigned int __low,__upper_old,__upper_new; unsigned long long value; __upper_new = *global_timer_upper; do { __upper_old = __upper_new; #2 [asm volatile("":::"memory");] __low = *global_timer_low;
#3 [asm volatile("":::"memory");] __upper_new = *global_timer_upper;
#4 [asm volatile("":::"memory");] }while(__upper_new != __upper_old); *re_low = __low; *re_upper = __upper_new; value = (unsigned long long)__low | (((unsigned long long)__upper_new) << 32); return value; }
源程序編譯後獲得的結果以下,能夠看出通過編譯器的優化,循環消失了,由於編譯器認爲執行流中沒有改變*global_timer_upper,因此while循環中的比較是多餘的。spa
00000000 <read_cycle>: 0: b470 push {r4, r5, r6} 2: f240 0400 movw r4, #0 6: f2c0 0400 movt r4, #0 a: 2200 movs r2, #0 c: 6865 ldr r5, [r4, #4] e: 6826 ldr r6, [r4, #0] 10: 682d ldr r5, [r5, #0] 12: 6834 ldr r4, [r6, #0] 14: 432a orrs r2, r5 16: 6005 str r5, [r0, #0] 18: 4610 mov r0, r2 1a: 600c str r4, [r1, #0] 1c: 4621 mov r1, r4 1e: bc70 pop {r4, r5, r6} 20: 4770 bx lr 22: bf00 nop
只在#1處,添加volatile修飾,編譯結果以下,此時明顯看到循環的出現,可是這裏的程序依然是錯誤的,編譯器雖然雖然沒有對__upper_new\__upper_old等值的讀寫進行優化,使用了堆棧進行存儲。可是,編譯器認爲*global_timer_upper的值沒有改變過,因此將其存儲在寄存器r4中:設計
00000000 <read_cycle>: 0: f240 0300 movw r3, #0 4: f2c0 0300 movt r3, #0 8: b4f0 push {r4, r5, r6, r7} a: b084 sub sp, #16 c: cb0c ldmia r3, {r2, r3} @ 讀取upper和low的內存地址放到r2,r3中 e: 6814 ldr r4, [r2, #0] @ 從對應內存讀取upper至r4 (__upper_new = * global_timer_upper) 10: 9401 str r4, [sp, #4] @ 將upper放在棧上[sp,#4] 12: 681d ldr r5, [r3, #0] @ 從對應內存讀取lower至r5 14: 9b01 ldr r3, [sp, #4] @ 循環開始 相對於 __upper_old = __upper_new 16: 9302 str r3, [sp, #8] @ 將__upper_old放在棧上[sp,#8] 18: 9503 str r5, [sp, #12] @ 將lower保存在棧上[sp,#12] 1a: 9401 str r4, [sp, #4] @ 將將r4中的__upper_new放在棧上[sp,#4], r4是最開始e處得到的upper值,一直未改變。 1c: 9a01 ldr r2, [sp, #4] @ 從棧上讀出__upper_new 1e: 9b02 ldr r3, [sp, #8] @ 從棧上讀出__upper_old 20: 429a cmp r2, r3 22: d1f7 bne.n 14 <read_cycle+0x14> 24: 9f03 ldr r7, [sp, #12] 26: 2200 movs r2, #0 28: 9d01 ldr r5, [sp, #4] 2a: 9c03 ldr r4, [sp, #12] 2c: 9e01 ldr r6, [sp, #4] 2e: 4322 orrs r2, r4 30: 6007 str r7, [r0, #0] 32: 600d str r5, [r1, #0] 34: 4610 mov r0, r2 36: 4631 mov r1, r6 38: b004 add sp, #16 3a: bcf0 pop {r4, r5, r6, r7} 3c: 4770 bx lr 3e: bf00 nop
只在#2處,添加memory barrier,編譯的結果以下,此時雖然是從global timer 的對應內存中讀取數據,可是仍然存在問題,指令的亂序問題。咱們的設計,應該是先讀low,再讀upper,可是編譯後的結果是先讀upper,再讀low,致使結果錯誤,因此考慮在這兩條語句間加memory barrier。指針
00000000 <read_cycle>: 0: f240 0300 movw r3, #0 4: f2c0 0300 movt r3, #0 8: b430 push {r4, r5} a: 681a ldr r2, [r3, #0] @ 讀取upper的內存地址 c: 6814 ldr r4, [r2, #0] @ 從對應內存讀取upper ,__upper_new = * global_timer_upper e: e000 b.n 12 <read_cycle+0x12> 10: 4614 mov r4, r2 @ __upper_old = __upper_new 12: e893 0024 ldmia.w r3, {r2, r5} @ 讀取upper和lower的內存地址 16: 6812 ldr r2, [r2, #0] @ 從對應內存讀取 upper , __upper_new = * global_timer_upper 18: 682d ldr r5, [r5, #0] @ 從對應內存讀取 low 1a: 42a2 cmp r2, r4 @ 比較 1c: d1f8 bne.n 10 <read_cycle+0x10> 1e: 2200 movs r2, #0 20: 6005 str r5, [r0, #0] 22: 432a orrs r2, r5 24: 600c str r4, [r1, #0] 26: 4610 mov r0, r2 28: 4621 mov r1, r4 2a: bc30 pop {r4, r5} 2c: 4770 bx lr 2e: bf00 nop
只在#4處添加memory barrier,編譯的結果以下,效果和不加同樣。code
00000000 <read_cycle>: 0: f240 0300 movw r3, #0 4: f2c0 0300 movt r3, #0 8: b430 push {r4, r5} a: cb0c ldmia r3, {r2, r3} c: 6815 ldr r5, [r2, #0] e: 681c ldr r4, [r3, #0] 10: 2200 movs r2, #0 12: 6004 str r4, [r0, #0] 14: 4322 orrs r2, r4 16: 600d str r5, [r1, #0] 18: 4610 mov r0, r2 1a: 4629 mov r1, r5 1c: bc30 pop {r4, r5} 1e: 4770 bx lr
只在#3處添加memory barrier,編譯的結果正確,指令未亂序,並且對low的讀取也未被優化,這個結果和MB的做用有關,有待分析:blog
00000000 <read_cycle>: 0: f240 0300 movw r3, #0 4: f2c0 0300 movt r3, #0 8: b430 push {r4, r5} a: 681a ldr r2, [r3, #0] @ 讀取upper的內存地址 c: 6814 ldr r4, [r2, #0] @ 從對應內存中讀取upper , __upper_new = * global_timer_upper; e: e000 b.n 12 <read_cycle+0x12> 10: 4614 mov r4, r2 @ __upper_old = __upper_new 12: 685a ldr r2, [r3, #4] @ 讀取low的內存地址 14: 6815 ldr r5, [r2, #0] @ 從對應內存中讀取low 16: 681a ldr r2, [r3, #0] @ 讀取upper的內存地址 18: 6812 ldr r2, [r2, #0] @ 從對應內存中讀取upper 1a: 42a2 cmp r2, r4 @ 比較 1c: d1f8 bne.n 10 <read_cycle+0x10> 1e: 2200 movs r2, #0 20: 6005 str r5, [r0, #0] 22: 432a orrs r2, r5 24: 600c str r4, [r1, #0] 26: 4610 mov r0, r2 28: 4621 mov r1, r4 2a: bc30 pop {r4, r5} 2c: 4770 bx lr 2e: bf00 nop
只在#0處添加volatile,結果也是正確的,這個結果也很奇怪,表面上看這樣只能防止對global_timer_upper/lower這兩個指針讀寫的優化,而咱們要防止對這兩個指針所指向的存儲空間(即*global_timer_upper/low)讀寫的優化,可是加上這樣的修飾符,確實是起做用了,多是指針類型的關係:內存
00000000 <read_cycle>: 0: f240 0300 movw r3, #0 4: f2c0 0300 movt r3, #0 8: b470 push {r4, r5, r6} a: e893 0044 ldmia.w r3, {r2, r6} e: 6814 ldr r4, [r2, #0] 10: e000 b.n 14 <read_cycle+0x14> 12: 461c mov r4, r3 14: 6835 ldr r5, [r6, #0] 16: 6813 ldr r3, [r2, #0] 18: 42a3 cmp r3, r4 1a: d1fa bne.n 12 <read_cycle+0x12> 1c: 2200 movs r2, #0 1e: 6005 str r5, [r0, #0] 20: 432a orrs r2, r5 22: 600c str r4, [r1, #0] 24: 4610 mov r0, r2 26: 4621 mov r1, r4 28: bc70 pop {r4, r5, r6} 2a: 4770 bx lr
還有一個奇怪的現象,就是若是函數的修飾同時有inline和static,並且此文件中沒有其餘函數調用此函數,那麼obj文件中,可能沒有此函數的符號,能夠用nm或者objdump來查看。編譯器