ARM Linux 內核 panic 之cache 一致性 ——Cortex-A9多核cache和TLB一致性廣播

ARM Linux 內核 panic 之cache 一致性 ——Cortex-A9多核cache和TLB一致性廣播android

 

Cortex-A9的多喝CPU能夠接收和執行一致性廣播操做,當其使能並處於SMP模式時。本文之內核的panic爲例,在給出內核panic後的真正緣由後,討論Cortex-A9多核的cache和TLB的一致性廣播,實際使用中應該怎麼設置。函數

 

1 多核啓動android失敗

內核版本:3.0.15           CPU:Freescale Imx6Q(Cortex-A9四核)spa

芯片特色:支持ARM TrustZone線程

 

操做步驟:主核CPU0以Secure模式啓動後,切換到NS模式,而後啓動內核。內核啓動其它的三個CPU,它們也會切換到NS模式,最後啓動Android系統。日誌

可是啓動失敗了,後來發現內核只是panic,並無完全死機。爲了確認panic後的狀態,在內核的 arch/arm/kernel/smp.c文件,do_local_timer函數中,打印CPU的ID和時鐘節拍,發現panic後,這個中斷患有,信息還能夠打印出來。blog

 

原始日誌以下:進程

[   24.707074] request_suspend_state: wakeup (3->0) at 24564020006 (1970-01-02 00:14:26.233322336 UTC)ci

[   24.719704] in panic, line:75     cpu:3get

[   24.726012] Kernel panic - not syncing: Attempted to kill init!it

[   24.732338] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c05633a0>] (panic+0x88/0x1b4)

[   24.740586] [<c05633a0>] (panic+0x88/0x1b4) from [<c0076c60>] (do_exit+0x664/0x710)

[   24.748278] [<c0076c60>] (do_exit+0x664/0x710) from [<c0076d48>] (do_group_exit+0x3c/0xbc)

[   24.756581] [<c0076d48>] (do_group_exit+0x3c/0xbc) from [<c0082918>] (get_signal_to_deliver+0x1f8/0x430)

[   24.766107] [<c0082918>] (get_signal_to_deliver+0x1f8/0x430) from [<c0048fec>] (do_signal+0x94/0x534)

[   24.775440] [<c0048fec>] (do_signal+0x94/0x534) from [<c00494c4>] (do_notify_resume+0x38/0x44)

[   24.784162] [<c00494c4>] (do_notify_resume+0x38/0x44) from [<c0046698>] (work_pending+0x24/0x28)

[   24.792971] CPU1: stopping

[   24.795700] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

[   24.804064] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c004608c>] (__irq_svc+0x4c/0xe8)

[   24.811895] Exception stack(0xd7551d78 to 0xd7551dc0)

[   24.816948] 1d60:                                                       4657775f 00000001

[   24.825130] 1d80: 00000101 00000101 cbf48cc0 d6c4a758 40464000 cbecaee0 4657775f d6c0d190

[   24.833313] 1da0: c07dde00 0004a466 c0771c80 d7551dc0 c00de290 c0050688 60000113 ffffffff

[   24.841505] [<c004608c>] (__irq_svc+0x4c/0xe8) from [<c0050688>] (__sync_icache_dcache+0x14/0xa0)

[   24.850385] [<c0050688>] (__sync_icache_dcache+0x14/0xa0) from [<d6c0d000>] (0xd6c0d000)

[   24.858482] CPU0: stopping

[   24.861209] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

[   24.869573] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c004608c>] (__irq_svc+0x4c/0xe8)

[   24.877405] Exception stack(0xd752bec8 to 0xd752bf10)

[   24.882460] bec0:                   d7610180 00100073 00000000 00000000 d7610180 00000000

[   24.890642] bee0: 4ae957df d76cf208 cbaed9ec 40083000 d6739000 40082fff 00000001 d752bf10

[   24.898821] bf00: c00e5d80 c00e5d80 60000013 ffffffff

[   24.903884] [<c004608c>] (__irq_svc+0x4c/0xe8) from [<c00e5d80>] (mprotect_fixup+0x318/0x410)

[   24.912417] [<c00e5d80>] (mprotect_fixup+0x318/0x410) from [<c00e5f94>] (sys_mprotect+0x11c/0x1c0)

[   24.921385] [<c00e5f94>] (sys_mprotect+0x11c/0x1c0) from [<c0046640>] (ret_fast_syscall+0x0/0x30)

[   24.930264] CPU2: stopping

[   24.932988] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

[   24.941349] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c0046328>] (__irq_usr+0x48/0xe0)

[   24.949181] Exception stack(0xd75e9fb0 to 0xd75e9ff8)

[   24.954236] 9fa0:                                     405923ec 401d9688 01010101 07000000

[   24.962418] 9fc0: 78635f5f 5f5f0076 4058ecb4 4058e344 40590bd4 401d9686 401d9686 40238a60

[   24.970598] 9fe0: 00005f5f becb84a8 b0003c5b b0001774 60000010 ffffffff

[   25.073049] in do_local_timer, line:453  cpu:3

[   26.073044] in do_local_timer, line:453    cpu:3

 

跟蹤內核發現,這個panic的執行流程是這樣的。

work_pending -> do_notify_resume -> do_signal -> get_signal_to_deliver -> do_group_exit -> 

do_exit -> exit_notify -> forget_original_parent -> find_new_reaper -> panic("Attempted to kill init!");

涉及到線程、進程的退出,以及線程父子之間的關係,暫時沒法分析出來。

怎麼會走到kill init這一步,考慮到是多核環境下出現的,則嘗試改成單核啓動系統,而後再手動啓動其它CPU,見下節描述。

 

2 手動啓動其它的CPU

單核啓動Android不死機,此時手動用命令啓動其它CPU。

echo 1 > /sys/devices/system/cpu/cpu1/online

這樣CPU1就能夠起來,一段時間後,內核又panic了,日誌以下。

[   88.604151] XXXXXXXXXX  in panic, line:75        cpu:0

[   88.610321] Kernel panic - not syncing: Attempted to kill init!    

[   88.619172] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c05633a0>] (panic+0x88/0x1b4)

[   88.627741] [<c05633a0>] (panic+0x88/0x1b4) from [<c0076c60>] (do_exit+0x664/0x710)

[   88.635424] [<c0076c60>] (do_exit+0x664/0x710) from [<c0076d48>] (do_group_exit+0x3c/0xbc)

[   88.643713] [<c0076d48>] (do_group_exit+0x3c/0xbc) from [<c0082918>] (get_signal_to_deliver+0x1f8/0x430)

root@android:/ # [   88.653215] [<c0082918>] (get_signal_to_deliver+0x1f8/0x430) from [<c0048fec>] (do_signal+0x94/0x534)

[   88.663905] [<c0048fec>] (do_signal+0x94/0x534) from [<c00494c4>] (do_notify_resume+0x38/0x44)

[   88.672545] [<c00494c4>] (do_notify_resume+0x38/0x44) from [<c0046698>] (work_pending+0x24/0x28)

[   88.681352] CPU1: stopping

[   88.684082] [<c004c65c>] (unwind_backtrace+0x0/0xfc) from [<c00402d4>] (do_IPI+0x188/0x1bc)

[   88.692449] [<c00402d4>] (do_IPI+0x188/0x1bc) from [<c004608c>] (__irq_svc+0x4c/0xe8)

[   88.700281] Exception stack(0xd2cf1f90 to 0xd2cf1fd8)

[   88.705337] 1f80:                                     00000020 c0771aa4 d2cf1fd8 00000000

[   88.713520] 1fa0: d2cf0000 c07d0624 c0567c74 c077a0f4 1000406a 412fc09a 00000000 00000000

[   88.721702] 1fc0: 00000001 d2cf1fd8 c0053aec c00471dc 60000013 ffffffff

[   88.728328] [<c004608c>] (__irq_svc+0x4c/0xe8) from [<c00471dc>] (default_idle+0x24/0x28)

[   88.736514] [<c00471dc>] (default_idle+0x24/0x28) from [<c00475b4>] (cpu_idle+0xbc/0xfc)

[   88.744612] [<c00475b4>] (cpu_idle+0xbc/0xfc) from [<10560094>] (0x10560094)

[   89.321214] in do_local_timer, line:453    cpu:0

[   90.291213] in do_local_timer, line:453    cpu:0

[   91.291213] in do_local_timer, line:453    cpu:0

 

panic的信息跟上一節是同樣的,都是按照那樣的流程,最後走入了kill init那一步。

 

3 爲什麼多核SMP會panic

既然可以定位到是多核致使的,只能將多核相關的寄存器仔細查看了。

3.1 NS訪問控制寄存器

NSACR寄存器的描述以下圖所示。這個寄存器在S模式是能夠讀寫的,NS模式則爲只讀。

它的NS_SMP位能夠決定NS模式下,可否修改輔助控制寄存器的SMP位。

 

 

3.2 輔助控制寄存器

輔助控制寄存器以下所述,相關的是。

一致性模式,SMP或者AMP;

廣播cache、分支預測、TLB的一致性操做。

 

S模式下能夠讀寫;

NS下只讀,若NSACR.NS_SMP是0;若這個位變成1,則NS下能夠讀寫,這種狀況下,其它位都是寫忽略的,除了SMP位

 

 

 

 

根據這個寄存器的描述,就是無論是否設置了它的FW位,它均可以從同簇的其它CPU那裏,發送或者接收對內部共享的寫回、寫分配的一致性請求。

言外之意:個人理解是,如果設置了SMP bit,則必須設置FW bit

 

基於這個推測,結合上面這個寄存器的描述,CPU這樣設置。

在S模式,首先設置NSACR的NS_SMP位是1,而後設置輔助控制寄存器的SMP、FW位也是1,這樣切換到NS模式後,也能修改輔助控制寄存器的SMP位,而它的FW位也是1。

通過這樣設置,多核啓動Android成功了,系統沒有再出現panic。

 

4 後續問題怎麼解決 

上面的問題,是在定位到是多核致使後,通過修改寄存器,而後解決的。

至於怎麼根據panic的Kill init信息去跟蹤,而後推導出是cache一致性沒有處理好,最後內核奔潰的,沒有好的思路。

就是出現 Kernel panic - not syncing: Attempted to kill init!

這個問題仍是沒有找到根本的解決思路。

相關文章
相關標籤/搜索