關於numa loadbance的死鎖分析

背景:這個是在3.10.0-957.el7.x86_64 遇到的一例crash。下面列一下咱們是怎麼排查並解這個問題的。node

1、故障現象

Oppo雲智能監控發現機器down機:linux

 KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.el7.x86_64/vmlinux  .... PANIC: "Kernel panic - not syncing: Hard LOCKUP" PID: 14 COMMAND: "migration/1" TASK: ffff8f1bf6bb9040 [THREAD_INFO: ffff8f1bf6bc4000] CPU: 1 STATE: TASK_INTERRUPTIBLE (PANIC)
crash> btPID: 14 TASK: ffff8f1bf6bb9040 CPU: 1 COMMAND: "migration/1" #0 [ffff8f4afbe089f0] machine_kexec at ffffffff83863674 #1 [ffff8f4afbe08a50] __crash_kexec at ffffffff8391ce12 #2 [ffff8f4afbe08b20] panic at ffffffff83f5b4db #3 [ffff8f4afbe08ba0] nmi_panic at ffffffff8389739f #4 [ffff8f4afbe08bb0] watchdog_overflow_callback at ffffffff83949241 #5 [ffff8f4afbe08bc8] __perf_event_overflow at ffffffff839a1027 #6 [ffff8f4afbe08c00] perf_event_overflow at ffffffff839aa694 #7 [ffff8f4afbe08c10] intel_pmu_handle_irq at ffffffff8380a6b0 #8 [ffff8f4afbe08e38] perf_event_nmi_handler at ffffffff83f6b031 #9 [ffff8f4afbe08e58] nmi_handle at ffffffff83f6c8fc#10 [ffff8f4afbe08eb0] do_nmi at ffffffff83f6cbd8#11 [ffff8f4afbe08ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: native_queued_spin_lock_slowpath+462] RIP: ffffffff839121ae RSP: ffff8f1bf6bc7c50 RFLAGS: 00000002 RAX: 0000000000000001 RBX: 0000000000000082 RCX: 0000000000000001 RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8f1afdf55fe8---鎖 RBP: ffff8f1bf6bc7c50 R8: 0000000000000101 R9: 0000000000000400 R10: 000000000000499e R11: 000000000000499f R12: ffff8f1afdf55fe8 R13: ffff8f1bf5150000 R14: ffff8f1afdf5b488 R15: ffff8f1bf5187818 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018--- <NMI exception stack> ---#12 [ffff8f1bf6bc7c50] native_queued_spin_lock_slowpath at ffffffff839121ae#13 [ffff8f1bf6bc7c58] queued_spin_lock_slowpath at ffffffff83f5bf4b#14 [ffff8f1bf6bc7c68] _raw_spin_lock_irqsave at ffffffff83f6a487#15 [ffff8f1bf6bc7c80] cpu_stop_queue_work at ffffffff8392fc70#16 [ffff8f1bf6bc7cb0] stop_one_cpu_nowait at ffffffff83930450#17 [ffff8f1bf6bc7cc0] load_balance at ffffffff838e4c6e#18 [ffff8f1bf6bc7da8] idle_balance at ffffffff838e5451#19 [ffff8f1bf6bc7e00] __schedule at ffffffff83f67b14#20 [ffff8f1bf6bc7e88] schedule at ffffffff83f67bc9#21 [ffff8f1bf6bc7e98] smpboot_thread_fn at ffffffff838ca562#22 [ffff8f1bf6bc7ec8] kthread at ffffffff838c1c31#23 [ffff8f1bf6bc7f50] ret_from_fork_nospec_begin at ffffffff83f74c1dcrash>

2、故障現象分析web

hardlock通常是因爲關中斷時間過長,從堆棧看,上面的"migration/1" 進程在搶spinlock,因爲_raw_spin_lock_irqsave 會先調用 arch_local_irq_disable,而後再去拿鎖,而arch_local_irq_disable 是常見的關中斷函數,下面分析這個進程想要拿的鎖被誰拿着。shell

x86架構下,native_queued_spin_lock_slowpath的rdi就是存放鎖地址的centos

crash> arch_spinlock_t ffff8f1afdf55fe8struct arch_spinlock_t { val = { counter = 257 }}

下面,咱們須要瞭解,這個是一把什麼鎖。從調用鏈分析 idle_balance-->load_balance-->stop_one_cpu_nowait-->cpu_stop_queue_work反彙編 cpu_stop_queue_work 拿鎖阻塞的代碼:微信

crash> dis -l ffffffff8392fc70/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/stop_machine.c: 910xffffffff8392fc70 <cpu_stop_queue_work+48>: cmpb $0x0,0xc(%rbx)
85 static void cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) 86 { 87 struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); 88 unsigned long flags; 89 90 spin_lock_irqsave(&stopper->lock, flags);---因此是卡在拿這把鎖 91 if (stopper->enabled) 92 __cpu_stop_queue_work(stopper, work); 93 else 94 cpu_stop_signal_done(work->done, false); 95 spin_unlock_irqrestore(&stopper->lock, flags); 96 }

看起來 須要根據cpu號,來獲取對應的percpu變量 cpu_stopper,這個入參在 load_balance 函數中找到的最忙的rq,而後獲取其對應的cpu號:架構

 6545 static int load_balance(int this_cpu, struct rq *this_rq, 6546 struct sched_domain *sd, enum cpu_idle_type idle, 6547 int *should_balance) 6548 {.... 6735 if (active_balance) { 6736 stop_one_cpu_nowait(cpu_of(busiest), 6737 active_load_balance_cpu_stop, busiest, 6738 &busiest->active_balance_work); 6739 }.... 6781 }
crash> dis -l load_balance |grep stop_one_cpu_nowait -B 60xffffffff838e4c4d <load_balance+2045>: callq 0xffffffff83f6a0e0 <_raw_spin_unlock_irqrestore>/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/sched/fair.c: 67360xffffffff838e4c52 <load_balance+2050>: mov 0x930(%rbx),%edi------------根據rbx能夠取cpu號,rbx就是最忙的rq0xffffffff838e4c58 <load_balance+2056>: lea 0x908(%rbx),%rcx0xffffffff838e4c5f <load_balance+2063>: mov %rbx,%rdx0xffffffff838e4c62 <load_balance+2066>: mov $0xffffffff838de690,%rsi0xffffffff838e4c69 <load_balance+2073>: callq 0xffffffff83930420 <stop_one_cpu_nowait>

後咱們再棧中取的數據以下:app

最忙的組是:crash> rq.cpu ffff8f1afdf5ab80 cpu = 26

也就是說,1號cpu在等 percpu變量cpu_stopper 的26號cpu的鎖。dom

而後咱們搜索這把鎖在其餘哪一個進程的棧中,找到了以下:編輯器

ffff8f4957fbfab0: ffff8f1afdf55fe8 --------這個在 355608 的棧中crash> kmem ffff8f4957fbfab0 PID: 355608COMMAND: "custom_exporter" TASK: ffff8f4aea3a8000 [THREAD_INFO: ffff8f4957fbc000] CPU: 26--------恰好也是運行在26號cpu的進程 STATE: TASK_RUNNING (ACTIVE)

下面,就須要分析,爲何位於26號cpu的進程 custom_exporter 會長時間拿着 ffff8f1afdf55fe8

咱們來分析26號cpu的堆棧:

crash> bt -f 355608PID: 355608 TASK: ffff8f4aea3a8000 CPU: 26 COMMAND: "custom_exporter"..... #3 [ffff8f1afdf48ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: try_to_wake_up+114] RIP: ffffffff838d63d2 RSP: ffff8f4957fbfa30 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff8f1bf6bb9844 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8f1bf6bb9844 RBP: ffff8f4957fbfa70 R8: ffff8f4afbe15ff0 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8f1bf6bb9040 R14: 0000000000000000 R15: 0000000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000--- <NMI exception stack> --- #4 [ffff8f4957fbfa30] try_to_wake_up at ffffffff838d63d2 ffff8f4957fbfa38: 000000000001ab80 0000000000000086  ffff8f4957fbfa48: ffff8f4afbe15fe0 ffff8f4957fbfb48  ffff8f4957fbfa58: 0000000000000001 ffff8f4afbe15fe0  ffff8f4957fbfa68: ffff8f1afdf55fe0 ffff8f4957fbfa80  ffff8f4957fbfa78: ffffffff838d6705  #5 [ffff8f4957fbfa78] wake_up_process at ffffffff838d6705 ffff8f4957fbfa80: ffff8f4957fbfa98 ffffffff8392fc05  #6 [ffff8f4957fbfa88] __cpu_stop_queue_work at ffffffff8392fc05 ffff8f4957fbfa90: 000000000000001a ffff8f4957fbfbb0  ffff8f4957fbfaa0: ffffffff8393037a  #7 [ffff8f4957fbfaa0] stop_two_cpus at ffffffff8393037a..... ffff8f4957fbfbb8: ffffffff838d3867  #8 [ffff8f4957fbfbb8] migrate_swap at ffffffff838d3867 ffff8f4957fbfbc0: ffff8f4aea3a8000 ffff8f1ae77dc100 -------棧中的 migration_swap_arg ffff8f4957fbfbd0: 000000010000001a 0000000080490f7c  ffff8f4957fbfbe0: ffff8f4aea3a8000 ffff8f4957fbfc30  ffff8f4957fbfbf0: 0000000000000076 0000000000000076  ffff8f4957fbfc00: 0000000000000371 ffff8f4957fbfce8  ffff8f4957fbfc10: ffffffff838dd0ba  #9 [ffff8f4957fbfc10] task_numa_migrate at ffffffff838dd0ba ffff8f4957fbfc18: ffff8f1afc121f40 000000000000001a  ffff8f4957fbfc28: 0000000000000371 ffff8f4aea3a8000 ---這裏ffff8f4957fbfc30 就是 task_numa_env 的存放在棧中的地址 ffff8f4957fbfc38: 000000000000001a 000000010000003f  ffff8f4957fbfc48: 000000000000000b 000000000000022c  ffff8f4957fbfc58: 00000000000049a0 0000000000000012  ffff8f4957fbfc68: 0000000000000001 0000000000000003  ffff8f4957fbfc78: 000000000000006f 000000000000499f  ffff8f4957fbfc88: 0000000000000012 0000000000000001  ffff8f4957fbfc98: 0000000000000070 ffff8f1ae77dc100  ffff8f4957fbfca8: 00000000000002fb 0000000000000001  ffff8f4957fbfcb8: 0000000080490f7c ffff8f4aea3a8000 ---rbx壓棧在此,因此這個就是current ffff8f4957fbfcc8: 0000000000017a48 0000000000001818  ffff8f4957fbfcd8: 0000000000000018 ffff8f4957fbfe20  ffff8f4957fbfce8: ffff8f4957fbfcf8 ffffffff838dd4d3 #10 [ffff8f4957fbfcf0] numa_migrate_preferred at ffffffff838dd4d3 ffff8f4957fbfcf8: ffff8f4957fbfd88 ffffffff838df5b0 .....crash> crash> 

總體上看,26號上的cpu也正在進行numa的balance動做,簡單展開介紹一下numa在balance下的動做在 task_tick_fair 函數中:

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued){ struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); entity_tick(cfs_rq, se, queued); }
if (numabalancing_enabled)----------若是開啓numabalancing,則會調用task_tick_numa task_tick_numa(rq, curr);
update_rq_runnable_avg(rq, 1);}

而 task_tick_numa 會根據掃描狀況,將當前進程須要numa_balance的時候推送到一個work中。經過調用change_prot_numa將全部映射到VMA的PTE頁表項該爲PAGE_NONE,使得下次進程訪問頁表的時候產生缺頁中斷,handle_pte_fault 函數會因爲缺頁中斷的機會來根據numa 選擇更好的node,具體再也不展開。

在 26號cpu的調用鏈中,stop_two_cpus-->cpu_stop_queue_two_works-->cpu_stop_queue_work 函數因爲 cpu_stop_queue_two_works 被內聯了,可是 cpu_stop_queue_two_works 調用 cpu_stop_queue_work有兩次,因此須要根據壓棧地址判斷當前是哪次調用出現問題。

 227 static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1, 228 int cpu2, struct cpu_stop_work *work2) 229 { 230 struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1); 231 struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2); 232 int err; 233  234 lg_double_lock(&stop_cpus_lock, cpu1, cpu2); 235 spin_lock_irq(&stopper1->lock);---注意到這裏已經持有了stopper1的鎖 236 spin_lock_nested(&stopper2->lock, SINGLE_DEPTH_NESTING);..... 243 __cpu_stop_queue_work(stopper1, work1); 244 __cpu_stop_queue_work(stopper2, work2);..... 251 }

根據壓棧的地址:

 #5 [ffff8f4957fbfa78] wake_up_process at ffffffff838d6705 ffff8f4957fbfa80: ffff8f4957fbfa98 ffffffff8392fc05  #6 [ffff8f4957fbfa88] __cpu_stop_queue_work at ffffffff8392fc05 ffff8f4957fbfa90: 000000000000001a ffff8f4957fbfbb0  ffff8f4957fbfaa0: ffffffff8393037a  #7 [ffff8f4957fbfaa0] stop_two_cpus at ffffffff8393037a ffff8f4957fbfaa8: 0000000100000001 ffff8f1afdf55fe8 
crash> dis -l ffffffff8393037a 2/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/stop_machine.c: 2440xffffffff8393037a <stop_two_cpus+394>: lea 0x48(%rsp),%rsi0xffffffff8393037f <stop_two_cpus+399>: mov %r15,%rdi

說明壓棧的是244行的地址,也就是說目前調用的是243行的 __cpu_stop_queue_work。

而後分析對應的入參:

crash> task_numa_env ffff8f4957fbfc30struct task_numa_env { p = 0xffff8f4aea3a8000,  src_cpu = 26,  src_nid = 0,  dst_cpu = 63,  dst_nid = 1,  src_stats = { nr_running = 11,  load = 556, ---load高 compute_capacity = 18848, ---容量至關 task_capacity = 18,  has_free_capacity = 1 },  dst_stats = { nr_running = 3,  load = 111, ---load低,且容量至關,要遷移過來 compute_capacity = 18847, ---容量至關 task_capacity = 18,  has_free_capacity = 1 },  imbalance_pct = 112,  idx = 0,  best_task = 0xffff8f1ae77dc100, ---要對調的task,是經過 task_numa_find_cpu-->task_numa_compare-->task_numa_assign 來獲取的 best_imp = 763,  best_cpu = 1---最佳的swap的對象對於1號cpu}
crash> migration_swap_arg ffff8f4957fbfbc0 struct migration_swap_arg { src_task = 0xffff8f4aea3a8000, dst_task = 0xffff8f1ae77dc100, src_cpu = 26, dst_cpu = 1-----選擇的dst cpu爲1}

根據 cpu_stop_queue_two_works 的代碼,它在持有 cpu_stopper:26號cpu鎖的狀況下,去調用try_to_wake_up ,wake的對象是 用來migrate的 kworker。

static void __cpu_stop_queue_work(struct cpu_stopper *stopper, struct cpu_stop_work *work){ list_add_tail(&work->list, &stopper->works); wake_up_process(stopper->thread);//其實通常就是喚醒 migration}

因爲最佳的cpu對象爲1,因此須要cpu上的migrate來拉取進程。

crash> p cpu_stopper:1per_cpu(cpu_stopper, 1) = $33 = { thread = 0xffff8f1bf6bb9040, ----須要喚醒的目的task lock = { { rlock = { raw_lock = { val = { counter = 1 } } } } },  enabled = true,  works = { next = 0xffff8f4957fbfac0,  prev = 0xffff8f4957fbfac0 },  stop_work = { list = { next = 0xffff8f4afbe16000,  prev = 0xffff8f4afbe16000 },  fn = 0xffffffff83952100,  arg = 0x0,  done = 0xffff8f1ae3647c08 }}crash> kmem 0xffff8f1bf6bb9040CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZEffff8eecffc05f00 task_struct 4152 1604 2219 317 32k SLAB MEMORY NODE TOTAL ALLOCATED FREE fffff26501daee00 ffff8f1bf6bb8000 1 7 7 0 FREE / [ALLOCATED] [ffff8f1bf6bb9040]
PID: 14COMMAND: "migration/1"--------------目的task就是對應的cpu上的migration TASK: ffff8f1bf6bb9040 [THREAD_INFO: ffff8f1bf6bc4000] CPU: 1 STATE: TASK_INTERRUPTIBLE (PANIC)
PAGE PHYSICAL MAPPING INDEX CNT FLAGSfffff26501daee40 3076bb9000 0 0 0 6fffff00008000 tail

如今的問題是,雖然咱們知道了當前cpu26號進程在拿了鎖的狀況下去喚醒1號cpu上的migrate進程,那麼爲何會遲遲不釋放鎖,致使1號cpu由於等待該鎖時間過長而觸發了hardlock的panic呢?

下面就分析,爲何它持鎖的時間這麼長:

 #3 [ffff8f1afdf48ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: try_to_wake_up+114] RIP: ffffffff838d63d2 RSP: ffff8f4957fbfa30 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff8f1bf6bb9844 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8f1bf6bb9844 RBP: ffff8f4957fbfa70 R8: ffff8f4afbe15ff0 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8f1bf6bb9040 R14: 0000000000000000 R15: 0000000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000--- <NMI exception stack> --- #4 [ffff8f4957fbfa30] try_to_wake_up at ffffffff838d63d2 ffff8f4957fbfa38: 000000000001ab80 0000000000000086  ffff8f4957fbfa48: ffff8f4afbe15fe0 ffff8f4957fbfb48  ffff8f4957fbfa58: 0000000000000001 ffff8f4afbe15fe0  ffff8f4957fbfa68: ffff8f1afdf55fe0 ffff8f4957fbfa80 
crash> dis -l ffffffff838d63d2/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/sched/core.c: 17900xffffffff838d63d2 <try_to_wake_up+114>: mov 0x28(%r13),%eax
1721 static int 1722 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) 1723 {..... 1787 * If the owning (remote) cpu is still in the middle of schedule() with 1788 * this task as prev, wait until its done referencing the task. 1789 */ 1790 while (p->on_cpu)---------原來循環在此 1791 cpu_relax();..... 1814 return success; 1815 }

咱們用一個簡單的圖來表示一下這個hardlock:

 CPU1 CPU26 schedule(.prev=migrate/1) <fault> pick_next_task() ... idle_balance() migrate_swap() active_balance() stop_two_cpus() spin_lock(stopper0->lock) spin_lock(stopper1->lock) try_to_wake_up pause() -- waits for schedule() stop_one_cpu(1) spin_lock(stopper26->lock) -- waits for stopper lock

查看上游的補丁

 static void __cpu_stop_queue_work(struct cpu_stopper *stopper,- struct cpu_stop_work *work)+ struct cpu_stop_work *work,+ struct wake_q_head *wakeq) { list_add_tail(&work->list, &stopper->works);- wake_up_process(stopper->thread);+ wake_q_add(wakeq, stopper->thread); }

3、故障復現

因爲這個是一個race condition致使的hardlock,邏輯上分析已經沒有問題了,就沒有花時間去復現,該環境運行一個dpdk的node,不過爲了性能設置了只在一個numa節點上運行,能夠頻繁形成numa的不均衡,因此要復現的同窗,能夠參考單numa節點上運行dpdk來複現,會機率大一些。

4、故障規避或解決

咱們的解決方案是:

1.關閉numa的自動balance.

2.手工合入 linux社區的 0b26351b910f 補丁

3.這個補丁在centos的 3.10.0-974.el7 合入了:

 [kernel] stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock (Phil Auld) [1557061]

同時紅帽又反向合入到了3.10.0-957.27.2.el7.x86_64,因此把centos內核升級到 3.10.0-957.27.2.el7.x86_64也是一種選擇。

☆ END ☆


招聘信息

OPPO互聯網雲平臺團隊招聘一大波崗位,涵蓋Java、容器、Linux內核開發、產品經理、項目經理等多個方向,請在公衆號後臺回覆關鍵詞「雲招聘」查看查詳細信息。


你可能還喜歡

一例 centos7.6 內核 hardlock 的解析

OPPO內核性能追蹤平臺技術實踐——記一次奇怪的IO 100%忙問題定位過程

如何進行 kubernetes 問題的排障

OPPO自研ESA DataFlow架構與實踐

Docker hung住問題解析系列(一):pipe容量不夠


更多技術乾貨

掃碼關注

OPPO互聯網技術

 

我就知道你「在看」

本文分享自微信公衆號 - OPPO互聯網技術(OPPO_tech)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。

相關文章
相關標籤/搜索