虛機numa配置錯誤引起的問題

問題描述

2018年10月20日,宿主上的一臺虛機觸發oom,致使虛機被內核幹掉,問題出現時宿主上內存還剩不少,message中日誌以下:html

說明node

日誌中的order=0說明申請了多少內存,order=0說明申請2的0次方頁內存,也就是4k內存linux

Oct 20 00:43:07  kernel: qemu-kvm invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Oct 20 00:43:07  kernel: qemu-kvm cpuset=emulator mems_allowed=1
Oct 20 00:43:07  kernel: CPU: 7 PID: 1194284 Comm: qemu-kvm Tainted: G           OE  ------------   3.10.0-327.el7.x86_64 #1
Oct 20 00:43:07  kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.5.5 08/16/2017
Oct 20 00:43:07  kernel: ffff882e328f0b80 000000008b0f4108 ffff882f6f367b00 ffffffff816351f1
Oct 20 00:43:07  kernel: ffff882f6f367b90 ffffffff81630191 ffff882e32a91980 0000000000000001
Oct 20 00:43:07  kernel: 000000000000420f 0000000000000010 ffffffff8197d740 00000000b922b922
Oct 20 00:43:07  kernel: Call Trace:
Oct 20 00:43:07  kernel: [<ffffffff816351f1>] dump_stack+0x19/0x1b
Oct 20 00:43:07  kernel: [<ffffffff81630191>] dump_header+0x8e/0x214
Oct 20 00:43:07  kernel: [<ffffffff8116cdee>] oom_kill_process+0x24e/0x3b0
Oct 20 00:43:07  kernel: [<ffffffff8116c956>] ? find_lock_task_mm+0x56/0xc0
Oct 20 00:43:07  kernel: [<ffffffff8116d616>] out_of_memory+0x4b6/0x4f0
Oct 20 00:43:07  kernel: [<ffffffff811737f5>] __alloc_pages_nodemask+0xa95/0xb90
Oct 20 00:43:07  kernel: [<ffffffff811b78ca>] alloc_pages_vma+0x9a/0x140
Oct 20 00:43:07  kernel: [<ffffffff81197655>] handle_mm_fault+0xb85/0xf50
Oct 20 00:43:07  kernel: [<ffffffff8122bb37>] ? eventfd_ctx_read+0x67/0x210
Oct 20 00:43:07  kernel: [<ffffffff81640e22>] __do_page_fault+0x152/0x420
Oct 20 00:43:07  kernel: [<ffffffff81641113>] do_page_fault+0x23/0x80
Oct 20 00:43:07  kernel: [<ffffffff8163d408>] page_fault+0x28/0x30
Oct 20 00:43:07  kernel: Mem-Info:

Oct 20 00:43:07  kernel: active_anon:87309259 inactive_anon:444334 isolated_anon:0#012 active_file:101827 inactive_file:1066463 isolated_file:0#012 unevictable:0 dirty:16777 writeback:0 unstable:0#012 free:8521193 slab_reclaimable:179558 slab_unreclaimable:138991#012 mapped:14804 shmem:1180357 pagetables:195678 bounce:0#012 free_cma:0
Oct 20 00:43:07  kernel: Node 1 Normal free:44244kB min:45096kB low:56368kB high:67644kB active_anon:194740280kB inactive_anon:795780kB active_file:80kB inactive_file:100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198168156kB mlocked:0kB dirty:4kB writeback:0kB mapped:2500kB shmem:2177236kB slab_reclaimable:158548kB slab_unreclaimable:199088kB kernel_stack:109552kB pagetables:478460kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:301 all_unreclaimable? yes
Oct 20 00:43:07  kernel: lowmem_reserve[]: 0 0 0 0
Oct 20 00:43:07  kernel: Node 1 Normal: 10147*4kB (UEM) 22*8kB (UE) 3*16kB (U) 11*32kB (UR) 8*64kB (R) 6*128kB (R) 2*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 44492kB
Oct 20 00:43:07  kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 20 00:43:07  kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 20 00:43:07  kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 20 00:43:07  kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 20 00:43:07  kernel: 2349178 total pagecache pages
Oct 20 00:43:07  kernel: 0 pages in swap cache
Oct 20 00:43:07  kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 20 00:43:07  kernel: Free swap  = 0kB
Oct 20 00:43:07  kernel: Total swap = 0kB
Oct 20 00:43:07  kernel: 100639322 pages RAM
Oct 20 00:43:07  kernel: 0 pages HighMem/MovableOnly
Oct 20 00:43:07  kernel: 1646159 pages reserved
Oct 20 00:43:07  kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name

Oct 20 00:43:07  kernel: Out of memory: Kill process 1409878 (qemu-kvm) score 666 or sacrifice child
Oct 20 00:43:07  kernel: Killed process 1409878 (qemu-kvm) total-vm:136850144kB, anon-rss:133909332kB, file-rss:4724kB
Oct 20 00:43:30  libvirtd: 2018-10-19 16:43:30.303+0000: 81546: error : qemuMonitorIO:705 : internal error: End of file from qemu monitor
Oct 20 00:43:30  systemd-machined: Machine qemu-7-c2683281-6cbd-4100-ba91-e221ed06ee60 terminated.
Oct 20 00:43:30  kvm: 6 guests now active
複製代碼

上述日誌省略掉了meminfo的詳細信息和每一個進程佔用內存的信息。bash

從日誌中能夠看到Node 1 Normal free內存只剩下44M左右,因此觸發了oom,但當時其實node0上還有不少內存未被使用,觸發oom的進程kvm,pid爲1194284,經過查日誌定位到引起問題的虛機爲25913bd0-d869-4310-ab53-8df6855dd258,查看出本臺虛機機xml文件配置發現,虛機內存的numa配置爲:app

<numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
複製代碼

經過virsh client獲取到的信息以下:ide

virsh # numatune 25913bd0-d869-4310-ab53-8df6855dd258
numa_mode      : strict
numa_nodeset   : 1
複製代碼

發現當mode是strict,placement爲auto的時候,進程會算出一個合適的numa節點配置到這臺虛機上。因此這臺虛機內存就被限定到了node1上,當node1的內存被用盡就觸發了oom工具

numatune mode選擇redhat官網說法以下

參見官網連接性能

  • strict

嚴格策略意思是,若是目標節點上的內存不能被分配,那麼內存分配就會失敗 指定了numa節點列表,可是沒有定義內存模式默認爲strict策略測試

  • interleave

跨越指定節點集分配內存頁,分配遵循round-robin(循環/輪替)方法ui

  • preferred

內存從單個首選內存節點分配,若是沒有足夠的內存能知足,那麼內存從其餘節點分配。

重要提示

若是在strict模式內存被過量使用,而且guest沒有足夠的swap空間,那麼內核將kill某些guest進程來得到足夠的內存,因此紅帽官方推薦用perferred,配置一個單節點(好比說,nodeset=‘0’)來避免這種狀況

問題復現

咱們拿了一臺新的宿主建立一臺虛擬機,修改虛擬機的numatune配置,測試了虛機的numatune配置在strict和prefreed兩種mode在如下三種配置下的表現:

interleave這種跨節點內存分配方式性能表現確定會比以上兩種弱,且咱們主要想測在單node節點內存佔用滿的狀況下strict和prefreed兩種模式會不會觸發oom,因此interleave模式不在測試範圍內。

定義三種配置

第一種配置

mode 爲strict placement爲auto

<numatune>
    <memory mode='strict' placement='auto'/>
  </numatune>
複製代碼

第二種配置

mode 爲preferred placement爲auto

<numatune>
    <memory mode='preferred' placement='auto'/>
  </numatune>
複製代碼

第三種配置

mode爲strict nodeset配置爲0-1

<numatune>
    <memory mode='strict' nodeset='0-1'/>
  </numatune>
複製代碼

具體操做

將宿主上單個node的節點內存用memholder(這個工具屬於ssplatform2-tools這個rpm包)佔用滿(具體命令numactl -i 0 memholder 64000 &),而後在虛機上也跑memholder進程,看虛機佔用內存也不斷升高時,內存在numa節點上的分配狀況。

測試結果

第一種配置

virsh client段獲取到的信息以下,placement是auto,可是qemu-kvm進程仍是選了個node肯定了下來

virsh # numatune 638abba7-bba8-498b-88d6-ddc70f2cef18
numa_mode      : strict
numa_nodeset   : 1
複製代碼

開始虛機內存佔用以下

# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      0    693   694
1764062 (qemu-kv      0    366   366
---------------  ------ ------ -----
Total                 1   1060  1060
複製代碼

用memholder把node1內存佔用滿以後宿主的內存佔用

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58476 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 64 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
複製代碼

虛機裏運行完memholder開始佔用內存以後,虛機的內存佔用以下:

numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      6    685   692
1764062 (qemu-kv      7   4670  4677
---------------  ------ ------ -----
Total                13   5355  5368
複製代碼

宿主的內存佔用:

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58650 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 52181 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
複製代碼

這個時候發現kvm進程已經出發了oom,宿主上佔用內存的memholder進程已經被kernel kill調了,宿主內存空閒了出來

message裏日誌以下:

Nov 13 21:07:07  kernel: qemu-kvm invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Nov 13 21:07:07  kernel: qemu-kvm cpuset=emulator mems_allowed=1
Nov 13 21:07:07  kernel: CPU: 28 PID: 1332894 Comm: qemu-kvm Not tainted 4.4.36-1.el7.elrepo.x86_64 #1

Nov 13 21:07:07  kernel: Mem-Info:
Nov 13 21:07:07  kernel: active_anon:1986423 inactive_anon:403229 isolated_anon:0#012 active_file:116773 inactive_file:577075 isolated_file:0#012 unevictable:14364416 dirty:142 writeback:0 unstable:0#012 slab_reclaimable:61182 slab_unreclaimable:296489#012 mapped:14400991 shmem:15542531 pagetables:35749 bounce:0#012 free:14983912 free_pcp:0 free_cma:0
Nov 13 21:07:07  kernel: Node 1 Normal free:44952kB min:45120kB low:56400kB high:67680kB active_anon:5485032kB inactive_anon:1571408kB active_file:308kB inactive_file:0kB unevictable:57286820kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66044484kB mlocked:57286820kB dirty:48kB writeback:0kB mapped:57330444kB shmem:61948048kB slab_reclaimable:143752kB slab_unreclaimable:1107004kB kernel_stack:16592kB pagetables:129312kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2248 all_unreclaimable? yes
Nov 13 21:07:07  kernel: lowmem_reserve[]: 0 0 0 0
Nov 13 21:07:07  kernel: Node 1 Normal: 1018*4kB (UME) 312*8kB (UE) 155*16kB (UE) 34*32kB (UE) 293*64kB (UM) 53*128kB (U) 5*256kB (U) 1*512kB (U) 1*1024kB (E) 2*2048kB (UM) 2*4096kB (M) = 50776kB
Nov 13 21:07:07  kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 13 21:07:07  kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 13 21:07:07  kernel: 16236582 total pagecache pages
Nov 13 21:07:07  kernel: 0 pages in swap cache
Nov 13 21:07:07  kernel: Swap cache stats: add 0, delete 0, find 0/0
Nov 13 21:07:07  kernel: Free swap  = 0kB
Nov 13 21:07:07  kernel: Total swap = 0kB
Nov 13 21:07:07  kernel: 33530456 pages RAM
Nov 13 21:07:07  kernel: 0 pages HighMem/MovableOnly
Nov 13 21:07:07  kernel: 551723 pages reserved
Nov 13 21:07:07  kernel: 0 pages hwpoisoned
複製代碼

咱們測試的進程是1764062,可是出發oom的進程是1332894,該進程對應的虛機的numatune配置也爲配置一,且運行virsh client獲取到的nodeset也是1

virsh # numatune c11a155a-95b0-4593-9ce5-f2a42dc0ccca
numa_mode      : strict
numa_nodeset   : 1
複製代碼

第二種配置

virsh client獲取到的虛機numatune以下:

virsh # numatune 638abba7-bba8-498b-88d6-ddc70f2cef18
numa_mode      : preferred
numa_nodeset   : 1
複製代碼

開始虛機的內存佔用以下

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      6    691   698
1897916 (qemu-kv     17    677   694
---------------  ------ ------ -----
Total                24   1368  1392
複製代碼

用memholder把node1內存佔用滿以後宿主的內存佔用

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58403 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 56 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
複製代碼

虛機裏運行完memholder開始佔用內存以後,虛機的內存佔用以下

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      7    690   697
1897916 (qemu-kv   4012    682  4695
---------------  ------ ------ -----
Total              4019   1372  5391
複製代碼

宿主的內存佔用:

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 54395 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 55 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
複製代碼

從以上表現來看雖然preferred是node1可是當node1內存不足的時候,進程申請了node0的內存,並未引起oom

第三種配置

說明1308480這個進程是咱們測試的虛機進程

開始虛機內存佔用以下

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1308480 (qemu-kv    141    584   725
1332894 (qemu-kv      0    707   708
---------------  ------ ------ -----
Total               141   1291  1432
複製代碼

宿主上的內存佔用以下

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58241 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 131 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

複製代碼

虛機裏運行完memholder開始佔用內存以後,虛機的內存佔用以下:

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1308480 (qemu-kv   4017    682  4699
1332894 (qemu-kv      7    681   688
---------------  ------ ------ -----
Total              4024   1363  5387
複製代碼

宿主上的內存佔用以下:

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 54410 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 55 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
複製代碼

總結

從測試來看,第二種和第三種配置方式都不會致使因爲兩個node節點內存使用不均衡致使oom,可是哪一種配置性能更好還須要後續的測試。

參考連接

www.jianshu.com/p/c2e7d3682…

相關文章
相關標籤/搜索