之前用Docker Compose作開發環境,MySQL也在Docker裏運行,一切使用正常。後來開發環境遷移到K3s(輕量級的K8s),可是MySQL一啓動就被OOM Killer幹掉,因此一直沒遷移MySQL。node
使用Kubectl直接運行一個MySQL便可重現:mysql
apiVersion: apps/v1 kind: Deployment metadata: name: mysql spec: replicas: 1 selector: matchLabels: app: mysql template: metadata: labels: app: mysql spec: containers: - name: mysql image: mysql:5.7 imagePullPolicy: IfNotPresent env: - name: MYSQL_ROOT_PASSWORD value: root resources: limits: memory: 4G cpu: 500m
dmesg
能夠看mysqld分配超過了3.7G內存而後被殺掉:sql
[ 839.399262] Tasks state (memory values in pages): [ 839.399263] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name ... [ 839.399278] [ 34888] 0 34888 4208240 974177 7962624 0 -998 mysqld [ 839.399280] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=...,mems_allowed=0,oom_memcg=/kubepods/pod..,task_memcg=/kubepods/pod../56..,task=mysqld,pid=34888,uid=0 [ 839.399294] Memory cgroup out of memory: Killed process 34888 (mysqld) total-vm:16832960kB, anon-rss:3895388kB, file-rss:1320kB, shmem-rss:0kB [ 839.496988] oom_reaper: reaped process 34888 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
第一反應是MySQL配置有問題,因而修改配置減小各類buffer的大小:docker
[mysqld] innodb_buffer_pool_size = 32M innodb_buffer_pool_instances=1 innodb_log_file_size = 64M innodb_log_buffer_size = 8M key_buffer_size = 16k myisam_sort_buffer_size = 16k max_connections = 50 open_files_limit = 4096 max_allowed_packet = 1M table_open_cache = 16 sort_buffer_size = 512k net_buffer_length = 8K read_buffer_size = 256K read_rnd_buffer_size = 256K thread_cache_size = 64 query_cache_size = 0 tmp_table_size = 12M thread_stack=256K
然而問題依舊,改了很多配置都沒有效果,看來須要跟蹤或調試。shell
這裏沒有使用strace主要是由於啓動就被殺掉,須要修改鏡像啓動命令而後在容器內安裝和執行strace ,而bpftrace能夠系統全局跟蹤,在容器外就能夠完成跟蹤,更加方便。api
內存分配主要經過skb和mmap系統調用,用bpftrace跟蹤這兩個系統調用:bash
#!/usr/bin/bpftrace tracepoint:syscalls:sys_enter_mmap / comm == "mysqld" / { printf("%d %s addr=%ld len=%ld flags=%ld\n", pid, probe, args->addr, args->len, args->flags); /* printf("%s\n", ustack(perf, 10)); */ } tracepoint:syscalls:sys_enter_brk / comm == "mysqld" / { printf("%d %s brk %d\n", pid, probe, args->brk); }
sudo ./mysql-oom.bt Attaching 2 probes... 57950 tracepoint:syscalls:sys_enter_brk brk 0 57950 tracepoint:syscalls:sys_enter_mmap addr=0 len=8740 flags=2 ... ... 57950 tracepoint:syscalls:sys_enter_brk brk 1699086336 57950 tracepoint:syscalls:sys_enter_mmap addr=0 len=17179869184 flags=34
能夠看到最後用 mmap
一次分配了 16G 內存,而後就被殺了。app
嘗試獲取調用堆棧(ustack),可是都拿不到有用的信息,可能mysql沒有編譯時沒有開啓frame pointer:memcached
97694 tracepoint:syscalls:sys_enter_mmap addr=0 len=12288 flags=34 7f84c662730a 0x7f84c662730a ([unknown]) 97694 tracepoint:syscalls:sys_enter_mmap addr=0 len=17179869184 flags=34 7f84c4c4064a 0x7f84c4c4064a ([unknown])
拿不到堆棧,再嘗試跟蹤全部系統系統調用:函數
#!/usr/bin/bpftrace tracepoint:syscalls:sys_enter_* / comm == "mysqld" / { printf("%d %s\n", pid, probe); } ...
輸出:
Attaching 331 probes... ... 115490 tracepoint:syscalls:sys_enter_close 115490 tracepoint:syscalls:sys_enter_brk 115490 tracepoint:syscalls:sys_enter_newstat 115490 tracepoint:syscalls:sys_enter_getrlimit 115490 tracepoint:syscalls:sys_enter_mmap addr=0 len=17179869184 flags=34
能夠看到在最後一次 mmap
前調用了 getrlimit
, 猜想是mysql會根據系統資源限制來分配內存。
MySQL源碼中直接調用 getrlimit
的地方很少,排除了ndb、innodb_memcached、libevent以後,只有一處直接調用:
static uint set_max_open_files(uint max_file_limit) { struct rlimit rlimit; uint old_cur; DBUG_ENTER("set_max_open_files"); DBUG_PRINT("enter",("files: %u", max_file_limit)); if (!getrlimit(RLIMIT_NOFILE,&rlimit)) { old_cur= (uint) rlimit.rlim_cur; DBUG_PRINT("info", ("rlim_cur: %u rlim_max: %u", (uint) rlimit.rlim_cur, (uint) rlimit.rlim_max)); if (rlimit.rlim_cur == (rlim_t) RLIM_INFINITY) rlimit.rlim_cur = max_file_limit; if (rlimit.rlim_cur >= max_file_limit) DBUG_RETURN(rlimit.rlim_cur); /* purecov: inspected */ rlimit.rlim_cur= rlimit.rlim_max= max_file_limit; if (setrlimit(RLIMIT_NOFILE, &rlimit)) max_file_limit= old_cur; /* Use original value */ else { rlimit.rlim_cur= 0; /* Safety if next call fails */ (void) getrlimit(RLIMIT_NOFILE,&rlimit); DBUG_PRINT("info", ("rlim_cur: %u", (uint) rlimit.rlim_cur)); if (rlimit.rlim_cur) /* If call didn't fail */ max_file_limit= (uint) rlimit.rlim_cur; } } DBUG_PRINT("exit",("max_file_limit: %u", max_file_limit)); DBUG_RETURN(max_file_limit);
其中邏輯是:若是系統的文件打開限制是 RLIM_INFINITY
或者比要設置的 max_file_limit
大,都返回系統的限制。
這個函數也只被直接調用一次:
uint my_set_max_open_files(uint files) { struct st_my_file_info *tmp; DBUG_ENTER("my_set_max_open_files"); DBUG_PRINT("enter",("files: %u my_file_limit: %u", files, my_file_limit)); files+= MY_FILE_MIN; files= set_max_open_files(MY_MIN(files, OS_FILE_LIMIT)); // 獲取最大打開文件數 if (files <= MY_NFILE) DBUG_RETURN(files); // 分配內存 if (!(tmp= (struct st_my_file_info*) my_malloc(key_memory_my_file_info, sizeof(*tmp) * files, MYF(MY_WME)))) DBUG_RETURN(MY_NFILE); // 初始化 /* Copy any initialized files */ memcpy((char*) tmp, (char*) my_file_info, sizeof(*tmp) * MY_MIN(my_file_limit, files)); memset((tmp + my_file_limit), 0, MY_MAX((int) (files - my_file_limit), 0) * sizeof(*tmp)); my_free_open_file_info(); /* Free if already allocated */ my_file_info= tmp; my_file_limit= files; DBUG_PRINT("exit",("files: %u", files)); DBUG_RETURN(files); }
原來MySQL會根據最大可打開文件數,提早爲每一個文件分配和初始化內存,在這個時候就可能分配過多內存,致使OOM。
由於K8s目前不支持設置ulimit,很多Helm Chart都在啓動前設置ulimit。
command: ["sh","-c", "ulimit -n 4096 && exec /usr/local/bin/docker-entrypoint.sh mysqld"]
K3s經過systemd啓動,能夠修改k3s.service,限制K3s打開文件數,這個限制會傳給containerd:
LimitNOFILE=1048576