docker run hangs問題排查記錄

一、故障描述

  這兩天遇到一個很是詭異的問題,如今將完整的故障描述以下:node

1)最初是同事跟我反饋k8s集羣中有個worker node狀態變爲NoReady,該node的kubelet的error日誌中發現大量這種日誌linux

E0603 01:50:51.455117   76268 remote_runtime.go:332] ExecSync 1f0e3ac13faf224129bc48a35d515700403e46b094242867ce8f2b7ab981f74e 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.456039   76268 remote_runtime.go:332] ExecSync e86c1b8d460ae2dfbb3fa0369e1ba6308962561f6c7b1076da35ff1db229ebc6 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.523473   76268 remote_runtime.go:332] ExecSync dfddd3a462cf2d81e10385c6d30a1b6242961496db59b9d036fda6c477725c6a '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.523491   76268 remote_runtime.go:332] ExecSync a6e8011a7f4a32d5e733ae9c0da58a310059051feb4d119ab55a387e46b3e7cd '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.523494   76268 remote_runtime.go:332] ExecSync 0f85e0370a366a4ea90f7f21db2fc592a7e4cf817293097b36607a748191e195 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:51.935857   76268 remote_runtime.go:332] ExecSync 45dab41f28be2b8c789a789774d0b8d1117c95e5e3ccbe8f0144146409239e03 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:52.053326   76268 remote_runtime.go:332] ExecSync 45dab41f28be2b8c789a789774d0b8d1117c95e5e3ccbe8f0144146409239e03 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:52.053328   76268 remote_runtime.go:332] ExecSync a944b50db75702b200677511b8e44d839fa185536184812145010859fe4dbe57 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:53.035958   76268 remote_runtime.go:332] ExecSync 5bca3245ed12b9c470cce5b48490839761a021640e7cf97cbf3e749c3a81f488 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:50:54.438308   76268 remote_runtime.go:332] ExecSync 95341ccee3fa0ba35923d5e7cda051dd395e328ff0b7bdd8c392395e212f7b6b 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:00.478244   76268 remote_runtime.go:332] ExecSync c09247eb9167dfc9f0956a5de23f5371c95a030b0eaafdf8518bc494c41bea9f 'ps' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:00.478529   76268 remote_runtime.go:332] ExecSync 95341ccee3fa0ba35923d5e7cda051dd395e328ff0b7bdd8c392395e212f7b6b 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:00.955916   76268 remote_runtime.go:332] ExecSync 3cbb0f53c0f2f8cfe320f54a6f94527b31664465df68c6df16ab269ce16e3871 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:04.668234   76268 remote_runtime.go:332] ExecSync 1f0e3ac13faf224129bc48a35d515700403e46b094242867ce8f2b7ab981f74e 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:07.306240   76268 remote_runtime.go:332] ExecSync 08807433ab5376c75501f9330a168a87734c0f738708e1c423ff4de69245d604 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:17.296389   76268 remote_runtime.go:332] ExecSync 3cbb0f53c0f2f8cfe320f54a6f94527b31664465df68c6df16ab269ce16e3871 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:37.267301   76268 remote_runtime.go:332] ExecSync e5e029786289b2efe8c0ddde19283e0e36fc85c235704b2bbe9133fb520cb57c '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:49.835358   76268 remote_runtime.go:332] ExecSync ee846bc29ffbd70e5a7231102e5fd85929cdac9019d97303b12510a89f0743d8 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:51:52.468602   76268 remote_runtime.go:332] ExecSync 4ca67d88a771ef0689c206a2ea706770b75889fddedf0d38e0ce016ac54c243d '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:05.470375   76268 remote_runtime.go:332] ExecSync 165d53f51c0e611e95882cd2019ef6893de63eaab652df77e055d8f3b17e161e '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:07.475034   76268 remote_runtime.go:115] StopPodSandbox "c3fe3fbdae2ef09fff929878050d46852126100017a299a5bf9f2c7d7aaf0f59" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0603 01:52:07.475126   76268 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "c3fe3fbdae2ef09fff929878050d46852126100017a299a5bf9f2c7d7aaf0f59"}
E0603 01:52:07.475208   76268 kubelet.go:1540] error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "1b4efdb0-82c5-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:07.475270   76268 pod_workers.go:186] Error syncing pod 1b4efdb0-82c5-11e9-bae1-005056a23aab ("app-2034f7b2f71a91f71d2ac3115ba33a4afe9dfe27-1-59747f99cf-zv75k_maxhub-fat-fat(1b4efdb0-82c5-11e9-bae1-005056a23aab)"), skipping: error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "1b4efdb0-82c5-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:20.880257   76268 remote_runtime.go:115] StopPodSandbox "d84fd54b92406166ae162712e40139f6a7a898c9f8d8c8297c69f569b9542348" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0603 01:52:20.880367   76268 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "d84fd54b92406166ae162712e40139f6a7a898c9f8d8c8297c69f569b9542348"}
E0603 01:52:20.880455   76268 kubelet.go:1540] error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "98adf988-840f-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:20.880472   76268 pod_workers.go:186] Error syncing pod 98adf988-840f-11e9-bae1-005056a23aab ("app-f8a857f59f6784bb87ed44c2cd13d86e0663bd29-2-68dd78fc7f-h7qq4_project-394f23ca5e64aad710030c7c78981ec294a1bf59(98adf988-840f-11e9-bae1-005056a23aab)"), skipping: error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
, failed to "KillPodSandbox" for "98adf988-840f-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
]
E0603 01:52:21.672344   76268 remote_runtime.go:332] ExecSync cdb69e42aa1c2f261c1b30a9d4e511ec2be2f50050938f943fd714bfad71f44b 'ps' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:22.132342   76268 remote_runtime.go:332] ExecSync c1e134e598dae5dcd439c036b13d289add90726b32fe90acda778b524b68f01c '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:22.362812   76268 remote_runtime.go:332] ExecSync 8881290b09a1f88d8b323a9be1236533ac6750a58463a438a45a1cd9c44aa7b3 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:23.649141   76268 remote_runtime.go:332] ExecSync ba1af801f817bc3cba324b5d14af7215acbff2f79e5b204bd992a3203c288d9e '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:23.875760   76268 remote_runtime.go:332] ExecSync 3a04819fc488f5bb1d7954a00e33a419286accadc0c7aa739c7b81f264d7c3c0 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
E0603 01:52:23.876992   76268 remote_runtime.go:332] ExecSync f61dfa21713d74f9f8c72df9a13b96a662feb1582f84b910204870c05443cfe0 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

2) 查看message 日誌,關於dockerd的日誌包含如下錯誤docker

jun  4 11:10:16 k8s-node145 dockerd: time="2019-06-04T11:10:16.894554055+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.453079842+08:00" level=info msg="shim reaped" id=19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.458578126+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.458628597+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.500849138+08:00" level=error msg="19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd cleanup: failed to delete container from containerd: no such container"
Jun  4 11:15:27 k8s-node145 dockerd: time="2019-06-04T11:15:27.809076915+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.252794583+08:00" level=info msg="shim reaped" id=226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.257559564+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.257611410+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.291278605+08:00" level=error msg="226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3 cleanup: failed to delete container from containerd: no such container"
Jun  4 11:15:39 k8s-node145 dockerd: time="2019-06-04T11:15:39.794587143+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/e9e91349ffaf0b89bf35740e3af34cb4e922e0af7d6559e9e1a4387943ae0fd0/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:16:31 k8s-node145 dockerd: time="2019-06-04T11:16:31.077775311+08:00" level=info msg="shim reaped" id=e9e91349ffaf0b89bf35740e3af34cb4e922e0af7d6559e9e1a4387943ae0fd0
Jun  4 11:16:31 k8s-node145 dockerd: time="2019-06-04T11:16:31.079700724+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun  4 11:16:57 k8s-node145 dockerd: time="2019-06-04T11:16:57.262180392+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/16ea66bd6a288acaf44b98179f5d1533ae0e5df683d8e6bcfff9b19d8840b6c5/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:17:04 k8s-node145 dockerd: time="2019-06-04T11:17:04.279961690+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/f051aa4bdb94080d887466a926054c560216aa293c0ca8058e8479616fbcfcea/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:17:05 k8s-node145 dockerd: time="2019-06-04T11:17:05.634709458+08:00" level=info msg="shim reaped" id=f051aa4bdb94080d887466a926054c560216aa293c0ca8058e8479616fbcfcea
Jun  4 11:17:05 k8s-node145 dockerd: time="2019-06-04T11:17:05.636388105+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun  4 11:17:07 k8s-node145 dockerd: time="2019-06-04T11:17:07.241859584+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/e3414b19ea4332ff3faab7ef17926172a31177acd9e2ca2ba4e2cc11f679b554/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:17:07 k8s-node145 dockerd: time="2019-06-04T11:17:07.980239680+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/5cdd5bf269b7b08e2a8f971e386dd52b398fd7f4d8a7c5b70276e8386a980343/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:25:31 k8s-node145 dockerd: time="2019-06-04T11:25:31.821280121+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0/shim.sock" debug=false pid=k8s-node145 
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.330601768+08:00" level=info msg="shim reaped" id=b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.335868161+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.335868997+08:00" level=error msg="stream copy error: reading from a closed fifo"
Jun  4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.374385142+08:00" level=error msg="b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0 cleanup: failed to delete container from containerd: no such container"
Jun  4 11:26:16 k8s-node145 dockerd: time="2019-06-04T11:26:16.918871781+08:00" level=info msg="shim reaped" id=e3414b19ea4332ff3faab7ef17926172a31177acd9e2ca2ba4e2cc11f679b554
Jun  4 11:26:16 k8s-node145 dockerd: time="2019-06-04T11:26:16.926022215+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

3)、docker服務正常,docker ps -a查看容器狀態,發現新建立的容器爲created狀態,也就是建立失敗了。json

4)、手動建立容器,出現如下報錯,好像是卡在了read docker daemon 響應結果的階段,可是docker 服務是running狀態,執行docker ps命令也正常。centos

# strace docker run --rm registry.gz.cvte.cn/egg-demo/dev:dev-635f82b ls
futex(0x56190f2b6490, FUTEX_WAKE, 1) = 1 read(3, "HTTP/1.1 201 Created\r\nApi-Versio"..., 4096) = 297 futex(0xc4204d6548, FUTEX_WAKE, 1) = 1 read(3, 0xc420639000, 4096) = -1 EAGAIN (Resource temporarily unavailable) pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout) pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout) futex(0x56190f2b70e8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xc420696948, FUTEX_WAKE, 1) = 1 futex(0xc420696948, FUTEX_WAKE, 1) = 1 futex(0xc4204ef548, FUTEX_WAKE, 1) = 1

五、再次查看message日誌,發現不斷出現如下報錯。bash

Jun  4 10:42:01 k8s-node145 systemd-logind: Failed to start session scope session-413369.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:43:01 k8s-node145 systemd-logind: Failed to start session scope session-413370.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:44:01 k8s-node145 systemd-logind: Failed to start session scope session-413371.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:45:01 k8s-node145 systemd-logind: Failed to start session scope session-413372.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:45:01 k8s-node145 systemd-logind: Failed to start session scope session-413373.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:46:01 k8s-node145 systemd-logind: Failed to start session scope session-413374.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:47:01 k8s-node145 systemd-logind: Failed to start session scope session-413375.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:48:01 k8s-node145 systemd-logind: Failed to start session scope session-413376.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:49:01 k8s-node145 systemd-logind: Failed to start session scope session-413377.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:50:01 k8s-node145 systemd-logind: Failed to start session scope session-413378.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:50:01 k8s-node145 systemd-logind: Failed to start session scope session-413379.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:51:01 k8s-node145 systemd-logind: Failed to start session scope session-413380.scope: The maximum number of pending replies per connection has been reached
Jun  4 10:52:01 k8s-node145 systemd-logind: Failed to start session scope session-413381.scope: The maximum number of pending replies per connection has been reached

二、故障處理

1)、根據message日誌中的dockerd的報錯(msg="stream copy error: reading from a closed fifo")搜索,有人遇到相似問題是由於docker 容器設置limits過低,致使docker容器進程被oom,但應該不會致使我這裏遇到的docker run hangs的狀況。這條線索中斷。session

2)、也有人反饋是docker的bug,因而查看了一下這個node安裝的docker版本以下:app

docker-ce.x86_64                        3:18.09.2-3.el7                installed
docker-ce-cli.x86_64                    1:18.09.5-3.el7                installed

  這不是docker-ce-stable repo中的版本,再看看master node的docker版本:socket

docker-ce-17.03.2.ce-1.el7.centos.x86_64
docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch

  版本竟然不一致。。。ide

3)、對故障的worker node重啓docker服務,測試,發現kubelet服務正常了,可是經過docker ps -a發現有三個daemonset pod 狀態爲created,看來仍是有問題。

4)、這個問題的處理時間已經有4個小時,實在沒有辦法,更換一個docker 版本測試。選擇的是docker-ce-stable源中的如下版本:

docker-ce-18.09.6-3.el7.x86_64
docker-ce-cli-18.09.6-3.el7.x86_64

  發現docker run依然失敗,容器一直爲created狀態

5)、此時該node的docker info信息以下:

Containers: 36
 Running: 31
 Paused: 0
 Stopped: 5
Images: 17
Server Version: 18.09.6
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-862.14.4.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 251.4GiB
Name: k8s-172-17-84-144
ID: XQYD:6IMZ:IGRL:L4TO:J53F:GYMA:VCWL:2DCT:YZVA:RHAQ:MT2D:F6Q7
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

6)、是否是跟存儲驅動有關呢?因而修改docker.service文件取消-s overlay2 --storage-opt overlay2.override_kernel_check=true,啓動參數,以overlay 存儲驅動啓動。算命果真不靠譜。

7)、實在沒有其餘辦法,來個乾脆點的。卸載docker,把/var/lib/docker、/var/lib/docker-engine和/var/run/docker目錄更名,再從新安裝docker。發現問題依然存在,仍是算命。

8)、作了六、7兩步之後直覺告訴我,這個問題跟系統有關了。因而回到了前面的The maximum number of pending replies per connection has been reached報錯,這個報錯雖然是systemd-logind產生的,但二者會不會有什麼關聯呢?這個報錯沒有遇到過,因而google一下。

9)、結果以下:

On 15/06/16 19:05, marcin at saepia.net wrote: > I have recently started to get the error response > 
> "The maximum number of pending replies per connection has been reached"
> 
> to my method calls. The intention of this maximum is to prevent denial-of-service by a bus client. The dbus-daemon allows exactly one reply to each message that expects a reply, therefore it must allocate memory every time it receives a message that expects a reply, to record that fact. That memory can be freed when it sees the reply, or when the process from which it expects a reply disconnects (therefore there can be no reply and there is no longer any point in tracking/allowing it). To avoid denial of service, the dbus-daemon limits the amount of memory that it is prepared to allocate on behalf of any particular client. The limit is relatively small for the system bus, very large for the session bus, and configurable (look for max_replies_per_connection in
/etc/dbus-1/session.conf).

好像是系統爲了防止程序佔用過多系統資源致使拒絕服務而作的限制。看看/etc/dbus-1/session.conf文件屬於哪一個包,包含哪些文件

[root@k8s-node-145 eden]# ls /var/lib/docker^C
[root@k8s-node-145 eden]# rpm -qf /etc/dbus-1/session.conf
dbus-1.10.24-7.el7.x86_64
[root@k8s-172-17-84-144 eden]# rpm -ql dbus-1.10.24-7.el7.x86_64
/etc/dbus-1
/etc/dbus-1/session.conf
/etc/dbus-1/session.d
/etc/dbus-1/system.conf
/etc/dbus-1/system.d
/run/dbus
/usr/bin/dbus-cleanup-sockets
/usr/bin/dbus-daemon
/usr/bin/dbus-monitor
/usr/bin/dbus-run-session
/usr/bin/dbus-send
/usr/bin/dbus-test-tool
/usr/bin/dbus-update-activation-environment
/usr/bin/dbus-uuidgen
/usr/lib/systemd/system/dbus.service
/usr/lib/systemd/system/dbus.socket
/usr/lib/systemd/system/messagebus.service
/usr/share/dbus-1/session.conf

發現/usr/share/dbus-1/session.conf文件末尾有個max_replies_per_connection參數跟message日誌中報錯相似,是否是這個參數限制致使的?默認是50000,修改成100000試試,重啓dbus.service服務。

10)再次跑一下docker run命令,發現成功了。那麼問題來了,其餘node max_replies_per_connection也是設置爲50000,是什麼緣由觸發了這個問題呢?嘗試將max_replies_per_connection改回50000,再重啓dbus.service,發現docker run也是正常的。只能等一段時間再看會不會再次出現了。

相關文章
相關標籤/搜索