這兩天遇到一個很是詭異的問題,如今將完整的故障描述以下:node
1)最初是同事跟我反饋k8s集羣中有個worker node狀態變爲NoReady,該node的kubelet的error日誌中發現大量這種日誌linux
E0603 01:50:51.455117 76268 remote_runtime.go:332] ExecSync 1f0e3ac13faf224129bc48a35d515700403e46b094242867ce8f2b7ab981f74e 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:51.456039 76268 remote_runtime.go:332] ExecSync e86c1b8d460ae2dfbb3fa0369e1ba6308962561f6c7b1076da35ff1db229ebc6 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:51.523473 76268 remote_runtime.go:332] ExecSync dfddd3a462cf2d81e10385c6d30a1b6242961496db59b9d036fda6c477725c6a '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:51.523491 76268 remote_runtime.go:332] ExecSync a6e8011a7f4a32d5e733ae9c0da58a310059051feb4d119ab55a387e46b3e7cd '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:51.523494 76268 remote_runtime.go:332] ExecSync 0f85e0370a366a4ea90f7f21db2fc592a7e4cf817293097b36607a748191e195 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:51.935857 76268 remote_runtime.go:332] ExecSync 45dab41f28be2b8c789a789774d0b8d1117c95e5e3ccbe8f0144146409239e03 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:52.053326 76268 remote_runtime.go:332] ExecSync 45dab41f28be2b8c789a789774d0b8d1117c95e5e3ccbe8f0144146409239e03 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:52.053328 76268 remote_runtime.go:332] ExecSync a944b50db75702b200677511b8e44d839fa185536184812145010859fe4dbe57 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:53.035958 76268 remote_runtime.go:332] ExecSync 5bca3245ed12b9c470cce5b48490839761a021640e7cf97cbf3e749c3a81f488 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:50:54.438308 76268 remote_runtime.go:332] ExecSync 95341ccee3fa0ba35923d5e7cda051dd395e328ff0b7bdd8c392395e212f7b6b 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:00.478244 76268 remote_runtime.go:332] ExecSync c09247eb9167dfc9f0956a5de23f5371c95a030b0eaafdf8518bc494c41bea9f 'ps' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:00.478529 76268 remote_runtime.go:332] ExecSync 95341ccee3fa0ba35923d5e7cda051dd395e328ff0b7bdd8c392395e212f7b6b 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:00.955916 76268 remote_runtime.go:332] ExecSync 3cbb0f53c0f2f8cfe320f54a6f94527b31664465df68c6df16ab269ce16e3871 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:04.668234 76268 remote_runtime.go:332] ExecSync 1f0e3ac13faf224129bc48a35d515700403e46b094242867ce8f2b7ab981f74e 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:07.306240 76268 remote_runtime.go:332] ExecSync 08807433ab5376c75501f9330a168a87734c0f738708e1c423ff4de69245d604 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:17.296389 76268 remote_runtime.go:332] ExecSync 3cbb0f53c0f2f8cfe320f54a6f94527b31664465df68c6df16ab269ce16e3871 'ls' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:37.267301 76268 remote_runtime.go:332] ExecSync e5e029786289b2efe8c0ddde19283e0e36fc85c235704b2bbe9133fb520cb57c '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:49.835358 76268 remote_runtime.go:332] ExecSync ee846bc29ffbd70e5a7231102e5fd85929cdac9019d97303b12510a89f0743d8 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:51:52.468602 76268 remote_runtime.go:332] ExecSync 4ca67d88a771ef0689c206a2ea706770b75889fddedf0d38e0ce016ac54c243d '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:05.470375 76268 remote_runtime.go:332] ExecSync 165d53f51c0e611e95882cd2019ef6893de63eaab652df77e055d8f3b17e161e '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:07.475034 76268 remote_runtime.go:115] StopPodSandbox "c3fe3fbdae2ef09fff929878050d46852126100017a299a5bf9f2c7d7aaf0f59" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded E0603 01:52:07.475126 76268 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "c3fe3fbdae2ef09fff929878050d46852126100017a299a5bf9f2c7d7aaf0f59"} E0603 01:52:07.475208 76268 kubelet.go:1540] error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillPodSandbox" for "1b4efdb0-82c5-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" ] E0603 01:52:07.475270 76268 pod_workers.go:186] Error syncing pod 1b4efdb0-82c5-11e9-bae1-005056a23aab ("app-2034f7b2f71a91f71d2ac3115ba33a4afe9dfe27-1-59747f99cf-zv75k_maxhub-fat-fat(1b4efdb0-82c5-11e9-bae1-005056a23aab)"), skipping: error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillPodSandbox" for "1b4efdb0-82c5-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" ] E0603 01:52:20.880257 76268 remote_runtime.go:115] StopPodSandbox "d84fd54b92406166ae162712e40139f6a7a898c9f8d8c8297c69f569b9542348" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded E0603 01:52:20.880367 76268 kuberuntime_manager.go:799] Failed to stop sandbox {"docker" "d84fd54b92406166ae162712e40139f6a7a898c9f8d8c8297c69f569b9542348"} E0603 01:52:20.880455 76268 kubelet.go:1540] error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillPodSandbox" for "98adf988-840f-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" ] E0603 01:52:20.880472 76268 pod_workers.go:186] Error syncing pod 98adf988-840f-11e9-bae1-005056a23aab ("app-f8a857f59f6784bb87ed44c2cd13d86e0663bd29-2-68dd78fc7f-h7qq4_project-394f23ca5e64aad710030c7c78981ec294a1bf59(98adf988-840f-11e9-bae1-005056a23aab)"), skipping: error killing pod: [failed to "KillContainer" for "container" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillContainer" for "logtail" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded" , failed to "KillPodSandbox" for "98adf988-840f-11e9-bae1-005056a23aab" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" ] E0603 01:52:21.672344 76268 remote_runtime.go:332] ExecSync cdb69e42aa1c2f261c1b30a9d4e511ec2be2f50050938f943fd714bfad71f44b 'ps' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:22.132342 76268 remote_runtime.go:332] ExecSync c1e134e598dae5dcd439c036b13d289add90726b32fe90acda778b524b68f01c '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:22.362812 76268 remote_runtime.go:332] ExecSync 8881290b09a1f88d8b323a9be1236533ac6750a58463a438a45a1cd9c44aa7b3 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:23.649141 76268 remote_runtime.go:332] ExecSync ba1af801f817bc3cba324b5d14af7215acbff2f79e5b204bd992a3203c288d9e '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:23.875760 76268 remote_runtime.go:332] ExecSync 3a04819fc488f5bb1d7954a00e33a419286accadc0c7aa739c7b81f264d7c3c0 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded E0603 01:52:23.876992 76268 remote_runtime.go:332] ExecSync f61dfa21713d74f9f8c72df9a13b96a662feb1582f84b910204870c05443cfe0 '/etc/init.d/ilogtaild status' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
2) 查看message 日誌,關於dockerd的日誌包含如下錯誤docker
jun 4 11:10:16 k8s-node145 dockerd: time="2019-06-04T11:10:16.894554055+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd/shim.sock" debug=false pid=k8s-node145 Jun 4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.453079842+08:00" level=info msg="shim reaped" id=19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd Jun 4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.458578126+08:00" level=error msg="stream copy error: reading from a closed fifo" Jun 4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.458628597+08:00" level=error msg="stream copy error: reading from a closed fifo" Jun 4 11:10:17 k8s-node145 dockerd: time="2019-06-04T11:10:17.500849138+08:00" level=error msg="19f6f6b5c883112a0e8501364e282127b419524872665c6ad148d0973f9a46fd cleanup: failed to delete container from containerd: no such container" Jun 4 11:15:27 k8s-node145 dockerd: time="2019-06-04T11:15:27.809076915+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3/shim.sock" debug=false pid=k8s-node145 Jun 4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.252794583+08:00" level=info msg="shim reaped" id=226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3 Jun 4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.257559564+08:00" level=error msg="stream copy error: reading from a closed fifo" Jun 4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.257611410+08:00" level=error msg="stream copy error: reading from a closed fifo" Jun 4 11:15:28 k8s-node145 dockerd: time="2019-06-04T11:15:28.291278605+08:00" level=error msg="226c09d6f3cee649e3b1a912990b2d79cc4f8dcdd75751aa53906fe151e314a3 cleanup: failed to delete container from containerd: no such container" Jun 4 11:15:39 k8s-node145 dockerd: time="2019-06-04T11:15:39.794587143+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/e9e91349ffaf0b89bf35740e3af34cb4e922e0af7d6559e9e1a4387943ae0fd0/shim.sock" debug=false pid=k8s-node145 Jun 4 11:16:31 k8s-node145 dockerd: time="2019-06-04T11:16:31.077775311+08:00" level=info msg="shim reaped" id=e9e91349ffaf0b89bf35740e3af34cb4e922e0af7d6559e9e1a4387943ae0fd0 Jun 4 11:16:31 k8s-node145 dockerd: time="2019-06-04T11:16:31.079700724+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jun 4 11:16:57 k8s-node145 dockerd: time="2019-06-04T11:16:57.262180392+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/16ea66bd6a288acaf44b98179f5d1533ae0e5df683d8e6bcfff9b19d8840b6c5/shim.sock" debug=false pid=k8s-node145 Jun 4 11:17:04 k8s-node145 dockerd: time="2019-06-04T11:17:04.279961690+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/f051aa4bdb94080d887466a926054c560216aa293c0ca8058e8479616fbcfcea/shim.sock" debug=false pid=k8s-node145 Jun 4 11:17:05 k8s-node145 dockerd: time="2019-06-04T11:17:05.634709458+08:00" level=info msg="shim reaped" id=f051aa4bdb94080d887466a926054c560216aa293c0ca8058e8479616fbcfcea Jun 4 11:17:05 k8s-node145 dockerd: time="2019-06-04T11:17:05.636388105+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jun 4 11:17:07 k8s-node145 dockerd: time="2019-06-04T11:17:07.241859584+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/e3414b19ea4332ff3faab7ef17926172a31177acd9e2ca2ba4e2cc11f679b554/shim.sock" debug=false pid=k8s-node145 Jun 4 11:17:07 k8s-node145 dockerd: time="2019-06-04T11:17:07.980239680+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/5cdd5bf269b7b08e2a8f971e386dd52b398fd7f4d8a7c5b70276e8386a980343/shim.sock" debug=false pid=k8s-node145 Jun 4 11:25:31 k8s-node145 dockerd: time="2019-06-04T11:25:31.821280121+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0/shim.sock" debug=false pid=k8s-node145 Jun 4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.330601768+08:00" level=info msg="shim reaped" id=b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0 Jun 4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.335868161+08:00" level=error msg="stream copy error: reading from a closed fifo" Jun 4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.335868997+08:00" level=error msg="stream copy error: reading from a closed fifo" Jun 4 11:25:32 k8s-node145 dockerd: time="2019-06-04T11:25:32.374385142+08:00" level=error msg="b99289ee12a554ab3d2a1fece92979c2d02dcc31411f614694a49872d4baa8e0 cleanup: failed to delete container from containerd: no such container" Jun 4 11:26:16 k8s-node145 dockerd: time="2019-06-04T11:26:16.918871781+08:00" level=info msg="shim reaped" id=e3414b19ea4332ff3faab7ef17926172a31177acd9e2ca2ba4e2cc11f679b554 Jun 4 11:26:16 k8s-node145 dockerd: time="2019-06-04T11:26:16.926022215+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
3)、docker服務正常,docker ps -a查看容器狀態,發現新建立的容器爲created狀態,也就是建立失敗了。json
4)、手動建立容器,出現如下報錯,好像是卡在了read docker daemon 響應結果的階段,可是docker 服務是running狀態,執行docker ps命令也正常。centos
# strace docker run --rm registry.gz.cvte.cn/egg-demo/dev:dev-635f82b ls
futex(0x56190f2b6490, FUTEX_WAKE, 1) = 1 read(3, "HTTP/1.1 201 Created\r\nApi-Versio"..., 4096) = 297 futex(0xc4204d6548, FUTEX_WAKE, 1) = 1 read(3, 0xc420639000, 4096) = -1 EAGAIN (Resource temporarily unavailable) pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout) pselect6(0, NULL, NULL, NULL, {0, 3000}, NULL) = 0 (Timeout) futex(0x56190f2b70e8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xc420696948, FUTEX_WAKE, 1) = 1 futex(0xc420696948, FUTEX_WAKE, 1) = 1 futex(0xc4204ef548, FUTEX_WAKE, 1) = 1
五、再次查看message日誌,發現不斷出現如下報錯。bash
Jun 4 10:42:01 k8s-node145 systemd-logind: Failed to start session scope session-413369.scope: The maximum number of pending replies per connection has been reached Jun 4 10:43:01 k8s-node145 systemd-logind: Failed to start session scope session-413370.scope: The maximum number of pending replies per connection has been reached Jun 4 10:44:01 k8s-node145 systemd-logind: Failed to start session scope session-413371.scope: The maximum number of pending replies per connection has been reached Jun 4 10:45:01 k8s-node145 systemd-logind: Failed to start session scope session-413372.scope: The maximum number of pending replies per connection has been reached Jun 4 10:45:01 k8s-node145 systemd-logind: Failed to start session scope session-413373.scope: The maximum number of pending replies per connection has been reached Jun 4 10:46:01 k8s-node145 systemd-logind: Failed to start session scope session-413374.scope: The maximum number of pending replies per connection has been reached Jun 4 10:47:01 k8s-node145 systemd-logind: Failed to start session scope session-413375.scope: The maximum number of pending replies per connection has been reached Jun 4 10:48:01 k8s-node145 systemd-logind: Failed to start session scope session-413376.scope: The maximum number of pending replies per connection has been reached Jun 4 10:49:01 k8s-node145 systemd-logind: Failed to start session scope session-413377.scope: The maximum number of pending replies per connection has been reached Jun 4 10:50:01 k8s-node145 systemd-logind: Failed to start session scope session-413378.scope: The maximum number of pending replies per connection has been reached Jun 4 10:50:01 k8s-node145 systemd-logind: Failed to start session scope session-413379.scope: The maximum number of pending replies per connection has been reached Jun 4 10:51:01 k8s-node145 systemd-logind: Failed to start session scope session-413380.scope: The maximum number of pending replies per connection has been reached Jun 4 10:52:01 k8s-node145 systemd-logind: Failed to start session scope session-413381.scope: The maximum number of pending replies per connection has been reached
1)、根據message日誌中的dockerd的報錯(msg="stream copy error: reading from a closed fifo")搜索,有人遇到相似問題是由於docker 容器設置limits過低,致使docker容器進程被oom,但應該不會致使我這裏遇到的docker run hangs的狀況。這條線索中斷。session
2)、也有人反饋是docker的bug,因而查看了一下這個node安裝的docker版本以下:app
docker-ce.x86_64 3:18.09.2-3.el7 installed docker-ce-cli.x86_64 1:18.09.5-3.el7 installed
這不是docker-ce-stable repo中的版本,再看看master node的docker版本:socket
docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
版本竟然不一致。。。ide
3)、對故障的worker node重啓docker服務,測試,發現kubelet服務正常了,可是經過docker ps -a發現有三個daemonset pod 狀態爲created,看來仍是有問題。
4)、這個問題的處理時間已經有4個小時,實在沒有辦法,更換一個docker 版本測試。選擇的是docker-ce-stable源中的如下版本:
docker-ce-18.09.6-3.el7.x86_64 docker-ce-cli-18.09.6-3.el7.x86_64
發現docker run依然失敗,容器一直爲created狀態
5)、此時該node的docker info信息以下:
Containers: 36 Running: 31 Paused: 0 Stopped: 5 Images: 17 Server Version: 18.09.6 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: systemd Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84 runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30 init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 3.10.0-862.14.4.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 48 Total Memory: 251.4GiB Name: k8s-172-17-84-144 ID: XQYD:6IMZ:IGRL:L4TO:J53F:GYMA:VCWL:2DCT:YZVA:RHAQ:MT2D:F6Q7 Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false Product License: Community Engine
6)、是否是跟存儲驅動有關呢?因而修改docker.service文件取消-s overlay2 --storage-opt overlay2.override_kernel_check=true,啓動參數,以overlay 存儲驅動啓動。算命果真不靠譜。
7)、實在沒有其餘辦法,來個乾脆點的。卸載docker,把/var/lib/docker、/var/lib/docker-engine和/var/run/docker目錄更名,再從新安裝docker。發現問題依然存在,仍是算命。
8)、作了六、7兩步之後直覺告訴我,這個問題跟系統有關了。因而回到了前面的The maximum number of pending replies per connection has been reached報錯,這個報錯雖然是systemd-logind產生的,但二者會不會有什麼關聯呢?這個報錯沒有遇到過,因而google一下。
9)、結果以下:
On 15/06/16 19:05, marcin at saepia.net wrote: > I have recently started to get the error response > > "The maximum number of pending replies per connection has been reached" > > to my method calls. The intention of this maximum is to prevent denial-of-service by a bus client. The dbus-daemon allows exactly one reply to each message that expects a reply, therefore it must allocate memory every time it receives a message that expects a reply, to record that fact. That memory can be freed when it sees the reply, or when the process from which it expects a reply disconnects (therefore there can be no reply and there is no longer any point in tracking/allowing it). To avoid denial of service, the dbus-daemon limits the amount of memory that it is prepared to allocate on behalf of any particular client. The limit is relatively small for the system bus, very large for the session bus, and configurable (look for max_replies_per_connection in /etc/dbus-1/session.conf).
好像是系統爲了防止程序佔用過多系統資源致使拒絕服務而作的限制。看看/etc/dbus-1/session.conf文件屬於哪一個包,包含哪些文件
[root@k8s-node-145 eden]# ls /var/lib/docker^C [root@k8s-node-145 eden]# rpm -qf /etc/dbus-1/session.conf dbus-1.10.24-7.el7.x86_64 [root@k8s-172-17-84-144 eden]# rpm -ql dbus-1.10.24-7.el7.x86_64 /etc/dbus-1 /etc/dbus-1/session.conf /etc/dbus-1/session.d /etc/dbus-1/system.conf /etc/dbus-1/system.d /run/dbus /usr/bin/dbus-cleanup-sockets /usr/bin/dbus-daemon /usr/bin/dbus-monitor /usr/bin/dbus-run-session /usr/bin/dbus-send /usr/bin/dbus-test-tool /usr/bin/dbus-update-activation-environment /usr/bin/dbus-uuidgen /usr/lib/systemd/system/dbus.service /usr/lib/systemd/system/dbus.socket /usr/lib/systemd/system/messagebus.service
/usr/share/dbus-1/session.conf
發現/usr/share/dbus-1/session.conf文件末尾有個max_replies_per_connection參數跟message日誌中報錯相似,是否是這個參數限制致使的?默認是50000,修改成100000試試,重啓dbus.service服務。
10)再次跑一下docker run命令,發現成功了。那麼問題來了,其餘node max_replies_per_connection也是設置爲50000,是什麼緣由觸發了這個問題呢?嘗試將max_replies_per_connection改回50000,再重啓dbus.service,發現docker run也是正常的。只能等一段時間再看會不會再次出現了。