一次在k8s集羣中建立實例發現etcd集羣狀態出現鏈接失敗情況,致使建立實例失敗。因而排查了一下緣由。docker
下面是etcd集羣健康狀態:bootstrap
[root@docker01 ~]# cd /opt/kubernetes/ssl/
[root@docker01 ssl]# /opt/kubernetes/bin/etcdctl \
> --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem \
> --endpoints="https://10.0.0.99:2379,https://10.0.0.100:2379,https://10.0.0.111:2379" \
> cluster-health
member 1bd4d12de986e887 is healthy: got healthy result from https://10.0.0.99:2379
member 45396926a395958b is healthy: got healthy result from https://10.0.0.100:2379
failed to check the health of member c2c5804bd87e2884 on https://10.0.0.111:2379: Get https://10.0.0.111:2379/health: net/http: TLS handshake timeout
member c2c5804bd87e2884 is unreachable: [https://10.0.0.111:2379] are all unreachable
cluster is healthy
[root@docker01 ssl]#
能夠明顯看到etcd節點03出現問題。bash
這個時候到節點03上來重啓etcd服務以下:app
[root@docker03 ~]# systemctl restart etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
[root@docker03 ~]# journalctl -xe
Mar 24 22:24:32 docker03 etcd[1895]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 24 22:24:32 docker03 etcd[1895]: the server is already initialized as member before, starting as etcd member...
Mar 24 22:24:32 docker03 etcd[1895]: peerTLS: cert = /opt/kubernetes/ssl/server.pem, key = /opt/kubernetes/ssl/server-key.pem, ca = , trusted-ca = /opt/kubernetes/ssl
Mar 24 22:24:32 docker03 etcd[1895]: listening for peers on https://10.0.0.111:2380
Mar 24 22:24:32 docker03 etcd[1895]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Mar 24 22:24:32 docker03 etcd[1895]: listening for client requests on 127.0.0.1:2379
Mar 24 22:24:32 docker03 etcd[1895]: listening for client requests on 10.0.0.111:2379
Mar 24 22:24:32 docker03 etcd[1895]: member c2c5804bd87e2884 has already been bootstrapped
Mar 24 22:24:32 docker03 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Mar 24 22:24:32 docker03 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd.service has failed.
--
-- The result is failed.
Mar 24 22:24:32 docker03 systemd[1]: Unit etcd.service entered failed state.
Mar 24 22:24:32 docker03 systemd[1]: etcd.service failed.
Mar 24 22:24:33 docker03 systemd[1]: etcd.service holdoff time over, scheduling restart.
Mar 24 22:24:33 docker03 systemd[1]: start request repeated too quickly for etcd.service
Mar 24 22:24:33 docker03 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd.service has failed.
--
-- The result is failed.
Mar 24 22:24:33 docker03 systemd[1]: Unit etcd.service entered failed state.
Mar 24 22:24:33 docker03 systemd[1]: etcd.service failed.
並無成功啓動服務,能夠看到提示信息:member c2c5804bd87e2884 has already been bootstrappedui
查看資料說是:
One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
大概意思:
其中一個成員是經過discovery service引導的。必須刪除之前的數據目錄來清理成員信息。不然成員將忽略新配置,使用舊配置。這就是爲何你看到了不匹配。
看到了這裏,問題所在也就很明確了,啓動失敗的緣由在於data-dir (/var/lib/etcd/default.etcd)中記錄的信息與 etcd啓動的選項所標識的信息不太匹配形成的。url
第一種方式咱們能夠經過修改啓動參數解決這類錯誤。既然 data-dir 中已經記錄信息,咱們就不必在啓動項中加入多於配置。具體修改--initial-cluster-state參數:spa
[root@docker03 ~]# cat /usr/lib/systemd/system/etcd.service [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=-/opt/kubernetes/cfg/etcd ExecStart=/opt/kubernetes/bin/etcd \ --name=${ETCD_NAME} \ --data-dir=${ETCD_DATA_DIR} \ --listen-peer-urls=${ETCD_LISTEN_PEER_URLS} \ --listen-client-urls=${ETCD_LISTEN_CLIENT_URLS},http://127.0.0.1:2379 \ --advertise-client-urls=${ETCD_ADVERTISE_CLIENT_URLS} \ --initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \ --initial-cluster=${ETCD_INITIAL_CLUSTER} \ --initial-cluster-token=${ETCD_INITIAL_CLUSTER} \ --initial-cluster-state=existing \ # 將new這個參數修改爲existing,啓動正常! --cert-file=/opt/kubernetes/ssl/server.pem \ --key-file=/opt/kubernetes/ssl/server-key.pem \ --peer-cert-file=/opt/kubernetes/ssl/server.pem \ --peer-key-file=/opt/kubernetes/ssl/server-key.pem \ --trusted-ca-file=/opt/kubernetes/ssl/ca.pem \ --peer-trusted-ca-file=/opt/kubernetes/ssl/ca.pem Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target
咱們將 --initial-cluster-state=new 修改爲 --initial-cluster-state=existing,再次從新啓動就ok了。rest
第二種方式刪除全部etcd節點的 data-dir 文件(不刪也行),重啓各個節點的etcd服務,這個時候,每一個節點的data-dir的數據都會被更新,就不會有以上故障了。code
第三種方式是複製其餘節點的data-dir中的內容,以此爲基礎上以 --force-new-cluster 的形式強行拉起一個,而後以添加新成員的方式恢復這個集羣。orm
這是目前的幾種解決辦法