在某些狀況下,常常發現 kubectl 進程掛起現象,而後在 get 時候發現刪了一半,而另外的刪除不了node
[root@k8s-master ~]# kubectl get -f fluentd-elasticsearch/ NAME DESIRED CURRENT READY AGE rc/elasticsearch-logging-v1 0 2 2 15h NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/kibana-logging 0 1 1 1 15h Error from server (NotFound): services "elasticsearch-logging" not found Error from server (NotFound): daemonsets.extensions "fluentd-es-v1.22" not found Error from server (NotFound): services "kibana-logging" not found
刪除這些 deployment,service 或者 rc 命令以下:linux
kubectl delete deployment kibana-logging -n kube-system --cascade=false kubectl delete deployment kibana-logging -n kube-system --ignore-not-found delete rc elasticsearch-logging-v1 -n kube-system --force now --grace-period=0 1|2刪除不了後如何重置etcd
rm -rf /var/lib/etcd/*
刪除後從新 reboot master 結點。reset etcd 後須要從新設置網絡docker
etcdctl mk /atomic.io/network/config '{ "Network": "192.168.0.0/16" }'
每次啓動都是報以下問題:shell
start request repeated too quickly for kube-apiserver.service
但其實不是啓動頻率問題,須要查看, /var/log/messages,在個人狀況中是由於開啓 ServiceAccount 後找不到 ca.crt 等文件,致使啓動出錯。json
May 21 07:56:41 k8s-master kube-apiserver: Flag --port has been deprecated, see --insecure-port instead. May 21 07:56:41 k8s-master kube-apiserver: F0521 07:56:41.692480 4299 universal_validation.go:104] Validate server run options failed: unable to load client CA file: open /var/run/kubernetes/ca.crt: no such file or directory May 21 07:56:41 k8s-master systemd: kube-apiserver.service: main process exited, code=exited, status=255/n/a May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server. May 21 07:56:41 k8s-master systemd: Unit kube-apiserver.service entered failed state. May 21 07:56:41 k8s-master systemd: kube-apiserver.service failed. May 21 07:56:41 k8s-master systemd: kube-apiserver.service holdoff time over, scheduling restart. May 21 07:56:41 k8s-master systemd: start request repeated too quickly for kube-apiserver.service May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
在部署 fluentd 等日誌組件的時候,不少問題都是由於須要開啓 ServiceAccount 選項須要配置安全致使,因此說到底仍是須要配置好 ServiceAccount.bootstrap
在配置 fluentd 時候出現cannot create /var/log/fluentd.log: Permission denied 錯誤,這是由於沒有關掉 SElinux 安全致使。api
能夠在 /etc/selinux/config 中將 SELINUX=enforcing 設置成 disabled,而後 reboot
首先生成各類須要的 keys,k8s-master 需替換成 master 的主機名.安全
openssl genrsa -out ca.key 2048 openssl req -x509 -new -nodes -key ca.key -subj "/CN=k8s-master" -days 10000 -out ca.crt openssl genrsa -out server.key 2048 echo subjectAltName=IP:10.254.0.1 > extfile.cnf #ip由下述命令決定 #kubectl get services --all-namespaces |grep 'default'|grep 'kubernetes'|grep '443'|awk '{print $3}' openssl req -new -key server.key -subj "/CN=k8s-master" -out server.csr openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -extfile extfile.cnf -out server.crt -days 10000
若是修改 /etc/kubernetes/apiserver 的配置文件參數的話,經過 systemctl start kube-apiserver 啓動失敗,出錯信息爲:bash
Validate server run options failed: unable to load client CA file: open /root/keys/ca.crt: permission denied
但能夠經過命令行啓動 API Server服務器
/usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://k8s-master:2379 --address=0.0.0.0 --port=8080 --kubelet-port=10250 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=ServiceAccount --insecure-bind-address=0.0.0.0 --client-ca-file=/root/keys/ca.crt --tls-cert-file=/root/keys/server.crt --tls-private-key-file=/root/keys/server.key --basic-auth-file=/root/keys/basic_auth.csv --secure-port=443 &>> /var/log/kubernetes/kube-apiserver.log &
命令行啓動 Controller-manager
/usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://k8s-master:8080 --root-ca-file=/root/keys/ca.crt --service-account-private-key-file=/root/keys/server.key & >>/var/log/kubernetes/kube-controller-manage.log
etcd是kubernetes 集羣的zookeeper進程,幾乎全部的service都依賴於etcd的啓動,好比flanneld,apiserver,docker.....在啓動etcd是報錯日誌以下:
May 24 13:39:09 k8s-master systemd: Stopped Flanneld overlay address etcd agent. May 24 13:39:28 k8s-master systemd: Starting Etcd Server... May 24 13:39:28 k8s-master etcd: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd:2379,http://etcd:4001 May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag May 24 13:39:28 k8s-master etcd: etcd Version: 3.1.3 May 24 13:39:28 k8s-master etcd: Git SHA: 21fdcc6 May 24 13:39:28 k8s-master etcd: Go Version: go1.7.4 May 24 13:39:28 k8s-master etcd: Go OS/Arch: linux/amd64 May 24 13:39:28 k8s-master etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1 May 24 13:39:28 k8s-master etcd: the server is already initialized as member before, starting as etcd member... May 24 13:39:28 k8s-master etcd: listening for peers on http://localhost:2380 May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:2379 May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:4001 May 24 13:39:28 k8s-master etcd: recovered store from snapshot at index 140014 May 24 13:39:28 k8s-master etcd: name = master May 24 13:39:28 k8s-master etcd: data dir = /var/lib/etcd/default.etcd May 24 13:39:28 k8s-master etcd: member dir = /var/lib/etcd/default.etcd/member May 24 13:39:28 k8s-master etcd: heartbeat = 100ms May 24 13:39:28 k8s-master etcd: election = 1000ms May 24 13:39:28 k8s-master etcd: snapshot count = 10000 May 24 13:39:28 k8s-master etcd: advertise client URLs = http://etcd:2379,http://etcd:4001 May 24 13:39:28 k8s-master etcd: ignored file 0000000000000001-0000000000012700.wal.broken in wal May 24 13:39:29 k8s-master etcd: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 148905 May 24 13:39:29 k8s-master etcd: 8e9e05c52164694d became follower at term 12 May 24 13:39:29 k8s-master etcd: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 148905, applied: 140014, lastindex: 148905, lastterm: 12] May 24 13:39:29 k8s-master etcd: enabled capabilities for version 3.1 May 24 13:39:29 k8s-master etcd: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store May 24 13:39:29 k8s-master etcd: set the cluster version to 3.1 from store May 24 13:39:29 k8s-master etcd: starting server... [version: 3.1.3, cluster version: 3.1] May 24 13:39:29 k8s-master etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory May 24 13:39:29 k8s-master systemd: etcd.service: main process exited, code=exited, status=1/FAILURE May 24 13:39:29 k8s-master systemd: Failed to start Etcd Server. May 24 13:39:29 k8s-master systemd: Unit etcd.service entered failed state. May 24 13:39:29 k8s-master systemd: etcd.service failed. May 24 13:39:29 k8s-master systemd: etcd.service holdoff time over, scheduling restart.
核心語句:
raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory
進入相關目錄,刪除 0.tmp,而後就能夠啓動啦!
問題背景:當前部署了 3 個 etcd 節點,忽然有一天 3 臺集羣所有停電宕機了。從新啓動以後發現 K8S 集羣是能夠正常使用的,可是檢查了一遍組件以後,發現有一個節點的 etcd 啓動不了。通過一遍探查,發現時間不許確,經過如下命令 ntpdate ntp.aliyun.com 從新將時間調整正確,從新啓動 etcd,發現仍是起不來,報錯以下:
Mar 05 14:27:15 k8s-node2 etcd[3248]: etcd Version: 3.3.13 Mar 05 14:27:15 k8s-node2 etcd[3248]: Git SHA: 98d3084 Mar 05 14:27:15 k8s-node2 etcd[3248]: Go Version: go1.10.8 Mar 05 14:27:15 k8s-node2 etcd[3248]: Go OS/Arch: linux/amd64 Mar 05 14:27:15 k8s-node2 etcd[3248]: setting maximum number of CPUs to 4, total number of available CPUs is 4 Mar 05 14:27:15 k8s-node2 etcd[3248]: the server is already initialized as member before, starting as etcd member ... Mar 05 14:27:15 k8s-node2 etcd[3248]: peerTLS: cert = /opt/etcd/ssl/server.pem, key = /opt/etcd/ssl/server-key.pe m, ca = , trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file = Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for peers on https://192.168.25.226:2380 Mar 05 14:27:15 k8s-node2 etcd[3248]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files. Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 127.0.0.1:2379 Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 192.168.25.226:2379 Mar 05 14:27:15 k8s-node2 etcd[3248]: member 9c166b8b7cb6ecb8 has already been bootstrapped Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE Mar 05 14:27:15 k8s-node2 systemd[1]: Failed to start Etcd Server. Mar 05 14:27:15 k8s-node2 systemd[1]: Unit etcd.service entered failed state. Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed. Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed. Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service holdoff time over, scheduling restart. Mar 05 14:27:15 k8s-node2 systemd[1]: Starting Etcd Server... Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_NAME, but unused: shadowed by correspo nding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corr esponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadow ed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unuse d: shadowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: sha dowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: sha dowed by corresponding flag Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: sha dowed by corresponding flag
解決方法:檢查日誌發現並無特別明顯的錯誤,根據經驗來說,etcd 節點壞掉一個其實對集羣沒有大的影響,這時集羣已經能夠正常使用了,可是這個壞掉的 etcd 節點並無啓動,解決方法以下:
cd /var/lib/etcd/default.etcd/member/cp * /data/bak/
rm -rf /var/lib/etcd/default.etcd/member/*
#master 節點 systemctl stop etcd systemctl restart etcd #node1 節點 systemctl stop etcd systemctl restart etcd #node2 節點 systemctl stop etcd systemctl restart etcd
在每臺服務器須要創建主機互信的用戶名執行如下命令生成公鑰/密鑰,默認回車便可
ssh-keygen -t rsa
能夠看到生成個公鑰的文件。互傳公鑰,第一次須要輸入密碼,以後就OK了。
ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.199.132 (-p 2222)
-p 端口 默認端口不加-p,若是更改過端口,就得加上-p. 能夠看到是在.ssh/下生成了個 authorized_keys的文件,記錄了能登錄這臺服務器的其餘服務器的公鑰。測試看是否能登錄:
ssh 192.168.199.132 (-p 2222)
hostnamectl set-hostname k8s-master1
若是不安裝或者不輸出,能夠將 update 修改爲 install 再運行。
yum install update yum update kernel yum update kernel-devel yum install kernel-headers yum install gcc yum install gcc make
運行完後
sh VBoxLinuxAdditions.run
能夠經過下面命令強制刪除
kubectl delete pod NAME --grace-period=0 --force
能夠經過如下腳本強制刪除
[root@k8s-master1 k8s]# cat delete-ns.sh #!/bin/bash set -e useage(){ echo "useage:" echo " delns.sh NAMESPACE" } if [ $# -lt 1 ];then useage exit fi NAMESPACE=$1 JSONFILE=${NAMESPACE}.json kubectl get ns "${NAMESPACE}" -o json > "${JSONFILE}" vi "${JSONFILE}" curl -k -H "Content-Type: application/json" -X PUT --data-binary @"${JSONFLE}" \ http://127.0.0.1:8001/api/v1/namespaces/"${NAMESPACE}"/finalize
下面咱們建立一個對應的容器,該容器只有 requests 設定,可是沒有 limits 設定,
- name: busybox-cnt02 image: busybox command: ["/bin/sh"] args: ["-c", "while true; do echo hello from cnt02; sleep 10;done"] resources: requests: memory: "100Mi" cpu: "100m"
這個容器建立出來會有什麼問題呢?其實對於正常的環境來講沒有什麼問題,可是對於資源型 pod 來講,若是有的容器沒有設定 limit 限制,資源會被其餘的 pod 搶佔走,可能會形成容器應用失敗的狀況。能夠經過 limitrange 策略來去匹配,讓 pod 自動設定,前提是要提早配置好limitrange 規則。