最近咱們的集羣狀態異常,發現啓動zk的時候一直失敗,看了日誌是真的心酸,5s掛機,這要是在開黑,分分鐘被舉報了,一開始真的沒想懂啓動zk爲何這麼難,並且很穩定的在5s左右失敗,經過時間能夠判斷這個進程尚未徹底啓動,在這個時間段內,頂多就是在init
狀態java
[root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper ● zookeeper.service - ZooKeeper Service Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-03-23 11:29:21 CST; 4s ago Docs: http://zookeeper.apache.org Process: 31011 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Main PID: 31138 (java) CGroup: /system.slice/zookeeper.service └─31138 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo... Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service... Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: ZooKeeper JMX enabled by default Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service. [root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper ● zookeeper.service - ZooKeeper Service Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-03-23 11:29:21 CST; 5s ago Docs: http://zookeeper.apache.org Process: 31011 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Main PID: 31138 (java) CGroup: /system.slice/zookeeper.service └─31138 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo... Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service... Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: ZooKeeper JMX enabled by default Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service. [root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper ● zookeeper.service - ZooKeeper Service Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2020-03-23 11:29:26 CST; 706ms ago Docs: http://zookeeper.apache.org Process: 31225 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Process: 31129 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Main PID: 31138 (code=exited, status=1/FAILURE) Mar 23 11:29:20 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31129]: Starting zookeeper ... STARTED Mar 23 11:29:21 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service. Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: zookeeper.service: main process exited, code=exited, status=1/FAILURE Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: ZooKeeper JMX enabled by default Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: Stopping zookeeper ... /opt/zookeeper/zookeeper-prod/bin/zkServer.sh: 第 182 行:kill: (31138) - 沒有那個進程 Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[31225]: STOPPED Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Unit zookeeper.service entered failed state. Mar 23 11:29:26 ZYC3-AQGK-LJCL-SRV05 systemd[1]: zookeeper.service failed.
看到日誌中提到/opt/zookeeper/zookeeper-prod/conf/zoo.cfg
因此就去這個目錄下看看有沒有什麼值得挖掘的內容。畢竟conf
能夠猜想是一個配置文件夾,講道理應該是有log
或者output
之類的文件夾,裏面存放着運行日誌,特別是error
日誌,按照這個思路就能夠進行相關的排查了。
以後發現/opt/zookeeper/zookeeper-prod/bin
目錄中有個zookeeper.out
文件,這個是執行的細節,能夠看下內容,而後cat一下,問題就很明朗了數據庫
2020-03-23 11:36:58,799 [myid:] - INFO [main:QuorumPeerConfig@136] - Reading configuration from: /opt/zookeeper/zookeeper-prod/bin/../conf/zoo.cfg 2020-03-23 11:36:58,814 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.26 to address: /10.153.115.26 2020-03-23 11:36:58,815 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.25 to address: /10.153.115.25 2020-03-23 11:36:58,816 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.24 to address: /10.153.115.24 2020-03-23 11:36:58,816 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.29 to address: /10.153.115.29 2020-03-23 11:36:58,816 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.28 to address: /10.153.115.28 2020-03-23 11:36:58,816 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: 10.153.115.27 to address: /10.153.115.27 2020-03-23 11:36:58,816 [myid:] - WARN [main:QuorumPeerConfig@354] - Non-optimial configuration, consider an odd number of servers. 2020-03-23 11:36:58,816 [myid:] - INFO [main:QuorumPeerConfig@398] - Defaulting to majority quorums 2020-03-23 11:36:58,821 [myid:5] - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3 2020-03-23 11:36:58,821 [myid:5] - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24 2020-03-23 11:36:58,822 [myid:5] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started. 2020-03-23 11:36:58,837 [myid:5] - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed. 2020-03-23 11:36:58,839 [myid:5] - INFO [main:QuorumPeerMain@130] - Starting quorum peer 2020-03-23 11:36:58,849 [myid:5] - INFO [main:ServerCnxnFactory@117] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory 2020-03-23 11:36:58,856 [myid:5] - INFO [main:NIOServerCnxnFactory@89] - binding to port 0.0.0.0/0.0.0.0:2181 2020-03-23 11:36:58,861 [myid:5] - INFO [main:QuorumPeer@1158] - tickTime set to 2000 2020-03-23 11:36:58,861 [myid:5] - INFO [main:QuorumPeer@1204] - initLimit set to 10 2020-03-23 11:36:58,861 [myid:5] - INFO [main:QuorumPeer@1178] - minSessionTimeout set to -1 2020-03-23 11:36:58,862 [myid:5] - INFO [main:QuorumPeer@1189] - maxSessionTimeout set to -1 2020-03-23 11:36:58,871 [myid:5] - INFO [main:QuorumPeer@1467] - QuorumPeer communication is not secured! 2020-03-23 11:36:58,871 [myid:5] - INFO [main:QuorumPeer@1496] - quorum.cnxn.threads.size set to 20 2020-03-23 11:36:58,872 [myid:5] - INFO [main:FileSnap@86] - Reading snapshot /data/zookeeper/data/version-2/snapshot.b91d0000003c 2020-03-23 11:36:59,290 [myid:5] - ERROR [main:QuorumPeer@692] - Unable to load database on disk java.io.IOException: The accepted epoch, ba86 is less than the current epoch, ba87 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:689) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) 2020-03-23 11:36:59,292 [myid:5] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:693) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) Caused by: java.io.IOException: The accepted epoch, ba86 is less than the current epoch, ba87 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:689) ... 4 more
日誌中寫着,正在讀取zk的快照,而後就報錯了,沒法載入磁盤上的數據庫,那麼!我就把快照刪了snapshot.b91d0000003c
,讓其本身從新生成快照文件,完事了。apache
2020-03-23 11:36:58,872 [myid:5] - INFO [main:FileSnap@86] - Reading snapshot /data/zookeeper/data/version-2/snapshot.b91d0000003c 2020-03-23 11:36:59,290 [myid:5] - ERROR [main:QuorumPeer@692] - Unable to load database on disk
酸爽less
[root@ZYC3-AQGK-LJCL-SRV05 deployer]# systemctl status zookeeper ● zookeeper.service - ZooKeeper Service Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-03-23 12:12:08 CST; 5min ago Docs: http://zookeeper.apache.org Process: 25348 ExecStop=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh stop /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Process: 25658 ExecStart=/opt/zookeeper/zookeeper-prod/bin/zkServer.sh start /opt/zookeeper/zookeeper-prod/conf/zoo.cfg (code=exited, status=0/SUCCESS) Main PID: 25667 (java) CGroup: /system.slice/zookeeper.service └─25667 java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/zookeeper-prod/bin/../build/classes:/opt/zookeeper/zookeeper-prod/bin/../build/lib/*.jar:/opt/zookeeper/zoo... Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Starting ZooKeeper Service... Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: ZooKeeper JMX enabled by default Mar 23 12:12:07 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: Using config: /opt/zookeeper/zookeeper-prod/conf/zoo.cfg Mar 23 12:12:08 ZYC3-AQGK-LJCL-SRV05 zkServer.sh[25658]: Starting zookeeper ... STARTED Mar 23 12:12:08 ZYC3-AQGK-LJCL-SRV05 systemd[1]: Started ZooKeeper Service.
由於zk每次運行的時候都會有個一快照文件,這個是狀態恢復用的,因爲這臺主機以前磁盤滿了,導入zk沒法及時寫入消息,以後咱們將設備進行重啓,應該就是這個時候,快照文件寫入失敗致使的。不過這個也不是萬能解法,也是要具體問題具體分析,由於在集羣中,把快照文件刪了,後續想恢復數據庫狀態就難了,不過幸好咱們的zk是6節點的,另外五個節點正常,因此這麼操做是容許的。
至此收工:)ide