ZooKeeper相關錯誤的解決

時間 2019-11-08

原文原文鏈接

1、錯誤1
————————————————
版權聲明：本文爲CSDN博主「AllInCode」的原創文章，遵循 CC 4.0 BY-SA 版權協議，轉載請附上原文出處連接及本聲明。
原文連接：http://www.javashuo.com/article/p-glkpylpg-mh.html
1.一、錯誤描述
ZooKeeper Server（「FOLLOWER和LEADER」都有）的日誌中顯示有如下所示錯誤：
2016-05-14 15:33:01,818 [myid:2] - ERROR [CommitProcessor:2:NIOServerCnxn@178] -
Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:151)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1081)
at org.apache.zookeeper.server.FinalRequestProcessor.proce***equest(Fina
lRequestProcessor.java:170)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
1.二、錯誤緣由分析
ZooKeeper Server發送回覆時，Socket鏈接已經被關閉。java

1.三、錯誤解決
當ZooKeeper Server發送回覆時，增長一個「sk.isValid()」的判斷。以上實際上是一個bug，在ZooKeeper 3.4.8版本中獲得修復。apache

1.四、其餘
這個錯誤在上線「使用ZooKeeper獲取MQ地址方案」以前也存在。
2、錯誤2
2.一、錯誤描述
ZooKeeper Server（「FOLLOWER」）日誌中顯示有如下所示錯誤，出現該錯誤後，做爲「FOLLOWER」的該ZooKeeper Server在一段時間內會中止工做：
2016-05-15 04:04:40,569 [myid:1] - WARN [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 2243ms which will adversely effect operation latency. See the
ZooKeeper troubleshooting guide
————————————————
2016-05-14 15:32:50,764 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
2016-05-14 15:32:50,764 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
相應的ZooKeeper Server（「LEADER」）日誌中顯示有以下所示錯誤：
2016-05-14 15:32:42,605 [myid:3] - WARN [SyncThread:3:FileTxnLog@334] - fsync-i
ng the write ahead log in SyncThread:3 took 3041ms which will adversely effect o
peration latency. See the ZooKeeper troubleshooting guidesession

2016-05-14 15:32:50,764 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:218
1:LearnerHandler@687] - Closing connection to peer due to transaction timeout.
2016-05-14 15:32:50,764 [myid:3] - WARN [LearnerHandler-/10.110.20.23:39390:Lea
rnerHandler@646] - *** GOODBYE /10.110.20.23:39390 ****
2016-05-14 15:32:50,764 [myid:3] - WARN [LearnerHandler-/10.110.20.23:39390:Lea
rnerHandler@658] - Ignoring unexpected exception
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterrup
tibly(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantL
ock.java:312)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java
:294)
at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHan
dler.java:656)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.
java:649)ide

2.二、錯誤緣由分析
「FOLLOWER」在跟「LEADER」同步時，fsync操做時間過長，致使超時。測試

2.三、錯誤解決
增長「tickTime」或者「initLimit和syncLimit」的值，或者二者都增大。ui

2.四、其餘
這個錯誤在上線「使用ZooKeeper獲取MQ地址方案」以前也存在，只不過沒有這麼高頻率，而上線了「使用ZooKeeper獲取MQ地址方案」以後，ZooKeeper Server之間的同步數據量增大，ZooKeeper Server的負載加劇，於是最終致使高頻率出現上述錯誤。
————————————————.net

有一些網友給了一些解決方案，就是在zk配置中增長時間單元，使得鏈接的超時時間變大，從而保證同步延遲不會超過session的超時時間。因而我嘗試修改了配置：rest

tickTime=4000日誌

The number of ticks that the initial

synchronization phase can take

initLimit=20orm

The number of ticks that can pass between

sending a request and getting an acknowledgement

syncLimit=10
tickTime是zk中的時間單元，其餘時間設置都是按照其倍數來肯定的，這裏是4s。原來的配置是

tickTime=2000

The number of ticks that the initial

synchronization phase can take

initLimit=10

The number of ticks that can pass between

sending a request and getting an acknowledgement

syncLimit=5
我都增長了一倍。這樣，若是zk的forceSync消耗的時間不是特別的長，仍是能在session過時以前返回，這樣鏈接勉強還能夠維持。可是實際應用中，仍是會不斷的報同步延遲太高的警告：

fsync-ing the write ahead log in SyncThread:0 took 8001ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
去查了下storm和kafka的日誌，仍是動不動就檢測到disconnected、session time out等日誌，雖然服務基本不會掛，但說明問題仍是沒有解決。

最後無奈之下采用了一個網友的建議：在zoo.cfg配置文件中新增一項配置

forceSync=no
的確解決了問題，再也不出現同步延遲過高的問題，日誌裏再也不有以前的warn~

固然從該配置的意思上，咱們就知道這並非一個完美的解決方案，由於它將默認爲yes的forceSync改成了no。這誠然能夠解決同步延遲的問題，由於它使得forceSync再也不執行！！！

咱們能夠這樣理解：zk的forceSync默認爲yes，意思是，每次zk接收到一些數據以後，因爲forceSync=yes，因此會馬上去將當前的狀態信息同步到磁盤日誌文件中，同步完成以後纔會給出應答。在正常的狀況下，這沒有是什麼問題，可是在個人測試環境下，因爲某種我未知的緣由，使得寫入日誌到磁盤很是的慢，因而在這期間，zk的日誌出現了

fsync-ing the write ahead log in SyncThread:0 took 8001ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 而後因爲同步日誌耗時過久，鏈接得不到回覆，若是已經超過了鏈接的超時時間設置，那麼鏈接（好比kafka）會認爲，該鏈接已經失效，將從新申請創建~因而kafka和storm不斷的報錯，不斷的重連，偶爾還會掛掉。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。