zookeeper頻繁異常問題分析

Reference: https://blog.csdn.net/xjping0794/article/details/77784171java

 

1.1            操做系統信息
1.1.1          CPU信息apache

 


從上述能夠看出在問題點CPU使用率並不高,故排除CPU問題。緩存

1.1.2          內存信息
 網絡

CM監控界面並沒有內存信息,沒法提供圖片。session

但問題期間,監控機器內存,發現還有剩餘,故排除內存問題。socket

1.2            ZKFAILOVER日誌信息
 ide

2017-09-0109:45:57,390 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,have not heard from server in 1668ms for sessionid 0x0, closing socketconnection and attempting reconnect工具

2017-09-01 09:45:58,224 ERRORorg.apache.hadoop.ha.ActiveStandbyElector: Connection timed out: couldn'tconnect to ZooKeeper in 5000 millisecondsoop

2017-09-0109:45:58,450 INFO org.apache.zookeeper.ZooKeeper: Session: 0x0 closed測試

2017-09-0109:45:58,450 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down

2017-09-0109:45:58,451 WARN org.apache.hadoop.ha.ActiveStandbyElector:org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =ConnectionLoss

2017-09-0109:46:03,453 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection,connectString=test-ssps-s-02:2181,test-ssps-s-03:2181,test-ssps-s-04:2181sessionTimeout=5000watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f0d36c

2017-09-0109:46:03,455 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection toserver test-ssps-s-04/10.117.210.216:2181. Will not attempt to authenticateusing SASL (unknown error)

2017-09-0109:46:03,463 INFO org.apache.zookeeper.ClientCnxn: Socket connectionestablished to test-ssps-s-04/10.117.210.216:2181, initiating session

2017-09-0109:46:05,131 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,have not heard from server in 1668ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-0109:46:05,885 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection toserver test-ssps-s-03/10.51.20.155:2181. Will not attempt to authenticate usingSASL (unknown error)

2017-09-0109:46:06,626 INFO org.apache.zookeeper.ClientCnxn: Socket connectionestablished to test-ssps-s-03/10.51.20.155:2181, initiating session

2017-09-0109:46:08,293 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,have not heard from server in 1667ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-01 09:46:08,454 ERROR org.apache.hadoop.ha.ActiveStandbyElector:Connection timed out: couldn't connect to ZooKeeper in 5000 milliseconds

2017-09-0109:46:09,157 INFO org.apache.zookeeper.ZooKeeper: Session: 0x0 closed

2017-09-01 09:46:09,157 INFO org.apache.zookeeper.ClientCnxn:EventThread shut down

2017-09-0109:46:09,158 WARN org.apache.hadoop.ha.ActiveStandbyElector:org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =ConnectionLoss

標紅日誌,顯示沒法與zk創建鏈接,致使failover終止。

1.3            YARN RESOURCEMANAGER日誌信息
具體日誌信息未保留,但均是因爲ZK異常,致使服務掛掉。

1.4            ZOOKEEPER服務日誌
摘取leader和follower 服務日誌分析。

1.4.1           leader日誌
分析異常時間點9點40到9點50日誌,發現如下報錯信息:

 

2017-09-0109:45:49,754 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:40713

2017-09-0109:45:49,754 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to establish new session at /10.51.20.155:40713

2017-09-0109:45:51,090 INFO org.apache.zookeeper.server.PrepRequestProcessor: Gotuser-level KeeperException when processing sessionid:0x15e2a2922a213b5type:setData cxid:0x21 zxid:0x6d00050f33 txntype:-1 reqpath:n/a ErrorPath:/yarn-leader-election/yarnRM/ActiveBreadCrumb Error:KeeperErrorCode =BadVersion for /yarn-leader-election/yarnRM/ActiveBreadCrumb

2017-09-0109:45:52,000 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x35e2a2922671668, timeout of 5000ms exceeded

2017-09-0109:45:52,001 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processedsession termination for sessionid: 0x35e2a2922671668

2017-09-0109:45:52,482 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.117.68.10:33589

2017-09-0109:45:52,484 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to renew session 0x15e2a2922a21870 at /10.117.68.10:33589

2017-09-0109:45:52,484 INFO org.apache.zookeeper.server.ZooKeeperServer: Establishedsession 0x15e2a2922a21870 with negotiated timeout 5000 for client/10.117.68.10:33589

2017-09-0109:45:52,884 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end ofstream exception

2017-09-0109:45:52,884 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socketconnection for client /10.51.20.155:40712 which had sessionid 0x15de73379b60000

2017-09-0109:45:53,244 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:40730

2017-09-0109:46:36,520 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:40749

2017-09-01 09:46:29,773 WARN org.apache.zookeeper.server.persistence.FileTxnLog:fsync-ing the write ahead log in SyncThread:3 took53812ms which will adversely effect operation latency. See theZooKeeper troubleshooting guide

2017-09-01 09:46:17,239 INFOorg.apache.zookeeper.server.quorum.Leader: Shutting down

2017-09-0109:45:58,000 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a21870, timeout of 5000ms exceeded

2017-09-01 09:45:53,439 ERRORorg.apache.zookeeper.server.NIOServerCnxn: Unexpected Exception:

java.nio.channels.CancelledKeyException atsun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) atsun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) atorg.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) atorg.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)atorg.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171)atorg.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a21871, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a213b5, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15de73379b60000, timeout of 6000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x25de7337bb00000, timeout of 6000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x35e2a2922671890, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session0x15e2a2922a213b6, timeout of 10000ms exceeded

2017-09-01 09:46:36,647 INFOorg.apache.zookeeper.server.quorum.Leader: Shutdown called

 

從標紅日誌信息能夠看出,zookeerper 服務在同步日誌過程當中耗時太長,花了53812ms(正常應該在3秒內),同步日誌會致使ZK沒法響應外部請求,進而引起session過時,進而引起zk 服務端shut down。

另外,CancelledKeyException錯誤,是因爲session失效後,socket已關閉,但服務端仍往該session發送回覆信號,引起該錯誤,該錯誤並不致命,影響不大。是zookeeper版本(3.4.5)bug所致,已在ZK新版本中優化掉。

其餘點日誌報一樣問題,尤爲10點33分,zk刷新日誌居然花了將近2分鐘,致使服務異常退出。

 

1.4.2          Follower日誌
 

分析異常時間點13點00到13點10日誌,發現如下報錯信息:

 

2017-09-0113:08:55,526 INFO org.apache.zookeeper.server.ZooKeeperServer: Establishedsession 0x15e3b4962bc0000 with negotiated timeout 6000 for client/10.117.210.216:39709

2017-09-0113:08:57,259 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Acceptedsocket connection from /10.51.20.155:53325

2017-09-0113:08:57,260 INFO org.apache.zookeeper.server.ZooKeeperServer: Clientattempting to renew session 0x25e3b531e84006e at /10.51.20.155:53325

2017-09-0113:08:57,261 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session0x25e3b531e84006e with negotiated timeout 6000 for client /10.51.20.155:53325

2017-09-0113:08:57,261 INFO org.apache.zookeeper.server.PrepRequestProcessor: Gotuser-level KeeperException when processing sessionid:0x25e3b531e84006etype:delete cxid:0x11e3d zxid:0x7000006e8f txntype:-1 reqpath:n/a ErrorPath:/admin/preferred_replica_election Error:KeeperErrorCode = NoNode for/admin/preferred_replica_election

2017-09-0113:08:59,525 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end ofstream exception

2017-09-0113:08:59,526 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socketconnection for client /10.117.210.216:39709 which had sessionid0x15e3b4962bc0000

2017-09-01 13:09:00,302 WARNorg.apache.zookeeper.server.persistence.FileTxnLog: fsync-ing the write aheadlog in SyncThread:3 took 35356ms which will adversely effect operation latency.See the ZooKeeper troubleshooting guide

2017-09-0113:09:01,263 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end ofstream exception

2017-09-0113:09:01,264 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socketconnection for client /10.51.20.155:53325 which had sessionid 0x25e3b531e84006e

緣由同上。

1.5            磁盤

 


在問題點,發現IO使用很高。

9點46分,問題出現後,基於當時的集羣IO值分析,發現IO達到200M/s,一樣,10點33分IO也達到了將近200M/s,磁盤IO多是一個瓶頸。

1.6            網絡
 

       網絡正常。

      

 

2  總結及建議
 

從監控及分析結果來看,均屬ZK服務端在fsync-ing the write ahead log日誌時超長引發。從問題產生時間段的IO來分析,發現磁盤IO較正常點高了不少,達到200M/s,證實問題發生時,有大做業在進行磁盤IO,與開發人員楊儀軍確認,其今天確實較以往跑了大數據的做業。頸,需進一步經過專業工具測試磁盤IO能力。

2.1.1          官網建議
關於ZK日誌存放,官網給出以下建議:

Having a dedicated log devicehas a large impact on throughput and stable latencies. It is highly recommenedto dedicate a log device and set dataLogDir to point to a directory on thatdevice, and then make sure to point dataDir to a directory not residing on thatdevice.

     
      故爲避免此類問題,dataLogDir存放目錄應該與dataDir分開,可單獨採用一套存儲設備來存放ZK日誌。
 

2.1.2          磁盤IO
 

從今天觀察狀況來看,磁盤IO嚴重不行,200M/s即達到瓶頸,需進一步經過專業工具測試磁盤IO能力。

 

2.1.3          關閉forceSync參數
 

在ZOO.CFG中增長:

      forceSync=no

默認是開啓的,爲避免同步延遲問題,ZK接收到數據後會馬上去講當前狀態信息同步到磁盤日誌文件中,同步完成後纔會應答。將此項關閉後,客戶端鏈接能夠獲得快速響應。Zk涮日誌源碼以下圖:

 

關閉forceSync選項後,會存在潛在風險,雖然依舊會刷磁盤(log.flush()首先被執行),但由於操做系統爲提升寫磁盤效率,會先寫緩存,當機器異常後,可能致使一些zk狀態信息沒有同步到磁盤,從而帶來ZK先後信息不同問題。

2.1.4          解決CancelledKeyException
該問題,已在ZooKeeper 3.4.8版本中獲得修復。

如需解決該版本問題,可打補丁https://issues.apache.org/jira/browse/ZOOKEEPER-1237

而後從新編譯ZK並使用。

相關文章
相關標籤/搜索