昨天還好好的集羣,今天早上來看又掛掉了,還好是家裏的測試服務器集羣。。。java
首先,查看了Namenode的狀態,發現兩臺Namenode只剩下一臺了,趕忙到掛了的那臺去查看了logs下的日誌:node
2016-08-09 16:33:51,526 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:52,169 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2016-08-09 16:33:52,526 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:53,527 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:54,529 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:55,530 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:56,531 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:57,533 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:58,533 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:33:59,534 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:34:00,536 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:34:01,537 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16013 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:34:02,538 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17014 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:34:03,540 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:34:04,541 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.80.248.17:8486] 2016-08-09 16:34:05,525 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.80.248.17:8486, 10.80.248.18:8486, 10.80.248.19:8486], stream=QuorumOutputStream starting at txid 2947)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1221) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1158) at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1238) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6344) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:933) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 2016-08-09 16:34:05,526 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 2947 2016-08-09 16:34:05,600 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2016-08-09 16:34:05,733 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ut07/10.80.248.17 ************************************************************/
以上是hadoop-hadooptest-namenode-ut07.log在Namenode退出時候的關鍵日誌,能夠從中發現,Namenode在寫Journalnode發生了超時,默認的超時時間爲20秒,而在超時發生後,Namenode會觸發ExitUtil類的terminate 方法,致使進程的System.exit()apache
至於爲何好好的集羣,會發生寫入超時的問題呢,這個問題如今還不太肯定,有人說是由於執行了較大文件的HDFS操做,致使Namenode的FULL GC時間較長,因此致使寫Journalnode超時。可是我昨天到今天基本上沒有去動HDFS,這個根本緣由還得再深挖一下。。。服務器
不管如何,先把集羣恢復吧,還要用呢。app
其實在實際的生產環境中,也很容易發生相似的這種超時狀況,因此咱們須要把默認的20s超時改爲更大的值,好比60s。ide
咱們能夠在hadoop/etc/hadoop下的hdfs-site.xml中,加入一組配置:oop
<property> <name>dfs.qjournal.write-txns.timeout.ms</name> <value>60000</value> </property>
這也是我從別人博客中看到的配置方法,神奇的是,我在hadoop的官網中的關於hdfs-site.xml介紹中,竟然找不到關於這個配置的說明。。。測試
http://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xmlui
最後記得重啓整個集羣,這樣配置才能生效。spa
友情提示:使用了Flume的同窗,記得也要重啓Flume集羣哦~