【原創】大叔問題定位分享(13)HBase Region頻繁下線

問題現象:hive執行sql報錯

select count(*) from test_hive_table;html

 報錯java

Error: java.io.IOException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:node

Wed May 16 10:22:17 CST 2018, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68270: row '6495922803d09' on table 'test_hbase_table' at region=test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650., hostname=server121,16020,1526398855988, seqNum=47468sql

直接緣由一:region狀態

這個hive表是外部表,底層數據存放在hbase中,出現問題的region狀態:apache

2018-05-16 03:18:16 在RegionServer server121 上線;服務器

2018-05-16 06:44:13 RegionServer server121 掛了,region下線;(有15個小時處於下線狀態)app

2018-05-16 21:25:33 在RegionServer server92 上線;oop

2018-05-17 04:10:44 RegionServer server92 掛了,region下線;(有10個小時處於下線狀態)大數據

2018-05-17 14:27:33 在RegionServer server114 上線;ui

 

2018-05-16 10:22:17 查詢報錯的時間正好處於region下線狀態,因此報錯的直接緣由是hbase有一個region當時不可用,問題轉移到Hbase,如下爲詳細hbase日誌:

 

HMaster server132日誌:

--------------------------------------

2018-05-16 03:18:12,838 INFO  [ProcedureExecutor-15] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526370112734, server=server76,16020,1526343293931} to {

6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526411892838, server=server76,16020,1526343293931}

2018-05-16 03:18:15,861 INFO  [server132,16000,1525800676562-GeneralBulkAssigner-9] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526411895787, server=server76,16020,1526343293931} to {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526411895861, server=server121,16020,1526398855988}

2018-05-16 03:18:16,311 INFO  [AM.ZK.Worker-pool2-t9606] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526411895861, server=server121,16020,1526398855988}

 to {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526411896311, server=server121,16020,1526398855988}

2018-05-16 03:18:16,445 INFO  [AM.ZK.Worker-pool2-t9610] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526411896311, server=server121,16020,1526398855988} to {

6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526411896445, server=server121,16020,1526398855988}

2018-05-16 03:18:16,480 INFO  [AM.ZK.Worker-pool2-t9618] master.RegionStates: Offlined 6060407a4aa5e23c8df09614dd3fe650 from server76,16020,1526343293931

 

2018-05-16 14:21:44,565 INFO  [B.defaultRpcServer.handler=183,queue=3,port=16000] master.MasterRpcServices: Client=hadoop//client assign test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.

2018-05-16 14:21:44,565 INFO  [B.defaultRpcServer.handler=183,queue=3,port=16000] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526411896480, server=server121,16020,1526398855988} to {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704565, server=server121,16020,1526398855988}

2018-05-16 14:21:44,566 INFO  [B.defaultRpcServer.handler=183,queue=3,port=16000] master.AssignmentManager: Skip assigning test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650., it

is on a dead but not processed yet server: server121,16020,1526398855988

 

2018-05-16 21:25:32,030 INFO  [ProcedureExecutor-9] master.RegionStates: Transitioning {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704566, server=server121,16020,1526398855988} will be handled by ServerCrashProcedure for server121,16020,1526398855988

2018-05-16 21:25:32,489 INFO  [ProcedureExecutor-9] procedure.ServerCrashProcedure: Reassigning region {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526451704566, server=server121,16020,1526398855988} and clearing zknode if exists

2018-05-16 21:25:32,938 INFO  [server132,16000,1525800676562-GeneralBulkAssigner-9] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OFFLINE, ts=1526477132893, server=server121,16020,1526398855988} to {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526477132938, server=server92,16020,1526473170596}

2018-05-16 21:25:33,078 INFO  [AM.ZK.Worker-pool2-t10955] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=PENDING_OPEN, ts=1526477132938, server=server92,16020,1526473170596}

 to {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526477133078, server=server92,1526473170596}

2018-05-16 21:25:33,895 INFO  [AM.ZK.Worker-pool2-t10956] master.RegionStates: Transition {6060407a4aa5e23c8df09614dd3fe650 state=OPENING, ts=1526477133078, server=server92,16020,1526473170596} to {

6060407a4aa5e23c8df09614dd3fe650 state=OPEN, ts=1526477133895, server=server92,16020,1526473170596}

2018-05-16 21:25:33,940 INFO  [AM.ZK.Worker-pool2-t10956] master.RegionStates: Offlined 6060407a4aa5e23c8df09614dd3fe650 from server121,16020,1526398855988

 

RegionServer server121日誌:

--------------------------------------

2018-05-16 03:18:16,390 INFO  [RS_OPEN_REGION-server121:16020-0] regionserver.HRegion: Onlined 6060407a4aa5e23c8df09614dd3fe650; next sequenceid=47468

 

2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller

 

2018-05-16 06:44:13,218 INFO  [StoreCloserThread-test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.-1] regionserver.HStore: Closed cf_p

2018-05-16 06:44:13,218 INFO  [RS_CLOSE_REGION-server121:16020-0] regionserver.HRegion: Closed test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.

 

RegionServer server92日誌:

--------------------------------------

2018-05-16 21:25:33,837 INFO  [RS_OPEN_REGION-server92:16020-1] regionserver.HRegion: Onlined 6060407a4aa5e23c8df09614dd3fe650; next sequenceid=48261

 

2018-05-17 04:10:44,590 FATAL [regionserver/server92/server92:16020.logRoller] regionserver.HRegionServer: ABORTING region server server92,16020,1526473170596: Failed log close in log roller

 

2018-05-17 04:10:44,716 INFO  [StoreCloserThread-test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.-1] regionserver.HStore: Closed cf_p

2018-05-17 04:10:44,716 INFO  [RS_CLOSE_REGION-server92:16020-1] regionserver.HRegion: Closed test_hbase_table,6495922803d09,1526207008130.6060407a4aa5e23c8df09614dd3fe650.

 

直接緣由二:RegionServer進程頻繁退出

另外最近RegionServer很是不穩定,以RegionServer server121爲例,最近幾天進程頻繁退出

RegionServer退出日誌:

--------------------------------------

2018-05-15 00:09:26,412 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526299719028: Failed log close in log roller

2018-05-15 08:07:39,293 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526314169508: Failed log close in log roller

2018-05-15 12:58:12,082 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526342866588: Failed log close in log roller

2018-05-15 21:40:19,032 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526360301267: Failed log close in log roller

2018-05-15 23:40:47,297 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526391631046: Failed log close in log roller

2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller

2018-05-16 16:16:13,501 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526449759526: Failed log close in log roller

2018-05-16 21:54:02,776 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526458576939: Failed log close in log roller

2018-05-17 03:27:20,017 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526478847956: Failed log close in log roller

2018-05-17 16:03:16,978 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526498846730: Failed log close in log roller

 

RegionServer啓動日誌:

--------------------------------------

Tue May 15 00:09:28 CST 2018 Starting regionserver on server121

Tue May 15 08:07:45 CST 2018 Starting regionserver on server121

Tue May 15 12:58:20 CST 2018 Starting regionserver on server121

Tue May 15 21:40:29 CST 2018 Starting regionserver on server121

Tue May 15 23:40:54 CST 2018 Starting regionserver on server121

Wed May 16 06:44:20 CST 2018 Starting regionserver on server121

Wed May 16 13:49:18 CST 2018 Starting regionserver on server121

Wed May 16 16:16:15 CST 2018 Starting regionserver on server121

Wed May 16 21:54:06 CST 2018 Starting regionserver on server121

Thu May 17 03:27:25 CST 2018 Starting regionserver on server121

Thu May 17 16:03:21 CST 2018 Starting regionserver on server121

 

問題一:爲何Hbase的RegionServer會頻繁退出?

2018-05-16 06:44:13,077 FATAL [regionserver/server121/server121:16020.logRoller] regionserver.HRegionServer: ABORTING region server server121,16020,1526398855988: Failed log close in log roller

org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988/server121%2C16020%2C1526398855988.default.1526421652184, unflushedEntries=133

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:907)

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:709)

        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)

        at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=10739149, requesting roll of WAL

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:1650)

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1669)

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:858)

        ... 3 more

Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=10739149, requesting roll of WAL

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1970)

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1814)

        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1724)

        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        ... 1 more

Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[server82:50010], original=[server82:50010]).

 The current failed datanode replacement policy is ALWAYS, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1040)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1106)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1253)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)

        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)

 這裏報錯是由於獲取新的datanode失敗,HDFS代碼解析詳見 http://www.javashuo.com/article/p-zurtcnjy-gc.html

大概過程:因爲機架過多,maxNodesPerRack計算結果爲1(計算過程見問題二),即每一個機架最多隻能分配一個datanode,而後整個hdfs集羣只有一個機架可用,最後numOfAvailableNodes計算結果爲0,而後會拋出NotEnoughReplicasException

 

問題二:爲何出問題的HDFS文件只有一個副本?

hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988/server121%2C16020%2C1526398855988.default.1526421652184

Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[server82:50010], original=[server82:50010]).

 HDFS代碼解析詳見 http://www.javashuo.com/article/p-zurtcnjy-gc.html       

註釋:

int maxNodesPerRack = (totalNumOfReplicas-1)/numOfRacks + 2;

結果:maxNodesPerRack = (2-1)/5+2 = 2;

    if (maxNodesPerRack == totalNumOfReplicas) {

      maxNodesPerRack--;

    }

結果:maxNodesPerRack = 1;

因爲機架數量過多,致使一個機架最多隻能分配一個datanode,結合目前的HDFS機架結構,只有一個機架可用,因此只有1個副本

 

問題三:爲何Region從OFFLINE到ONLINE須要很長時間?

以出問題的Region從2018-05-16 06:44:13下線到2018-05-16 21:25:33上線過程來看

 

Hmaster日誌:

2018-05-16 06:44:13,437 INFO  [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [server121,16020,1526398855988]

2018-05-16 06:44:13,633 INFO  [ProcedureExecutor-2] procedure.ServerCrashProcedure: Start processing crashed server121,16020,1526398855988

2018-05-16 06:44:13,895 INFO  [ProcedureExecutor-8] master.SplitLogManager: dead splitlog workers [server121,16020,1526398855988]

2018-05-16 06:44:13,896 INFO  [ProcedureExecutor-8] master.SplitLogManager: Started splitting 3 logs in [hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting] for [server121,16020,1526398855988]

2018-05-16 06:44:13,911 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998 acquired by server94,16020,1526360475758

2018-05-16 06:44:13,919 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526424253042 acquired by server124,16020,1526398058280

2018-05-16 06:44:13,919 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526421652184 acquired by server119,16020,1526383578672

 

region server掛掉以後master開始server crash處理,首先嚐試split log,這裏一共有3個log,其中2個log很快處理完:

2018-05-16 06:44:13,964 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%

252C16020%252C1526398855988.default.1526424253042

2018-05-16 06:46:01,875 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%

252C16020%252C1526398855988.default.1526421652184

 

可是有1個一直在處理而且反覆報錯:

2018-05-16 07:56:32,848 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 09:08:56,761 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 10:21:29,120 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 11:47:19,093 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 12:59:40,008 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 14:12:18,594 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 15:24:46,758 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 16:37:17,790 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 18:06:36,864 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 19:18:50,573 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 20:37:30,438 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 21:17:56,062 WARN  [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

 

發現該log文件的split過程從6點44到21點多一直處於報錯重試過程,具體報錯看第一次split過程日誌:

2018-05-16 06:44:13,911 INFO  [SplitLogWorker-server94:16020] coordination.ZkSplitLogWorkerCoordination: worker server94,16020,1526360475758 acquired task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 06:44:13,932 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] wal.WALSplitter: Splitting wal: hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, length=83

2018-05-16 06:44:13,932 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] wal.WALSplitter: DistributedLogReplay = false

2018-05-16 06:44:13,932 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Recover lease on dfs file hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998

 

2018-05-16 06:44:13,934 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=0 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 1ms

2018-05-16 06:44:17,935 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=1 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 4002ms

2018-05-16 06:45:22,051 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=2 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 68118ms

2018-05-16 06:47:30,834 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=3 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 196901ms

2018-05-16 06:50:43,622 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=4 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 389689ms

2018-05-16 06:55:00,314 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=5 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 646381ms

2018-05-16 07:00:20,600 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] util.FSHDFSUtils: Failed to recover lease, attempt=6 on file=hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 after 966667ms

2018-05-16 07:01:09,660 WARN  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

2018-05-16 07:01:59,868 WARN  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:28464, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

2018-05-16 07:02:56,853 WARN  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:30766, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

2018-05-16 07:03:56,370 WARN  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:33214, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

 

發現是讀hdfs文件超時,具體報錯:

2018-05-16 07:01:09,660 WARN  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Failed to connect to /server28:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

java.io.IOException: Got error for OP_READ_BLOCK, self=/server94:26354, remote=/server28:50010, for file /hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C

1526398855988.default.1526418989998, for pool BP-436366437-server131-1493982655699 block 1361314257_288238951

        at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)

        at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)

        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:819)

        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697)

        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:354)

        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)

        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:848)

        at java.io.DataInputStream.read(DataInputStream.java:100)

        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:299)

        at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)

        at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:853)

        at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:777)

        at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:298)

        at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:236)

        at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)

        at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)

        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2018-05-16 07:01:09,661 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: Could not obtain BP-436366437-server131-1493982655699:blk_1361314257_288238951 from any node: java.io.IOException: No live

nodes contain block BP-436366437-server131-1493982655699:blk_1361314257_288238951 after checking nodes = [server127:50010, server28:50010], ignoredNodes = null No live nodes contain current block Block

locations: server127:50010 server28:50010 Dead nodes:  server127:50010 server28:50010. Will get new block locations from namenode and retry...

2018-05-16 07:01:09,661 WARN  [RS_LOG_REPLAY_OPS-server94:16020-0] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 1196.0361088580376 msec.

2018-05-16 07:01:14,861 INFO  [RS_LOG_REPLAY_OPS-server94:16020-0] ipc.Client: Retrying connect to server: server127/server127:50020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

 

直到2018-05-16 21:25:31終於split完成

2018-05-16 21:25:31,858 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998 entered state: DONE server76,16020,1526424280100

2018-05-16 21:25:31,868 INFO  [main-EventThread] wal.WALSplitter: Archived processed log hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting/server121%2C16020%2C1526398855988.default.1526418989998 to hdfs://hdfs_name/hbase/oldWALs/server121%2C16020%2C1526398855988.default.1526418989998

2018-05-16 21:25:31,869 INFO  [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase/splitWAL/WALs%2Fserver121%2C16020%2C1526398855988-splitting%2Fserver121%252C16020%252C1526398855988.default.1526418989998

2018-05-16 21:25:31,870 INFO  [ProcedureExecutor-11] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://hdfs_name/hbase/WALs/server121,16020,1526398855988-splitting] in 455631ms

 

而後有不少region被assign,包括出問題的region,這個region在RegionServer server92 上線,詳見以前的日誌,assign過程代碼詳見:

org.apache.hadoop.hbase.master.AssignmentManager.assign

 

問題總結

歸根結底,全部的問題都是因爲HDFS機架配置致使的,有的服務器配置機架信息,有的服務器沒有配置機架信息,致使機架數量特別多,並且機架中的服務器分配不均,由hdfs分配策略可計算出這種狀況下一個機架最多隻能分配一個datanode,同時大量的機架上的datanode剩餘空間爲0,因此不少文件只有1個副本,這樣就形成單點,一旦有datanode節點狀態異常,就可能致使大量文件不可用,進而致使一系列的其餘問題,好比上邊的hbase region長時間處於下線狀態。

問題發現於hive,直接緣由是hbase,根本緣由是hdfs,本質是由於rack配置不當;

hdfs中機架相關配置:

  <property>

    <name>net.topology.script.file.name</name>

    <value>/export/App/hadoop-2.6.1/etc/hadoop/rack.py</value>

  </property>

相關文章
相關標籤/搜索
本站公眾號
   歡迎關注本站公眾號,獲取更多信息