HBase 永久RIT(Region-In-Transition)問題

 

2018-02-28java

HBase 永久RIT(Region-In-Transition)問題:異常關機致使HBase表損壞和丟失,大量Regions 處於Offline狀態,沒法上線。node

  • 問題1:啓動HBase時,HBase Regionserver Web UI,一直停留在The RegionServer is initializing! 界面
 Initializing Master file system (since 10mins, 16sec ago) 
The RegionServer is initializing!
  •  問題1:故障排查及解決思路

查看HBase日誌ios

2018-02-27 17:59:43,114 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 17:59:53,116 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:03,119 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:13,121 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:23,123 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:33,125 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:43,128 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:00:53,130 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:03,132 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:13,135 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:23,137 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:33,139 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:43,141 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:01:53,144 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:02:03,146 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:02:13,148 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...
2018-02-27 18:02:23,151 INFO  [prd-bldb-hdp-name01:60000.activeMasterManager] util.FSUtils: Waiting for dfs to exit safe mode...

 

# 查看hdfs safe mode
hadoop dfsadmin -safemode get
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is ON in namenode02/172.31.132.72:9000
Safe mode is ON in namenode01/172.31.132.71:9000

 

# 退出hdfs safe mode
hadoop dfsadmin -safemode leave
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is OFF in namenode02/172.31.132.72:9000
Safe mode is OFF in namenode01/172.31.132.71:9000

 

  • 問題2:Hadoop Namenode Web UI 界面的報錯提示,有missing block。

 

 

 

  • 問題3:HBase  Master Web UI界面的提示,有大量Offline Regions 和RIT(Region-In-Transition)

 

  •  問題2/3:故障排查及解決思路
# 查看dfs 狀態報告
hadoop dfsadmin -report

 

# 查看損壞文件、當前hdfs的副本數
hdfs fsck / 或者 hadoop fsck -locations

 

Under replicated blocks    副本數少於指定副本數的block數量
Blocks with corrupt replicas   存在損壞副本的block的數據
Missing blocks        丟失block數量session

觀察發現,662-53=609 ,正好和問題2中的RIT數量,610 吻合!能夠初步判斷,HBase Offline Regions 和RIT(Region-In-Transition)問題,是因爲Hadoop裏存在未修復的block形成。app

以前屢次經過hbase hbck 試圖修復hbase表無效,原來仍是由於hadoop的副本沒有徹底恢復好。oop

對以上截圖中紅框部分的理解是:hdfs上有662個block存在副本缺失問題(Under replicated blocks),有609  block只剩1-2個副本,有53個block副本所有丟失(Missing blocks)。this

 

核心修復步驟1:spa

# 更改已經上傳文件的副本數,修復Missing blocks
hadoop fs -setrep -R 3 /

經過該命令,對於存在副本缺失問題(Under replicated blocks)的662個block,能夠從剩下的1-2個副本,從新生成3個副本,從而找回了丟失的副本。3d

 

Under replicated blocks: 53
Blocks with corrupt replicas: 0
Missing blocks: 53
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 0rest

從這裏能夠看到,如今還有53個缺失的blocks,這53個缺失的block,一個副本都沒有。存在1-2個副本的block,已經所有修復!

 

核心修復步驟2:

# 刪除損壞文件
hdfs fsck -delete 

經過屢次運行該命令,對於副本所有丟失(Missing blocks)或損壞的53個block,能夠從namenode節點刪除元信息和損壞文件。

Under replicated blocks: 33
Blocks with corrupt replicas: 0
Missing blocks: 33
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 244

 

此時再查看HBase Master 的Web UI,Offline Regions 和RIT(Region-In-Transition) 已經大大下降。

 

 

過了沒多久,發現datanode03的regionserver掛了

 

此時經過重啓HBase和Hadoop集羣,便可消除所有問題

# 在namenode01執行,關閉HBase
stop-hbase.sh

# 在namnode01執行,關閉Hadoop
stop-all.sh

# 在namenode01執行
start-all.sh

# 在namenode01和namenode02節點,分別執行
start-hbase.sh

 

 

 

  • 其餘問題:HBase相關故障排查修復思路
HBase的 Regions in Transition 問題
# 查看hbase中損壞的block
hbase hbck

# 修復hbase
hbase hbck -repair


The Load Balancer is not enabled which will eventually cause performance degradation in HBase as Regions will not be distributed across all RegionServers. The balancer is only expected to be disabled during rolling upgrade scenarios. 


關閉balance,防止在停掉服務後,原先節點上的分片會遷移到其餘節點上,到時候在移回來,浪費時間。 
hbase(main):001:0> balance_switch true


2018-02-27 21:14:54,236 INFO  [hbasefsck-pool1-t38] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => e540df791e7fcdc93c118b8055d1c74f, NAME => 'pos_flow_summary_20170713,,1503656523513.e540df791e7fcdc93c118b8055d1c74f.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,236 INFO  [hbasefsck-pool1-t47] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => e59b1015c6fed189cdb9ba8493024563, NAME => 'pos_flow_summary_20180111,,1515768771542.e59b1015c6fed189cdb9ba8493024563.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,241 INFO  [hbasefsck-pool1-t44] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => d22e214e72ff89e87b4df3eebd9603f9, NAME => 'pos_flow_summary_20180112,,1515855181051.d22e214e72ff89e87b4df3eebd9603f9.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,244 INFO  [hbasefsck-pool1-t23] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => e8667191e988db9d65b52cfdb5e83a4d, NAME => 'pos_flow_summary_20170310,,1504229353726.e8667191e988db9d65b52cfdb5e83a4d.', STARTKEY => '', ENDKEY => ''}
2018-02-27 21:14:54,245 INFO  [hbasefsck-pool1-t45] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => d05b759994d757b8fc857993e3351648, NAME => 'app_point,5000,1510910952310.d05b759994d757b8fc857993e3351648.', STARTKEY => '5000', ENDKEY => '5505|1dcfb8c9a44c4147acc823c2e463d536'}

# 修復 .META表
hbase hbck -fixMeta

ERROR: Region { meta => pos_flow,2012|dd12dceee69c56f6776154d02e49f840,1518061965154.71eb7d463708010bc2a3f1e96deca135., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/71eb7d463708010bc2a3f1e96deca135, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow_summary_20180115,,1516115199923.70df944adbd82c1422be8f7ee8c24f3e., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow_summary_20180115/70df944adbd82c1422be8f7ee8c24f3e, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,5215|249f79b383f5c144cdd95cd1c29fdec3,1518380260884.67bfa42b4c45ec847c7eb27bbd7d86e5., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/67bfa42b4c45ec847c7eb27bbd7d86e5, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow_summary_20170528,,1504142971183.679bcdecd0335c99d847374db34de31d., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow_summary_20170528/679bcdecd0335c99d847374db34de31d, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,4744|dcf7bccc75f738986e5db100f1f54473,1518489513549.673e899d577f6111b5699b3374ba6adc., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/673e899d577f6111b5699b3374ba6adc, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,1321|ab83f75ef25bdd0d2ecc363fe1fe0106,1518466350793.66b9622950bba42339f011ac745b080b., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/66b9622950bba42339f011ac745b080b, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => pos_flow,9449|1ed33683e675c3e9ddbecf4d9bd42183,1518041132081.66b11e69bc62f356b3f81f351b8a6c68., hdfs => hdfs://namenode01:9000/hbase/data/default/pos_flow/66b11e69bc62f356b3f81f351b8a6c68, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => access_log,1000,1517363823393.65c41f802af180f41af848f1fed8e725., hdfs => hdfs://namenode01:9000/hbase/data/default/access_log/65c41f802af180f41af848f1fed8e725, deployed => , replicaId => 0 } not deployed on any region server.


Table pos_flow_summary_20180222 is okay.
    Number of regions: 0
    Deployed on: 
Table pos_flow_summary_20180223 is okay.
    Number of regions: 0
    Deployed on: 
Table pos_flow_summary_20180224 is okay.
    Number of regions: 1
    Deployed on:  prd-bldb-hdp-data02,60020,1519734905071
Table pos_flow_summary_20180225 is okay.
    Number of regions: 1
    Deployed on:  prd-bldb-hdp-data02,60020,1519734905071
Table pos_flow_summary_20180226 is okay.
    Number of regions: 1
    Deployed on:  prd-bldb-hdp-data02,60020,1519734905071
Table hbase:namespace is okay.
    Number of regions: 1
    Deployed on:  prd-bldb-hdp-data02,60020,1519734905071
Table gb_app_active is inconsistent.
    Number of regions: 7
    Deployed on:  prd-bldb-hdp-data01,60020,1519734905393 prd-bldb-hdp-data02,60020,1519734905071 prd-bldb-hdp-data03,60020,1519734905043
Table app_point is inconsistent.
    Number of regions: 3
    Deployed on:  prd-bldb-hdp-data01,60020,1519734905393 prd-bldb-hdp-data03,60020,1519734905043
970 inconsistencies detected.
Status: INCONSISTENT
2018-02-27 21:40:59,644 INFO  [main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
2018-02-27 21:40:59,644 INFO  [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x161d70981710083
2018-02-27 21:40:59,646 INFO  [main] zookeeper.ZooKeeper: Session: 0x161d70981710083 closed
2018-02-27 21:40:59,646 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down


# 當出現漏洞  
hbase hbck -fixHdfsHoles 

# 缺乏regioninfo  
hbase hbck -fixHdfsOrphans

# hbase region 引用文件出錯
# Found lingering reference file hdfs:  
hbase hbck -fixReferenceFiles 

# 修復assignments問題
hbase hbck -fixAssignments

2018-02-28 14:07:57,814 INFO  [hbasefsck-pool1-t40] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0836651ac3e23c331ca049e2f333e19f, NAME => 'pos_flow,9139|62b0a7a92cb8c4d25cea82991856334e,1518205798951.0836651ac3e23c331ca049e2f333e19f.', STARTKEY => '9139|62b0a7a92cb8c4d25cea82991856334e', ENDKEY => '9159|441da161eba8d989493f9d2ca2a3e4a2'}
2018-02-28 14:07:57,814 INFO  [hbasefsck-pool1-t11] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0e4dbc902294c799db7029df118c61a4, NAME => 'app_point,9000,1510910950800.0e4dbc902294c799db7029df118c61a4.', STARTKEY => '9000', ENDKEY => '9509|05a93'}
2018-02-28 14:07:57,817 INFO  [hbasefsck-pool1-t10] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 03f6d27f36e4f73e8030cfa6454dfadf, NAME => 'pos_flow_summary_20170913,,1505387404710.03f6d27f36e4f73e8030cfa6454dfadf.', STARTKEY => '', ENDKEY => ''}
2018-02-28 14:07:57,817 INFO  [hbasefsck-pool1-t35] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0634fab1259b036a5fbd024fd8da4ba7, NAME => 'pos_flow_summary_20171213,,1513262830799.0634fab1259b036a5fbd024fd8da4ba7.', STARTKEY => '', ENDKEY => ''}
2018-02-28 14:07:57,818 INFO  [hbasefsck-pool1-t29] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0666757ecbb89b60a52613cba2dab2f0, NAME => 'pos_flow,4344|8226741aea1eb243789f87abd6e44318,1518152887266.0666757ecbb89b60a52613cba2dab2f0.', STARTKEY => '4344|8226741aea1eb243789f87abd6e44318', ENDKEY => '4404|5fe6f71832f527f173696d3570556461'}
2018-02-28 14:07:57,819 INFO  [hbasefsck-pool1-t42] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 037d3dff1101418ea3c3868dc9855ecc, NAME => 'pos_flow,7037|c0124cbc08feb233e745aae0d896195a,1518489580825.037d3dff1101418ea3c3868dc9855ecc.', STARTKEY => '7037|c0124cbc08feb233e745aae0d896195a', ENDKEY => '7057|9fae9bddd296a534155c02297532cd28'}
2018-02-28 14:07:57,820 INFO  [hbasefsck-pool1-t12] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 01ff32a4f85c2de9e3e16c9b6156afa2, NAME => 'pos_flow_summary_20170902,,1504437004424.01ff32a4f85c2de9e3e16c9b6156afa2.', STARTKEY => '', ENDKEY => ''}
2018-02-28 14:07:57,823 INFO  [hbasefsck-pool1-t41] util.HBaseFsckRepair: Region still in transition, waiting for it to become assigned: {ENCODED => 0761d39042e2ec7002dbf291ce23e209, NAME => 'pos_flow_summary_20170725,,1503624161916.0761d39042e2ec7002dbf291ce23e209.', STARTKEY => '', ENDKEY => ''}
  

hbase master stop
hbase master start
service hbase-master restart


2018-02-28 14:41:39,547 INFO  [main] hdfs.DFSClient: No node available for BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 file=/hbase/data/default/pos_flow_summary_20170304/.tabledesc/.tableinfo.0000000001
2018-02-28 14:41:39,547 INFO  [main] hdfs.DFSClient: Could not obtain BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
2018-02-28 14:41:39,547 WARN  [main] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 251.87213750182957 msec.
2018-02-28 14:41:39,799 INFO  [main] hdfs.DFSClient: No node available for BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 file=/hbase/data/default/pos_flow_summary_20170304/.tabledesc/.tableinfo.0000000001
2018-02-28 14:41:39,799 INFO  [main] hdfs.DFSClient: Could not obtain BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
2018-02-28 14:41:39,799 WARN  [main] hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 5083.97300871329 msec.
2018-02-28 14:41:44,883 INFO  [main] hdfs.DFSClient: No node available for BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 file=/hbase/data/default/pos_flow_summary_20170304/.tabledesc/.tableinfo.0000000001
2018-02-28 14:41:44,883 INFO  [main] hdfs.DFSClient: Could not obtain BP-1225127698-172.31.132.71-1516782893469:blk_1073741999_1175 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: Dead nodes: . Will get new block locations from namenode and retry...
2018-02-28 14:41:44,883 WARN  [main] hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 9836.488682902267 msec.
相關文章
相關標籤/搜索