Hbase合併Region的過程當中出現永久RIT的解決

在合併Region的過程當中出現永久RIT怎麼辦？筆者在生產環境中就遇到過這種狀況，在批量合併Region的過程當中，出現了永久MERGING_NEW的狀況，雖然這種狀況不會影響現有集羣的正常的服務能力，可是若是集羣有某個節點發生重啓，那麼可能此時該RegionServer上的Region是無法均衡的。由於在RIT狀態時，HBase是不會執行Region負載均衡的，即便手動執行balancer命令也是無效的。apache

若是不解決這種RIT狀況，那麼後續有HBase節點相繼重啓，這樣會致使整個集羣的Region驗證不均衡，這是很致命的，對集羣的性能將會影響很大。通過查詢HBase JIRA單，發現這種MERGING_NEW永久RIT的狀況是觸發了HBASE-17682的BUG，須要打上該Patch來修復這個BUG，其實就是HBase源代碼在判斷業務邏輯時，沒有對MERGING_NEW這種狀態進行判斷，直接進入到else流程中了。源代碼以下：負載均衡

for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign.
          LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable)
          if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state +
              " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }

修復以後代碼：性能

for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign.
          LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable)
          if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state +
              " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else if (isOneOfStates(state, State.SPLITTING_NEW, State.MERGING_NEW)) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); }else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }

可是，這裏有一個問題，目前該JIRA單只是說了須要去修復BUG，打Patch。可是，實際生產狀況下，面對這種RIT狀況，是不可能長時間中止集羣，影響應用程序讀寫的。那麼，有沒有臨時的解決辦法，先臨時解決當前的MERGING_NEW這種永久RIT，以後在進行HBase版本升級操做。this

辦法是有的，在分析了MERGE合併的流程以後，發現HBase在執行Region合併時，會先生成一個初始狀態的MERGING_NEW。整個Region合併流程以下：url

從流程圖中能夠看到，MERGING_NEW是一個初始化狀態，在Master的內存中，而處於Backup狀態的Master內存中是沒有這個新Region的MERGING_NEW狀態的，那麼能夠經過對HBase的Master進行一個主備切換，來臨時消除這個永久RIT狀態。而HBase是一個高可用的集羣，進行主備切換時對用戶應用來講是無感操做。所以，面對MERGING_NEW狀態的永久RIT可使用對HBase進行主備切換的方式來作一個臨時處理方案。以後，咱們在對HBase進行修復BUG，打Patch進行版本升級。spa