在合併Region的過程當中出現永久RIT怎麼辦?筆者在生產環境中就遇到過這種狀況,在批量合併Region的過程當中,出現了永久MERGING_NEW的狀況,雖然這種狀況不會影響現有集羣的正常的服務能力,可是若是集羣有某個節點發生重啓,那麼可能此時該RegionServer上的Region是無法均衡的。由於在RIT狀態時,HBase是不會執行Region負載均衡的,即便手動執行balancer命令也是無效的。apache
若是不解決這種RIT狀況,那麼後續有HBase節點相繼重啓,這樣會致使整個集羣的Region驗證不均衡,這是很致命的,對集羣的性能將會影響很大。通過查詢HBase JIRA單,發現這種MERGING_NEW永久RIT的狀況是觸發了HBASE-17682的BUG,須要打上該Patch來修復這個BUG,其實就是HBase源代碼在判斷業務邏輯時,沒有對MERGING_NEW這種狀態進行判斷,直接進入到else流程中了。源代碼以下:負載均衡
for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign. LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable) if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }
修復以後代碼:性能
for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign. LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable) if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state + " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else if (isOneOfStates(state, State.SPLITTING_NEW, State.MERGING_NEW)) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); }else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }
可是,這裏有一個問題,目前該JIRA單只是說了須要去修復BUG,打Patch。可是,實際生產狀況下,面對這種RIT狀況,是不可能長時間中止集羣,影響應用程序讀寫的。那麼,有沒有臨時的解決辦法,先臨時解決當前的MERGING_NEW這種永久RIT,以後在進行HBase版本升級操做。this
辦法是有的,在分析了MERGE合併的流程以後,發現HBase在執行Region合併時,會先生成一個初始狀態的MERGING_NEW。整個Region合併流程以下:url
從流程圖中能夠看到,MERGING_NEW是一個初始化狀態,在Master的內存中,而處於Backup狀態的Master內存中是沒有這個新Region的MERGING_NEW狀態的,那麼能夠經過對HBase的Master進行一個主備切換,來臨時消除這個永久RIT狀態。而HBase是一個高可用的集羣,進行主備切換時對用戶應用來講是無感操做。所以,面對MERGING_NEW狀態的永久RIT可使用對HBase進行主備切換的方式來作一個臨時處理方案。以後,咱們在對HBase進行修復BUG,打Patch進行版本升級。spa