Apache Hadoop2.0之HDFS均衡操做分析

時間 2019-11-12

標籤 apache hadoop2.0 hadoop hdfs 均衡分析欄目 Apache 简体版

原文原文鏈接

1 HDFS均衡操做原理

HDFS默認的塊的副本存放策略是在發起請求的客戶端存放一個副本，若是這個客戶端在集羣之外，那就選擇一個不是太忙，存儲不是太滿的節點來存放，第二個副本放在與第一個副本相同的機架可是不一樣節點上，第三個放在與第二個和第一個副本不一樣的機架上，原則是儘可能避免在相同的機架上放太多的副本。java

隨着時間的推移，在各個DataNode節點上的數據塊會分佈的愈來愈不均衡。若是集羣不均衡的程度很嚴重，會下降Mapreduce的使用性能，致使部分DataNode節點相對而言變得更加繁忙。因此，應該儘可能的避免出現這種狀況。node

HDFS的 Balancer 類，是爲了實現HDFS的負載調整而存在的。Balancer類是以一個獨立的進程存在的，能夠獨立的運行和配置。它NameNode節點進行通訊，獲取各個DataNode節點的負載情況，從而進行調整DataNode上的Block的分佈。主要的調整其實就是一個操做，將一個數據塊從一個服務器搬遷到另外一個服務器上。Balancer會向相關的目標 DataNode 節點發出一個DataTransferProtocol.OP_REPLACE_BLOCK 消息，接收到這個消息的DataNode節點，會將從源DataNode節點傳輸來的數據塊寫入本地，寫成功後，通知NameNode，刪除源DataNode上的同一個數據塊，直到集羣達到均衡爲止，即每一個DataNode的使用率(該節點已使用的空間和空間容量之間的百分比值)和集羣的使用率(集羣中已使用的空間和集羣的空間容量之間的百分比值)很是接近，差距不超過均衡時給定的閾值。linux

其中，一個塊是否能夠被移動，要知足三個條件：web

（1）正在被移動或者已經被移動的塊，不會重複移動apache

（2）一個塊若是在源節點和目標節點上都有其副本，則此塊不會被移動；緩存

（3）移動不會減小一個塊所在的機架的數目；服務器

可見，因爲上述等條件的限制，均衡操做並不能使得HDFS達到真正意義上的均衡，它只能是儘可能的減小不均衡。網絡

均衡操做依靠一個均衡操做服務器、NameNode的代理和DataNode來實現，其邏輯流程以下：併發

其中，socket

Step1:Rebalance Server從Name Node中獲取全部的Data Node狀況，即每個Data Node磁盤使用狀況;

Step2: Rebalance Server計算哪些Dataode節點須要將數據移動，哪些Dataode節點能夠接受移動的塊數據，而且從NameNode中獲取須要移動的數據分佈狀況;

Step3: Rebalance Server計算出來能夠將哪一臺Dataode節點的block移動到另外一臺機器中去;

Step四、五、6：須要移動block的Dataode節點將數據移動到目標DataNode節點上去，同時刪除本身節點上的block數據；

Step7: Rebalance Server獲取到本次數據移動的執行結果，並繼續執行這個過程，一直到沒有數據能夠移動或者HDFS集羣以及達到了平衡的標準爲止；

在step2中，HDFS會把當前的DataNode節點根據閾值的設定狀況劃分到四個鏈表中：

(1)over組：此組中的DataNode的均知足

DataNode_usedSpace_percent > Cluster_usedSpace_percent + threshold；

(2)above組：此組中的DataNode的均知足

Cluster_usedSpace_percent + threshold > DataNode_ usedSpace _percent > Cluster_usedSpace_percent；

(3)below組：此組中的DataNode的均知足

Cluster_usedSpace_percent > DataNode_ usedSpace_percent > Cluster_ usedSpace_percent – threshold；

(4)under組：此組中的DataNode的均知足

Cluster_usedSpace_percent – threshold > DataNode_usedSpace_percent；

用一個示例圖表示：

在移動塊的時候，會把over組和above組中的塊向below組和under組移動，直到均衡狀態或者達到均衡退出的條件爲止。

總得來講，均衡操做的步驟能夠分爲4步：

(1)從namenode獲取datanode磁盤使用狀況;

(2)計算哪些節點須要把哪些數據移動到哪裏;

(3)分別移動，完成後刪除舊的block信息;

(4)循環執行，直到達到平衡標準;

2 HDFS均衡操做的啓動

使用HDFS的balancer命令，能夠配置一個Threshold來平衡每個DataNode磁盤利用率。命令以下：

start-balancer.sh -threshold 8

運行以後，會有Balancer進程出現：

上述命令設置了Threshold爲8%，那麼執行balancer命令的時候，首先統計全部DataNode的磁盤利用率的均值，而後判斷若是某一個DataNode的磁盤利用率超過這個均值Threshold，那麼將會把這個DataNode的block轉移到磁盤利用率低的DataNode，這對於新節點的加入來講十分有用。Threshold的值爲1到100之間，不顯示的進行參數設置的話，默認是10。

範圍超出以後，會有異常拋出：

java.lang.IllegalArgumentException: Number out of range: threshold = 0.07

at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.parse(Balancer.java:1535)

at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:1510)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1582)

2012-12-19 16:28:33,299 ERROR org.apache.hadoop.hdfs.server.balancer.Balancer: Exiting balancer due an exception

java.lang.IllegalArgumentException: Number out of range: threshold = 110.0

at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.parse(Balancer.java:1535)

at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:1510)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1582)

若是參數值設置的越小，花費的時間就越長。使用此命令時，會反覆的從磁盤使用率高的節點上，把塊轉移到磁盤使用率低的磁盤上，每次移動不超過10G大小，每次移動不超過20分鐘。

在作均衡的時候，會對網絡帶寬有影響，可在配置文件中對均衡操做的帶寬作限制：

<name>dfs.balance.bandwidthPerSec</name>

Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second.

</description>

</property>

若不設置，則balance操做時，速度默認爲1M/S大小。參數重啓時生效。不容許在集羣中使用多個均衡同時操做。

3 HDFS均衡操做的退出

除了在命令行直接使用stop-balancer.sh腳原本執行退出均衡操做以外，當發生如下幾種狀況時，當前執行的均衡操做也會退出：

(1)集羣已經達到均衡狀態；

(2)沒有塊能夠再被移動；

(3)連續五次迭代操做時沒有塊移動；

(4)和NameNode通訊時出現IOException；

(5)另一個均衡操做啓動；

4 實例分析

均衡操做以前DataNode節點塊分佈狀況：

在sbin目錄下執行命令./start-balancer.sh -threshold 5

均衡過程當中：

開始作均衡操做時，會有以下日誌打印出：

2012-12-28 10:08:55,667 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 5.0

2012-12-28 10:08:55,668 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: namenodes = [hdfs://goon]

2012-12-28 10:08:55,669 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: p = Balancer.Parameters[BalancingPolicy.Node, threshold=5.0]

其中會標明對哪些nameservice進行均衡，同時對輸入的參數進行說明，默認的策略是BalancingPolicy.Node，表示均衡的對象是DataNode，不然是對塊池作均衡。

以後，會計算須要移動的塊，移動的字節數，塊的源地址和目標地址：

2012-12-28 10:08:57,200 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [Source[10.28.169.126:50010, utilization=25.807874799394025]]

2012-12-28 10:08:57,200 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 underutilized: [BalancerDatanode[10.28.169.122:50010, utilization=9.395091359283992]]

2012-12-28 10:08:57,201 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 2.75 GB to make the cluster balanced.

2012-12-28 10:08:57,202 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 2.46 GB bytes from 10.28.169.126:50010 to 10.28.169.122:50010

2012-12-28 10:08:57,203 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 2.46 GB in this iteration

此時：

剛開始均衡操做時，其進程所佔資源：

同時，能夠觀察到會有塊的移動，從datanode0移動到sdc2和Tdatanode0，在NN會有以下日誌(hadoop-hdfs-balancer-Tdatanode0.log)：

2012-12-28 10:08:57,724 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Moving block 365766470964968392 from 10.28.169.126:50010 to 10.28.169.122:50010 through 10.28.169.225:50010 is succeeded.

2012-12-28 10:08:57,724 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Moving block 7346867188539736438 from 10.28.169.126:50010 to 10.28.169.122:50010 through 10.28.169.225:50010 is succeeded.

2012-12-28 10:08:57,724 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Moving block -6882244909405313690 from 10.28.169.126:50010 to 10.28.169.122:50010 through 10.28.169.225:50010 is succeeded.

因爲在Balancer類中對線程池作了限制，

final static private int MOVER_THREAD_POOL_SIZE = 1000;

因此最多能夠有1000個線程併發塊移動操做，可是向一個DN進行移動塊時，最多5個塊併發移動。

因爲機器資源的限制，當均衡操做建立的線程數量達到900多的時候，就沒法再建立線程，並有以下錯誤：

2012-12-28 10:19:32,905 WARN org.apache.hadoop.hdfs.server.balancer.Balancer: Dispatcher thread failed

java.lang.OutOfMemoryError: unable to create new native thread

at java.lang.Thread.start0(Native Method)

at java.lang.Thread.start(Thread.java:640)

at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703)

at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:652)

at org.apache.hadoop.hdfs.server.balancer.Balancer$PendingBlockMove.scheduleBlockMove(Balancer.java:402)

at org.apache.hadoop.hdfs.server.balancer.Balancer$PendingBlockMove.access$3500(Balancer.java:236)

at org.apache.hadoop.hdfs.server.balancer.Balancer$Source.dispatchBlocks(Balancer.java:746)

at org.apache.hadoop.hdfs.server.balancer.Balancer$Source.access$2000(Balancer.java:591)

at org.apache.hadoop.hdfs.server.balancer.Balancer$Source$BlockMoveDispatcher.run(Balancer.java:598)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

查看均衡操做所產生的線程會被阻塞：

Thread 15688: (state = BLOCKED)

- sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)

- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Interpreted frame)

- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Interpreted frame)

- java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399 (Interpreted frame)

- java.util.concurrent.ThreadPoolExecutor.getTask() @bci=78, line=947 (Interpreted frame)

- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=18, line=907 (Interpreted frame)

- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

Thread 15687: (state = BLOCKED)

- sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)

- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Interpreted frame)

- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Interpreted frame)

- java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399 (Interpreted frame)

- java.util.concurrent.ThreadPoolExecutor.getTask() @bci=78, line=947 (Interpreted frame)

- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=18, line=907 (Interpreted frame)

- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)

同時，因爲沒法建立線程，RPC通訊也會阻塞，

2012-12-28 10:37:28,710 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB. Trying to fail over immediately.

2012-12-28 10:37:28,716 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB after 1 fail over attempts. Trying to fail over immediately.

2012-12-28 10:37:28,721 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB after 2 fail over attempts. Trying to fail over immediately.

。。。

2012-12-28 10:37:28,795 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB after 14 fail over attempts. Trying to fail over immediately.

2012-12-28 10:37:28,802 WARN org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease. Not retrying because failovers (15) exceeded maximum allowed (15)

java.io.IOException: Failed on local exception: java.io.IOException: Couldn’t set up IO streams; Host Details : local host is: 「Tdatanode0/10.28.169.126″; destination host is: 「sdc1″:9000;

at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760)

at org.apache.hadoop.ipc.Client.call(Client.java:1168)

at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)

at $Proxy11.renewLease(Unknown Source)

at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:452)

at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)

at $Proxy12.renewLease(Unknown Source)

at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:613)

at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:411)

at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:436)

at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:70)

at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:297)

at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: Couldn’t set up IO streams

at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:640)

at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:220)

at org.apache.hadoop.ipc.Client.getConnection(Client.java:1217)

at org.apache.hadoop.ipc.Client.call(Client.java:1144)

… 15 more

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

at java.lang.Thread.start0(Native Method)

at java.lang.Thread.start(Thread.java:640)

at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:633)

… 18 more

對每一個NameNode都會嘗試兩次鏈接 set up IO streams，週期性的進行鏈接請求，即NN1鏈接不成功(嘗試鏈接2次)，而後就去嘗試鏈接NN2(若嘗試鏈接2次，不成功，就再去鏈接NN1)，每次嘗試鏈接NN後，都會有15次的更新租約嘗試，固然，鏈接不上NN，租約更新也是失效的。

同時，因爲此均衡是在NN上操做的，此NN節點會出現僵死狀態(在NN上進行均衡僅爲測試用，真正使用時不會在NN上)，經過web不能訪問，由於操做不能再建立新的線程。而且，若是當前是在active節點上進行的操做，並且在配置文件中配置了容許HA自動切換，那麼此時會發生HA自動切換，即當前的standby節點變爲active節點。

能夠在linux下更改參數進行擴大線程建立的數量：

更改以前：

經過命令ulimit -u 65535臨時修改，或者經過修改配置文件永久修改：/etc/security/limits.conf

同時也能夠修改棧空間所佔的大小，默認是10240字節，能夠經過ulimit -s 1024修改成1024字節，減小每一個堆棧的使用空間，也能夠增長線程的數量。

另外，在均衡過程當中，Balancer的內存使用狀況以下：

所佔內存大概是140M左右。在運行過程當中，會產生一些垃圾文件，由於節點硬件的限制，因此必須對緩存進行清理：

若出現socket鏈接異常，則塊移動失敗，則會提示：

2012-12-21 13:10:42,915 WARN org.apache.hadoop.hdfs.server.balancer.Balancer: Error moving block -5503762968190078708 from 10.28.169.225:50010 to 10.28.169.122:50010 through 10.28.169.225:50010: Connection reset

均衡以後：

當出現日誌：

2012-12-28 12:20:00,430 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.28.169.122:50010

2012-12-28 12:20:00,431 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.28.169.225:50010

2012-12-28 12:20:00,431 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.28.169.126:50010

2012-12-28 12:20:00,431 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 over-utilized: []

2012-12-28 12:20:00,431 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []

均衡操做結束，由於Over組沒有DataNode，Under組也沒有DataNode，此時：

在均衡操做以前： clusterAvg1

=集羣dfs已使用空間/集羣總空間

=(12.7+12.19+4.62)/(49.22+49.22+49.22)

=19.99%

咱們在作均衡時所設定的閾值爲5，即百分之5，clusterAvg1-5%=14.99%，clusterAvg1+5%=24.99%，而Tdatanode0的空間使用率是25.81%，超過 24.99% ，屬於over節點，sdc2的空間使用率是9.4%，低於14.99%，屬於under節點，datanode0的空間使用率是24.76%，高於14.99%可是低於24.99%，屬於above節點，符合均衡操做的發生條件。

集羣的空間平均使用率爲： clusterAvg2

=集羣dfs已使用空間/集羣總空間

=(12.7+11.46+8.91)/(49.22+49.22+49.22)

=22.41%

咱們在作均衡時所設定的閾值爲5，即百分之5，clusterAvg2-5%=17.41%，clusterAvg2+5%=27.41%，全部的DN的空間使用率都在17.41%–27.41%中間，說明集羣已經達到均衡狀態。

轉載：http://www.tuicool.com/articles/uaaEve