這裏分兩部分,第一部分是NameNode HA,第二部分是ResourceManager HAhtml
(ResourceManager HA是hadoop-2.4.1以後加上的)node
1.啓動Zookeeper web
zkServer.sh start
能夠用zkServer.sh status查看狀態(看看該節點是否是leader仍是follower)bootstrap
2.在hadoop001上執行,格式化ZooKeeper集羣,目的是在ZooKeeper集羣上創建HA的相應節點服務器
hdfs zkfc -formatZKsession
...
15/07/17 14:50:08 INFO ha.ActiveStandbyElector: Successfully deleted /hadoop-ha/appcluster from ZK.
15/07/17 14:50:08 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/appcluster in ZK.
驗證:zkCli.shapp
...
Welcome to ZooKeeper!
2015-07-17 14:51:32,531 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-07-17 14:51:32,544 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@852] - Socket connection established to localhost/127.0.0.1:2181, initiating session
JLine support is enabled
2015-07-17 14:51:32,561 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14e9ac4b6a60001, negotiated timeout = 30000
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0]
ls /socket
[rmstore, yarn-leader-election, hadoop-ha, zookeeper]
ls /hadoop-haoop
[appcluster]
3.在hadoop001,hadoop002,hadoop003上啓動日誌程序journalnodespa
hadoop-daemon.sh start journalnode
starting journalnode, logging to /data/hadoop-2.6.0/logs/hadoop-root-journalnode-hadoop001.out
jps
14183 QuorumPeerMain
14680 Jps
14459 JournalNode
4.格式化NameNode(必須開啓JournalNode進程)
hdfs namenode -format
若是不是首次format的話仍是把NameNode和DataNode存放數據地址下的數據手動刪除一下,不然會形成NameNode ID和DataNode ID不一致,
rm -rf /data/hadoop/storage/hdfs/name/* & rm -rf /data/hadoop/storage/hdfs/data/*
(若是是HDFS聯盟,即有多個HDFS集羣同時工做,則用hdfs namenode -format -clusterId [clusterID])
5.啓動NameNode
hadoop-daemon.sh start namenode
6.把NameNode的數據從hadoop001同步到hadoop002中
注意,在hadoop002(namenode standby)下執行:
hdfs namenode -bootstrapStandby
... ===================================================== About to bootstrap Standby ID nn2 from: Nameservice ID: appcluster Other Namenode ID: nn1 Other NN's HTTP address: http://hadoop001:50070 Other NN's IPC address: hadoop001/**.**.**.**:8020 Namespace ID: 1358416288 Block pool ID: BP-503387195-**.**.**.**-1437119166865
Cluster ID: CID-51e580f5-f003-463d-ae45-e109a7ec31d4
Layout version: -60
=====================================================
...
7.啓動全部的DataNode
hadoop-daemons.sh start datanode
8.啓動Yarn
start-yarn.sh
9.在hadoop001,hadoop002啓動ZooKeeperFailoverController(這裏不用在hadoop003中啓動,由於hadoop003這個節點是純粹的DataNode)
hadoop-daemon.sh start zkfc
10.驗證HA的故障自動轉移是否好用
由於用的公司的遠程服務器,沒法經過web查看NameNode的Standby或者Active狀態,只能從指定namenode名稱空間的存儲地址下看edits文件的更新時間
namenode名稱空間在上一節集羣配置中設置以下
<property> <name>dfs.namenode.name.dir</name> <value>file:///data/hadoop/storage/hdfs/name</value> </property>
在兩個namennode的該路徑下分別有兩個fsimage文件,fsimage是存儲元數據的文件,在Active的NameNode中還會有edit log,而且每對hdfs操做一次 edit log都會更新,從時間的更新就能看出。而Standby NameNode的 edit log不會更新。當Active的NameNode被kill掉以後能夠立馬在Standby NameNode的name路徑下看到最新的edit log更新。這一切都要歸功於JournalNode。在journalNode路徑下能夠看到完整的edit log備份。
小結:
集羣啓動要特別當心,很容易由於操做順序不對致使failover失敗的。
以前還由於kill掉Hadoop001的NameNode而hadoop002的NameNode的也跟着down掉。致使操做hdfs的時候connection refused。一直在找connection的問題,好比端口、/etc/hosts的問題。結果從新按流程啓動了一遍又好了,不知道以前的問題出在哪,莫名其妙,搞的心力憔悴,浪費了很多時間。
因此每一步操做的檢查很重要,看看進程、name路徑下的edit log更新。
NameNode HA操做完以後咱們能夠發現只有一個節點(這裏是hadoop001)啓動,須要手動啓動另一個節點(hadoop002)的resourcemanager。
yarn-daemon.sh start resourcemanager
而後用如下指令查看resourcemanager狀態
yarn rmadmin –getServiceState rm1
結果顯示Active
而rm2是standby。
驗證HA和NameNode HA同理,kill掉Active resourcemanager,則standby的resourcemanager則會轉換爲Active。
還有一條指令能夠強制轉換
yarn rmadmin –transitionToStandby rm1
參考文獻
[1] hdfs-site.xml:http://www.21ops.com/front-tech/10744.html
[2] yarn-site.xml: http://www.aboutyun.com/thread-10572-1-1.html 評論也值得參考
[3] http://www.cnblogs.com/meiyuanbao/p/3545929.html (沒有作到Yarn的HA)