官網_2.6.5_HDFS高可用性使用仲裁日誌管理器（HDFS HA QJM）

時間 2020-08-10

標籤 2.6.5 hdfs 用性使用仲裁日誌管理器 qjm 欄目 Hadoop 简体版

原文原文鏈接

目的（Purpose）java

使用仲裁日誌管理器（QJM）功能概述了HDFS高可用性（HA）功能以及如何配置和管理HA HDFS羣集。node

背景（Background）shell

在Hadoop 2.0.0以前，NameNode是HDFS集羣中的單點故障（SPOF）。每一個集羣都有一個NameNode，若是該機器或進程不可用，整個集羣將不可用，直到NameNode從新啓動或在單獨的計算機上啓動爲止。apache

這在兩個主要方面影響了HDFS集羣的整體可用性：bootstrap

1）在計劃外事件（例如機器崩潰）的狀況下，直到操做員從新啓動NameNode後，集羣纔可用。安全

2）計劃的維護事件（如NameNode計算機上的軟件或硬件升級）將致使集羣停機。服務器

HDFS高可用性功能經過提供在具備熱備份的Active/Passive配置中在同一集羣中運行兩個冗餘NameNode的選項來解決上述問題。這容許在計算機崩潰的狀況下快速故障轉移到新的NameNode，或者爲計劃維護目的而進行管理員啓動的正常故障轉移。網絡

體系結構（Architecture）session

在典型的HA羣集中，兩臺獨立的機器配置爲NameNode。在任什麼時候候，只有一個NameNode處於Active狀態，另外一個處於Standby狀態。Active NameNode負責集羣中的全部客戶端操做，而Standby僅充當從服務器，並保持足夠的狀態以在必要時提供快速故障轉移。less

爲了使備用節點保持其與主動節點的狀態同步，兩個節點都與一組稱爲「日誌節點」（JN）的獨立守護進程進行通訊。當活動節點執行任何名稱空間修改時，它會將修改記錄持久記錄到大多數這些JN中。備用節點可以讀取來自JN的編輯，並不斷監視它們以更改編輯日誌。當待機節點看到編輯時，它將它們應用到它本身的名稱空間。若是發生故障轉移，備用服務器將確保它在將本身提高爲活動狀態以前已經從JounalNodes中讀取全部編輯。這確保了在故障轉移發生以前命名空間狀態已徹底同步。

爲了提供快速故障切換，備用節點還須要有關於集羣中塊的位置的最新信息。爲了實現這一點，DataNode配置了兩個NameNode的位置，並將塊位置信息和心跳發送到二者。

一次只有一個NameNode處於活動狀態對於HA集羣的正確操做相當重要。不然，命名空間狀態將很快在二者之間發生分歧，從而可能致使數據丟失或其餘不正確的結果。爲了確保這個屬性並防止所謂的「裂腦場景」，JournalNodes將永遠只容許一個NameNode成爲一個Active。在故障轉移期間，要成爲活動狀態的NameNode將簡單地接管寫入JournalNodes的角色，這將有效地防止其餘NameNode繼續處於活動狀態，從而容許新的Active安全地進行故障轉移。

硬件資源（Hardware resources）

爲了部署HA羣集，應該準備如下內容：

1）NameNode計算機 - 運行Active和Standby NameNode的計算機應該具備彼此相同的硬件，以及與非HA集羣中使用的硬件相同的硬件。

2）JournalNode機器 - 您運行JournalNodes的機器。JournalNode守護進程相對輕量級，所以這些守護進程能夠合理地與具備其餘Hadoop守護進程的計算機並置，例如NameNodes，JobTracker或YARN ResourceManager。注意：必須至少有3個JournalNode守護進程，由於必須將編輯日誌修改寫入大多數JN。這將容許系統容忍單臺機器的故障。也能夠運行3個以上的JournalNodes，但爲了實際增長系統能夠容忍的故障次數，應該運行奇數個JN（即3，5，7等）。請注意，在運行N個JournalNodes時，系統最多能夠承受（N-1）/ 2次故障並繼續正常運行。

請注意，在HA羣集中，備用NameNode還執行名稱空間狀態的檢查點，所以不須要在HA羣集中運行Secondary NameNode，CheckpointNode或BackupNode。事實上，這樣作會是一個錯誤。

部署（Deployment）

配置概述（Configuration overview）

HA配置向後兼容，並容許現有的單一NameNode配置無需更改便可運行。新配置的設計使集羣中的全部節點均可以具備相同的配置，而無需根據節點的類型將不一樣的配置文件部署到不一樣的機器。

和HDFS聯合同樣，HA集羣重用nameservice ID來標識實際上可能由多個HA NameNode組成的單個HDFS實例。另外，HA中添加了名爲NameNode ID的新抽象。羣集中每一個不一樣的NameNode都有一個不一樣的NameNode ID來區分它。爲了支持全部NameNode的單個配置文件，相關配置參數後綴爲nameservice ID以及NameNode ID。

配置細節（Configuration details）

要配置HA NameNode，必須將多個配置選項添加到hdfs-site.xml配置文件。

設置這些配置的順序並不重要，可是爲dfs.nameservices和dfs.ha.namenodes.[nameservice ID] 選擇的值將決定後面的那些鍵。所以，應該在設置其他配置選項以前決定這些值。

配置屬性

描述

示例

dfs.nameservices

dfs.nameservices - the logical name for this new nameservice

Choose a logical name for this nameservice, for example "mycluster", and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.

dfs.nameservices - 這個新名稱服務的邏輯名稱

爲此名稱服務選擇一個邏輯名稱，例如「mycluster」，並使用此邏輯名稱做爲此配置選項的值。你選擇的名字是任意的。它將用於配置，也可用做羣集中絕對HDFS路徑的權威組件。

注意：若是還使用HDFS聯合身份驗證，則此配置設置還應該將其餘名稱服務（HA或其餘）的列表做爲逗號分隔列表。

<name>dfs.nameservices</name>

<value>mycluster</value>

</property>

dfs.ha.namenodes.[nameservice ID]

dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice

Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used "mycluster" as the nameservice ID previously, and you wanted to use "nn1" and "nn2" as the individual IDs of the NameNodes, you would configure this。

Note: Currently, only a maximum of two NameNodes may be configured per nameservice.

名稱服務中每一個NameNode的惟一標識符

配置一個由逗號分隔的NameNode ID列表。這將由DataNode用於肯定羣集中的全部NameNode。例如，若是您之前使用「mycluster」做爲名稱服務標識，而且您但願使用「nn1」和「nn2」做爲NameNodes的單個標識，則這樣配置它。

注意：目前，每一個名稱服務最多隻能配置兩個NameNode。

<name>dfs.ha.namenodes.mycluster</name>

</property>

dfs.namenode.rpc-address.[nameservice ID].[name node ID]

dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each NameNode to listen on

For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode processs. Note that this results in two separate configuration options.

Note: You may similarly configure the "servicerpc-address" setting if you so desire.

每一個NameNode監聽的徹底限定的RPC地址

對於以前配置的NameNode ID，請設置NameNode進程的完整地址和IPC端口。請注意，這會致使兩個單獨的配置選項。

注意：若是您願意，您能夠相似地配置「 servicerpc-address 」設置。

<name>dfs.namenode.rpc-address.mycluster.nn1</name>

<value>machine1.example.com:8020</value>

</property>

<name>dfs.namenode.rpc-address.mycluster.nn2</name>

<value>machine2.example.com:8020</value>

</property>

dfs.namenode.http-address.[nameservice ID].[name node ID]

dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on

Similarly to rpc-address above, set the addresses for both NameNodes' HTTP servers to listen on.

Note: If you have Hadoop's security features enabled, you should also set the https-address similarly for each NameNode.

每一個NameNode監聽的徹底限定的HTTP地址

與上面的rpc-address相似，爲兩個NameNode的HTTP服務器設置偵聽地址。

注意：若是啓用了Hadoop的安全功能，則還應該爲每一個NameNode 設置相似的https地址。

<name>dfs.namenode.http-address.mycluster.nn1</name>

<value>machine1.example.com:50070</value>

</property>

<name>dfs.namenode.http-address.mycluster.nn2</name>

<value>machine2.example.com:50070</value>

</property>

dfs.namenode.shared.edits.dir

dfs.namenode.shared.edits.dir - the URI which identifies the group of JNs where the NameNodes will write/read edits

This is where one configures the addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be of the form: "qjournal://host1:port1;host2:port2;host3:port3/journalId". The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though not a requirement, it's a good idea to reuse the nameservice ID for the journal identifier.

For example, if the JournalNodes for this cluster were running on the machines "node1.example.com", "node2.example.com", and "node3.example.com" and the nameservice ID were "mycluster", you would use the following as the value for this setting (the default port for the JournalNode is 8485):

標識NameNode將寫入/讀取編輯的JN組的URI

這是配置提供共享編輯存儲的JournalNode的地址，由Active NameNode編寫並由Standby NameNode讀取，以保持Active NameNode所作的全部文件系統更改的最新狀態。儘管必須指定多個JournalNode地址，但應該只配置其中一個URI。URI的格式應爲：「qjournal://host1:port1;host2:port2;host3:port3/journalId 」。日誌ID是此名稱服務的惟一標識符，它容許一組JournalNodes爲多個聯邦名稱系統提供存儲。雖然不是必需的，但重用日誌標識符的名稱服務ID是個好主意。

例如，若是此集羣的JournalNodes在計算機「node1.example.com」，「node2.example.com」和「node3.example.com」上運行而且名稱服務ID是「mycluster」，則可使用如下做爲此設置的值（JournalNode的默認端口爲8485）：

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>

</property>

dfs.client.failover.proxy.provider.[nameservice ID]

dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode

Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The only implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so use this unless you are using a custom one. For example:

HDFS客戶端用於聯繫Active NameNode的Java類

配置將由DFS客戶端使用的Java類的名稱，以肯定哪一個NameNode是當前的Active，以及哪一個NameNode當前正在爲客戶端請求提供服務。目前與Hadoop一塊兒提供的惟一實現是ConfiguredFailoverProxyProvider，所以除非您使用自定義配置，不然請使用它。

<name>dfs.client.failover.proxy.provider.mycluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

</property>

dfs.ha.fencing.methods

dfs.ha.fencing.methods - It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager. However, to improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, for example "shell(/bin/true)".

The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

在故障轉移期間將用於遏制活動NameNode的腳本或Java類的列表

系統的正確性是可取的，即在任何給定時間只有一個NameNode處於活動狀態。重要的是，在使用仲裁日誌管理器時，只有一個NameNode將被容許寫入JournalNodes，所以不會破壞裂腦場景中的文件系統元數據。可是，當發生故障轉移時，前面的Active NameNode仍可能向客戶端提供讀取請求，這可能會過期，直到NameNode在嘗試寫入JournalNodes時關閉。因爲這個緣由，即便在使用仲裁日誌管理器時，仍然須要配置一些屏蔽方法。可是，爲了在防禦機制發生故障的狀況下提升系統的可用性，建議配置一種防禦方法，該防禦方法可保證做爲列表中的最後一個防禦方法返回成功。請注意，若是您選擇不使用實際的防禦方法，則仍必須爲此設置配置一些內容，例如「 shell（/ bin / true）」。

故障切換期間使用的防禦方法將配置爲一個以回車分隔的列表，該列表將按順序嘗試，直到指示防禦成功爲止。Hadoop提供了兩種方法：shell和sshfence。有關實現本身的自定義fencing方法的信息，請參閱org.apache.hadoop.ha.NodeFencer類。

sshfence - SSH to the Active NameNode and kill the process

The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service's TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:

<name>dfs.ha.fencing.methods</name>

<value>sshfence</value>

</property>

<name>dfs.ha.fencing.ssh.private-key-files</name>

<value>/home/exampleuser/.ssh/id_rsa</value>

</property>

SSH到活動NameNode並殺死進程

該sshfence選項SSHes到目標節點，並殺死進程。爲了使此隔離選項正常工做，它必須可以在不提供密碼的狀況下經過SSH鏈接到目標節點。所以，還必須配置dfs.ha.fencing.ssh.private-key-files選項

Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:

或者，能夠配置非標準用戶名或端口來執行SSH。也能夠爲SSH配置超時（以毫秒爲單位），以後將認爲此防禦方法失敗。它能夠像這樣配置：

<name>dfs.ha.fencing.methods</name>

<value>sshfence([[username][:port]])</value>

</property>

<name>dfs.ha.fencing.ssh.connect-timeout</name>

</property>

shell - run an arbitrary shell command to fence the Active NameNode

運行一個任意的shell命令來隔離活動NameNode

<name>dfs.ha.fencing.methods</name>

<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>

</property>

或者

<name>dfs.ha.fencing.methods</name>

<value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>

</property>

If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.

若是shell命令返回0的退出碼，則肯定防禦成功。若是它返回任何其餘退出代碼，則防禦未成功，並嘗試列表中的下一個防禦方法。

fs.defaultFS

fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given

Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:

或者，您如今能夠將Hadoop客戶端的默認路徑配置爲使用新的啓用HA的邏輯URI。若是您以前使用「mycluster」做爲名稱服務標識，則這將是全部HDFS路徑的權限部分的值。這多是這樣配置的，在你的core-site.xml文件中。

<name>fs.defaultFS</name>

<value>hdfs://mycluster</value>

</property>

dfs.journalnode.edits.dir

dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state

This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array.

這是JournalNode計算機上絕對路徑，JN將使用編輯和其餘本地狀態進行存儲。您只能爲此配置使用單個路徑。經過運行多個單獨的JournalNode或經過在本地鏈接的RAID陣列上配置此目錄來提供此數據的冗餘。：

<name>dfs.journalnode.edits.dir</name>

<value>/path/to/journal/node/local/data</value>

</property>

NFS實現HA配置文件的區別：

註釋hdfs-site.xml這兩配置

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://node2:8485;node3:8485;node4:8485/jiaozi</value>

</property>

<name>dfs.journalnode.edits.dir</name>

<value>/journaldata</value>

</property>

添加nfs讀取共享目錄

<name>dfs.namenode.shared.edits.dir</name>

</property>

其餘配置同樣

部署詳情（Deployment details）

在設置了全部必要的配置選項以後，您必須在它們將運行的機器集上啓動JournalNode守護進程。這能夠經過運行命令「 hadoop-daemon.sh start journalnode 」並等待守護進程在每臺相關機器上啓動來完成。

JournalNodes啓動後，必須首先同步兩個HA NameNodes的磁盤元數據。

1）若是您正在設置新的HDFS集羣，則應首先在NameNode之一上運行format命令（hdfs namenode -format）。

2）If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command "hdfs namenode -bootstrapStandby" on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.

若是你已經格式化NameNode，或者將非HA啓用的集羣轉換爲啓用HA，應該如今把NameNode的元數據目錄中的內容複製到其它，未格式化的NameNode運行「hdfs namenode -bootstrapStandby」命令。運行這個命令還將確保JournalNodes（由dfs.namenode.shared.edits.dir配置）包含足夠的編輯事務，以便可以啓動兩個NameNode。

3）若是要將非HA NameNode轉換爲HA，則應運行命令「 hdfs -initializeSharedEdits 」，該命令將使用來自本地NameNode編輯目錄的編輯數據初始化JournalNodes。

此時，您能夠像啓動NameNode同樣啓動兩個HA NameNode。

您能夠經過瀏覽到其配置的HTTP地址來分別訪問每一個NameNode的網頁。您應該注意到，配置的地址旁邊將是NameNode的HA狀態（「待機」或「活動」）。每當HA NameNode啓動時，它最初都處於Standby狀態。

管理命令（Administrative commands）

如今您的HA NameNode已配置並啓動，您將能夠訪問一些其餘命令來管理HA HDFS集羣。具體來講，您應該熟悉「 hdfs haadmin 」命令的全部子命令。不帶任何附加參數運行此命令將顯示如下使用信息：

Usage: DFSHAAdmin [-ns <nameserviceId>]

[-transitionToActive <serviceId>]

[-transitionToStandby <serviceId>]

[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]

[-getServiceState <serviceId>]

[-checkHealth <serviceId>]

[-help <command>]

下面介紹了每一個子命令的高級用法。有關每一個子命令的具體使用信息，應運行「 hdfs haadmin -help <command>」。

transitionToActive and transitionToStandby - 將給定NameNode的狀態轉換爲Active或Standby

這些子命令會使給定的NameNode分別轉換到活動或待機狀態。這些命令不會嘗試執行任何防禦，所以應該不多使用。相反，人們應該老是喜歡使用「 hdfs haadmin -failover 」子命令。

failover - 在兩個NameNode之間啓動故障轉移

此子命令致使從第一個提供的NameNode到第二個的故障轉移。若是第一個NameNode處於Standby狀態，則此命令只是將第二個NameNode轉換爲Active狀態而不會出錯。若是第一個NameNode處於Active狀態，則會嘗試將其正常轉換到Standby狀態。若是失敗，則會嘗試按照順序嘗試防禦方法（由dfs.ha.fencing.methods配置），直到成功爲止。只有在這個過程以後，第二個NameNode纔會轉換到活動狀態。若是沒有防禦方法成功，則第二個NameNode不會轉換爲活動狀態，而且會返回錯誤。

getServiceState - 肯定給定的NameNode是Active仍是Standby

鏈接到提供的NameNode以肯定其當前狀態，適當地打印「standby」或「active」到STDOUT。此子命令可能由cron做業或監視腳本使用，這些腳本須要根據NameNode當前處於活動狀態仍是待機狀態而具備不一樣的行爲。

checkHealth - 檢查給定NameNode的健康情況

鏈接到提供的NameNode以檢查其健康情況。NameNode可以對自身執行一些診斷，包括檢查內部服務是否按預期運行。若是NameNode健康，該命令將返回0，不然返回非零值。有人可能會將此命令用於監視目的。注意：這尚未實現，而且目前將始終返回成功，除非給定的NameNode徹底關閉。

自動故障轉移（Automatic Failover）

介紹（Introduction）

以上各節介紹如何配置手動故障轉移。在該模式下，即便主動節點發生故障，系統也不會自動觸發從活動節點到備用節點的故障轉移。本節介紹如何配置和部署自動故障轉移。

組件（Components）

自動故障轉移爲HDFS部署添加了兩個新組件：一個ZooKeeper仲裁和ZKFailoverController進程（縮寫爲ZKFC）。

Apache ZooKeeper是一種高度可用的服務，用於維護少許的協調數據，通知客戶端數據發生變化，並監視客戶端的故障。自動HDFS故障轉移的實現依賴ZooKeeper進行如下操做：

1）失敗檢測 - 集羣中的每一個NameNode機器都在ZooKeeper中維護一個持久會話。若是機器崩潰，ZooKeeper會話將過時，並通知其餘NameNode應該觸發故障轉移。

2）活動NameNode選舉 - ZooKeeper提供了一種簡單的機制來獨佔選擇節點爲活動狀態。若是當前活動的NameNode崩潰，另外一個節點可能會在ZooKeeper中使用一個特殊的獨佔鎖，代表它應該成爲下一個活動。

ZKFailoverController（ZKFC）是一個新的組件，它是一個ZooKeeper客戶端，它也監視和管理NameNode的狀態。每一個運行NameNode的機器也運行一個ZKFC，ZKFC負責：

1）健康監控 - ZKFC按期使用健康檢查命令對其本地NameNode執行ping操做。只要NameNode及時響應並具備健康狀態，ZKFC就認爲節點健康。若是節點崩潰，凍結或以其餘方式進入不健康狀態，則健康監視器會將其標記爲不健康。

2）ZooKeeper會話管理 - 當本地NameNode健康時，ZKFC在ZooKeeper中保持會話打開狀態。若是本地NameNode處於活動狀態，則它還包含一個特殊的「鎖定」znode。該鎖使用ZooKeeper對「短暫」節點的支持; 若是會話過時，鎖定節點將被自動刪除。

3）基於ZooKeeper的選舉 - 若是本地NameNode健康，而且ZKFC發現當前沒有其餘節點持有鎖定znode，它將本身嘗試獲取鎖定。若是成功，則它「贏得選舉」，並負責運行故障轉移以使其本地NameNode處於活動狀態。故障切換過程與上述手動故障切換相似：首先，若是必要，先前的活動將被隔離，而後本地NameNode轉換爲活動狀態。

部署ZooKeeper（Deploying ZooKeeper）

在典型的部署中，ZooKeeper守護進程被配置爲在三個或五個節點上運行。因爲ZooKeeper自己具備輕量資源需求，所以能夠在與HDFS NameNode和Standby節點相同的硬件上配置ZooKeeper節點。許多運營商選擇在與YARN ResourceManager相同的節點上部署第三個ZooKeeper進程。建議將ZooKeeper節點配置爲將其數據存儲在HDFS元數據的單獨磁盤驅動器上，以得到最佳性能和隔離。

ZooKeeper的設置超出了本文檔的範圍。咱們假設您已經創建了一個運行在三個或更多節點上的ZooKeeper集羣，並經過使用ZK CLI進行鏈接來驗證其正確的操做。

在你開始以前（Before you begin）

在開始配置自動故障轉移以前，您應該關閉集羣。當集羣正在運行時，目前沒法從手動故障轉移設置轉換爲自動故障轉移設置。

配置自動故障轉移（Configuring automatic failover）

自動故障轉移的配置須要在配置中添加兩個新參數。在您的hdfs-site.xml文件中，添加：

<name>dfs.ha.automatic-failover.enabled</name>

</property>

這指定應將羣集設置爲自動故障轉移。在你的core-site.xml文件中，添加：

<name>ha.zookeeper.quorum</name>

<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>

</property>

這列出了運行ZooKeeper服務的主機端口。

與文檔中前面介紹的參數同樣，能夠經過在名稱服務的基礎上配置名稱服務ID後綴來配置這些設置。例如，在啓用了聯合的集羣中，您能夠經過設置dfs.ha.automatic-failover.enabled.my-nameservice-id爲其中一個名稱服務顯式啓用自動故障轉移。

用start-dfs.sh啓動集羣（Starting the cluster with start-dfs.sh）

因爲配置中啓用了自動故障轉移功能，所以start-dfs.sh腳本將自動在任何運行NameNode的計算機上啓動ZKFC守護程序。當ZKFC啓動時，他們將自動選擇一個NameNode變爲活動狀態。

手動啓動集羣（Starting the cluster manually）

若是您手動管理羣集上的服務，則須要在運行NameNode的每臺機器上手動啓動zkfc守護進程。您能夠運行如下命令來啓動守護進程：

$ hadoop-daemon.sh start zkfc

確保訪問ZooKeeper（Securing access to ZooKeeper）

若是您正在運行安全集羣，您可能須要確保存儲在ZooKeeper中的信息也是安全的。這能夠防止惡意客戶修改ZooKeeper中的元數據或者可能觸發錯誤的故障轉移。

爲了保護ZooKeeper中的信息，首先將如下內容添加到core-site.xml文件中：

<name>ha.zookeeper.auth</name>

</property>

<name>ha.zookeeper.acl</name>

</property>

請注意這些值中的'@'字符 - 它指定配置不是內聯的，而是指向磁盤上的文件。

第一個配置的文件指定了ZK CLI使用的相同格式的ZooKeeper認證列表。例如，您能夠指定以下所示的內容：

digest:hdfs-zkfcs:mypassword

...其中hdfs-zkfcs是ZooKeeper的惟一用戶名，mypassword是用做密碼的一些惟一字符串。

接下來，使用相似下面的命令生成與此驗證對應的ZooKeeper ACL：

$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword

output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=

將' - >'字符串後面的輸出部分複製並粘貼到文件zk-acls.txt中，並以字符串「 digest：」做爲前綴。例如：

digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda

爲了使這些ACL生效，您應該從新運行zkfc -formatZK命令，如上所述。

這樣作後，您能夠按以下方式驗證ZK CLI的ACL：

[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha

'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=

: cdrwa

驗證自動故障轉移（Verifying automatic failover）

一旦設置了自動故障轉移，您應該測試其操做。爲此，首先找到活動的NameNode。您能夠經過訪問NameNode Web界面來肯定哪一個節點處於活動狀態 - 每一個節點都會在頁面頂部報告其HA狀態。

找到活動的NameNode後，可能會在該節點上致使失敗。例如，可使用kill -9 <NN >的pid來模擬JVM崩潰。或者，您能夠從新啓動機器或拔下其網絡接口以模擬其餘類型的中斷。觸發您但願測試的中斷後，另外一個NameNode應在幾秒鐘內自動激活。檢測故障並觸發故障切換所需的時間取決於ha.zookeeper.session-timeout.ms的配置，但默認爲5秒。

若是測試不成功，則多是配置錯誤。檢查zkfc守護進程以及NameNode守護進程的日誌，以便進一步診斷問題。