YARN的重啓動問題：RM Restart/RM HA/Timeline Server/NM Restart

時間 2019-11-07

標籤 yarn 重啓動問 restart timeline server 欄目 Hadoop 简体版

原文原文鏈接

ResourceManger Restart

ResourceManager負責資源管理和應用的調度，是YARN的核心組件，有可能存在單點失敗的問題。ResourceManager Restart是使RM在重啓動時可以使Yarn集羣正常工做的feature，而且使RM的出現的失敗不被用戶知道。html

ResourceManager Restart feature is divided into two phases:node

ResourceManager Restart Phase 1 (Non-work-preserving RM restart，since hadoop2.4.0): Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications.
ResourceManager Restart Phase 2 (Work-preserving RM restart, since hadoop2.6.0): Focus on re-constructing the running state of ResourceManager by combining the container statuses from NodeManagers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won’t lose its work because of RM outage.

ResourceManager High Availability

Hadoop2.4.0以前，ResourceManager存在單點失敗的問題。Yarn的HA（高可用）使用Actice/Standby結構。在任意一個時刻，只有一個Active RM，一個到多個Standby RM。其實就是將ResourceManager進行了備份，使得系統中存在Active RM和Standby RM。web

Manual transitions and failover

輸入yarn rmadminapache

Automatic failover

當RM 失效或者再也不響應時，基於Zookeeper的ActiveStandbyElector（已經內嵌到了RM中，不用啓動單獨的ZKFC daemon）選舉出新的Active RM。app

Client, ApplicationMaster and NodeManager on RM failover

若是有多個RM，那麼全部節點上的yarn-site.xml文件都須要列出全部的RM。Clients、AMs、NMs以Round-Robin的方式鏈接RMs，直到遇到一個Active RM爲止。若是Active RM失效，那麼從新以Round-Robin的方式找到新的Active RM。框架

The YARN Timeline Server

YARN經過Timeline Server解決apps當前信息和歷史信息的存儲和檢索。TimelineServer的兩個職責：ide

Persisting Application Specific Information

信息的蒐集和檢索與特定的app或者框架有關。例如MapReduce框架的信息能夠包括number of map tasks, reduce tasks, counters…etc。用戶能夠將app專門的信息經過Application Master包含的TimelineClientoop

或者App的container進行發佈。ui

Persisting Generic Information about Completed Applications

Generic information爲app level的信息，例如queue-name，user info等。通用數據被Yarn的RM發佈到timeline store中，用於web-UI的已經完成的apps的信息展現。this

NodeManager Restart

NodeManager Restart機制可以使NodeManager所在節點的active Containers不丟失。NM在處理container 管理請求時，將必要的state存儲到local state-store。當NMs restart時，首先爲不一樣的子系統加載state，而後讓子系統使用加載的state進行恢復。

enabling NM Restart：

（1）將/conf/yarn-site.xml中的yarn.nodemanager.recovery.enabled設置爲true。默認爲false

（2） Configure a path to the local file-system directory where the NodeManager can save its run state.

（3） Configure a valid RPC address for the NodeManager.

（4） Auxiliary services.

Link：

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/TimelineServer.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。