yarn RM crash問題一例

今天收到線上的resource manager報警:java

wKiom1O8B9XDyZhiAAJfMJa1DWk421.jpg

報錯信息以下:node

2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:xxxx:53356 Timed out after 600 secs
2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node xxxx:53356 as it is now LOST
2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: xxxx:53356 Node Transitioned from UNHEALTHY to LOST
2014-07-08 13:22:54,118 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:715)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:974)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:108)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:378)
        at java.lang.Thread.run(Thread.java:662)
2014-07-08 13:22:54,118 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1000
2014-07-08 13:22:54,119 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2000

這是一個bug,bug id:https://issues.apache.org/jira/browse/YARN-502apache

根據bug的描述,是在rm刪除標記爲UNHEALTHY的nm的時候可能會觸發bug(第一次已經刪除,後面刪除再進行刪除操做時就會報錯)。less

根據堆棧信息來看代碼:ide

org.apache.hadoop.yarn.server.resourcemanager.scheduler.ResourceScheduler:
  protected ResourceScheduler scheduler; 
    private final class EventProcessor implements Runnable { // 開啓一個EventProcessor 線程,對event進行處理
      @Override
      public void run() {
        SchedulerEvent event;
        while (!stopped && !Thread.currentThread ().isInterrupted()) {
          try {
            event = eventQueue.take();  // 從event queue裏面拿出event
          } catch (InterruptedException e) {
            LOG.error("Returning, interrupted : " + e);
            return; // TODO: Kill RM.
          }
          try {
            scheduler.handle(event); //處理event
          } catch (Throwable t) { // cache event的異常
            // An error occurred, but we are shutting down anyway.
            // If it was an InterruptedException, the very act of
            // shutdown could have caused it and is probably harmless.
            if (stopped ) {
              LOG.warn("Exception during shutdown: " , t);
              break;
            }
            LOG.fatal("Error in handling event type " + event.getType() //根據日誌來看,這裏獲取的event.getType()爲 NODE_REMOVED
                + " to the scheduler", t);
            if (shouldExitOnError
                && !ShutdownHookManager.get().isShutdownInProgress()) {
              LOG.info("Exiting, bbye.." );
              System. exit(-1);
            }
          }
        }
      }
    }

這裏能夠看到能夠經過shouldExitOnError能夠控制RM線程是否退出。函數

private boolean shouldExitOnError = false; // 初始設置爲false
    @Override
    public synchronized void init(Configuration conf) {  // 在作初始化時,能夠經過配置文件獲取
      this. shouldExitOnError =
          conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
            Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR); // 參數在Dispatcher類中定義
      super.init(conf);
    }
org.apache.hadoop.yarn.event.Dispatcher類:
public interface Dispatcher {   
  // Configuration to make sure dispatcher crashes but doesn't do system-exit in
  // case of errors. By default, it should be false, so that tests are not
  // affected. For all daemons it should be explicitly set to true so that
  // daemons can crash instead of hanging around.
  public static final String DISPATCHER_EXIT_ON_ERROR_KEY =
      "yarn.dispatcher.exit-on-error"; // 控制參數
  public static final boolean DEFAULT_DISPATCHER_EXIT_ON_ERROR = false; // 默認爲false
  EventHandler getEventHandler();
  void register(Class<? extends Enum> eventType, EventHandler handler);
}

在ResourceManager類的init函數中:oop

 @Override
  public synchronized void init(Configuration conf) {
    this. conf = conf;
    this. conf.setBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY, true);  // 這個值的默認值爲true了(覆蓋了Dispatcher類中的DEFAULT設置)

即默認在遇到dispather的錯誤時,會退出。
遇到錯誤是否退出能夠由配置參數yarn.dispatcher.exit-on-error決定。不過這個改動影響比較大,最好仍是不要設置,仍是打patch來解決吧。this

官方的patch也比較簡單,即在rmnm時進行一次判斷,防止二次刪除操做:spa

--- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
+++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
@@ -501,8 +501,13 @@ public DeactivateNodeTransition(NodeState finalState) {
     public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
       // Inform the scheduler
       rmNode.nodeUpdateQueue.clear();
-      rmNode.context.getDispatcher().getEventHandler().handle(
-          new NodeRemovedSchedulerEvent(rmNode));
+      // If the current state is NodeState.UNHEALTHY
+      // Then node is already been removed from the
+      // Scheduler
+      if (!rmNode.getState().equals(NodeState.UNHEALTHY)) {
+        rmNode.context.getDispatcher().getEventHandler()
+          .handle( new NodeRemovedSchedulerEvent(rmNode));
+      }
       rmNode.context.getDispatcher().getEventHandler().handle(
           new NodesListManagerEvent(
               NodesListManagerEventType.NODE_UNUSABLE, rmNode));
相關文章
相關標籤/搜索