yarn查詢/cluster/nodes均返回localhost

背景:java

  一、已禁用ipv6。node

  二、全部節點的/etc/hosts正確配置,任務在ResourceManager提交。web

  三、yarn-site.xml中指定了apache

    yarn.resourcemanager.hostname=Master
    yarn.nodemanager.aux-services=mapreduce_shuffle
    並在各NodeManager配置了相應的yarn.nodemanager.hostname

四、mapred-site.xml中指定了mapreduce.framework.name=yarn

 

現象:api

  提交MR任務的鏈接拒絕的堆棧,其中鏈接的container地址爲localhost,與實際須要的不一致。app

ser: root
Name: Bigdata-Hadoop-1.0-SNAPSHOT.jar
Application Type: MAPREDUCE
Application Tags:  
YarnApplicationState: FAILED
Queue: default
FinalStatus Reported by AM: FAILED
Started: Thu Nov 22 21:59:31 +0800 2018
Elapsed: 6mins, 1sec
Tracking URL: History
Diagnostics:
Application application_1542889591013_0006 failed 2 times due to Error launching appattempt_1542889591013_0006_000002. Got exception: java.net.ConnectException: Call From localhost/127.0.0.1 to localhost:33070 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy83.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy84.startContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:250)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:713)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
... 15 more
. Failing the application.

 

同時在底部的兩次嘗試時,driver地址也爲localhostwebapp

 

經過查詢發現yarn返回的集羣節點信息中,全部的NodeManager地址均爲localhost。oop

以上均證明經過yarn查詢到的NodeManager地址異常,沒法遠程調用NodeManager來啓動Container,直接致使MR任務失敗。ui

方案:this

  一、四方博客,擼遍全網,無果。

  二、遊走各羣,虛心請教,無果。

  三、自力更生,強擼源碼,待續 ... ...

源碼:

  找不到入口就別看了。

  org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java:252

  @GET
  @Path("/nodes")
  @Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML })
  public NodesInfo getNodes(@QueryParam("states") String states) {
    init();
    ResourceScheduler sched = this.rm.getResourceScheduler();
    if (sched == null) {
      throw new NotFoundException("Null ResourceScheduler instance");
    }
    
    EnumSet<NodeState> acceptedStates;
    if (states == null) {
      acceptedStates = EnumSet.allOf(NodeState.class);
    } else {
      acceptedStates = EnumSet.noneOf(NodeState.class);
      for (String stateStr : states.split(",")) {
        acceptedStates.add(
            NodeState.valueOf(StringUtils.toUpperCase(stateStr)));
      }
    }
    
 Collection<RMNode> rmNodes = RMServerUtils.queryRMNodes(this.rm.getRMContext(), acceptedStates);
    NodesInfo nodesInfo = new NodesInfo();
    for (RMNode rmNode : rmNodes) {
      NodeInfo nodeInfo = new NodeInfo(rmNode, sched);
      if (EnumSet.of(NodeState.LOST, NodeState.DECOMMISSIONED, NodeState.REBOOTED)
          .contains(rmNode.getState())) {
        nodeInfo.setNodeHTTPAddress(EMPTY);
      }
      nodesInfo.add(nodeInfo);
    }
    
    return nodesInfo;
  }

 

  這裏在生成的節點信息。

  org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeInfo.java:57

public NodeInfo(RMNode ni, ResourceScheduler sched) {
    NodeId id = ni.getNodeID();
    SchedulerNodeReport report = sched.getNodeReport(id);
    this.numContainers = 0;
    this.usedMemoryMB = 0;
    this.availMemoryMB = 0;
    if (report != null) {
      this.numContainers = report.getNumContainers();
      this.usedMemoryMB = report.getUsedResource().getMemory();
      this.availMemoryMB = report.getAvailableResource().getMemory();
      this.usedVirtualCores = report.getUsedResource().getVirtualCores();
      this.availableVirtualCores = report.getAvailableResource().getVirtualCores();
    }
    this.id = id.toString();
    this.rack = ni.getRackName();
    this.nodeHostName = ni.getHostName();
    this.state = ni.getState();
    this.nodeHTTPAddress = ni.getHttpAddress();
    this.lastHealthUpdate = ni.getLastHealthReportTime();
    this.healthReport = String.valueOf(ni.getHealthReport());

  三個關鍵信息全是ni這個怪胎來的,那就看你怎麼來的行不。

  org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java:63

 public static List<RMNode> queryRMNodes(RMContext context,
      EnumSet<NodeState> acceptedStates) {
    // nodes contains nodes that are NEW, RUNNING OR UNHEALTHY
    ArrayList<RMNode> results = new ArrayList<RMNode>();
    if (acceptedStates.contains(NodeState.NEW) ||
        acceptedStates.contains(NodeState.RUNNING) ||
        acceptedStates.contains(NodeState.UNHEALTHY)) {
      for (RMNode rmNode : context.getRMNodes().values()) {
        if (acceptedStates.contains(rmNode.getState())) {
          results.add(rmNode);
        }
      }
    }

  來這個context裏有點東西,具體怎麼初始化這個context下回再研究,先看裏面對RMNodes的操做。

  接下的時間裏就是在跟Yarn掙扎,可是事實證實並不能找到這個hostname到底是怎麼成了localhost,而不是指望的工做節的hostname。畢竟代碼量很多,裏面錯綜複雜,還須要點時間縷縷,那就下次接着看源碼。不過在瞭解了必定原理後,摟一遍源碼確實對理解原理仍是蠻有效的。

  雖然看源碼沒有獲得想要的結果,可是有個大膽想法:經過IP解析hostname是取hosts文件裏IP匹配上的第一個hostname(待確認)。所以就將工做節點的ip和hostname挪到第一行,重啓yarn集羣,MR任務瞬間暢通。

相關文章
相關標籤/搜索