deploy目錄下的SparkSubmit類

以前說的各類腳本:spark-submit,spark-class也好,仍是launcher工程也好,主要工做是準備各類環境、依賴包、JVM參數等運行環境。實際的提交主要仍是Spark Code中的deploy下的SparkSubmit類來負責的。apache

deploy目錄下的SparkSubmit類,前面提到過,主要入口方法是runMain。app

咱們先看看其餘方法吧。ui

一、prepareSubmitEnvironmentspa

這個方法準備提交的環境和參數。.net

先判斷集羣管理方式(cluster manager):yarn、meros、k8s,standalone。部署方式(deploy mode ): client仍是cluster。code

後面要根據這些信息設置不一樣的Backend和Wapper類等。部署

提交模式這一段真很差講,由於它包含了太多種類的部署環境了,個性化較強,要慢慢看了。get

cluster方式只看兩種:yarn cluster和standalone cluster。把yarn和standalone兩個搞懂了,其餘的也就很好理解了。it

這個方法返回一個四元組:spark

@return a 4-tuple:
   *        (1) the arguments for the child process,
   *        (2) a list of classpath entries for the child,
   *        (3) a map of system properties, and
   *        (4) the main class for the child

核心代碼

if (deployMode == CLIENT) {
      childMainClass = args.mainClass
      if (localPrimaryResource != null && isUserJar(localPrimaryResource)) {
        childClasspath += localPrimaryResource
      }
      if (localJars != null) { childClasspath ++= localJars.split(",") }
    }
    // Add the main application jar and any added jars to classpath in case YARN client
    // requires these jars.
    // This assumes both primaryResource and user jars are local jars, or already downloaded
    // to local by configuring "spark.yarn.dist.forceDownloadSchemes", otherwise it will not be
    // added to the classpath of YARN client.
    if (isYarnCluster) {
      if (isUserJar(args.primaryResource)) {
        childClasspath += args.primaryResource
      }
      if (args.jars != null) { childClasspath ++= args.jars.split(",") }
    }

    if (deployMode == CLIENT) {
      if (args.childArgs != null) { childArgs ++= args.childArgs }
    }

 if (args.isStandaloneCluster) {
      if (args.useRest) {
        childMainClass = REST_CLUSTER_SUBMIT_CLASS
        childArgs += (args.primaryResource, args.mainClass)
      } else {
        // In legacy standalone cluster mode, use Client as a wrapper around the user class
        childMainClass = STANDALONE_CLUSTER_SUBMIT_CLASS
        if (args.supervise) { childArgs += "--supervise" }
        Option(args.driverMemory).foreach { m => childArgs += ("--memory", m) }
        Option(args.driverCores).foreach { c => childArgs += ("--cores", c) }
        childArgs += "launch"
        childArgs += (args.master, args.primaryResource, args.mainClass)
      }
      if (args.childArgs != null) {
        childArgs ++= args.childArgs
      }
    }

// In yarn-cluster mode, use yarn.Client as a wrapper around the user class
    if (isYarnCluster) {
      childMainClass = YARN_CLUSTER_SUBMIT_CLASS
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

上面這段代碼很是核心,很是重要。它定義了不一樣的集羣模式不一樣的部署方式下,應用使用什麼類來包裝咱們的spark程序,好適應不一樣的集羣環境下的提交流程。

咱們就多花點時間來分析一下這段代碼。

先看看ChildMainClass:

standaloneCluster下:REST_CLUSTER_SUBMIT_CLASS=classOf[RestSubmissionClientApp].getName()

yarnCluster下:YARN_CLUSTER_SUBMIT_CLASS=org.apache.spark.deploy.yarn.YarnClusterApplication

standalone client模式下:STANDALONE_CLUSTER_SUBMIT_CLASS = classOf[ClientApp].getName()

二、runMain

上一步得到四元組以後,就是runMain的流程了。

核心代碼先上:

private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
    val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
    val loader = getSubmitClassLoader(sparkConf)
    for (jar <- childClasspath) {
      addJarToClasspath(jar, loader)
    }
    var mainClass: Class[_] = null
    try {
      mainClass = Utils.classForName(childMainClass)
    } catch {
      
    }
    val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
      mainClass.getConstructor().newInstance().asInstanceOf[SparkApplication]
    } else {
      new JavaMainApplication(mainClass)
    }

    try {
      app.start(childArgs.toArray, sparkConf)
    } catch {
      case t: Throwable =>
        throw findCause(t)
    }
  }

搞清了prepareSubmitEnvironment的流程,runMain也就很簡單了,它就是啓動ChildMainClass(是SparkApplication的子類),而後執行start方法。

若是不是cluster模式而是client模式,那麼ChildMainClass就是args.mainClass。這點須要注意下,這時候ChildMainClass就會用JavaMainApplication來包裝了:

new JavaMainApplication(mainClass);

後面的內容就是看看RestSubmissionClientApp和org.apache.spark.deploy.yarn.YarnClusterApplication的實現邏輯了。

相關文章
相關標籤/搜索