Spark源碼分析之Spark-submit和Spark-class

時間 2019-11-24

標籤 spark 源碼分析 submit class 欄目 Spark 简体版

原文原文鏈接

有了前面spark-shell的經驗，看這兩個腳本就容易多啦。前面總結的Spark-shell的分析能夠參考：html

Spark-submit

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

跟Spark-shell同樣，先檢查是否設置了${SPARK_HOME},而後啓動spark-class，並傳遞了org.apache.spark.deploy.SparkSubmit做爲第一個參數，而後把前面Spark-shell的參數都傳給spark-classjava

Spark-class

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ `command -v java` ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find assembly jar
SPARK_ASSEMBLY_JAR=
if [ -f "${SPARK_HOME}/RELEASE" ]; then
  ASSEMBLY_DIR="${SPARK_HOME}/lib"
else
  ASSEMBLY_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION"
fi

GREP_OPTIONS=
num_jars="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" | wc -l)"
if [ "$num_jars" -eq "0" -a -z "$SPARK_ASSEMBLY_JAR" -a "$SPARK_PREPEND_CLASSES" != "1" ]; then
  echo "Failed to find Spark assembly in $ASSEMBLY_DIR." 1>&2
  echo "You need to build Spark before running this program." 1>&2
  exit 1
fi
if [ -d "$ASSEMBLY_DIR" ]; then
  ASSEMBLY_JARS="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" || true)"
  if [ "$num_jars" -gt "1" ]; then
    echo "Found multiple Spark assembly jars in $ASSEMBLY_DIR:" 1>&2
    echo "$ASSEMBLY_JARS" 1>&2
    echo "Please remove all but one jar." 1>&2
    exit 1
  fi
fi

SPARK_ASSEMBLY_JAR="${ASSEMBLY_DIR}/${ASSEMBLY_JARS}"

LAUNCH_CLASSPATH="$SPARK_ASSEMBLY_JAR"

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

export _SPARK_ASSEMBLY="$SPARK_ASSEMBLY_JAR"

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
exec "${CMD[@]}"

這個類是真正的執行者,咱們好好看看這個真正的入口在哪裏？linux

首先，依然是設置項目主目錄：shell

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

而後，配置一些環境變量:apache

. "${SPARK_HOME}"/bin/load-spark-env.sh

在spark-env中設置了assembly相關的信息。bash

而後尋找java,並賦值給RUNNER變量dom

# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ `command -v java` ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

中間是一大坨跟assembly相關的內容。oop

最關鍵的就是下面這句了：源碼分析

CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
exec "${CMD[@]}"

首先循環讀取ARG參數，加入到CMD中。而後執行了"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@ 這個是真正執行的第一個spark的類。學習

該類在launcher模塊下，簡單的瀏覽下代碼：

public static void main(String[] argsArray) throws Exception {
   ...
    List<String> args = new ArrayList<String>(Arrays.asList(argsArray));
    String className = args.remove(0);
    ...
    //建立命令解析器
    AbstractCommandBuilder builder;
    if (className.equals("org.apache.spark.deploy.SparkSubmit")) {
      try {
        builder = new SparkSubmitCommandBuilder(args);
      } catch (IllegalArgumentException e) {
        ...
      }
    } else {
      builder = new SparkClassCommandBuilder(className, args);
    }

    List<String> cmd = builder.buildCommand(env);//解析器解析參數
    ...
    //返回有效的參數
    if (isWindows()) {
      System.out.println(prepareWindowsCommand(cmd, env));
    } else {
      List<String> bashCmd = prepareBashCommand(cmd, env);
      for (String c : bashCmd) {
        System.out.print(c);
        System.out.print('\0');
      }
    }
  }

launcher.Main返回的數據存儲到CMD中。

而後執行命令:

exec "${CMD[@]}"

這裏開始真正執行某個Spark的類。

最後來講說這個exec命令，想要理解它跟着其餘幾個命令一塊兒學習：

source命令，在執行腳本的時候，會在當前的shell中直接把source執行的腳本給挪到本身的shell中執行。換句話說，就是把目標腳本的任務拿過來本身執行。
exec命令，是建立一個新的進程，只不過這個進程與前一個進程的ID是同樣的。這樣，原來的腳本剩餘的部分就不能執行了，由於至關於換了一個進程。另外，建立新進程並非說把全部的東西都直接複製，而是採用寫時複製，即在新進程使用到某些內容時，才拷貝這些內容
sh命令則是開啓一個新的shell執行，至關於建立一個新進程

舉個簡單的例子,下面有三個腳本:
xingoo-test-1.sh

exec -c sh /home/xinghl/test/xingoo-test-2.sh

xingoo-test-2.sh

while true
do
        echo "a2"
        sleep 3
done

xingoo-test-3.sh

sh /home/xinghl/test/xingoo-test-2.sh

xingoo-test-4.sh

source /home/xinghl/test/xingoo-test-2.sh

在執行xingoo-test-1.sh和xingoo-test-4.sh的效果是同樣的，都只有一個進程。
在執行xingoo-test-3.sh的時候會出現兩個進程。

參考

linux裏source、sh、bash、./有什麼區別

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。