有了前面spark-shell的經驗,看這兩個腳本就容易多啦。前面總結的Spark-shell的分析能夠參考:html
if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi # disable randomized hash for string in Python 3.3+ export PYTHONHASHSEED=0 exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
跟Spark-shell同樣,先檢查是否設置了${SPARK_HOME}
,而後啓動spark-class
,並傳遞了org.apache.spark.deploy.SparkSubmit
做爲第一個參數,而後把前面Spark-shell的參數都傳給spark-class
java
if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi . "${SPARK_HOME}"/bin/load-spark-env.sh # Find the java binary if [ -n "${JAVA_HOME}" ]; then RUNNER="${JAVA_HOME}/bin/java" else if [ `command -v java` ]; then RUNNER="java" else echo "JAVA_HOME is not set" >&2 exit 1 fi fi # Find assembly jar SPARK_ASSEMBLY_JAR= if [ -f "${SPARK_HOME}/RELEASE" ]; then ASSEMBLY_DIR="${SPARK_HOME}/lib" else ASSEMBLY_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION" fi GREP_OPTIONS= num_jars="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" | wc -l)" if [ "$num_jars" -eq "0" -a -z "$SPARK_ASSEMBLY_JAR" -a "$SPARK_PREPEND_CLASSES" != "1" ]; then echo "Failed to find Spark assembly in $ASSEMBLY_DIR." 1>&2 echo "You need to build Spark before running this program." 1>&2 exit 1 fi if [ -d "$ASSEMBLY_DIR" ]; then ASSEMBLY_JARS="$(ls -1 "$ASSEMBLY_DIR" | grep "^spark-assembly.*hadoop.*\.jar$" || true)" if [ "$num_jars" -gt "1" ]; then echo "Found multiple Spark assembly jars in $ASSEMBLY_DIR:" 1>&2 echo "$ASSEMBLY_JARS" 1>&2 echo "Please remove all but one jar." 1>&2 exit 1 fi fi SPARK_ASSEMBLY_JAR="${ASSEMBLY_DIR}/${ASSEMBLY_JARS}" LAUNCH_CLASSPATH="$SPARK_ASSEMBLY_JAR" # Add the launcher build dir to the classpath if requested. if [ -n "$SPARK_PREPEND_CLASSES" ]; then LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH" fi export _SPARK_ASSEMBLY="$SPARK_ASSEMBLY_JAR" # For tests if [[ -n "$SPARK_TESTING" ]]; then unset YARN_CONF_DIR unset HADOOP_CONF_DIR fi # The launcher library will print arguments separated by a NULL character, to allow arguments with # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") exec "${CMD[@]}"
這個類是真正的執行者,咱們好好看看這個真正的入口在哪裏?linux
首先,依然是設置項目主目錄:shell
if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi
而後,配置一些環境變量:apache
. "${SPARK_HOME}"/bin/load-spark-env.sh
在spark-env中設置了assembly相關的信息。bash
而後尋找java,並賦值給RUNNER變量dom
# Find the java binary if [ -n "${JAVA_HOME}" ]; then RUNNER="${JAVA_HOME}/bin/java" else if [ `command -v java` ]; then RUNNER="java" else echo "JAVA_HOME is not set" >&2 exit 1 fi fi
中間是一大坨跟assembly相關的內容。oop
最關鍵的就是下面這句了:源碼分析
CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") exec "${CMD[@]}"
首先循環讀取ARG參數,加入到CMD中。而後執行了"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@
這個是真正執行的第一個spark的類。學習
該類在launcher模塊下,簡單的瀏覽下代碼:
public static void main(String[] argsArray) throws Exception { ... List<String> args = new ArrayList<String>(Arrays.asList(argsArray)); String className = args.remove(0); ... //建立命令解析器 AbstractCommandBuilder builder; if (className.equals("org.apache.spark.deploy.SparkSubmit")) { try { builder = new SparkSubmitCommandBuilder(args); } catch (IllegalArgumentException e) { ... } } else { builder = new SparkClassCommandBuilder(className, args); } List<String> cmd = builder.buildCommand(env);//解析器解析參數 ... //返回有效的參數 if (isWindows()) { System.out.println(prepareWindowsCommand(cmd, env)); } else { List<String> bashCmd = prepareBashCommand(cmd, env); for (String c : bashCmd) { System.out.print(c); System.out.print('\0'); } } }
launcher.Main
返回的數據存儲到CMD中。
而後執行命令:
exec "${CMD[@]}"
這裏開始真正執行某個Spark的類。
最後來講說這個exec命令,想要理解它跟着其餘幾個命令一塊兒學習:
source
命令,在執行腳本的時候,會在當前的shell中直接把source執行的腳本給挪到本身的shell中執行。換句話說,就是把目標腳本的任務拿過來本身執行。exec
命令,是建立一個新的進程,只不過這個進程與前一個進程的ID是同樣的。這樣,原來的腳本剩餘的部分就不能執行了,由於至關於換了一個進程。另外,建立新進程並非說把全部的東西都直接複製,而是採用寫時複製,即在新進程使用到某些內容時,才拷貝這些內容sh
命令則是開啓一個新的shell執行,至關於建立一個新進程舉個簡單的例子,下面有三個腳本:
xingoo-test-1.sh
exec -c sh /home/xinghl/test/xingoo-test-2.sh
xingoo-test-2.sh
while true do echo "a2" sleep 3 done
xingoo-test-3.sh
sh /home/xinghl/test/xingoo-test-2.sh
xingoo-test-4.sh
source /home/xinghl/test/xingoo-test-2.sh
在執行xingoo-test-1.sh和xingoo-test-4.sh的效果是同樣的,都只有一個進程。
在執行xingoo-test-3.sh的時候會出現兩個進程。