本文內容:提交spark job到spark集羣,並使用crontab指定計劃任務,定時提交spark job。java
進入${SPARK_HOME}/bin目錄,有一個spark-submit的文件,就是經過腳本提交spark job的。python
vim ./spark-submit 看看裏面都幹了啥。linux
if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi # disable randomized hash for string in Python 3.3+ export PYTHONHASHSEED=0 exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
首先找了一下環境變量,而後source了一下,這裏並無提交什麼東西,關鍵是最後一行,執行${SPARK_HOME}/bin目錄下的spark-class,帶了一個org.apache.spark.deploy.SparkSubmit參數,而且把參數傳給了spark-class,基本能夠判斷spark-submit並非真正提交job的地方。web
vim ./spark-classshell
if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi . "${SPARK_HOME}"/bin/load-spark-env.sh # Find the java binary if [ -n "${JAVA_HOME}" ]; then RUNNER="${JAVA_HOME}/bin/java" ... # Find Spark jars. if [ -d "${SPARK_HOME}/jars" ]; then SPARK_JARS_DIR="${SPARK_HOME}/jars" else SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars" fi ... # Add the launcher build dir to the classpath if requested. if [ -n "$SPARK_PREPEND_CLASSES" ]; then LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH" fi # command array and checks the value to see if the launcher succeeded. build_command() { "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" printf "%d\0" $? } # Turn off posix mode since it does not allow process substitution set +o posix CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <(build_command "$@") ... CMD=("${CMD[@]:0:$LAST}") exec "${CMD[@]}"
這裏只給出了部分源碼,代碼有點長,看註釋基本就能明白這裏作了什麼事:apache
最終是調用了org.apache.spark.launcher.Main這個類,而後把輸入的參數給了這個Main中的main(String[] argsArray)方法。繞了好大一圈仍是回到了Java,這裏的Main是Java的一個類,具體源碼自行去閱讀(真的有點複雜,由於這裏的java代碼又調用了scala...)。vim
上問分析源碼的時候屢次看到了 $@ 這個字符,做用就是獲取全部的輸入參數,如今就來分析一下這些參數裏都有什麼。app
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
如下是spark官網給出的示例:dom
# Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100 # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000 # Run a Python application on a Spark standalone cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ 1000 # Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master mesos://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ 1000 # Run on a Kubernetes cluster in cluster deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master k8s://xx.yy.zz.ww:443 \ --deploy-mode cluster \ --executor-memory 20G \ --num-executors 50 \ http://path/to/examples.jar \ 1000
http://www.javashuo.com/article/p-xrkmoeet-dg.htmlide
這篇文章給出瞭如何指定計劃任務,這裏須要用到計劃任務。把上文中的提交job的shell寫到一個腳本中去,而後 crontab -e
新建一個計劃任務,而後設置時間定時提交job。
# 每兩個小時的第0分鐘執行一次 0 */2 * * * . /etc/profile;/bin/sh /usr/hdp/spark/tmp/submit-spark-job.sh
而後重啓crond服務:
systemctl restart crond.service crontab -l # 查看當前計劃任務