爲何要使用YARN?html
數據共享、資源利用率、更方便的管理集羣等。node
詳情參見:http://www.cnblogs.com/luogankun/p/3887019.htmlapache
Spark YARN版本編譯app
編譯hadoop對應的支持YARN的Spark版本oop
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn clean package -DskipTests -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.0 -Dprotobuf.version=2.5.0 -Pyarn -Phive
詳情參見:http://www.cnblogs.com/luogankun/p/3798403.htmlurl
Spark On YARNspa
Spark的Cluster Manager負責管理啓動executor進程,集羣能夠是Standalone、YARN和Mesos;調試
每一個SparkContext(換句話說是:Application)對應一個ApplicationMaster(Application啓動過程當中的第一個容器);日誌
ApplicationMaster負責和ResourceManager打交道,並請求資源,當獲取資源以後通知NodeManager爲其啓動container; 每一個Container中運行一個ExecutorBackend;code
ResourceManager決定哪些Application能夠運行、何時運行以及在哪些NodeManager上運行; NodeManager的Container上運行executor進程;
在Standalone模式中有Worker的概念,而在Spark On YARN中沒有Worker的概念;
因爲executor是運行在container中,故container內存要大於executor的內存;
Spark On YARN有兩種:
一、yarn-client
Client和Driver運行在一塊兒,ApplicationMaster只負責獲取資源;
Client會和請求到的資源container通訊來調度他們進行工做,也就是說Client不能退出滴;
日誌信息輸出能輸出在終端控制檯上,適用於交互或者調試,也就是但願快速地看到application的輸出,好比SparkStreaming;
二、yarn-cluster
Driver和ApplicationMaster運行在一塊兒;負責向YARN申請資源,並檢測做業的運行情況;executor運行在container中;
提交Application以後,即便關掉了Client,做業仍然會繼續在YARN上運行;
日誌信息不會輸出在終端控制檯上;
注:使用Spark On YARN須要在spark-env.sh中配置HADOOP_CONF_DIR或者YARN_CONF_DIR指向Hadoop配置文件所在目錄
提交Spark做業到YARN
提交命令:
./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ ... # other options <application-jar> \ [application-arguments]
一、提交本地jar
提交到yarn-cluster/yarn-client
./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000
若是採用的是yarn-cluster的方式運行的話,想中止執行應用,須要去多個node上幹掉;而在yarn-client模式運行時,只須要在client上幹掉應用便可。
提交到standalone
./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000
二、提交hdfs上的jar
./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ --num-executors 50 \ hdfs://hadoop000:8020/lib/examples.jar \ 1000
若是沒有在spark-env.sh文件中配置HADOOP_CONF_DIR或者YARN_CONF_DIR,能夠在提交做業前指定,形如:
export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000
詳情參見:http://spark.apache.org/docs/latest/submitting-applications.html