【原】Spark Standalone模式

時間 2019-11-22

標籤 spark standalone 模式欄目 Spark 简体版

原文原文鏈接

Spark Standalone模式

安裝Spark Standalone集羣
手動啓動集羣
集羣建立腳本
提交應用到集羣
建立Spark應用
資源調度及分配
監控與日誌
與Hadoop共存
配置網絡安全端口
高可用性
- 基於Zookeeper的Master
- 本地系統的單節點恢復

除了運行在mesos或yarn集羣管理器中，spark也提供了簡單的standalone部署模式。你能夠經過手動啓動master和worker節點來建立集羣，或者用官網提供的啓動腳本。這些守護進程也能夠只在一臺機器上以便測試使用。node

1.安裝Spark Standalone集羣

安裝Spark Standalone集羣，你只須要在每一個節點上部署編譯好的Spark便可。你能夠在官網上獲得已經預編譯好的，也能夠根據本身的須要進行編譯。web

2.手動啓動集羣shell

你能夠啓動Standalone模式的master服務，經過執行以下命令：apache

./sbin/start-master.sh安全

一旦啓動，master節點將打印出Spark://HOST:PORT URL，你能夠用這個URL來鏈接worker節點或者把它賦值給「master」參數傳遞給SparkContext。你也能夠在master的WEB UI找到這個URL，默認的是http://localhost:8080,最好是http://master所在的ip地址:8080,這樣和master在同一個局域網內的機器均可以訪問。網絡

一樣地，你能夠啓動一個或多個worker節點並把它註冊到master節點上，執行以下命令：app

./sbin/start-slave.sh <master-spark-URL>less

一旦你啓動了worker節點，經過master的WEB UI，你能夠看到註冊到它上面的worker的信息，好比CPU核數、內存等。dom

最後，下面的配置選項能夠傳遞給master和worker節點。ssh

Argument	Meaning
-h HOST, --host HOST	Hostname to listen on
-i HOST, --ip HOST	Hostname to listen on (deprecated, use -h or --host)
-p PORT, --port PORT	Port for service to listen on (default: 7077 for master, random for worker)
--webui-port PORT	Port for web UI (default: 8080 for master, 8081 for worker)
-c CORES, --cores CORES	Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
-m MEM, --memory MEM	Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker
-d DIR, --work-dir DIR	Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker
--properties-file FILE	Path to a custom Spark properties file to load (default: conf/spark-defaults.conf)

3.集羣建立腳本

若是用腳本啓動集羣的話，你應該在你的Spark_HOME下建立一個conf/slaves，這個slaves文件必須包含worker的主機名，每行一個。若是conf/slaves不存在的話，建立腳本默認值啓動本機單個節點，這對於測試頗有用。注意，master經過ssh來和worker進行通訊。

一旦你設置了這個文件，你能夠經過下面的Shell腳原本啓動或中止集羣，相似於Hadoop的部署腳本，這些腳本在SPARK_HOME/sbin下找到。

sbin/start-master.sh - 啓動腳本所在機器上的master節點
sbin/start-slaves.sh - 啓動conf/slaves文件中指定的slave全部節點
sbin/start-slave.sh - 啓動腳本所在的機器上的slave節點
sbin/start-all.sh - 啓動腳本所在的slave節點及與其相關的slave節點
sbin/stop-master.sh - 中止腳本所在機器上的master節點
sbin/stop-slaves.sh - 啓動conf/slaves文件中指定的slave全部節點
sbin/stop-all.sh - 中止腳本所在機器上的master節點

注意這些腳本必須在你想要運行Spark master節點上，而不是你本地機器

你能夠在conf/spark-env.sh中選擇性地配置下面的選項，這個文件集羣中的每臺機器都必須有。

Environment Variable	Meaning
SPARK_MASTER_IP	Bind the master to a specific IP address, for example a public one.
SPARK_MASTER_PORT	Start the master on a different port (default: 7077).
SPARK_MASTER_WEBUI_PORT	Port for the master web UI (default: 8080).
SPARK_MASTER_OPTS	Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.
SPARK_LOCAL_DIRS	Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
SPARK_WORKER_CORES	Total number of cores to allow Spark applications to use on the machine (default: all available cores).
SPARK_WORKER_MEMORY	Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memory property.
SPARK_WORKER_PORT	Start the Spark worker on a specific port (default: random).
SPARK_WORKER_WEBUI_PORT	Port for the worker web UI (default: 8081).
SPARK_WORKER_INSTANCES	Number of worker instances to run on each machine (default: 1). You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes. If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.
SPARK_WORKER_DIR	Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).
SPARK_WORKER_OPTS	Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.
SPARK_DAEMON_MEMORY	Memory to allocate to the Spark master and worker daemons themselves (default: 1g).
SPARK_DAEMON_JAVA_OPTS	JVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none).
SPARK_PUBLIC_DNS	The public DNS name of the Spark master and workers (default: none).

SPARK_MASTER_OPTS能夠配置下面的系統屬性：

Property Name	Default	Meaning
spark.deploy.retainedApplications	200	The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit.
spark.deploy.retainedDrivers	200	The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit.
spark.deploy.spreadOut	true	Whether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads.
spark.deploy.defaultCores	(infinite)	Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default.
spark.worker.timeout	60	Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.

SPARK_WORKER_OPTS能夠配置下面的系統屬性：

Property Name	Default	Meaning
spark.worker.cleanup.enabled	false	Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval	1800 (30 minutes)	Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl	7 * 24 * 3600 (7 days)	The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

4.提交應用到集羣

在Spark集羣中運行一個Spark應用程序，須要把master節點的Spark://IP:PORT URL傳遞給SparkContext 的構造函數中。

在交互式Shell中Spark應用程序，需運行下面的命令：

./bin/spark-shell --master spark://IP:PORT

你也能夠傳遞選項--total-executor-cores <numCores>來控制Spark Shell使用的機器的核數。

5.建立Spark應用

spark-submit腳本提供了提供應用到集羣最直接的方式。對於Standalone模式而言，Spark目前支持兩種部署模式。在Client模式中，Driver程序在提交命令的機器上。在Cluster模式中，Driver從集羣中的worker節點中任取一個運行驅動程序。

若是你的應用經過Spark submit提交，這個應用jar自動分發到集羣中的全部worker節點上。對於你的應用依賴的額外的jars，你應該經過--jars 參數來指定，多個之間用逗號分隔（若是：--jars jar1,jar2）

另外，standalone cluster模式也自動重啓你的應用程序。爲了使用這個特性，你能夠在spark-submit啓動你的應用程序時傳遞--supervise參數。

./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>

6.資源調度及分配

Standalone cluster模式目前僅支持應用調度的FIFO模式。爲了運行多個用戶，你能夠控制每一個應用使用的最大資源。默認，它會使用集羣中全部機器的核數，這隻對於集羣中只有一個應用有效。你能夠經過 spark.cores.max 參數來控制核數，以下所示：

val conf = new SparkConf()

.setMaster(...)

.setAppName(...)

.set("spark.cores.max", "10")val sc = new SparkContext(conf)

另外，你能夠在集羣的master中配置 spark.deploy.defaultCores參數來改變默認值。以下所示：

export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"

7.監控與日誌

Spark Standalone模式提供了一個web接口來監控集羣。master和每一個worker有他們本身的WEB UI。默認你能夠經過8080端口訪問master的WEB UI。這個端口能夠在配置文件中修改或在命令行中選項修改。

另外，每一個job的詳細日誌默認寫入到每一個slave節點的工做目錄（默認SPARK_HOME/work）。在目錄下，對於每一個job，你會看到兩個文件分別是stdout和stderr。

8.與Hadoop共存

你能夠基於你現有的Hadoop集羣運行Spark，只須要在一樣的機器上啓動單獨的服務便可。在Spark中訪問Hadoop中的數據，只須要使用hdfs:// URL (典型hdfs://<namenode>:9000/path）路徑便可。另外，你能夠爲Spark建立一個獨立的集羣，經過網絡仍然能夠訪問HDFS，這可能比本次磁盤慢。

9.配置網絡安全端口

Spark大量使用網絡，一些環境有嚴格的防火牆要求。想要了解配置的端口，請看安全模塊。

10.高可用性

默認，standalone集羣調度對於worker節點的失效是有彈性的。然而，集羣調度器經過master作決策，默認只有單個節點。若是master宕機了，將不會再建立新的應用。爲了不單點故障，咱們提供兩種高可用性模式，詳情以下。

10.1基於Zookeeper的Master

使用Zookeeper來提供leader選舉和一些轉態存儲，你能夠在基於Zookeeper的集羣中啓動多個master。一旦一個master被選中爲「leader」，其餘的將處於standby轉態。若是當前的leader宕機了，Zookeeper將會從新選舉出另一個master，從前一個master的轉態中繼續任務調度。整個的恢復過程耗時在1-2分鐘。注意，這種延遲僅僅影響調用新的應用程序而不影響正在運行的應用。

配置

爲了支持這種恢復模式，你能夠在spark-env.sh中設置SPARK_DAEMON_JAVA_OPTS配置以下選項：

System property	Meaning
spark.deploy.recoveryMode	Set to ZOOKEEPER to enable standby Master recovery mode (default: NONE).
spark.deploy.zookeeper.url	The ZooKeeper cluster url (e.g., 192.168.1.100:2181,192.168.1.101:2181).
spark.deploy.zookeeper.dir	The directory in ZooKeeper to store recovery state (default: /spark).

詳情

若是你集羣中已經安裝好了Zookeeper，容許HA是很簡單的。只須要在不一樣的節點上啓動讀個master進程便可，master能夠隨時增刪。

爲了調度新的應用或集羣中添加worker，他們須要知道當期啊leader 的ip地址。這僅須要傳遞一個list便可。例如，你經過spark://host1:port1,host2:port2來啓動應用程序時，若是host1宕機了，集羣仍讓正常，由於集羣已經從新找到了一個新的leader，即host2

10.2本地系統的單節點恢復

Zookeeper是最好的HA方式，但若是你想要master若是宕了重啓的話，文件系統模式支持。當應用程序和worker註冊到master後，他們有足夠的轉態寫入到了特定目錄中，這些轉態能夠在master進程重啓時恢復。

配置

爲了支持這種恢復模式，你能夠在spark-env.sh中設置SPARK_DAEMON_JAVA_OPTS配置以下選項：

System property	Meaning
spark.deploy.recoveryMode	Set to FILESYSTEM to enable single-node recovery mode (default: NONE).
spark.deploy.recoveryDirectory	The directory in which Spark will store recovery state, accessible from the Master's perspective.