Spark官網

時間 2019-11-17

標籤 spark 欄目 Spark 简体版

原文原文鏈接

Components

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Spark應用程序做爲一系列獨立的進程運行在集羣上，被在main程序中的SparkContext對象（驅動程序）協調。node

Specifically, to run on a cluster,
1/ the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
SparkContext連上一個集羣管理器，跨應用程序分配資源。
2/ Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.
一旦鏈接上，Spark從集羣上的節點獲取執行器，執行器是爲應用程序計算和存儲數據的進程。
3/ Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.
發送應用程序代碼到執行器上。
4/ Finally, SparkContext sends tasks to the executors to run.
SparkContext發送任務到執行器上執行。web

Spark Standalone Mode

1/ Starting a Cluster Manually
You can start a standalone master server by executing:
./sbin/start-master.shshell

Similarly, you can start one or more workers and connect them to the master via:
./sbin/start-slave.sh <master-spark-URL>apache

2/ Cluster Launch Scripts
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/slaves in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line. If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which is useful for testing. Note, the master machine accesses each of the worker machines via ssh. By default, ssh is run in parallel and requires password-less (using a private key) access to be setup. If you do not have a password-less setup, you can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.緩存

Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin:服務器

sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.
Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.網絡

You can optionally configure the cluster further by setting environment variables in conf/spark-env.sh. Create this file by starting with the conf/spark-env.sh.template, and copy it to all your worker machines for the settings to take effect. The following settings are available:併發

3/ Connecting an Application to the Cluster
To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor.app

To run an interactive Spark shell against the cluster, run the following command:less

./bin/spark-shell --master spark://IP:PORT
You can also pass an option --total-executor-cores <numCores> to control the number of cores that spark-shell uses on the cluster.

4/ Launching Spark Applications
啓動Spark應用程序
The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster. For standalone clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
spark-submit腳本提供了最直接的方式提交一個編譯的Spark應用程序到集羣。對standalone集羣，Spark如今支持兩種部署模式。
在client mode，driver得啓動和提交應用程序的客戶端在同一個進程。可是，在cluster mode，driver得啓動是在集羣中的一個Wrok進程，而且客戶端進程在履行完應用程序的提交責任後當即退出，而不是等應用程序執行完。
If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2). To control the application’s configuration or execution environment, see Spark Configuration.
若是application經過Spark submit啓動，application jar自動分發到all worker nodes。對application依賴的其餘任何jar，您應該經過--jars標記使用逗號做爲分隔符（例如--jars jar1，jar2）來指定它們。

Additionally, standalone cluster mode supports restarting your application automatically if it exited with non-zero exit code. To use this feature, you may pass in the --supervise flag to spark-submit when launching your application. Then, if you wish to kill an application that is failing repeatedly, you may do so through:
另外，standalone集羣模式支持自動重啓application，若是非零退出。要使用此功能，能夠在啓動application時傳遞 --supervise 給spark-submit。若是想要kill 一個重複失敗的 application，你能夠這樣作：

./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
You can find the driver ID through the standalone Master web UI at http://<master url>:8080.

5/Resource Scheduling
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time. You can cap the number of cores by setting spark.cores.max in your SparkConf. For example:
standalone cluster mode目前只支持跨applications的一個簡單FIFO scheduler。可是，爲了容許多個併發用戶，您能夠控制每一個應用程序將使用的最大資源數量。默認狀況下，它將獲取集羣中的全部內核，只有您一次運行一個應用程序纔有意義。您能夠經過在SparkConf中設置spark.cores.max來封頂核心數量。例如：

val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.cores.max", "10")
val sc = new SparkContext(conf)
In addition, you can configure spark.deploy.defaultCores on the cluster master process to change the default for applications that don’t set spark.cores.max to something less than infinite. Do this by adding the following to conf/spark-env.sh:
此外，你能夠在集羣上的master進程上配置spark.deploy.defaultCores，以將未設置spark.cores.max的應用程序的默認值更改成小於無限大的值。經過將如下內容添加到conf / spark-env.sh來執行此操做：

export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"
This is useful on shared clusters where users might not have configured a maximum number of cores individually.

6/ Monitoring and Logging
Spark’s standalone mode offers a web-based user interface to monitor the cluster. The master and each worker has its own web UI that shows cluster and job statistics. By default you can access the web UI for the master at port 8080. The port can be changed either in the configuration file or via command-line options.

每一個slave node的輸出日誌目錄SPARK_HOME/work。
In addition, detailed log output for each job is also written to the work directory of each slave node (SPARK_HOME/work by default). You will see two files for each job, stdout and stderr, with all output it wrote to its console.

7/ Running Alongside Hadoop（與Hadoop一塊兒運行）
You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically hdfs://<namenode>:9000/path, but you can find the right URL on your Hadoop Namenode’s web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
你能夠和現有的Hadoop集羣一塊兒運行Spark，只須要在同一臺機器上做爲單獨的服務啓動。在Spark上訪問Hadoop數據，只須要使用hdfs:// URL (typically hdfs://<namenode>:9000/path,但你能夠在Hadoop Namenode’s web UI上找到正確的URL）。或者，你能夠爲Spark設置單獨的集羣，仍然能夠經過網絡訪問HDFS；這將比磁盤本地化的速度慢，可是若是在同一個局域網中差異不大。

8/ Configuring Ports for Network Security
Spark makes heavy use of the network, and some environments have strict requirements for using tight firewall settings. For a complete list of ports to configure, see the security page.
Spark大量使用網絡，某些環境對使用嚴格的防火牆設置有嚴格的要求。有關要配置的端口的完整列表，請參閱security page。

9/ High Availability

Running Spark on YARN

1/ Launching Spark on YARN
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).
確保HADOOP_CONF_DIR or YARN_CONF_DIR指向包含Hadoop cluster配置文件目錄（客戶端）。這些配置文件用於寫入HDFS並鏈接到YARN ResourceManager。此目錄中包含的配置將分發到YARN集羣，以便應用程序使用的全部容器都使用相同的配置。若是配置引用了不受YARN管理的Java系統屬性或環境變量，那麼也應該在Spark application’s configuration (driver, executors, and the AM when running in client mode)中進行設置。

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
有兩種部署模式可以被用來啓動Spark applications on YARN。In cluster mode, Spark driver運行在由集羣上的YARN管理的application master process中，client能夠在初始化application後離開。In client mode,driver運行在client process，application master僅用於從YARN請求資源。

Unlike Spark standalone and Mesos modes, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
For example:

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10

The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the 「Debugging your Application」 section below for how to see driver and executor logs.
以上啓動了啓動一個default Application Master的YARN client program。SparkPi將做爲Application Master的子進程運行。client將按期輪詢Application Master的狀態更新和將其顯示在控制檯。一旦your application執行完畢，client將退出。

To launch a Spark application in client mode, do the same, but replace cluster with client. The following shows how you can run spark-shell in client mode:

$ ./bin/spark-shell --master yarn --deploy-mode client

2/ Adding Other JARs
In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.

$ ./bin/spark-submit --class my.main.Class \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar \
my-main-jar.jar \
app_arg1 app_arg2

3/ Preparations
Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. To build Spark yourself, refer to Building Spark.

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

4/ Configuration
Most of the configs are the same for Spark on YARN as for other deployment modes. See the configuration page for more information on those. These are configs that are specific to Spark on YARN.

5/ Debugging your Application
In YARN terminology, executors and application masters run inside 「containers」. YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command.
在YARN術語中，執行者和應用程序主人在「容器」內部運行。應用程序完成後，YARN有兩種處理容器日誌的方式。若是日誌聚合已打開（使用yarn.log-aggregation-enable config），容器日誌將複製到HDFS並在本地計算機上刪除。可使用紗線日誌命令從羣集上的任何位置查看這些日誌。

yarn logs -applicationId <app ID>
will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url in yarn-site.xml properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.
將從給定的應用程序中打印全部日誌文件的內容。您還可使用HDFS shell或API直接在HDFS中查看容器日誌文件。能夠經過查看YARN配置（yarn.nodemanager.remote-app-log-dir和yarn.nodemanager.remote-app-log-dir-suffix）找到它們所在的目錄。日誌也能夠在Spark Web UI的「執行程序」選項卡下使用。您須要同時運行Spark歷史記錄服務器和MapReduce歷史記錄服務器，並正確地在yarn-site.xml中配置yarn.log.server.url。 Spark歷史記錄服務器UI上的日誌URL將重定向到MapReduce歷史記錄服務器以顯示聚合日誌。

When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn’t require running the MapReduce history server.
當日志聚合未打開時，日誌將保留在YARN_APP_LOGS_DIR下的每臺計算機上，一般根據Hadoop版本和安裝配置爲/ tmp / logs或$ HADOOP_HOME / logs / userlog。查看容器的日誌須要轉到包含它們的主機並查看此目錄。子目錄根據應用程序ID和容器ID組織日誌文件。日誌也可在Spark Web UI的Executors選項卡下使用，不須要運行MapReduce歷史記錄服務器。

To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, JARs, and all environment variables used for launching each container. This process is useful for debugging classpath problems in particular. (Note that enabling this requires admin privileges on cluster settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).
要查看每一個容器啓動環境，請將yarn.nodemanager.delete.debug-delay-sec增長到一個較大的值（例如36000），而後經過在容器的節點上的紗線.nodemanager.local-dirs訪問應用程序緩存。推出。此目錄包含啓動腳本，JAR和用於啓動每一個容器的全部環境變量。此過程特別適用於調試類路徑問題。（請注意，啓用此功能須要管理員對羣集設置的權限和全部節點管理器的從新啓動，所以這不適用於宿主羣集）。

To use a custom log4j configuration for the application master or executors, here are the options:

upload a custom log4j.properties using spark-submit, by adding it to the --files list of files to be uploaded with the application.
add -Dlog4j.configuration=<location of configuration file> to spark.driver.extraJavaOptions (for the driver) or spark.executor.extraJavaOptions (for executors). Note that if using a file, the file: protocol should be explicitly provided, and the file needs to exist locally on all the nodes.
update the $SPARK_CONF_DIR/log4j.properties file and it will be automatically uploaded along with the other configurations. Note that other 2 options has higher priority than this option if multiple options are specified.
Note that for the first option, both executors and the application master will share the same log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file).

If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use spark.yarn.app.container.log.dir in your log4j.properties. For example, log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log. For streaming applications, configuring RollingFileAppender and setting file location to YARN’s log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN’s log utility.
若是您須要引用正確的位置將日誌文件放在YARN中，以便YARN能夠正確顯示和聚合它們，請在log4j.properties中使用spark.yarn.app.container.log.dir。例如，log4j.appender.file_appender.File = $ {spark.yarn.app.container.log.dir} /spark.log。對於流式應用程序，配置RollingFileAppender並將文件位置設置爲YARN的日誌目錄將避免大型日誌文件引發的磁盤溢出，而且可使用YARN的日誌實用程序訪問日誌。

To use a custom metrics.properties for the application master and executors, update the $SPARK_CONF_DIR/metrics.properties file. It will automatically be uploaded with other configurations, so you don’t need to specify it manually with --files.

6/Important notes
Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
核心請求是否符合調度決策取決於哪一個調度程序正在使用以及如何配置。

In cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do.
在羣集模式下，Spark執行程序和Spark驅動程序使用的本地目錄將是爲YARN（Hadoop YARN config yarn.nodemanager.local-dirs）配置的本地目錄。若是用戶指定了spark.local.dir，它將被忽略。在客戶端模式下，Spark執行程序將使用爲YARN配置的本地目錄，而Spark驅動程序將使用在spark.local.dir中定義的目錄。這是由於Spark驅動程序不是在客戶端模式下運行在YARN集羣上，而只是Spark執行器。

The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
--files和--archives選項支持使用相似於Hadoop的＃指定文件名。例如，您能夠指定：--files localtest.txt＃appSees.txt，這將把您本地名爲localtest.txt的文件上傳到HDFS，但這將經過名稱appSees.txt連接，您的應用程序應使用將其命名爲appSees.txt，以便在YARN上運行時引用它。

The --jars option allows the SparkContext.addJar function to work if you are using it with local files and running in cluster mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.若是您使用本地文件並以羣集模式運行，則--jars選項將容許SparkContext.addJar函數起做用。若是您使用HDFS，HTTP，HTTPS或FTP文件，則不須要使用它。

相關標籤/搜索

flume+spark+hive+spark

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。