spark.sql.shuffle.partitions和spark.default.parallelism的區別

時間 2019-12-04

標籤 spark.sql.shuffle.partitions spark sql shuffle partitions spark.default.parallelism default parallelism 區別欄目 Spark 简体版

原文原文鏈接

在關於spark任務並行度的設置中，有兩個參數咱們會常常遇到，spark.sql.shuffle.partitions 和 spark.default.parallelism, 那麼這兩個參數到底有什麼區別的？node

首先，讓咱們來看下它們的定義sql

Property Name

Default

Meaning

spark.sql.shuffle.partitions

200

Configures the number of partitions to use when shuffling data for joins or aggregations.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.測試

For operations like parallelize with no parent RDDs, it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is largerspa

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起來它們的定義彷佛也很類似，但在實際測試中，code

spark.default.parallelism只有在處理RDD時纔會起做用，對Spark SQL的無效。 spark.sql.shuffle.partitions則是對Spark SQL專用的設置

咱們能夠在提交做業的經過 --conf 來修改這兩個設置的值，方法以下：orm

spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20

相關標籤/搜索

spark.default.parallelism

spark.sql.shuffle.partitions

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。