spark.sql.shuffle.partitions和spark.default.parallelism的區別

在關於spark任務並行度的設置中,有兩個參數咱們會常常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那麼這兩個參數到底有什麼區別的?node

首先,讓咱們來看下它們的定義sql

Property Name Default Meaning
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.測試

For operations like parallelize with no parent RDDs, it depends on the cluster manager:
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is largerspa

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

看起來它們的定義彷佛也很類似,但在實際測試中,code

 

spark.default.parallelism只有在處理RDD時纔會起做用,對Spark SQL的無效。 spark.sql.shuffle.partitions則是對Spark SQL專用的設置

咱們能夠在提交做業的經過 --conf 來修改這兩個設置的值,方法以下:orm

spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20
相關文章
相關標籤/搜索