在關於spark任務並行度的設置中,有兩個參數咱們會常常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那麼這兩個參數到底有什麼區別的?node
首先,讓咱們來看下它們的定義sql
Property Name | Default | Meaning |
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD.測試 For operations like parallelize with no parent RDDs, it depends on the cluster manager: |
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
看起來它們的定義彷佛也很類似,但在實際測試中,code
spark.default.parallelism只有在處理RDD時纔會起做用,對Spark SQL的無效。 spark.sql.shuffle.partitions則是對Spark SQL專用的設置
咱們能夠在提交做業的經過 --conf 來修改這兩個設置的值,方法以下:orm
spark-submit --conf spark.sql.shuffle.partitions=20 --conf spark.default.parallelism=20