coalesce
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]apache
該函數用於將RDD進行重分區,使用HashPartitioner。函數
第一個參數爲重分區的數目,第二個爲是否進行shuffle,默認爲false;es5
如下面的例子來看:spa
- scala> var data = sc.textFile("/tmp/lxw1234/1.txt")
- data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at textFile at :21
-
- scala> data.collect
- res37: Array[String] = Array(hello world, hello spark, hello hive, hi spark)
-
- scala> data.partitions.size
- res38: Int = 2 //RDD data默認有兩個分區
-
- scala> var rdd1 = data.coalesce(1)
- rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at :23
-
- scala> rdd1.partitions.size
- res1: Int = 1 //rdd1的分區數爲1
-
-
- scala> var rdd1 = data.coalesce(4)
- rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at :23
-
- scala> rdd1.partitions.size
- res2: Int = 2 //若是重分區的數目大於原來的分區數,那麼必須指定shuffle參數爲true,//不然,分區數不便
-
- scala> var rdd1 = data.coalesce(4,true)
- rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at :23
-
- scala> rdd1.partitions.size
- res3: Int = 4
-
repartition
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]scala
該函數其實就是coalesce函數第二個參數爲true的實現blog
- scala> var rdd2 = data.repartition(1)
- rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23
-
- scala> rdd2.partitions.size
- res4: Int = 1
-
- scala> var rdd2 = data.repartition(4)
- rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23
-
- scala> rdd2.partitions.size
- res5: Int = 4
若是以爲本博客對您有幫助,請 贊助做者 。ci