關於Spark中 sortByKey被劃分到transformation中，卻有action操做緣由

時間 2019-11-12

標籤關於 spark sortbykey 劃分 transformation 卻有 action 緣由欄目 Spark 简体版

原文原文鏈接

注意有可能會成爲面試題。

在Spark 1.4 中關於sortByKey 源碼以下：面試

/**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }

針對註釋大體翻譯是這樣的：

經過RDD的Key 來進行排序，使每一個分區包含的元素的排序範圍內。調用collect或save 方法，RDD 將返回或輸出的有序的記錄列表。（在`狀況下，它們將被寫入到多個'part-X`文件在文件系統中, 爲這些 keys）算法

當咱們調用 sortByKey 這個方法的時候，其中有兩個可選參數。而且都帶有默認值：學習

ascending:排序的方向，默認爲true，表示升序排列。 this

ascending: Boolean = true,

numPartitions:分區數，默認爲原分區數 spa

numPartitions: Int = self.partitions.length

在方法內部有這樣一句話： scala

val part = new RangePartitioner(numPartitions, self, ascending)

注意這裏，緣由就是這裏：實例化了一個RangePartitioner對象，在RangePartitioner中，在Range數據分片的時候，內部進行排序，它須要對全部分區的的數據進行掃描和範圍的劃分，（就比如二叉樹的算法）翻譯

官方將sortByKey劃分到transformation了。 code

其餘就自行查看源碼去學習吧！ orm

相關標籤/搜索

transformation&action

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。