前面一篇文章提到大數據開發-Spark Join原理詳解,本文從源碼角度來看cogroup 的join實現apache

1.分析下面的代碼

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object JoinDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") 
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    
    val random = scala.util.Random
    val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx"))
    val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD"))
    val rdd1: RDD[(Int, String)] = sc.makeRDD(col1) 
    val rdd2: RDD[(Int, String)] = sc.makeRDD(col2)
    val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2) 
    println(rdd3.dependencies)
    val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
    println(rdd4.dependencies)
    sc.stop() 
  }
}

分析上面一段代碼，打印結果是什麼，這種join是寬依賴仍是窄依賴，爲何是這樣dom

2.從spark的ui界面來查看運行狀況

關於stage劃分和寬依賴窄依賴的關係，從2.1.3 如何區別寬依賴和窄依賴就知道stage與寬依賴對應，因此從rdd3和rdd4的stage的依賴圖就能夠區別寬依賴，能夠看到join劃分除了新的stage，因此rdd3的生成事寬依賴，另外rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3))) 是另外的依賴圖，因此能夠看到partitionBy之後再沒有劃分新的 stage，因此是窄依賴。ide

3.join的源碼實現

前面知道結論，是從ui圖裏面看到的，如今看join源碼是如何實現的（基於spark2.4.5）大數據

先進去入口方法，其中withScope的作法能夠理解爲裝飾器，爲了在sparkUI中能展現更多的信息。因此把全部建立的RDD的方法都包裹起來，同時用RDDOperationScope 記錄 RDD 的操做歷史和關聯，就能達成目標。ui

/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    join(other, defaultPartitioner(self, other))
  }

下面來看defaultPartitioner 的實現，其目的就是在默認值和分區器之間取一個較大的，返回分區器this

def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    // 判斷有沒有設置分區器partitioner
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    
    //若是設置了partitioner，則取設置partitioner的最大分區數
    val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
      Some(hasPartitioner.maxBy(_.partitions.length))
    } else {
      None
    }
 
    //判斷是否設置了spark.default.parallelism，若是設置了則取spark.default.parallelism
    val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
      rdd.context.defaultParallelism
    } else {
      rdds.map(_.partitions.length).max
    }
 
    // If the existing max partitioner is an eligible one, or its partitions number is larger
    // than the default number of partitions, use the existing partitioner.
    //主要判斷傳入rdd是否設置了默認的partitioner 以及設置的partitioner是否合法                
    //或者設置的partitioner分區數大於默認的分區數 
    //條件成立則取傳入rdd最大的分區數，不然取默認的分區數
    if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
        defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
      hasMaxPartitioner.get.partitioner.get
    } else {
      new HashPartitioner(defaultNumPartitions)
    }
  }

  private def isEligiblePartitioner(
     hasMaxPartitioner: RDD[_],
     rdds: Seq[RDD[_]]): Boolean = {
    val maxPartitions = rdds.map(_.partitions.length).max
    log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1
  }
}

再進入join的重載方法，裏面有個new CoGroupedRDD[K](Seq(self, other), partitioner)spa

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    throw new SparkException("HashPartitioner cannot partition array keys.")
  }
  //partitioner 經過對比獲得的默認分區器，主要是分區器中的分區數
  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
  cg.mapValues { case Array(vs, w1s) =>
    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
  }
}


  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
    join(other, new HashPartitioner(numPartitions))
  }

最後來看CoGroupedRDD，這是決定是寬依賴仍是窄依賴的地方，能夠看到若是左邊rdd的分區和上面選擇給定的分區器一致，則認爲是窄依賴，不然是寬依賴scala

override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

總結，join時候能夠指定分區數，若是join操做左右的rdd的分區方式和分區數一致則不會產生shuffle，不然就會shuffle，而是寬依賴，分區方式和分區數的體現就是分區器。大數據開發，更多關注查看我的資料code

大數據開發-從cogroup的實現來看join是寬依賴仍是窄依賴

1.分析下面的代碼

2.從spark的ui界面來查看運行狀況

3.join的源碼實現