本章主要討論,在Spark2.4 Structured Streaming讀取kafka數據源時,kafka的topic數據是如何被執行的過程進行分析。html
如下邊例子展開分析:java
SparkSession sparkSession = SparkSession.builder().getOrCreate(); Dataset<Row> sourceDataset = sparkSession.readStream().format("kafka").option("", "").load(); sourceDataset.createOrReplaceTempView("tv_test"); Dataset<Row> aggResultDataset = sparkSession.sql("select ...."); StreamingQuery query = aggResultDataset.writeStream().format("kafka").option("", "") .trigger(Trigger.Continuous(1000)) .start(); try { query.awaitTermination(); } catch (StreamingQueryException e1) { e1.printStackTrace(); }
上邊例子業務,使用structured streaming讀取kafka的topic,並作agg,而後sink到kafka的另一個topic上。node
要分析DataSourceReader#load方法返回的DataSet的處理過程,須要對DataSourceReader的load方法進行分析,下邊這個截圖就是DataSourceReader#load的核心代碼。git
在分析以前,咱們來了解一下測試結果:sql
package com.boco.broadcast trait MicroBatchReadSupport { } trait ContinuousReadSupport { } trait DataSourceRegister { def shortName(): String } class KafkaSourceProvider extends DataSourceRegister with MicroBatchReadSupport with ContinuousReadSupport{ override def shortName(): String = "kafka" } object KafkaSourceProvider{ def main(args:Array[String]):Unit={ val ds=classOf[KafkaSourceProvider].newInstance() ds match { case s: MicroBatchReadSupport => println("MicroBatchReadSupport") case s:ContinuousReadSupport=> println("ContinuousReadSupport") } } }
上邊這個執行結果時只會執行輸出「MicroBatchReadSupport」,永遠走不到ConitnuousReadSupport match分支,後邊會單獨介紹這個事情。。。express
帶着這個測試結果,咱們分析DataSourceReader的load方法代碼:apache
1)通過上篇文章《Spark2.x(六十):在Structured Streaming流處理中是如何查找kafka的DataSourceProvider? 》分析,咱們知道DataSource.lookupDataSource()方法,返回的是KafkaSourceProvider類,那麼ds就是KafkaSourceProvider的實例對象;bootstrap
2)從上邊截圖咱們能夠清楚的知道KafkaSourceProvider(https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala)的定義繼承了DataSourceRegister,StreamSourceProvider,StreamSinkProvider,RelationProvider,CreatableRelationProvider,StreamWriteProvider,ContinuousReadSupport,MicroBatchReadSupport等接口session
3) v1DataSource是DataSource類,那麼咱們來分析DataSource初始化都作了什麼事情。
// We need to generate the V1 data source so we can pass it to the V2 relation as a shim. // We can't be sure at this point whether we'll actually want to use V2, since we don't know the // writer or whether the query is continuous. val v1DataSource = DataSource( sparkSession, userSpecifiedSchema = userSpecifiedSchema, className = source, options = extraOptions.toMap)
在DataSource初始化作的事情只有這些,並未加載數據。
1)調用object DataSource.loopupDataSource加載provider class;
2)獲取kafka的topic的schema;
3)保存option參數,也就是sparkSession.readStream().option相關參數;
4)獲取sparkSession屬性。
1)DataSource#sourceSchema方法內部調用KafkaSourceProvider的#sourceShema(。。。);
2)KafkaSourceProvider#sourceSchema返回了Map,(key:shourName(),value:KafkaOffsetReader.kafkaSchema)。
代碼分析到這裏並未加載數據。
ds就是provider實例,
v1DataSource是實際上就是包含source的provider,source的屬性(spark.readeStream.option這些參數[topic,maxOffsetsSize等等]),source的schema的,它自己是一個數據描述類。
兩個主要區別仍是在tempReader的區別:
MicroBatchReadSupport:使用KafkaSourceProvider的createMicroBatchReader生成KafkaMicroBatchReader對象;
ContinuousReadSuuport:使用KafkaSourceProvider的createContinuousReader生成KafkaContinuousReader對象。
測試代碼1:
package com.boco.broadcast import java.util.concurrent.TimeUnit import org.apache.spark.sql.streaming.{OutputMode, Trigger} import org.apache.spark.sql.{Row, SparkSession} object TestContinuous { def main(args:Array[String]):Unit={ val spark=SparkSession.builder().appName("test").master("local[*]").getOrCreate() val source= spark.readStream.format("kafka") .option("subscribe", "test") .option("startingOffsets", "earliest") .option("kafka.bootstrap.servers","localhost:9092") .option("failOnDataLoss",true) .option("retries",2) .option("session.timeout.ms",3000) .option("fetch.max.wait.ms",500) .option("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") .option("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") .load() source.createOrReplaceTempView("tv_test") val aggResult=spark.sql("select * from tv_test") val query=aggResult.writeStream .format("csv") .option("path","E:\\test\\testdd") .option("checkpointLocation","E:\\test\\checkpoint") .trigger(Trigger.Continuous(5,TimeUnit.MINUTES)) .outputMode(OutputMode.Append()) .start() query.awaitTermination() } }
測試代碼2:
package com.boco.broadcast import java.util.concurrent.TimeUnit import org.apache.spark.sql.streaming.{OutputMode, Trigger} import org.apache.spark.sql.{Row, SparkSession} object TestContinuous { def main(args:Array[String]):Unit={ val spark=SparkSession.builder().appName("test").master("local[*]").getOrCreate() val source= spark.readStream.format("kafka") .option("subscribe", "test") .option("startingOffsets", "earliest") .option("kafka.bootstrap.servers","localhost:9092") .option("failOnDataLoss",true) .option("retries",2) .option("session.timeout.ms",3000) .option("fetch.max.wait.ms",500) .option("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") .option("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") .load() source.createOrReplaceTempView("tv_test") val aggResult=spark.sql("select * from tv_test") val query=aggResult.writeStream .format("kafka") .option("subscribe", "test_sink") .option("checkpointLocation","E:\\test\\checkpoint") .trigger(Trigger.Continuous(5,TimeUnit.MINUTES)) .outputMode(OutputMode.Append()) .start() query.awaitTermination() } }
測試代碼的Pom文件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.boco.broadcast.test</groupId> <artifactId>broadcast_test</artifactId> <version>1.0-SNAPSHOT</version> <inceptionYear>2008</inceptionYear> <properties> <scala.version>2.11.12</scala.version> <spark.version>2.4.0</spark.version> </properties> <repositories> <repository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <dependencies> <!--Scala --> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-reflect</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-compiler</artifactId> <version>${scala.version}</version> </dependency> <!--Scala --> <!--Spark --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.11</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>2.3.0</version> </dependency> <!--Spark --> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency> <dependency> <groupId>org.specs</groupId> <artifactId>specs</artifactId> <version>1.2.5</version> <scope>test</scope> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <scalaVersion>${scala.version}</scalaVersion> <args> <arg>-target:jvm-1.8</arg> </args> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-eclipse-plugin</artifactId> <configuration> <downloadSources>true</downloadSources> <buildcommands> <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand> </buildcommands> <additionalProjectnatures> <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature> </additionalProjectnatures> <classpathContainers> <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer> <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer> </classpathContainers> </configuration> </plugin> </plugins> </build> <reporting> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <configuration> <scalaVersion>${scala.version}</scalaVersion> </configuration> </plugin> </plugins> </reporting> </project>
調試結果:
無論是執行「測試代碼1」 ,仍是執行「測試代碼2」,ds match的結果同樣,都是隻走case MicroBatchReadSupport分支,這裏一個疑問:
爲何在Trigger是Continous方式時,讀取kafka topic數據源採用的是「KafkaMicroBatchReader」,而不是「KafkaContinuousReader」?
可是最終都被包裝爲StreamingRelationV2 extends LeafNode (logicPlan)傳遞給Dataset,Dataset在加載數據時,執行的就是這個logicplan
package org.apache.spark.sql.execution.streaming import org.apache.spark.rdd.RDD import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation import org.apache.spark.sql.catalyst.expressions.Attribute import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} import org.apache.spark.sql.execution.LeafExecNode import org.apache.spark.sql.execution.datasources.DataSource import org.apache.spark.sql.sources.v2.{ContinuousReadSupport, DataSourceV2} object StreamingRelation { def apply(dataSource: DataSource): StreamingRelation = { StreamingRelation( dataSource, dataSource.sourceInfo.name, dataSource.sourceInfo.schema.toAttributes) } } /** * Used to link a streaming [[DataSource]] into a * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]. This is only used for creating * a streaming [[org.apache.spark.sql.DataFrame]] from [[org.apache.spark.sql.DataFrameReader]]. * It should be used to create [[Source]] and converted to [[StreamingExecutionRelation]] when * passing to [[StreamExecution]] to run a query. */ case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute]) extends LeafNode with MultiInstanceRelation { override def isStreaming: Boolean = true override def toString: String = sourceName // There's no sensible value here. On the execution path, this relation will be // swapped out with microbatches. But some dataframe operations (in particular explain) do lead // to this node surviving analysis. So we satisfy the LeafNode contract with the session default // value. override def computeStats(): Statistics = Statistics( sizeInBytes = BigInt(dataSource.sparkSession.sessionState.conf.defaultSizeInBytes) ) override def newInstance(): LogicalPlan = this.copy(output = output.map(_.newInstance())) } 。。。。 // We have to pack in the V1 data source as a shim, for the case when a source implements // continuous processing (which is always V2) but only has V1 microbatch support. We don't // know at read time whether the query is conntinuous or not, so we need to be able to // swap a V1 relation back in. /** * Used to link a [[DataSourceV2]] into a streaming * [[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]. This is only used for creating * a streaming [[org.apache.spark.sql.DataFrame]] from [[org.apache.spark.sql.DataFrameReader]], * and should be converted before passing to [[StreamExecution]]. */ case class StreamingRelationV2( dataSource: DataSourceV2, sourceName: String, extraOptions: Map[String, String], output: Seq[Attribute], v1Relation: Option[StreamingRelation])(session: SparkSession) extends LeafNode with MultiInstanceRelation { override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = sourceName override def computeStats(): Statistics = Statistics( sizeInBytes = BigInt(session.sessionState.conf.defaultSizeInBytes) ) override def newInstance(): LogicalPlan = this.copy(output = output.map(_.newInstance()))(session) }
那兩個reader是microbatch和continue獲取數據的根本規則定義。
StreamingRelation和StreamingRelationV2只是對datasource的包裝,並且自身繼承了catalyst.plans.logical.LeafNode,並不具備其餘操做,只是個包裝類。
實際上這些都是一個邏輯計劃生成的過程,生成了一個具備邏輯計劃的Dataset,以便後邊觸發流處理是執行該邏輯計劃生成數據來使用。
start()方法返回的是一個StreamingQuery對象,StreamingQuery是一個接口類定義在:
aggResult.wirteStream.format(「kafka」).option(「」,」」).trigger(Trigger.Continuous(1000)),它是一個DataStreamWriter對象:
在DataStreamWriter中定義了一個start方法,在這個start方法是整個流處理程序開始執行的入口。
DataStreamWriter的start方法內部走的分支代碼以下:
上邊的DataStreamWriter#start()最後一行調用的StreamingQueryManager#startQuery()