Ls 1 - Understanding the nature of Spark Streaming

What is Spark Streaming?

According to the Official Apache Spark website, Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. node

Spark Streaming can integrate with different kinds of input source, e.g. such as Kafka, Flume, HDFS/S3, Kinesis and Twitter. The input source will be extracted and transformed by the Spark Streaming application. The transformed data can be loaded into HDFS file system, different databases e.g. Redis, MySQL or Oracle etc. It can also connect to live dashboard.web

Caption from Apache Spark website: app

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.socket


Spark Streaming Data Flow experiment

I will base on the Blacklist Filtering examples to look into the details of the data flow. Let me update the socket time to 5 mins and see what happen to the data flow.ide

From the History Sever, the job takes a total of 4.9 mins to complete the application.oop


Select the job you just ran, you can see clearly that it has executed 5 jobs. Next, we will go and check one by one.大數據

Job Id 0

We will start checking the details by Job Id 0. As you can see for Job 0, it has 2 stages, stage 0 and stage 1.this

From the Spark Streaming UI, we can see that It has executed an reduceByKey function. But according to our program, we don’t have any reduceByKey transformation, so why?spa


Job Id 0 – Stage 0

This is the first job inside one Spark Applicationscala


Job Id 0 – Stage 1

This tells us a very important message that the initiation of Spark Streaming program will automatically generate one or more Spark Jobs that consists of Action type transformation such as reduceByKey.


Next, go to Job Id 1 and check the details

As you read through the screenshot, you can figure out the total duration of this job is 4.9 mins, however, you can see below, Job 1 was executed for 4.5 mins which is around 95% of the whole Spark Streaming Application.

What is inside Job 1, why is it needs to consume such a higher percentage of time. Let’s deep dive and see what’s going on inside this Job.



It was very surprising that a Spark Streaming program needs around 95% of time to create a Receiver for receiving streaming data and it executed on one node e.g. HadoopS1 with one task only. The Initiation of Receiver on Executor is triggered by a Spark Job which consists of an Action type transformation.

Have you ever thought of the difference between the job created by the Receiver and the normal Spark Job? They are indeed no difference and the result here implies that we can create a lot of Spark job in one application. Different jobs can integrate with each other. The automatically created Receiver (job) in Spark Streaming Application serve as a very good starting point to collect data for other complicated program to process, such as passing the streaming data for graph analysis or machine learning. A complicated Spark application must consists of a couples of jobs.

Another point to note is the locality level, it shows that the locality level for the Receiver is Process Local, means the data will first keep in ram until it was full and then move to disk, such as HDFS.

Qs: How many jobs are being initiated by your Spark Application and how did different jobs in the same program integrate with each others?

Qs: But why it needs take so long to create a Receiver, what is it actually done?



Next, go to Job ID 2 and check the details

It generated a BlockRDD that came from the sock text stream, this is by input DStream based on the time interval (300 seconds) to generate the RDD. When the job is running, it will have a regular data input.

We receive data on one executor but we process the data on the Spark Cluster (3 worker nodes)



Stage 3 of Job Id 2

Stage 4 of Job Id 2

Stage 5 of Job Id 2

Stage 6 of Job Id 3

Qs: why there are so many tasks being skipped?

Stage 11 of Job id 4

In Conclusion

This exercise aimed to show the readers about the process flow inside a Spark Streaming Application. There was only one business logic which aimed to filter out the blacklist (e.g. name/ ip). However, when we looked into the DStream Graph of an application. It founds out that it was also triggered and automatically generated other jobs for the streaming program to function properly.

Obviously, I still got a few questions in my mind what is the function of the other jobs and why is it needed? (Those questions will be not answered at this moment)

Spark Streaming is basically a batch process with regular time interval to generate an RDD which called DStream. The time interval can pre-configure to per mili-second. In our examples, we have configured the time interval to 5 mins, means for every 5 mins, the program will generate an RDD with action type transformation to trigger a job.

The DStream will also generate a DStream Graph about different job dependencies which are the same concept of RDD-DAG.

 

Bullet Point:

1. Spark Streaming is an extension of the core Spark API.
2.
 DStream = RDD + Time Interval
3. Receiver is one of the job that executed on Executor and it consumes 95% duration of the Spark Streaming Application




Thanks for reading

Janice

——————————————————————————————–
Reference: DT大數據夢工廠SPARK版本定製課程 – 第1課:經過案例對SparkStreaming 透徹理解三板斧之一:解密SparkStreaming另類實驗及SparkStreaming本質解析

Sharing is Good, Learning is Fun.今天很殘酷、明天更殘酷,後天很美好。但不少人死在明天晚上、而看不到後天的太陽。 –馬雲 Jack Ma

相關文章
相關標籤/搜索