Spark Structured Streaming框架(1)之基本用法

時間 2019-12-12

標籤 spark structured streaming 框架基本用法欄目 Spark 简体版

原文原文鏈接

　　 Spark Struntured Streaming是Spark 2.1.0版本後新增長的流計算引擎，本博將經過幾篇博文詳細介紹這個框架。這篇是介紹Spark Structured Streaming的基本開發方法。以Spark 自帶的example進行測試和介紹，其爲"StructuredNetworkWordcount.scala"文件。html

1. Quick Example

　　因爲咱們是在單機上進行測試，因此須要修單機運行模型，修改後的程序以下：sql

package org.apache.spark.examples.sql.streaming apache

import org.apache.spark.sql.SparkSession 數組

/** app

* Counts words in UTF8 encoded, '\n' delimited text received from the network. 框架

* socket

* Usage: StructuredNetworkWordCount <hostname> <port> ide

* <hostname> and <port> describe the TCP server that Structured Streaming 測試

* would connect to receive data. ui

* To run this on your local machine, you need to first run a Netcat server

* `$ nc -lk 9999`

* and then run the example

* `$ bin/run-example sql.streaming.StructuredNetworkWordCount

* localhost 9999`

object StructuredNetworkWordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: StructuredNetworkWordCount <hostname> <port>")

System.exit(1)

}

val host = args(0)

val port = args(1).toInt

val spark = SparkSession

.builder

.appName("StructuredNetworkWordCount")

.master("local[*]")

.getOrCreate()

import spark.implicits._

// Create DataFrame representing the stream of input lines from connection to host:port

val lines = spark.readStream

.format("socket")

.option("host", host)

.option("port", port)

.load()

// Split the lines into words

val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count

val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console

val query = wordCounts.writeStream

.outputMode("complete")

.format("console")

.start()

query.awaitTermination()

}

2. 剖析

　　對於上述所示的程序，進行以下的解讀和分析：

2.1 數據輸入

　　在建立SparkSessiion對象以後，須要設置數據源的類型，及數據源的配置。而後就會數據源中源源不斷的接收數據，接收到的數據以DataFrame對象存在，該類型與Spark SQL中定義類型同樣，內部由Dataset數組組成。

以下程序所示，設置輸入源的類型爲socket，並配置socket源的IP地址和端口號。接收到的數據流存儲到lines對象中，其類型爲DataFarme。

// Create DataFrame representing the stream of input lines from connection to host:port

val lines = spark.readStream

.format("socket")

.option("host", host)

.option("port", port)

.load()

2.2 單詞統計

　　以下程序所示，首先將接受到的數據流lines轉換爲String類型的序列；接着每一批數據都以空格分隔爲獨立的單詞；最後再對每一個單詞進行分組並統計次數。

// Split the lines into words

val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count

val wordCounts = words.groupBy("value").count()

2.3 數據輸出

經過DataFrame對象的writeStream方法獲取DataStreamWrite對象，DataStreamWrite類定義了一些數據輸出的方式。Quick example程序將數據輸出到控制終端。注意只有在調用start()方法後，纔開始執行Streaming進程，start()方法會返回一個StreamingQuery對象，用戶可使用該對象來管理Streaming進程。如上述程序調用awaitTermination()方法阻塞接收全部數據。