SQL On Streaming

時間 2019-11-10

標籤 sql streaming 欄目 SQL 简体版

原文原文鏈接

此文已由做者嶽猛受權網易雲社區發佈。
html

歡迎訪問網易雲社區，瞭解更多網易技術產品運營經驗。sql

實時計算的一個方向

實時計算將來會成爲一個趨勢，基本上全部的離線計算任務都能經過實時計算來完成，對於實時計算來算，除了性能，延遲性和吞吐量這些硬指標要求之外，我以爲易用性上面應該是將來的一個發展方向，畢竟如今的實時計算入storm，flink，sparkstreaming等都是經過API來進行的，這些使用起來都不太方便，後續更大的一個側重方向應該是SQL ON STREAMING，對storm瞭解不是不少，可是有些公司已經針對storm進行了sql封裝,下面只想談下兩個比較流行的開源流計算引擎對SQL的封裝粒度。apache

Flink

SQL on Streaming Tables安全

code examplesapp

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)// read a DataStream from an external sourceval ds: DataStream[(Long, String, Integer)] = env.addSource(...)// register the DataStream under the name "Orders"tableEnv.registerDataStream("Orders", ds, 'user, 'product, 'amount)// run a SQL query on the Table and retrieve the result as a new Table
val result = tableEnv.sql( "SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'")

限制

  1.2版本 只支持SELECT, FROM, WHERE, and UNION，不支持聚合，join操做，感受離真正的使用仍是有一段距離要走。

spark 2.0，structure streaming

code examplesoop

import org.apache.spark.sql.functions._import org.apache.spark.sql.SparkSession
val input = spark.readStream.text("file:///home/hadoop/data/")
val words = input.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream.outputMode("complete").format("console").start
query.awaitTermination

限制

output mode只實現了兩種，且有限制性能

Append mode (default)
This is the default mode, where only the new rows added to the result table since the last trigger will be outputted to the sink. This is only applicable to queries that do not have any aggregations (e.g. queries with only select, where, map, flatMap, filter,join, etc.).
Complete mode
The whole result table will be outputted to the sink.This is only applicable to queries that have aggregations