重磅開源 KSQL：用於 Apache Kafka 的流數據 SQL 引擎 2017.8.29

時間 2019-11-05

標籤重磅開源 ksql 用於 apache kafka 數據 sql 引擎 2017.8.29 欄目 Apache 简体版

原文原文鏈接

Kafka 的做者 Neha Narkhede 在 Confluent 上發表了一篇博文，介紹了Kafka 新引入的KSQL 引擎——一個基於流的SQL。推出KSQL 是爲了下降流式處理的門檻，爲處理Kafka 數據提供簡單而完整的可交互式SQL 接口。KSQL 目前能夠支持多種流式操做，包括聚合（aggregate）、鏈接（join）、時間窗口（window）、會話（session），等等。html

與傳統 SQL 的主要區別git

KSQL 與關係型數據庫中的 SQL 仍是有很大不一樣的。傳統的 SQL 都是即時的一次性操做，不論是查詢仍是更新都是在當前的數據集上進行。而 KSQL 則不一樣，KSQL 的查詢和更新是持續進行的，並且數據集能夠源源不斷地增長。KSQL 所作的實際上是轉換操做，也就是流式處理。github

KSQL 的適用場景sql

1. 實時監控數據庫

一方面，能夠經過 KSQL 自定義業務層面的度量指標，這些指標能夠實時得到。底層的度量指標沒法告訴咱們應用程序的實際行爲，因此基於應用程序生成的原始事件來自定義度量指標能夠更好地瞭解應用程序的運行情況。另外一方面，能夠經過 KSQL 爲應用程序定義某種標準，用於檢查應用程序在生產環境中的行爲是否達到預期。apache

2. 安全檢測編程

KSQL 把事件流轉換成包含數值的時間序列數據，而後經過可視化工具把這些數據展現在 UI 上，這樣就能夠檢測到不少威脅安全的行爲，好比欺詐、入侵，等等。KSQL 爲此提供了一種實時、簡單而完備的方案。安全

3. 在線數據集成服務器

大部分的數據處理都會經歷 ETL（Extract——Transform——Load）這樣的過程，而這樣的系統一般都是經過定時的批次做業來完成數據處理的，但批次做業所帶來的延時在不少時候是沒法被接受的。而經過使用 KSQL 和 Kafka 鏈接器，能夠將批次數據集成轉變成在線數據集成。好比，經過流與表的鏈接，能夠用存儲在數據表裏的元數據來填充事件流裏的數據，或者在將數據傳輸到其餘系統以前過濾掉數據裏的敏感信息。session

4. 應用開發

對於複雜的應用來講，使用 Kafka 的原生 Streams API 或許會更合適。不過，對於簡單的應用來講，或者對於不喜歡 Java 編程的人來講，KSQL 會是更好的選擇。

KSQL 的核心抽象

KSQL 是基於 Kafka 的 Streams API 進行構建的，因此它的兩個核心概念是流（Stream）和表（Table）。流是沒有邊界的結構化數據，數據能夠被源源不斷地添加到流當中，但流中已有的數據是不會發生變化的，即不會被修改也不會被刪除。表就是流的視圖，或者說它表明了可變數據的集合。它與傳統的數據庫表相似，只不過具有了一些流式語義，好比時間窗口，並且表中的數據是可變的。KSQL 將流和表集成在一塊兒，容許將表明當前狀態的表與表明當前發生事件的流鏈接在一塊兒。

KSQL 架構

KSQL 是一個獨立運行的服務器，多個 KSQL 服務器能夠組成集羣，能夠動態地添加服務器實例。集羣具備容錯機制，若是一個服務器失效，其餘服務器就會接管它的工做。KSQL 命令行客戶端經過 REST API 向集羣發起查詢操做，能夠查看流和表的信息、查詢數據以及查看查詢狀態。由於是基於 Streams API 構建的，因此 KSQL 也沿襲了 Streams API 的彈性、狀態管理和容錯能力，同時也具有了僅一次（exactly once）語義。KSQL 服務器內嵌了這些特性，並增長了一個分佈式SQL 引擎、用於提高查詢性能的自動字節碼生成機制，以及用於執行查詢和管理的REST API。

Kafka+KSQL 要顛覆傳統數據庫

傳統關係型數據庫以表爲核心，日誌只不過是實現手段。而在以事件爲中心的世界裏，狀況卻剛好相反。日誌成爲了核心，而表幾乎是以日誌爲基礎，新的事件不斷被添加到日誌裏，表的狀態也所以發生變化。將 Kafka 做爲中心日誌，配置 KSQL 這個引擎，咱們就能夠建立出咱們想要的物化視圖，並且視圖也會持續不斷地獲得更新。

KSQL 的將來

KSQL 目前還處於開發者預覽階段，做者還在收集社區的反饋。將來計劃增長更多的特性，包括支持更豐富的SQL 語法，讓KSQL 成爲生產就緒的系統。

這裏有 KSQL 的快速入門指南和一個演示程序。能夠在Slack 的#KSQL 頻道上向做者提供反饋信息，或者若是發現Bug，能夠在 GitHub 上提出來。

KSQL - Streaming SQL for Apache Kafka

KSQL is now GA and officially supported by Confluent Inc. Get started with KSQL today.

KSQL is the streaming SQL engine for Apache Kafka. It provides a simple and completely interactive SQL interface for stream processing on Kafka; no need to write code in a programming language such as Java or Python. KSQL is distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more. You can find more KSQL tutorials and resources here if you are interested.

Click here to watch a screencast of the KSQL demo on YouTube.

Getting Started and Download

Stable Releases

Stable releases are published every four months and are officially supported by Confluent.

Download latest stable KSQL, which is included in Confluent Platform.
Follow the Quick Start.
Read the KSQL Documentation, notably the KSQL Tutorials and Examples, which include Docker-based variants.

Preview Releases

In addition to supported stable KSQL releases, we also provide preview releases. We encourage you to try them in development and testing environments and to take advantage of Confluent Community resources to get help and share feedback.

Download latest KSQL Preview.

Documentation

See KSQL documentation for the latest stable release.

Use Cases and Examples

Streaming ETL

Apache Kafka is a popular choice for powering data pipelines. KSQL makes it simple to transform data within the pipeline, readying messages to cleanly land in another system.

CREATE STREAM vip_actions AS
  SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';

Anomaly Detection

KSQL is a good fit for identifying patterns or anomalies on real-time data. By processing the stream as data arrives you can identify and properly surface out of the ordinary events with millisecond latency.

CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3;

Monitoring

Kafka's ability to provide scalable ordered messages with stream processing make it a common solution for log data monitoring and alerting. KSQL lends a familiar syntax for tracking, understanding, and managing alerts.

CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;

Join the Community

You can get help, learn how to contribute to KSQL, and find the latest news by connecting with the Confluent community.

Ask a question in the #ksql channel in our public Confluent Community Slack. Account registration is free and self-service.
Join the Confluent Google group.

Contributing

Contributions to the code, examples, documentation, etc. are very much appreciated.

Report issues and bugs directly in this GitHub project.
Learn how to work with the KSQL source code, including building and testing KSQL as well as contributing code changes to KSQL by reading our Development and Contribution guidelines.
One good way to get started is by tackling a newbie issue.