做者的原文地址以下:html
https://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650769273&idx=1&sn=195b25c91f476aba7b7c6cbf1c902f2e&chksm=f3f932ecc48ebbfa7cf8742ede5956c19ab5ca0bad839fd00c4fcf73de3abf6bd8c7d2eb437b&scene=0#rd
引用了做者文中的一些圖片,版權歸做者全部。
從互聯網上下載引用的圖片,也歸原做者全部。
我的認爲比較片面,做者所用圖例和數據倉庫聖經《Dimensional Modeling 》即中文版的《維度建模》相差太大。數據加載和數據轉換放在同一層,和 ETL 有些違和。ETL 的步驟是 Extract, Transform, Load, 因此放在同一層來說,不能很好的體現前後順序,容易讓剛開始實施數據倉庫建設的讀者引發迷糊。先轉換後裝載,或許以後還有第二層轉換,第三層轉換,這些步驟其實應該是做爲交叉循環進行的。node
如下是做者的原圖:mysql
數據倉庫分兩種大架構, Inmon 的數據集市和 Kimball 提倡的集中式數據倉庫。數據集市是將數據分爲各種主題,迴流到各個業務部門,以提供信息檢索。集中式數據倉庫則是將全部主題融合到一塊兒,作出更多聯合性的分析,而在這前,經過數據操做層(OPDS)已經採用雪花模型將各個業務系統數據加載到緩衝層,業務系統能夠在這裏採集到聚合信息。ios
做者的原圖沒有充分體現出 kimball 和 Inmon 兩大數據倉庫的特徵,因此我從新在網上找了個圖,方便理解。其實 Cube 只是一種 MOLAP 的實現,屬於數據倉庫的一部分。git
簡要的說下分佈式計算平臺比傳統構建在商業數據庫平臺上的數據倉庫的優點:github
經過將數據計算分配到離數據最近的存儲節點上,使得並行計算成爲可能。redis
將大份數據,拆解爲小份數據並分散存儲到不一樣的存儲節點,提供分佈式計算的前提條件sql
分區分庫分表等分佈式存儲操做以後,記錄這些結構信息,並作高可用管理,提供給應用程序的是路由功能。使得應用系統進來的查詢請求得以分配到合理的數據節點上計算。數據庫
而這一切在 oracle, sql server, mysql, postgresql 上是很難快速得以部署的。小規模 5-10 臺還能接受,100臺以上集羣,管理難度和成本會急劇加速。apache
我認爲構建在商業數據庫平臺上的數據倉庫其實沒有必要從新推翻,用 Hadoop 來從新作一遍,這一點和做者想法不一致。
a) 數據倉庫徹底能夠作爲數據源再丟到分佈式系統中作計算
b) 分佈式系統做爲數據倉庫的計算引擎,提供算力便可。
c) 分佈式系統將聚合數據/快速計算能力迴流給數據倉庫
e) 根據需求再將其餘主題相關建模以及計算,構建到新的分佈式系統中
借用做者的圖,咱們能夠用 數據倉庫 + hadoop 分佈式 實現 結果存儲+搜索引擎,數據倉庫和hadoop分佈式之間用 sqoop 來作傳輸的通道。實現分佈式算力的迴流,而展示分析工做等依舊能夠選擇連接 數據倉庫。一些即席分析(Ad Hoc Query) 須要大量的計算,那麼能夠直接連接 Hadoop 分佈式系統
在原文中,做者是這樣描述的:
在傳統大數據架構的基礎上,流式架構很是激進,直接拔掉了批處理,數據全程以流的形式處理,因此在數據接入端沒有了ETL,轉而替換爲數據通道。通過流處理加工後的數據,以消息的形式直接推送給了消費者。雖然有一個存儲部分,可是該存儲更多以窗口的形式進行存儲,因此該存儲並不是發生在數據湖,而是在外圍系統。
這裏有「數據湖」(Data Lake)的概念,稍微介紹下:
A data lake is a system or repository of data stored in its natural format,usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
節選自維基百科 https://en.wikipedia.org/wiki/Data_lake
一般存儲原始格式數據,好比 blob 對象或文件的系統,被稱爲數據湖(Data Lake). 在這套系統裏面,存儲了全部企業的數據,不只僅是原始應用系統數據,還包括了用於報表,可視化分析和機器學習的轉化事後的數據。所以,它包含了各類數據格式,有關係型數據庫的結構化數據,有半結構化數據(好比 CSV, log, XML ,Json),還有非結構化數據(email, document, PDF )和二進制數據(圖片,音頻和視頻)。
原文中做者對消息的存儲,並無給出一個合理的解釋,並將消息和數據湖區分開來,我認爲是不妥的。從維基百科來的解釋,一切的數據存儲都是歸檔在數據湖裏的,至少消息我認爲也應該算是數據湖的一部分。
上面的圖很難理解,並且對 Lambda Architecture 的三要素也沒有很好的展現,因此有必要解釋下 Lambda 的要素以及貼一下公認的 Lambda 架構圖
維基百科是這麼定義 Lambda 架構的:
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.
Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.
Lambda 是充分利用了 批次(batch) 處理和 流處理(stream-processing)各自強項的數據處理架構。它平衡了延遲,吞吐量和容錯。利用批次處理生成正確且深度聚合的數據視圖,同時藉助流式處理方法提供在線數據分析。在展示數據以前,能夠將二者結果融合在一塊兒。Lambda 的出現,與大數據,實時數據分析以及map-reduce 低延遲的推進是密不可分的。
Lambda 依賴於只增不改的數據源。歷史數據在這個模型中,就是穩定不變的歷史數據,變化着的數據永遠是最新進來的,而且不會重寫歷史數據。任何事件,對象等的狀態和屬性,都須要從有序的實踐中推斷出來。
所以在這套架構中,咱們能夠看到即時的數據,也能夠看到歷史的聚合數據。
維基百科中對 Lambda 經典的三大元素作了描述:
Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. [3]:13 The processing layers ingest from an immutable master copy of the entire data set.
Lambda 包含了三層:批次處理(batch), 高速處理也稱爲實時處理(speed or real-time),和響應查詢的服務層(serving).
《主流大數據架構》中有一點特別之處在於,他將數據源描述的特別豐富,因此即便用傳統的商業數據庫也是能夠實現 Lambda 架構的。Batch 用商業用的 ETL 工具,好比 SSIS, Informatic 等, Stream-Processing 用 Message Broker ,好比 RabbitMQ, ActiveMQ等。
但維基百科上可不是這麼認爲的,以 Hadoop 爲表明的分佈式系統,纔有資格稱得上是 Lambda 架構的組成。從它對三個元素的定義就能夠看的出來:
Batch layer
The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views.
Apache Hadoop is the de facto standard batch-processing system used in most high-throughput architectures
批次處理層,利用分佈式系統的強大處理能力,提供了可靠準確無誤的數據視圖。一旦發現錯誤的數據,便可從新計算所有的數據來得到最新的結果。Hadoop 被視爲這類高吞吐架構的標準。
Speed layer
The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available.
Stream-processing technologies typically used in this layer include Apache Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL databases
高速處理層,用實時計算的手段,將數據集成到存儲端。這部分處理雖然沒有最終的批次處理來的完整和精確,但彌補了批次處理的時效差的弱點。
一般使用 Apache Storm, SQLStream, Apache Spark 等,輸出通常是到NoSQL 數據庫
Serving layer
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.
Examples of technologies used in the serving layer include Druid, which provides a single cluster to handle output from both layers. Dedicated stores used in the serving layer include Apache Cassandra, Apache HBase, MongoDB, VoltDB or Elasticsearch for speed-layer output, and Elephant DB, Apache Impala or Apache Hive for batch-layer output.
這一層主要服務於終端數據消費者的即席查詢和分析。當最終計算結果存儲到本層的時候,就能夠對外服務了。對於高速處理層來的結果,能夠交由 Apache Cassandra, HBase, MongoDB, ElasticSearch 存儲;對於批次處理層來的結果,能夠交由 Apache Impala, Hive 存儲。
ElasticSearch 提供全文索引, Impala 就是相似於 Cube 作分析應用的項目,所以《主流大數據架構》中提到 Cube 爲數倉中心,其實不是很妥。依我看,更應該是各類數據衍生系統的大綜合,既有傳統意義上的數據倉庫(Dimension/Fact), 更要有全文索引,OLAP 應用等。
Lambda架構的一些應用:
Metamarkets, which provides analytics for companies in the programmatic advertising space, employs a version of the lambda architecture that uses Druid for storing and serving both the streamed and batch-processed data.
For running analytics on its advertising data warehouse, Yahoo has taken a similar approach, also using Apache Storm, Apache Hadoop, and Druid.
The Netflix Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not necessarily to provide the same type of views. https://en.wikipedia.org/wiki/Lambda_architecture#cite_note-netflix-10 Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.
Metamarkets, 一家從事計算廣告的公司,利用 Druid 列式存儲技術搭建了 Lambda 架構,同時支持數據的批次處理與實時處理
Yahoo 利用了相同技術棧, Storm, Hadoop, Druid 搭建了 Lambda
Netflix Suro 項目隔離了離線數據與在線數據的處理路徑。延遲要求不高的數據依然採用了 map-reduce 的處理類型。雖然沒有嚴格遵循 Lambda ,但本質上無異。
我在網上看到和做者差很少的圖,我以爲二者都沒有將 Stream, Serving 層畫出來,並且對 messaging 到 Raw Data Reserved 沒有特別指出其數據流過程,非常費解。
我只能將他當成是消息隊列的一個目標庫而已。
全部數據都走實時路線,一切都是流。而且以數據湖做爲最終存儲目的地。事實上仍是以 lambda爲基礎,只是將批次處理層(Batch Layer ) 去掉,剩下 Streaming, Serving 層。
結合上面兩圖,再看下圖,可能會更清晰一些。
Some variants of social network applications, devices connected to a cloud based monitoring system, Internet of things (IoT) use an optimized version of Lambda architecture which mainly uses the services of speed layer combined with streaming layer to process the data over the data lake.
Kappa architecture can be deployed for those data processing enterprise models where:
Multiple data events or queries are logged in a queue to be catered against a distributed file system storage or history.
The order of the events and queries is not predetermined. Stream processing platforms can interact with database at any time.
It is resilient and highly available as handling Terabytes of storage is required for each node of the system to support replication.
The above mentioned data scenarios are handled by exhausting Apache Kafka which is extremely fast, fault tolerant and horizontally scalable. It allows a better mechanism for governing the data-streams. A balanced control on the stream processors and databases makes it possible for the applications to perform as per expectations. Kafka retains the ordered data for longer durations and caters the analogous queries by linking them to the appropriate position of the retained log. LinkedIn and some other applications use this flavor of big data processing and reap the benefit of retaining large amount of data to cater those queries that are mere replica of each other.
物聯網 (Internet of Things IoT) 對這種架構是毫無抵抗力的。由於闖紅燈這件事,過後去分析或者告警,已經沒有太大意義了。
Kafka 在這其中扮演了實時分發數據的角色,它的快速,容錯和水平擴展能力都表現很是出色。
不得不說,《主流大數據架構》做者連 unified 都能寫成 unifield, 也不知道他本人是否熟知 unified 是什麼意思
Unified Lambda 的概念
Lambda 與身俱來帶有很強的複雜性,爲了克服這些複雜性,架構師們開始尋找各類各樣的替換方案,但始終逃不開這三樣:
1) 採用純粹的流式處理方法,同時使用靈巧的架構(好比 Apache Samza)完成某種意義上的批次處理。Apache Samza 依賴於 Kafka, 所以能夠完成可循環利用的分區,達成批次處理;
2)採用另外一種極端的方案,同時用微批次(micro batches)來完成準實時的數據處理,例如 Spark 就是這種方式,它的批次間隔能夠達到 秒級。
3)Twitter 早在 2013年開源的 Summingbird 是一種同時支持批次處理與實時處理的框架,用 Scala API 封裝全部的 Batch, Speed Layer 操做,使得 Batch Layer 運行在 Hadoop 之上,而 Speed Layer 運行在 Storm 之上,而這一些都是封裝好的。Lambdoop 也是一樣的原理,同一個 API 封裝了實時處理與批次處理。很不幸的是後者在 2017年9月已經關閉了項目。
The downside of λ is its inherent complexity. Keeping in sync two already complex distributed systems is quite an implementation and maintenance challenge. People have started to look for simpler alternatives that would bring just about the same benefits and handle the full problem set. There are basically three approaches:
1) Adopt a pure streaming approach, and use a flexible framework such as Apache Samza to provide some type of batch processing. Although its distributed streaming layer is pluggable, Samza typically relies on Apache Kafka. Samza’s streams are replayable, ordered partitions. Samza can be configured for batching, i.e. consume several messages from the same stream partition in sequence.
2) Take the opposite approach, and choose a flexible Batch framework that would also allow micro-batches, small enough to be close to real-time, with Apache Spark/Spark Streaming or Storm’s Trident. Spark streaming is essentially a sequence of small batch processes that can reach latency as low as one second.Trident is a high-level abstraction on top of Storm that can process streams as small batches as well as do batch aggregation.
3) Use a technology stack already combining batch and real-time, such as Spring 「XD」, Summingbird or Lambdoop. Summingbird (「Streaming MapReduce」) is a hybrid system where both batch/real-time workflows can be run at the same time and the results merged automatically.The Speed layer runs on Storm and the Batch layer on Hadoop, Lambdoop (Lambda-Hadoop, with HBase, Storm and Redis) also combines batch/real-time by offering a single API for both processing paradigms:
The integrated approach (unified λ) seeks to handle Big Data’s Volume and Velocity by featuring a hybrid computation model, where both batch and real-time data processing are combined transparently. And with a unified framework, there would be only one system to learn, and one system to maintain.
綜上,Lambda 架構是兼容了 batch layer, speed layer(real-time processing)的架構,Kappa 架構則是用 speed layer(real-time processing) 全程處理實時數據和歷史數據,Unified 架構則是利用統一的 Api 框架,兼容了 batch layer, speed layer, 而且在操做 2 層的數據結果的時候,使用的也是同一套 API 框架。
值得一提的是 butterfly architecture, 他採用的即是 Unified Architecture的原型, 在 http://ampool.io 公司的網站介紹上看到:
https://www.ampool.io/emerging-data-architectures-lambda-kappa-and-butterfly
We note that the primary difficulty in implementing the speed, serving, and batch layers in the same unified architecture is due to the deficiencies of the distributed file system in the Hadoop ecosystem. If a storage component could replace or augment the HDFS to serve the speed and serving layers, while keeping data consistent with HDFS for batch processing, it could truly provide a unified data processing platform. This observation leads to the butterfly architecture.
在 Hadoop 生態圈系統中,因爲分佈式文件系統缺失對 speed, batch, serving 層的一致性支持,因此想要基於 Hadoop 作統一存儲管理就比較困難。
The main differentiating characteristics of the butterfly architecture is the flexibility in computational paradigms on top of each of the above data abstractions. Thus a multitude of computational engines, such as MPP SQL-engines ( Apache Impala, Apache Drill, or Apache HAWQ), MapReduce, Apache Spark, Apache Flink, Apache Hive, or Apache Tez can process various data abstractions, such as datasets, dataframes, and event streams. These computation steps can be strung together to form data pipelines, which are orchestrated by an external scheduler. A resource manager, associated with pluggable resource schedulers that are data aware, are a must for implementing the butterfly architecture. Both Apache YARN, and Apache Mesos, along with orchestration frameworks, such as Kubernetes, or hybrid resource management frameworks, such as Apache Myriad, have emerged in the last few years to fulfill this role.
bufferfly architecture 的最大閃光點就是它可以基於每一層的存儲作靈活的計算處理抽象,使得存和取都使用同一套軟件框架。如此多樣的計算引擎,好比 MPP SQL (Impala,Drill,HAWQ), MapReduce, Spark, Flink, Hive, Tez 能夠同時訪問這些存儲抽象中的數據,好比 datasets, dataframes 和 event streams.
Datasets: partitioned collections, possibly distributed over multiple storage backends
DataFrames: structured datasets, similar to tables in RDBMS or NoSQL. Immutable dataframes are suited for analytical workloads, while mutable dataframes are suited for transactional CRUD workloads.
Event Steams: are unbounded dataframes, such as time series or sequences.
Datasets 就是分佈式的數據集;DataFrames,是結構化的 DataSets, 與二維關係表或者NoSQL 的文檔等相似, 易變的DataFrame用來支持 OLTP 應用,而不易變的DataFrame則用來支持決策性應用;Event Stream 是DataSets 的源頭,也是 Publisher 與 Stream Processing 的下游。
以上參考這篇文章,我只是作了翻譯,並無親自實現過相似的架構
1. Kappa 架構使用實時處理技術,既知足了高速實時處理需求,還兼顧了批次處理的場景。在這種狀況下, kappa 的缺陷是什麼呢?
1.1 kappa 的批次處理能力不如 Lambda 架構下的 MapReduce ?
在 Lambda 架構下, MapReduce 的處理優點體如今存儲和計算節點擴展容易,離線處理成功率高,並且每一步的 Map/Reduce 都有可靠的容錯能力,在失效場景下恢復數據處理夠快。而這一切對於實時處理程序 Storm/Flink/Spark 是否是就必定夠慢或者可靠性就差呢,其實不必定,關鍵看怎麼配置和管理集羣。
1.2 Kappa 的批次處理能力,須要配置的硬件成本可能比純粹的批次處理架構要高
這卻是可能的。基於內存計算的實時處理,佔用的資源必定是比基於 Hadoop 的 Map/Reduce 模式要高。
2. Unified Lambda 架構能夠提供一套統一的架構 API 來執行 Batch 和 Speed layer 的操做。
所以綜合了 Kappa 的優點,即單一代碼庫,並且還克服了 Kappa 的劣勢,即流式處理容易出錯和高成本。
雖然 Unified Lambda (Hybrid Architecture )看上去面面俱到,但我仍是看好Kappa, 且看微軟發佈的兩張圖,一張是 Lambda, 一張是 Kappa, 最簡單的東西每每最高效!
既然 Batch Layer 已經能夠被 Speed Layer 給兼容實現了,那麼 Kappa 就已經擁有了最簡單實用的核心,爲何還要費心去作額外的那些工做呢?
本文轉載於:
解讀主流大數據架構 - 知乎
https://zhuanlan.zhihu.com/p/40996525
數據科學交流羣,羣號:189158789 ,歡迎各位對數據科學感興趣的小夥伴的加入!