分佈式計算學習筆記

時間 2019-11-12

標籤分佈式計算學習筆記欄目系統架構简体版

原文原文鏈接

分佈式計算學習筆記html

1、概述

1.1 Kafka

Apache Kafka是分佈式發佈-訂閱消息系統。它最初由LinkedIn公司開發，以後成爲Apache項目的一部分。Kafka是一種快速、可擴展的、設計內在就是分佈式的，分區的和可複製的提交日誌服務。node

關於Kafka：react

一、一個分佈式的消息發佈、訂閱系統；web

二、設計用來處理實時的流數據；算法

三、最初由LinkedIn開發，現爲Apache的一部分；數據庫

四、沒有遵照JMS標準，也沒有使用JMS的API；apache

五、把個分區的更新保存到主題中（Kafka maintains feeds of messages in topics）。編程

Kafka是一個消息中間件，它的特色是：json

A、關注大吞吐量，而不是別的特性；api

B、針對實時性場景；

C、關於消息被處理的狀態是在Consumer端維護，而不是由Kafka Server端維護；

D、分佈式，Producer、Broker和Consumer都分佈於多臺機器上。

1.1.1 Total

Apache Kafka™ is a distributed streaming platform. What exactly does that mean?

We think of a streaming platform as having three key capabilities:

² It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.

² It lets you store streams of records in a fault-tolerant way.

² It lets you process streams of records as they occur.

What is Kafka good for?

ü It gets used for two broad classes of application:

ü Building real-time streaming data pipelines that reliably get data between systems or applications

ü Building real-time streaming applications that transform or react to the streams of data

ü To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.

First a few concepts:

Kafka is run as a cluster on one or more servers.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

Kafka has four core APIs:

The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version.

1.1.2 Topics and Logs

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example, a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

1.1.3 Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

1.1.4 Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

1.1.5 Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

1.2 Storm

Storm是一個實時處理系統，其典型應用場景：消費者拉取計算。Storm提供了storm-kafka，於是能夠直接kafka的低級API讀取數據。

Storm提供如下功能：

一、至少一次的消息處理；

二、容錯；

三、水平可擴展；

四、沒有中間隊列；

五、更少的操做開銷；

六、作最恰當的工做（just works）；

七、寬長應用場景的覆蓋，如：流處理、連續計算、分佈式的遠程進程調用等。

Storm架構圖

Storm工做任務的Topology：

1.2.1 任務調度及負載均衡

nimbus將能夠工做的worker稱爲worker-slot.
nimbus是整個集羣的控管核心，整體負責了topology的提交、運行狀態監控、負載均衡及任務從新分配，等等工做。nimbus分配的任務包含了topology代碼所在的路徑（在nimbus本地）、tasks、executors及workers信息。worker由node + port惟一肯定。
supervisor負責實際的同步worker的操做。一個supervisor稱爲一個node。所謂同步worker，是指響應nimbus的任務調度和分配，進行worker的創建、調度與銷燬。其經過將topology的代碼從nimbus下載到本地以進行任務調度。
任務分配信息中包含task到worker的映射信息task -> node + host，因此worker節點可據此信息判斷跟哪些遠程機器通信。

1.2.2 Worker

工做者，執行一個拓撲的子集，可能爲一個或多個組件執行一個或者多個執行任務（Exectutor）。

1.2.3 Executor

工做者產生的線程。

1.2.4 Task

實際的數據處理部分。

1.3 Flume

Flume是Cloudera提供的一個分佈式、可靠、和高可用的海量日誌採集、聚合和傳輸的日誌收集系統，支持在日誌系統中定製各種數據發送方，用於收集數據。同時，Flume提供對數據進行簡單處理，並寫到各類數據接受方的能力。

1.4 Hadoop

Hadoop是一個由Apache基金會所開發的分佈式系統基礎架構。

Hadoop實現了一個分佈式文件系統（Hadoop Distributed File System），簡稱HDFS。HDFS有高容錯性的特色，而且設計用來部署在低廉的（low-cost）硬件上；並且它提供高吞吐量（high throughput）來訪問應用程序的數據，適合那些有着超大數據集（large data set）的應用程序。HDFS放寬了（relax）POSIX的要求，能夠以流的形式訪問（streaming access）文件系統中的數據。

Hadoop的框架最核心的設計就是：HDFS和MapReduce。HDFS爲海量的數據提供了存儲，則MapReduce爲海量的數據提供了計算

1.5 Spark

Apache Spark™ is a fast and general engine for large-scale data processing。

Spark Streaming屬於Spark的核心api，它支持高吞吐量、支持容錯的實時流數據處理。

它能夠接受來自Kafka, Flume, Twitter, ZeroMQ和TCP Socket的數據源，使用簡單的api函數好比 map, reduce, join, window等操做，還能夠直接使用內置的機器學習算法、圖算法包來處理數據。

它的工做流程像下面的圖所示同樣，接受到實時數據後，給數據分批次，而後傳給Spark Engine處理最後生成該批次的結果。它支持的數據流叫Dstream，直接支持Kafka、Flume的數據源。Dstream是一種連續的RDDs。

1.5.1 Shark

Shark ( Hive on Spark): Shark基本上就是在Spark的框架基礎上提供和Hive同樣的H iveQL命令接口，爲了最大程度的保持和Hive的兼容性，Shark使用了Hive的API來實現query Parsing和 Logic Plan generation，最後的PhysicalPlan execution階段用Spark代替Hadoop MapReduce。經過配置Shark參數，Shark能夠自動在內存中緩存特定的RDD，實現數據重用，進而加快特定數據集的檢索。同時，Shark經過UDF用戶自定義函數實現特定的數據分析學習算法，使得SQL數據查詢和運算分析能結合在一塊兒，最大化RDD的重複使用。

1.5.2 Spark streaming

Spark streaming: 構建在Spark上處理Stream數據的框架，基本的原理是將Stream數據分紅小的時間片段（幾秒），以相似batch批量處理的方式來處理這小部分數據。Spark Streaming構建在Spark上，一方面是由於Spark的低延遲執行引擎（100ms+）能夠用於實時計算，另外一方面相比基於Record的其它處理框架（如Storm），RDD數據集更容易作高效的容錯處理。此外小批量處理的方式使得它能夠同時兼容批量和實時數據處理的邏輯和算法。方便了一些須要歷史數據和實時數據聯合分析的特定應用場合。

1.5.3 Bagel

Bagel: Pregel on Spark，能夠用Spark進行圖計算，這是個很是有用的小項目。Bagel自帶了一個例子，實現了Google的PageRank算法。

1.6 Solr

Solr是一個獨立的企業級搜索應用服務器，它對外提供相似於Web-service的API接口。用戶能夠經過http請求，向搜索引擎服務器提交必定格式的XML文件，生成索引；也能夠經過Http Get操做提出查找請求，並獲得XML格式的返回結果。

Solr是一個高性能，採用Java5開發，基於Lucene的全文搜索服務器。同時對其進行了擴展，提供了比Lucene更爲豐富的查詢語言，同時實現了可配置、可擴展並對查詢性能進行了優化，而且提供了一個完善的功能管理界面，是一款很是優秀的全文搜索引擎。

1.7 MongoDB

MongoDB是一個基於分佈式文件存儲的數據庫。由C++語言編寫。旨在爲WEB應用提供可擴展的高性能數據存儲解決方案。

MongoDB是一個介於關係數據庫和非關係數據庫之間的產品，是非關係數據庫當中功能最豐富，最像關係數據庫的。他支持的數據結構很是鬆散，是相似json的bson格式，所以能夠存儲比較複雜的數據類型。Mongo最大的特色是他支持的查詢語言很是強大，其語法有點相似於面向對象的查詢語言，幾乎能夠實現相似關係數據庫單表查詢的絕大部分功能，並且還支持對數據創建索引。

1.8 Mesos

Mesos是Apache下的開源分佈式資源管理框架，它被稱爲是分佈式系統的內核。Mesos最初是由加州大學伯克利分校的AMPLab開發的，後在Twitter獲得普遍使用。

Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.

1.9 HBase

HBase是一個分佈式的、面向列的開源數據庫，該技術來源於 Fay Chang 所撰寫的Google論文「Bigtable：一個結構化數據的分佈式存儲系統」。就像Bigtable利用了Google文件系統（File System）所提供的分佈式數據存儲同樣，HBase在Hadoop之上提供了相似於Bigtable的能力。HBase是Apache的Hadoop項目的子項目。HBase不一樣於通常的關係數據庫，它是一個適合於非結構化數據存儲的數據庫。另外一個不一樣的是HBase基於列的而不是基於行的模式。

1.10 Cassandra

Cassandra 的數據模型是基於列族（Column Family）的四維或五維模型。它借鑑了 Amazon的Dynamo和 Google's BigTable的數據結構和功能特色，採用Memtable和SSTable 的方式進行存儲。在 Cassandra 寫入數據以前，須要先記錄日誌 ( CommitLog )，而後數據開始寫入到Column Family對應的 Memtable中Memtable 是一種按照 key 排序數據的內存結構，在知足必定條件時，再把Memtable的數據批量的刷新到磁盤上，存儲爲SSTable 。

1.11 ZooKeeper

ZooKeeper是一個分佈式的，開放源碼的分佈式應用程序協調服務，是Google的Chubby一個開源的實現，是Hadoop和Hbase的重要組件。它是一個爲分佈式應用提供一致性服務的軟件，提供的功能包括：配置維護、域名服務、分佈式同步、組服務等。

ZooKeeper的目標就是封裝好複雜易出錯的關鍵服務，將簡單易用的接口和性能高效、功能穩定的系統提供給用戶。

ZooKeeper是以Fast Paxos算法爲基礎的，Paxos 算法存在活鎖的問題，即當有多個proposer交錯提交時，有可能互相排斥致使沒有一個proposer能提交成功，而Fast Paxos做了一些優化，經過選舉產生一個leader (領導者)，只有leader才能提交proposer。

ZooKeeper的基本運轉流程：

一、選舉Leader；

二、同步數據；

三、Leader要具備最高的執行ID，相似root權限；

四、集羣中大多數的機器獲得響應並follow選出的Leader。

1.12 YARN

YARN是一個徹底重寫的Hadoop集羣架構。YARN是新一代Hadoop資源管理器，經過YARN，用戶能夠運行和管理同一個物理集羣機上的多種做業，例如MapReduce批處理和圖形處理做業。這樣不只能夠鞏固一個組織管理的系統數目，並且能夠對相同的數據進行不一樣類型的數據分析。

與初版Hadoop中經典的MapReduce引擎相比，YARN 在可伸縮性、效率和靈活性上提供了明顯的優點。小型和大型Hadoop集羣都從YARN中受益不淺。對於最終用戶（開發人員，而不是管理員），這些更改幾乎是不可見的，由於可使用相同的MapReduce API和CLI 運行未經修改的MapReduce做業。

2、架構

2.1 Spark

Spark是UC Berkeley AMP lab所開源的類Hadoop MapReduce的通用的並行計算框架，Spark基於map reduce算法實現的分佈式計算，擁有Hadoop MapReduce所具備的優勢；但不一樣於MapReduce的是Job中間輸出和結果能夠保存在內存中，從而再也不須要讀寫HDFS，所以Spark能更好地適用於數據挖掘與機器學習等須要迭代的map reduce的算法。

2.2 Spark Streaming

Spark Streaming是將流式計算分解成一系列短小的批處理做業。這裏的批處理引擎是Spark，也就是把Spark Streaming的輸入數據按照batch size（如1秒）分紅一段一段的數據（Discretized Stream），每一段數據都轉換成Spark中的RDD（Resilient Distributed Dataset），而後將Spark Streaming中對DStream的Transformation操做變爲針對Spark中對RDD的Transformation操做，將RDD通過操做變成中間結果保存在內存中。整個流式計算根據業務的需求能夠對中間的結果進行疊加，或者存儲到外部設備。

2.3 Flume+Kafak+Storm

消息經過各類方式進入到Kafka消息中間件，好比能夠經過使用Flume來收集日誌數據，而後在Kafka中路由暫存，而後再由實時程序Storm作實時分析，這時咱們就須要將Storm的Spout中讀取Kafka中的消息，而後交由具體的Spot組件去分析處理。

Flume+Kafak+Storm，Flume做爲消息的Producer，生產的消息數據（日誌數據、業務請求數據等），發佈到Kafka中，而後經過訂閱的方式，使用Storm的Topology做爲消息的Consumer，在Storm集羣中分別進行處理。

處理方式有如下兩種：

一、直接使用Storm的Topology對數據進行實時分析處理；

二、整合Storm+HDFS，將消息處理後寫入HDFS進行離線分析處理。

Kafka集羣須要保證各個Broker的ID在整個集羣中必須惟一。

Storm集羣也依賴Zookeeper集羣，要保證Zookeeper集羣正常運行。

2.4 Kafka+Storm應用場景

一、須要擴展和計劃擴展

Kafka+Storm能夠方便的擴展拓撲，擴展僅限於硬件。

二、快速應用

開源軟件快速進化和社區支持。微服務，只須要作必須作的事情。

三、風險

拓撲並不成熟，但也別無選擇。

3、比較

3.1 Storm和SparkStreaming

Storm風暴和Spark Streaming火花流都是分佈式流處理的開源框架。

3.1.1 處理模型、延遲

Storm處理的是每次傳入的一個事件，而Spark Streaming是處理某個時間段窗口內的事件流。所以，Storm處理一個事件能夠達到秒內的延遲，而Spark Streaming則有幾秒鐘的延遲。

3.1.2 容錯、數據保證

在容錯數據保證方面的權衡是Spark Streaming提供了更好的支持容錯狀態計算，在Storm中每一個單獨的記錄當它經過系統時必須被跟蹤，因此Storm可以至少保證每一個記錄將被處理一次，可是在從錯誤中恢復過來的時候容許出現重複記錄。這意味着可變狀態可能不正確地被更新兩次。

簡而言之若是你須要處理秒內的延遲，Storm是一個不錯的選擇，並且沒有數據丟失。若是你須要有狀態的計算，並且要徹底保證每一個事件只被處理一次，Spark Streaming則更好。Spark Streaming編程邏輯也可能更容易，由於它相似於批處理程序Hadoop。

3.1.3 實現、編程API

Storm初次是由Clojure實現，而Spark Streaming是使用Scala。若是你想看看代碼還讓本身的定製時須要注意的地方，這樣以便發現每一個系統是如何工做的。Storm是由BackType和Twitter開發；Spark Streaming是在加州大學伯克利分校開發。

Spark基於這樣的理念：當數據龐大時，把計算過程傳遞給數據要比把數據傳遞給計算過程要更富效率，每一個節點存儲（或緩存）它的數據集，而後任務被提交給節點。

Storm的架構和Spark截然相反。Storm是一個分佈式流計算引擎。每一個節點實現一個基本的計算過程，而數據項在互相鏈接的網絡節點中流進流出。和Spark相反，這個是把數據傳遞給過程。兩個框架都用於處理大量數據的並行計算。Storm在動態處理生成的「小數據塊」上要更好。不肯定哪一種方式在數據吞吐量上要具優點，不過Storm計算時間延遲要小。

3.2 Storm和Hadoop

Hadoop是磁盤級計算，進行計算時，數據在磁盤上，須要讀寫磁盤。Storm是內存級計算，數據直接經過網絡導入內存。

Hadoop	Storm
批處理	實時處理
Jobs運行到結束	Topologies一直運行
結點有狀態	結點無狀態
可擴展	可擴展
保證數據不丟	保證數據不丟
開源	開源
大的批處理（Big Batch processing）	快速、反映式、實時的處理（Fast, reactive, real time processing）

3.3 Spark和Hadoop

Spark的中間數據放到內存中，對於迭代運算效率更高。

Spark更適合於迭代運算比較多的ML和DM運算。由於在Spark裏面，有RDD的抽象概念。

3.3.1 Spark比Hadoop更通用。

Spark提供的數據集操做類型有不少種，不像Hadoop只提供了Map和Reduce兩種操做。好比map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy等多種操做類型，Spark把這些操做稱爲Transformations。同時還提供Count, collect, reduce, lookup, save等多種actions操做。

這些多種多樣的數據集操做類型，給給開發上層應用的用戶提供了方便。各個處理節點之間的通訊模型再也不像Hadoop那樣就是惟一的Data Shuffle一種模式。用戶能夠命名，物化，控制中間結果的存儲、分區等。能夠說編程模型比Hadoop更靈活。

不過因爲RDD的特性，Spark不適用那種異步細粒度更新狀態的應用，例如web服務的存儲或者是增量的web爬蟲和索引。就是對於那種增量修改的應用模型不適合。

3.3.2 容錯性

在分佈式數據集計算時經過checkpoint來實現容錯，而checkpoint有兩種方式，一個是checkpoint data，一個是logging the updates。用戶能夠控制採用哪一種方式來實現容錯。

3.3.3 可用性

Spark經過提供豐富的Scala, Java，Python API及交互式Shell來提升可用性。

3.4 Spark vs. Hadoop vs. Storm

3.4.1 相同點

一、都是開源消息處理框架；

二、都能用於實時的商業智能和大數據分析；

三、因其實現方法簡單，因此經常使用於大數據處理；

四、都是基於JVM的實現，使用的語言有Java、Scala和Clojure；

3.4.2 不一樣點

數據處理模型

Hadoop MapReduce最適合用於批處理。對於小數據要求實時處理的應用，須要用其餘的開源平臺，如：Impala或者Storm。Apache Spark設計主要用於普通的數據處理，它能夠已有的機器學習庫和流程圖。由於Spark有很高的性能，因此他便可以用於批處理也能夠實時處理。Spark能夠在單個平臺處理事務，而不須要跨平臺。

微批處理是一種特殊的尺寸更小的批處理，微批處理提供帶狀態的計算，所以開窗口（Windowing）變得更容易。

Apache Storm是基於流的處理架構，使用Trident能夠用於微批處理。

Spark是批處理架構，經過Spark Streaming能夠用於微批處理。

Spark和Storm的主要區別在於：Spark是數據並行的計算，Storm是任務並行的計算。

Feature	Apache Storm/Trident	Spark Streaming
Programming languages	Java Clojure Scala	Java Scala
Reliability	Supports 「exactly once」 processing mode. Can be used in other modes like 「at least once」 processing & 「at most once」 processing mode as well.	Supports only 「exactly once」 processing mode.
Stream Source	Spout	HDFS
Stream Primitives	Tuple, Partitions	DStream
Persistence	MapState	Per RDD
State management	Supported	Supported
Resource Management	Yarn, Mesos	Yarn, Mesos
Provisioning	Apache Ambari	Basic monitoring using ganglia
Messaging	ZeroMA, Netty	Netty, Akka

在Spark流處理中，若是一個工做者（Worker）結點失敗，系統能夠根據數據輸入的拷貝從新計算。可是若是下網絡接收者（network receiver）出現問題，數據將沒法複製到其餘結點。總的來講，只在HDFS備份的數據是安全的。

在Storm/Trident中，若是一個工做運行出錯，雨雲（nimbus）會把失敗工做的狀態分配給系統中的其餘工做者。全部發給失敗結點的數據三元組都會超時，於是能夠自動的發送給其餘結點。在Storm中，發送可達的保證是基於數據源安全上。

Situation	Choice of framework
Strictly low latency	Storm can provide better latency with fewer restrictions than spark streaming.
Low development cost	With Spark, the same code base can be used for batch processing and stream processing, But with Storm, it is not possible.
Message Delivery Gurantee	Both Apache Storm(Trident) and Spark streaming offer 「exactly once」 processing mode
Fault tolerance	Both frameworks are relatively fault tolerant to the same extent. In Apache Storm/Trident, if a process fails, the supervisor process will restart it automatically as state management is handled through ZooKeeper. Spark handles restarting workers via resource manager which can be YARN, Mesos, or its standalone manager.

4、總結

Spark是大數據分析工具中的瑞士軍刀。

Storm更擅長於可靠的處理無限制流數據的實時處理，Hadoop這適合批處理。

Hadoop、Spark、Storm都是開源的數據處理平臺，雖然他們的功能相互重疊，但他們有不一樣的側重點。

Hadoop是開源的分佈式數據框架。Hadoop用於大數據集的存儲和在不一樣集羣上的數據分析和處理。Hadoop的MapReduce分佈計算用於比處理，這也是Hadoop作爲數據倉庫而不是數據分析工具的緣由。

Spark沒有本身的分佈式存儲系統，因此須要借Hadoop的HDFS來保存數據。

Storm是一個任務並行、開源分佈式計算系統。Storm在拓撲中有本身獨立的工做流，若有向非循環圖。拔掉一直運行，除非被打斷或者系統中止運行。Storm並不在Hadoop集羣上工做，而是基於ZooKeeper來管理處理流程。Storm能夠讀、寫文件到HDFS。

Hadoop是大數據處理中很受歡迎，但Spark和Storm更受歡迎。

5、附註

下爲英語原文，已經翻譯。

Understanding the differences

1) Data processing models

Hadoop MapReduce is best suited for batch processing. For bit data applications that require real time options, organizations must use other open source platform like Impala or Storm. Apache Spark is designed to do more than plain data processing as it can make use of existing machine learning libraries and process graphs. Thanks to the high performance of Apache spark, it can be used for both batch processing and real time processing. Spark provides an opportunity to use a single platform for everything rather than splitting the tasks on different open source platforms-avoiding the overhead of learning and maintaining different platforms.

Micro-batching s a special kind of batch processing wherein the batch size is orders smaller. Windowing becomes easy with micro-batching as it offer stateful computation of data.

Storm is a stream processing framework that also does micro-batching (Trident).

Spark is a batch processing framework that also does micro-batching (Spark Streaming).

Apache Storm is a stream processing framework, which can do micro-batching using Trident(an abstraction on Storm to perform statefule stream processing in batches).

Spark is a frame to perform batch processing. It can also do micro-batching using Spark Streaming(an abstraction on Spark to perform stateful stream processing).

One key difference between these two frameworks is that spark performs Data-parallel computions while Storm performs Taks-Paralle computations.

Feature	Apache Storm/Trident	Spark Streaming
Programming languages	Java Clojure Scala	Java Scala
Reliability	Supports 「exactly once」 processing mode. Can be used in other modes like 「at least once」 processing & 「at most once」 processing mode as well.	Supports only 「exactly once」 processing mode.
Stream Source	Spout	HDFS
Stream Primitives	Tuple, Partitions	DStream
Persistence	MapState	Per RDD
State management	Supported	Supported
Resource Management	Yarn, Mesos	Yarn, Mesos
Provisioning	Apache Ambari	Basic monitoring using ganglia
Messaging	ZeroMA, Netty	Netty, Akka

In Spark streaming, if a worker node fails, then the system can re-compute from the lest over copy of input data. But , if the node where the network receiver runs is failing, the the data which is not yet replicated to other node might be lost. In short, only HDFS backed data source is safe.

In Apache Storm/Trident, if a worker fails, the nimbus assigns the worker’s tasks to other nodes in the system. All tuples sent to the failed node will be timed out and hence replayed automatically. In Storm as well, delivery gurantee depends on a safe data source.

Situation	Choice of framework
Strictly low latency	Storm can provide better latency with fewer restrictions than spark streaming.
Low development cost	With Spark, the same code base can be used for batch processing and stream processing, But with Storm, it is not possible.
Message Delivery Gurantee	Both Apache Storm(Trident) and Spark streaming offer 「exactly once」 processing mode
Fault tolerance	Both frameworks are relatively fault tolerant to the same extent. In Apache Storm/Trident, if a process fails, the supervisor process will restart it automatically as state management is handled through ZooKeeper. Spark handles restarting workers via resource manager which can be YARN, Mesos, or its standalone manager.

Spark is what you might call a Swiss Army knife of Big Data analytics tools.

Storm makes it easy to reliable process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

Hadoop, Spark, Storm are some of the popular open source platforms for real time data processing. Each of these tools has some intersecting functionalities. However, they have different role to play.

Hadoop is an open source distributed processing framework. Hadoop is used for storing large data sets and running distributed analytics processes on various clusters. Hadoop MapReduce is limited to batch processing of one job at a time. This is the reason why these days Hadoop is being used extensively as a data warehousing tool and not as data analysis tool.

Spark doesn’t have its own distributed storage system. This is the reason why most of the big data projects install Apache Spark on Hadoop so that the advanced big data applications can be run on Spark by using the data stored in Hadoop Distributed File System.

Storm is a task parallel, open source distributed computing system. Storm has its independent workflows in topologies i.e. Directed Acyclic Graphs. The topologies in Storm execute until there is some kind of a disturbance or if the system shuts down completely. Storm does not run on Hadoop clusters but uses ZooKeeper and its own minion worker to manage its processes. Storm can read and write files to HDFS.

Apache Hadoop is hot in the big data market but its cousins Spark and Storm are hotter.