簡介:html
Storm是一個免費開源、分佈式、高容錯的實時計算系統。它與其餘大數據解決方案的不一樣之處在於它的處理方式。Hadoop 在本質上是一個批處理系統,數據被引入 Hadoop 文件系統 (HDFS) 並分發到各個節點進行處理。當處理完成時,結果數據返回到 HDFS 供始發者使用。Hadoop的高吞吐,海量數據處理的能力使得人們能夠方便地處理海量數據。可是,Hadoop的缺點也和它的優勢一樣鮮明——延遲大,響應緩慢,運維複雜。Storm就是爲了彌補Hadoop的實時性爲目標而被創造出來。Storm 支持建立拓撲結構來轉換沒有終點的數據流。不一樣於 Hadoop 做業,這些轉換從不中止,它們會持續處理到達的數據。Storm常常用於在實時分析、在線機器學習、持續計算、分佈式遠程調用和ETL等領域。Storm的部署管理很是簡單,並且,在同類的流式計算工具,Storm的性能也是很是出衆的。java
Strom的優勢:git
Storm的組成:github
在介紹Storm前咱們先來看下它與Hadoop的對比:數據庫
Storm主要分爲兩種組件Nimbus和Supervisor。這兩種組件都是快速失敗的,沒有狀態。任務狀態和心跳信息等都保存在Zookeeper上的,提交的代碼資源都在本地機器的硬盤上。Storm中的一些概念:apache
下圖描述了Nimbus、Supervisor、Worker、Task、Zookeeper這幾個角色之間的關係:編程
在Storm中,一個實時應用的計算任務被打包做爲Topology發佈,這同Hadoop的MapReduce任務類似。可是有一點不一樣的是:在Hadoop中,MapReduce任務最終會執行完成後結束;而在Storm中,Topology任務一旦提交後永遠不會結束,除非你顯示去中止任務。計算任務Topology是由不一樣的Spouts和Bolts,經過數據流(Stream)鏈接起來的圖。下面是一個Topology的結構示意圖:api
Topology中每個計算組件(Spout和Bolt)都有一個並行執行度,在建立Topology時能夠進行指定,Storm會在集羣內分配對應並行度個數的線程來同時執行這一組件。既然對於一個Spout或Bolt,都會有多個task線程來運行,那麼如何在兩個組件(Spout和Bolt)之間發送tuple元組呢?Storm提供了若干種數據流分發(Stream Grouping)策略用來解決這一問題。在Topology定義時,須要爲每一個Bolt指定接收什麼樣的Stream做爲其輸入(注:Spout並不須要接收Stream,只會發射Stream)。數組
下圖是Topology的提交流程圖:服務器
Storm 的一個最有趣的地方是它注重容錯和管理。Storm 實現了有保障的消息處理,因此每一個元組都會經過該拓撲結構進行全面處理;若是發現一個元組還未處理,它會自動從噴嘴處重放。Storm 還實現了任務級的故障檢測,在一個任務發生故障時,消息會自動從新分配以快速從新開始處理。Storm 包含比 Hadoop 更智能的處理管理,流程會由監管員來進行管理,以確保資源獲得充分使用。
下圖是Storm的數據交互圖。能夠看出兩個模塊Nimbus和Supervisor之間沒有直接交互。狀態都是保存在Zookeeper上。Worker之間經過ZeroMQ(新的消息機制使用netty代替ZeroMQ)傳送數據。
Storm 使用 ZeroMQ 傳送消息,這就消除了中間的排隊過程,使得消息可以直接在任務自身之間流動。在消息的背後,是一種用於序列化和反序列化 Storm 的原語類型的自動化且高效的機制。
Storm的應用:
Storm被普遍應用於實時分析,在線機器學習,持續計算、分佈式遠程調用等領域。若是,業務場景中須要低延遲的響應,但願在秒級或者毫秒級完成分析、並獲得響應,並且但願可以隨着數據量的增大而拓展。那就能夠考慮使用Storm。Storm的適用場景:
來看一些實際的應用:
參考連接:
一、淘寶搜索技術博客:storm簡介 (20121009)
二、UC技術博客:Storm:最火的流式處理框架 (20130923)
認識:Topology的提交流程圖 + Storm的數據交互圖
發展:
從開源時候的0.5.0版本,到如今的0.8.0+,和即將到來的0.9.0+。前後添加了如下重大的新特性:
Transactional topologies和Trident都是針對實際應用中遇到的重複計數問題和應用性問題的解決方案。能夠看出,實際的商用給予了Storm不少良好的反饋。
在GitHub上超過4000個項目負責人。Storm集成了許多庫,支持包括Kestrel、Kafka、JMS、Cassandra、Memcached以及更多系統。隨着支持的庫愈來愈多,Storm更容易與現有的系統協做。
Storm的擁有一個活躍的社區和一羣熱心的貢獻者。過去兩年,Storm的發展是成功的。
當前實際應用:
總結:
使用Storm你須要加入消息隊列作數據入口,考慮如何在流中保存狀態,考慮怎樣將大問題用分佈式去解決。解決這些問題的成本可能比增長一個服務器的成本還高。可是,一旦下定決定使用了Storm並解決了那些惱人的細節,你就能享受到Storm給你帶來的簡單,可拓展等優點了。
本篇文章主要介紹storm的關鍵概念!(翻譯摘取至徐明明博客)
——Storm官方文檔Tutorial的解讀
This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:
導讀:
The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.
一個實時計算應用程序的邏輯在storm裏面被封裝到topology對象裏面, 我把它叫作計算拓補. Storm裏面的topology至關於Hadoop裏面的一個MapReduce Job, 它們的關鍵區別是:一個MapReduce Job最終老是會結束的, 然而一個storm的topoloy會一直運行 — 除非你顯式的殺死它。 一個Topology是Spouts和Bolts組成的圖狀結構, 而連接Spouts和Bolts的則是Stream groupings。下面會有這些感念的描述。
Resources:
The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.
消息流是storm裏面的最關鍵的抽象。一個消息流是一個沒有邊界的tuple序列, 而這些tuples會被以一種分佈式的方式並行地建立和處理。 對消息流的定義主要是對消息流裏面的tuple的定義,( 咱們會給tuple裏的每一個字段一個名字。 而且不一樣tuple的對應字段的類型必須同樣。 也就是說: 兩個tuple的第一個字段的類型必須同樣, 第二個字段的類型必須同樣, 可是第一個字段和第二個字段能夠有不一樣的類型。) 在默認的狀況下, tuple的字段類型能夠是: integer, long, short, byte, string, double, float, boolean和byte array。 你還能夠自定義類型 — 只要你實現對應的序列化器。
Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of 「default」.
每一個消息流在定義的時候會被分配給一個id, 由於單向消息流是那麼的廣泛, OutputFieldsDeclarer定義了一些方法讓你能夠定義一個stream而不用指定這個id。在這種狀況下這個stream會有個默認的id: 1.
Resources:
A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.
消息源Spouts是storm裏面一個topology裏面的消息生產者。通常來講消息源會從一個外部源讀取數據而且向topology裏面發出消息: tuple。 消息源Spouts能夠是可靠的也能夠是不可靠的。一個可靠的消息源能夠從新發射一個tuple若是這個tuple沒有被storm成功的處理, 可是一個不可靠的消息源Spouts一旦發出一個tuple就把它完全忘了 — 也就不可能再發了。
Spouts can emit more than one stream. To do so, declare multiple streams using thedeclareStream
method of OutputFieldsDeclarer and specify the stream to emit to when using the emit
method on SpoutOutputCollector.
消息源能夠發射多條消息流stream。要達到這樣的效果, 使用OutFieldsDeclarer.declareStream來定義多個stream, 而後使用SpoutOutputCollector來發射指定的sream。
The main method on spouts is nextTuple
. nextTuple
either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple
does not block for any spout implementation, because Storm calls all the spout methods on the same thread.
Spout類裏面最重要的方法是nextTuple要麼發射一個新的tuple到topology裏面或者簡單的返回若是已經沒有新的tuple了。要注意的是nextTuple方法不能block Spout的實現, 由於storm在同一個線程上面調用全部消息源Spout的方法。
The other main methods on spouts are ack
and fail
. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack
and fail
are only called for reliable spouts. See the Javadoc for more information.
另外兩個比較重要的Spout方法是ack和fail。storm在檢測到一個tuple被整個topology成功處理的時候調用ack, 不然調用fail。storm只對可靠的spout調用ack和fail。
Resources:
All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.
全部的消息處理邏輯被封裝在bolts裏面。 Bolts能夠作不少事情: 過濾, 聚合, 查詢數據庫等等等等。
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
Bolts能夠簡單的作消息流的傳遞。複雜的消息流處理每每須要不少步驟, 從而也就須要通過不少Bolts。好比算出一堆圖片裏面被轉發最多的圖片就至少須要兩步: 第一步算出每一個圖片的轉發數量。第二步找出轉發最多的前10個圖片。(若是要把這個過程作得更具備擴展性那麼可能須要更多的步驟)。
Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream
method of OutputFieldsDeclarer and specify the stream to emit to when using the emit
method on OutputCollector.
Bolts能夠發射多條消息流, 使用OutputFieldsDeclarer.declareStream定義stream, 使用OutputCollector.emit來選擇要發射的stream。
When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping("1")
subscribes to the default stream on component 「1」 and is equivalent to declarer.shuffleGrouping("1", DEFAULT_STREAM_ID)
.
The main method in bolts is the execute
method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack
method on the OutputCollector
for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.
Its perfectly fine to launch new threads in bolts that do processing asynchronously.OutputCollector is thread-safe and can be called at any time.
Bolts的主要方法是execute, 它以一個tuple做爲輸入,Bolts使用OutputCollector來發射tuple, Bolts必需要爲它處理的每個tuple調用OutputCollector的ack方法,以通知storm這個tuple被處理完成了。– 從而咱們通知這個tuple的發射者Spouts。 通常的流程是: Bolts處理一個輸入tuple, 發射0個或者多個tuple, 而後調用ack通知storm本身已經處理過這個tuple了。storm提供了一個IBasicBolt會自動調用ack。
Resources:
Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.
定義一個Topology的其中一步是定義每一個bolt接受什麼樣的流做爲輸入。stream grouping就是用來定義一個stream應該若是分配給Bolts上面的多個Tasks。
There are seven built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:
storm裏面有7種類型的stream grouping:
emit
method in OutputCollector (which returns the task ids that the tuple was sent to).Resources:
setBolt
is called on TopologyBuilder
and is used for declaring a bolt’s input streams and how those streams should be groupedStorm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a 「message timeout」 associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.
storm保證每一個tuple會被topology完整的執行。storm會追蹤由每一個spout tuple所產生的tuple樹(一個bolt處理一個tuple以後可能會發射別的tuple從而能夠造成樹狀結構), 而且跟蹤這棵tuple樹何時成功處理完。每一個topology都有一個消息超時的設置, 若是storm在這個超時的時間內檢測不到某個tuple樹到底有沒有執行成功, 那麼topology會把這個tuple標記爲執行失敗,而且過一會會從新發射這個tuple。
To take advantage of Storm’s reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you’ve finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit
method, and you declare that you’re finished with a tuple using the ack
method.
爲了利用storm的可靠性特性,在你發出一個新的tuple以及你完成處理一個tuple的時候你必需要通知storm。這一切是由OutputCollector來完成的。經過它的emit方法來通知一個新的tuple產生了, 經過它的ack方法通知一個tuple處理完成了。
This is all explained in much more detail in Guaranteeing message processing.
Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout
and setBolt
methods of TopologyBuilder.
每個Spout和Bolt會被看成不少task在整個集羣裏面執行。每個task對應到一個線程,而stream grouping則是定義怎麼從一堆task發射tuple到另一堆task。你能夠調用TopologyBuilder.setSpout()和TopBuilder.setBolt()來設置並行度 — 也就是有多少個task。
Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.
一個topology可能會在一個或者多個工做進程裏面執行,每一個工做進程執行整個topology的一部分。好比對於並行度是300的topology來講,若是咱們使用50個工做進程來執行,那麼每一個工做進程會處理其中的6個tasks(其實就是每一個工做進程裏面分配6個線程)。storm會盡可能均勻的工做分配給全部的工做進程。
Resources:
storm裏面有一堆參數能夠配置來調整nimbus, supervisor以及正在運行的topology的行爲, 一些配置是系統級別的, 一些配置是topology級別的。全部有默認值的 配置的 默認配置 是配置在default.xml裏面的。你能夠經過定義個storm.xml在你的classpath裏面來覆蓋這些默認配置。而且你也能夠在代碼裏面設置一些topology相關的配置信息 – 使用StormSubmitter。固然,這些配置的優先級是: default.xml < storm.xml < TOPOLOGY-SPECIFIC配置。