分佈式流式處理框架：storm簡介 + Storm術語解釋

時間 2019-11-10

原文原文鏈接

簡介：html

　　Storm是一個免費開源、分佈式、高容錯的實時計算系統。它與其餘大數據解決方案的不一樣之處在於它的處理方式。Hadoop 在本質上是一個批處理系統，數據被引入 Hadoop 文件系統 (HDFS) 並分發到各個節點進行處理。當處理完成時，結果數據返回到 HDFS 供始發者使用。Hadoop的高吞吐，海量數據處理的能力使得人們能夠方便地處理海量數據。可是，Hadoop的缺點也和它的優勢一樣鮮明——延遲大，響應緩慢，運維複雜。Storm就是爲了彌補Hadoop的實時性爲目標而被創造出來。Storm 支持建立拓撲結構來轉換沒有終點的數據流。不一樣於 Hadoop 做業，這些轉換從不中止，它們會持續處理到達的數據。Storm常常用於在實時分析、在線機器學習、持續計算、分佈式遠程調用和ETL等領域。Storm的部署管理很是簡單，並且，在同類的流式計算工具，Storm的性能也是很是出衆的。java

Strom的優勢：git

簡單的編程模型。相似於MapReduce下降了並行批處理複雜性，Storm下降了進行實時處理的複雜性。
可使用各類編程語言。你能夠在Storm之上使用各類編程語言。默認支持Clojure、Java、Ruby和Python。要增長對其餘語言的支持，只需實現一個簡單的Storm通訊協議便可。
容錯性。Storm會管理工做進程和節點的故障。模塊都是無狀態的，隨時宕機重啓。因爲是分佈式，一個節點掛了不能影響系統的正常運行。
水平擴展。計算是在多個線程、進程和服務器之間並行進行的。
可靠的消息處理。Storm保證每一個消息至少能獲得一次完整處理。任務失敗時，它會負責從消息源重試消息。
快速。系統的設計保證了消息能獲得快速的處理，使用ZeroMQ（新的消息機制使用netty代替ZeroMQ）做爲其底層消息隊列。
本地模式。Storm有一個「本地模式」，能夠在處理過程當中徹底模擬Storm集羣。這讓你能夠快速進行開發和單元測試。

Storm的組成：github

在介紹Storm前咱們先來看下它與Hadoop的對比：數據庫

　　Storm主要分爲兩種組件Nimbus和Supervisor。這兩種組件都是快速失敗的，沒有狀態。任務狀態和心跳信息等都保存在Zookeeper上的，提交的代碼資源都在本地機器的硬盤上。Storm中的一些概念：apache

Nimbus：負責資源分配和任務調度。集羣裏面發送代碼，分配工做給機器，而且監控狀態。全局只有一個。
Supervisor：負責接受nimbus分配的任務，啓動和中止屬於本身管理的worker進程。會監聽分配給它那臺機器的工做，根據須要啓動/關閉工做進程Worker。每個要運行Storm的機器上都要部署一個，而且，按照機器的配置設定上面分配的槽位數。
Worker：運行具體處理組件邏輯的進程。
Task：worker中每個spout/bolt的線程稱爲一個task. 在storm0.8以後，task再也不與物理線程對應，同一個spout/bolt的task可能會共享一個物理線程，該線程稱爲executor。
Zookeeper：Storm重點依賴的外部資源。Nimbus和Supervisor甚至實際運行的Worker都是把心跳保存在Zookeeper上的。Nimbus也是根據Zookeerper上的心跳和任務運行情況，進行調度和任務分配的。
Topology：storm中運行的一個實時應用程序，由於各個組件間的消息流動造成邏輯上的一個拓撲結構。Topology處理的最小的消息單位是一個Tuple，也就是一個任意對象的數組。Topology由Spout和Bolt構成。
Spout：在一個topology中產生源數據流的組件。一般狀況下spout會從外部數據源（如Message Queue、RDBMS、NoSQL、Realtime Log）中讀取數據，而後轉換爲topology內部的源數據。Spout是一個主動的角色，其接口中有個nextTuple()函數，storm框架會不停地調用此函數，用戶只要在其中生成源數據便可。
Bolt：在一個topology中接受數據而後執行處理的組件。Bolt能夠執行過濾、函數操做、合併、寫數據庫等任何操做。Bolt是一個被動的角色，其接口中有個execute(Tuple input)函數,在接受到消息後會調用此函數，用戶能夠在其中執行本身想要的操做。
Tuple：一次消息傳遞的基本單元。原本應該是一個key-value的map，可是因爲各個組件間傳遞的tuple的字段名稱已經事先定義好，因此tuple中只要按序填入各個value就好了，因此就是一個value list。
Stream：源源不斷傳遞的tuple就組成了stream。是Storm中對數據進行的抽象，它是時間上無界的tuple元組序列。在Topology中，Spout是Stream的源頭，負責爲Topology從特定數據源發射Stream；Bolt能夠接收任意多個Stream做爲輸入，而後進行數據的加工處理過程，若是須要，Bolt還能夠發射出新的Stream給下級Bolt進行處理。
Stream Grouping即消息的partition方法。目前Storm中提供瞭如下7種Stream Grouping策略：Shuffle Grouping、Fields Grouping、All Grouping、Global Grouping、Non Grouping、Direct Grouping、Local or shuffle grouping，具體策略能夠參考這裏。

下圖描述了Nimbus、Supervisor、Worker、Task、Zookeeper這幾個角色之間的關係：編程

　　在Storm中，一個實時應用的計算任務被打包做爲Topology發佈，這同Hadoop的MapReduce任務類似。可是有一點不一樣的是：在Hadoop中，MapReduce任務最終會執行完成後結束；而在Storm中，Topology任務一旦提交後永遠不會結束，除非你顯示去中止任務。計算任務Topology是由不一樣的Spouts和Bolts，經過數據流（Stream）鏈接起來的圖。下面是一個Topology的結構示意圖：api

　　Topology中每個計算組件（Spout和Bolt）都有一個並行執行度，在建立Topology時能夠進行指定，Storm會在集羣內分配對應並行度個數的線程來同時執行這一組件。既然對於一個Spout或Bolt，都會有多個task線程來運行，那麼如何在兩個組件（Spout和Bolt）之間發送tuple元組呢？Storm提供了若干種數據流分發（Stream Grouping）策略用來解決這一問題。在Topology定義時，須要爲每一個Bolt指定接收什麼樣的Stream做爲其輸入（注：Spout並不須要接收Stream，只會發射Stream）。數組

下圖是Topology的提交流程圖：服務器

　　Storm 的一個最有趣的地方是它注重容錯和管理。Storm 實現了有保障的消息處理，因此每一個元組都會經過該拓撲結構進行全面處理；若是發現一個元組還未處理，它會自動從噴嘴處重放。Storm 還實現了任務級的故障檢測，在一個任務發生故障時，消息會自動從新分配以快速從新開始處理。Storm 包含比 Hadoop 更智能的處理管理，流程會由監管員來進行管理，以確保資源獲得充分使用。

　　下圖是Storm的數據交互圖。能夠看出兩個模塊Nimbus和Supervisor之間沒有直接交互。狀態都是保存在Zookeeper上。Worker之間經過ZeroMQ（新的消息機制使用netty代替ZeroMQ）傳送數據。

　　Storm 使用 ZeroMQ 傳送消息，這就消除了中間的排隊過程，使得消息可以直接在任務自身之間流動。在消息的背後，是一種用於序列化和反序列化 Storm 的原語類型的自動化且高效的機制。

Storm的應用：

　　Storm被普遍應用於實時分析，在線機器學習，持續計算、分佈式遠程調用等領域。若是，業務場景中須要低延遲的響應，但願在秒級或者毫秒級完成分析、並獲得響應，並且但願可以隨着數據量的增大而拓展。那就能夠考慮使用Storm。Storm的適用場景：

流數據處理。Storm能夠用來處理源源不斷流進來的消息，處理以後將結果寫入到某個存儲中去。
分佈式rpc。因爲storm的處理組件是分佈式的，並且處理延遲極低，因此能夠做爲一個通用的分佈式rpc框架來使用。

來看一些實際的應用:

一淘-實時分析系統pora：實時分析用戶的屬性，並反饋給搜索引擎。最初，用戶屬性分析是經過天天在雲梯上定時運行的MR job來完成的。爲了知足實時性的要求，但願可以實時分析用戶的行爲日誌，將最新的用戶屬性反饋給搜索引擎，可以爲用戶展示最貼近其當前需求的結果。
攜程-網站性能監控：實時分析系統監控攜程網的網站性能。利用HTML5提供的performance標準得到可用的指標，並記錄日誌。Storm集羣實時分析日誌和入庫。使用DRPC聚合成報表，經過歷史數據對比等判斷規則，觸發預警事件。

參考連接：

一、淘寶搜索技術博客：storm簡介（20121009）

實時計算系統需解決的問題、storm是什麼、storm基本概念、storm的將來、storm在淘寶
storm的最大亮點在於其記錄級容錯和可以保證消息精確處理的事務功能。
- strom記錄級容錯的基本原理　　
- storm的事務拓撲(transactional topology)

二、UC技術博客：Storm：最火的流式處理框架 （20130923）

誕生、認識、發展、當前、將來(storm on yarn)、總結

認識：Topology的提交流程圖 + Storm的數據交互圖

發展：

有50個大大小小的公司在使用Storm，相信更多的不留名的公司也在使用。這些公司中不乏淘寶，百度，Twitter，Groupon，雅虎等重量級公司。
從開源時候的0.5.0版本，到如今的0.8.0+，和即將到來的0.9.0+。前後添加了如下重大的新特性：
- 使用kryo做爲Tuple序列化的框架（0.6.0）
- 添加了Transactional topologies（事務性拓撲）的支持（0.7.0）
- 添加了Trident的支持（0.8.0）
- 引入netty做爲底層消息機制（0.9.0）

Transactional topologies和Trident都是針對實際應用中遇到的重複計數問題和應用性問題的解決方案。能夠看出，實際的商用給予了Storm不少良好的反饋。

在GitHub上超過4000個項目負責人。Storm集成了許多庫，支持包括Kestrel、Kafka、JMS、Cassandra、Memcached以及更多系統。隨着支持的庫愈來愈多，Storm更容易與現有的系統協做。

Storm的擁有一個活躍的社區和一羣熱心的貢獻者。過去兩年，Storm的發展是成功的。

當前實際應用：

一淘-實時分析系統pora：
攜程-網站性能監控：
遊戲監控-Zynga研發之道探祕：用數聽說話
Storm的Topology靈活的編程方式和分佈式協調

總結：

　　使用Storm你須要加入消息隊列作數據入口，考慮如何在流中保存狀態，考慮怎樣將大問題用分佈式去解決。解決這些問題的成本可能比增長一個服務器的成本還高。可是，一旦下定決定使用了Storm並解決了那些惱人的細節，你就能享受到Storm給你帶來的簡單，可拓展等優點了。

本篇文章主要介紹storm的關鍵概念！（翻譯摘取至徐明明博客）

　　　　——Storm官方文檔Tutorial的解讀

This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:

導讀：

Topologies
Streams
Spouts
Bolts
Stream groupings
Reliability
Tasks
Workers

一、Topologies——技術拓撲

　　The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

　　一個實時計算應用程序的邏輯在storm裏面被封裝到topology對象裏面，我把它叫作計算拓補. Storm裏面的topology至關於Hadoop裏面的一個MapReduce Job, 它們的關鍵區別是：一個MapReduce Job最終老是會結束的，然而一個storm的topoloy會一直運行 — 除非你顯式的殺死它。一個Topology是Spouts和Bolts組成的圖狀結構，而連接Spouts和Bolts的則是Stream groupings。下面會有這些感念的描述。

Resources:

TopologyBuilder: use this class to construct topologies in Java
Running topologies on a production cluster
Local mode: Read this to learn how to develop and test topologies in local mode.

二、Streams——消息流

　　The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

　　消息流是storm裏面的最關鍵的抽象。一個消息流是一個沒有邊界的tuple序列，而這些tuples會被以一種分佈式的方式並行地建立和處理。對消息流的定義主要是對消息流裏面的tuple的定義，（咱們會給tuple裏的每一個字段一個名字。而且不一樣tuple的對應字段的類型必須同樣。也就是說：兩個tuple的第一個字段的類型必須同樣，第二個字段的類型必須同樣，可是第一個字段和第二個字段能夠有不一樣的類型。）在默認的狀況下， tuple的字段類型能夠是： integer, long, short, byte, string, double, float, boolean和byte array。你還能夠自定義類型 — 只要你實現對應的序列化器。

　　Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of 「default」.

　　每一個消息流在定義的時候會被分配給一個id，由於單向消息流是那麼的廣泛， OutputFieldsDeclarer定義了一些方法讓你能夠定義一個stream而不用指定這個id。在這種狀況下這個stream會有個默認的id: 1.

Resources:

Tuple: streams are composed of tuples
OutputFieldsDeclarer: used to declare streams and their schemas
Serialization: Information about Storm’s dynamic typing of tuples and declaring custom serializations
ISerialization: custom serializers must implement this interface
CONFIG.TOPOLOGY_SERIALIZATIONS: custom serializers can be registered using this configuration

三、Spouts——消息源

　　A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

　　消息源Spouts是storm裏面一個topology裏面的消息生產者。通常來講消息源會從一個外部源讀取數據而且向topology裏面發出消息: tuple。消息源Spouts能夠是可靠的也能夠是不可靠的。一個可靠的消息源能夠從新發射一個tuple若是這個tuple沒有被storm成功的處理，可是一個不可靠的消息源Spouts一旦發出一個tuple就把它完全忘了 — 也就不可能再發了。

　　Spouts can emit more than one stream. To do so, declare multiple streams using thedeclareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

　　消息源能夠發射多條消息流stream。要達到這樣的效果，使用OutFieldsDeclarer.declareStream來定義多個stream, 而後使用SpoutOutputCollector來發射指定的sream。

　　The main method on spouts is nextTuple. nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

　　Spout類裏面最重要的方法是nextTuple要麼發射一個新的tuple到topology裏面或者簡單的返回若是已經沒有新的tuple了。要注意的是nextTuple方法不能block Spout的實現，由於storm在同一個線程上面調用全部消息源Spout的方法。

　　The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

　　另外兩個比較重要的Spout方法是ack和fail。storm在檢測到一個tuple被整個topology成功處理的時候調用ack, 不然調用fail。storm只對可靠的spout調用ack和fail。

Resources:

IRichSpout: this is the interface that spouts must implement.
Guaranteeing message processing

四、Bolts——消息處理者

　　All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

　　全部的消息處理邏輯被封裝在bolts裏面。 Bolts能夠作不少事情：過濾，聚合，查詢數據庫等等等等。

　　Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

　　Bolts能夠簡單的作消息流的傳遞。複雜的消息流處理每每須要不少步驟，從而也就須要通過不少Bolts。好比算出一堆圖片裏面被轉發最多的圖片就至少須要兩步：第一步算出每一個圖片的轉發數量。第二步找出轉發最多的前10個圖片。（若是要把這個過程作得更具備擴展性那麼可能須要更多的步驟）。

　　Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

　　Bolts能夠發射多條消息流，使用OutputFieldsDeclarer.declareStream定義stream，使用OutputCollector.emit來選擇要發射的stream。

　　When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping("1") subscribes to the default stream on component 「1」 and is equivalent to declarer.shuffleGrouping("1", DEFAULT_STREAM_ID).

　　The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

　　Its perfectly fine to launch new threads in bolts that do processing asynchronously.OutputCollector is thread-safe and can be called at any time.

　　Bolts的主要方法是execute, 它以一個tuple做爲輸入，Bolts使用OutputCollector來發射tuple, Bolts必需要爲它處理的每個tuple調用OutputCollector的ack方法，以通知storm這個tuple被處理完成了。– 從而咱們通知這個tuple的發射者Spouts。通常的流程是： Bolts處理一個輸入tuple, 發射0個或者多個tuple, 而後調用ack通知storm本身已經處理過這個tuple了。storm提供了一個IBasicBolt會自動調用ack。

Resources:

IRichBolt: this is general interface for bolts.
IBasicBolt: this is a convenience interface for defining bolts that do filtering or simple functions.
OutputCollector: bolts emit tuples to their output streams using an instance of this class
Guaranteeing message processing

五、Stream groupings——消息分發策略

　　Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

　　定義一個Topology的其中一步是定義每一個bolt接受什麼樣的流做爲輸入。stream grouping就是用來定義一個stream應該若是分配給Bolts上面的多個Tasks。

There are seven built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:

storm裏面有7種類型的stream grouping:

Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the 「user-id」 field, tuples with the same 「user-id」 will always go to the same task, but tuples with different 「user-id」’s may go to different tasks.
All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.
Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.
None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](/apidocs/backtype/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emitmethod in OutputCollector (which returns the task ids that the tuple was sent to).
Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

Shuffle Grouping: 隨機分組，隨機派發stream裏面的tuple，保證每一個bolt接收到的tuple數目相同。
Fields Grouping：按字段分組，好比按userid來分組，具備一樣userid的tuple會被分到相同的Bolts，而不一樣的userid則會被分配到不一樣的Bolts。
All Grouping：廣播發送，對於每個tuple，全部的Bolts都會收到。
Global Grouping: 全局分組，這個tuple被分配到storm中的一個bolt的其中一個task。再具體一點就是分配給id值最低的那個task。
Non Grouping: 不分組，這個分組的意思是說stream不關心到底誰會收到它的tuple。目前這種分組和Shuffle grouping是同樣的效果，有一點不一樣的是storm會把這個bolt放到這個bolt的訂閱者同一個線程裏面去執行。
Direct Grouping: 直接分組, 這是一種比較特別的分組方法，用這種分組意味着消息的發送者指定由消息接收者的哪一個task處理這個消息。只有被聲明爲Direct Stream的消息流能夠聲明這種分組方法。並且這種消息tuple必須使用emitDirect方法來發射。消息處理者能夠經過TopologyContext來獲取處理它的消息的taskid (OutputCollector.emit方法也會返回taskid)
Local or shuffle grouping:

Resources:

TopologyBuilder: use this class to define topologies
InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt’s input streams and how those streams should be grouped
CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

六、Reliability——可靠性

　　Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a 「message timeout」 associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

　　storm保證每一個tuple會被topology完整的執行。storm會追蹤由每一個spout tuple所產生的tuple樹(一個bolt處理一個tuple以後可能會發射別的tuple從而能夠造成樹狀結構), 而且跟蹤這棵tuple樹何時成功處理完。每一個topology都有一個消息超時的設置，若是storm在這個超時的時間內檢測不到某個tuple樹到底有沒有執行成功，那麼topology會把這個tuple標記爲執行失敗，而且過一會會從新發射這個tuple。

　　To take advantage of Storm’s reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you’ve finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you’re finished with a tuple using the ack method.

　　爲了利用storm的可靠性特性，在你發出一個新的tuple以及你完成處理一個tuple的時候你必需要通知storm。這一切是由OutputCollector來完成的。經過它的emit方法來通知一個新的tuple產生了，經過它的ack方法通知一個tuple處理完成了。

This is all explained in much more detail in Guaranteeing message processing.

七、Tasks——任務

　　Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

　　每個Spout和Bolt會被看成不少task在整個集羣裏面執行。每個task對應到一個線程，而stream grouping則是定義怎麼從一堆task發射tuple到另一堆task。你能夠調用TopologyBuilder.setSpout()和TopBuilder.setBolt()來設置並行度 — 也就是有多少個task。

八、Workers——工做進程

　　Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

　　一個topology可能會在一個或者多個工做進程裏面執行，每一個工做進程執行整個topology的一部分。好比對於並行度是300的topology來講，若是咱們使用50個工做進程來執行，那麼每一個工做進程會處理其中的6個tasks（其實就是每一個工做進程裏面分配6個線程）。storm會盡可能均勻的工做分配給全部的工做進程。

Resources:

Config.TOPOLOGY_WORKERS: this config sets the number of workers to allocate for executing the topology

注：配置

　　storm裏面有一堆參數能夠配置來調整nimbus, supervisor以及正在運行的topology的行爲，一些配置是系統級別的，一些配置是topology級別的。全部有默認值的配置的默認配置是配置在default.xml裏面的。你能夠經過定義個storm.xml在你的classpath裏面來覆蓋這些默認配置。而且你也能夠在代碼裏面設置一些topology相關的配置信息 – 使用StormSubmitter。固然，這些配置的優先級是: default.xml < storm.xml < TOPOLOGY-SPECIFIC配置。