[Elasticsearch]4.可伸縮性解密：集羣、節點和分片

時間 2020-08-04

原文原文鏈接

可伸縮性解密：集羣、節點和分片

更新連載中...請關注html

Scalability and resilience: clusters,nodes, and shardnode

Elasticsearch支持根據須要進行擴縮容.這得益於Elasticsearch是原生支持分佈式的.能夠經過往機器中添加服務器（節點）的方式擴大集羣容量從而存儲更多數據.Elasticsearch會自動的均一些數據和計算任務給新加入的數據.甚至不須要應用程序參與，Elasticsearch徹底知道該怎麼把數據均衡到多個節點而且提供良好的可伸縮性和高可用性.集羣的節點越多這種操做越順滑越無感. 就是這麼絲滑，堪比絲襪!安全

Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes. No need to overhaul your application, Elasticsearch knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier.服務器

爲何能這麼絲滑呢?用的什麼配方?爆開揉碎一探究竟，在Elasticsearch的內部索引實際上是包含一個或多個物理存儲分片的邏輯組。每一個分片都有它本身的索引數據.Elasticsearch就是經過把索引中的文檔劃分到多個分片（shard）中，再把劃分的分片分到多個節點來實現索引的分佈式存儲的.當機器發生擴容或縮容時，Elasticsearch自動遷移從新均衡分片，這樣就能保證在磁盤損壞或經過新加服務器（節點）擴容時依然能夠正常提供服務.網絡

How does this work? Under the covers, an Elasticsearch index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), Elasticsearch automatically migrates shards to rebalance the cluster.app

等等！好像哪裏不對,擴容好說，磁盤損壞了上面的分片不都壞了,還能提供正常服務?按讀書少，你可不要騙俺啊? less

這就要說下分片的類型了，其實有倆種類型的分片: 主分片和副分片(備用分片).在索引中的每一個文檔隸屬於一個主分片.副分片就是主分片的備份.運維

副分片: elasticsearch

其實我就是傳說中的備胎，咿呀咿呀喲!分佈式

主分片和副分片一般是不在一個磁盤上的，當發生磁盤損壞時，磁盤上的主分片對應的副分片也就轉正了，這樣就解釋了爲何當磁盤損壞時Elasticsearch仍能提供服務了.另外備胎，不！是副分片也能提供讀服務，這樣就提升了集羣的讀取文檔的吞吐量.(由於同時能夠有多臺機器提供文檔讀取服務)

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.

索引的主分片數量須要在建立索引的時候指定，建立後就不能修改了。但副分片的數量在索引建立後仍是能夠修改地.並且修改副分片數量不會影響正在執行的索引和查詢操做. 這就是備胎的分量啊!

The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations.

分片如此重要，怎麼設置分片大小和數量呢?

It depands:

設置索引的分片大小和主分片數量時，有些咱們須要權衡取捨地方:

分片越多維護這些分片的開銷就越大，分片越大再進行數據均衡時須要移動的數據也越多耗費的時間也越多.

魚和熊掌不可兼得

There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index. The more shards, the more overhead there is simply in maintaining those indices. The larger the shard size, the longer it takes to move shards around when Elasticsearch needs to rebalance a cluster.

查詢的分片越小查詢每一個分片用的時間越少，但也意味須要更多的查詢，查詢越多消耗也越大，因此有時候分區大點分區數量少點查詢可能反而更快.總之一句話，你本身看着辦吧.

Querying lots of small shards makes the processing per shard faster, but more queries means more overhead, so querying a smaller number of larger shards might be faster. In short…it depends.

這裏給出點建議，僅供參考:

控制分片大小在GB到數十GB.對於時序數據一般能夠控制20GB到40GB.
> Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range.
避免分片過多，一個節點能夠容納的分片數與可用堆空間成正比.通常來講，每GB堆空間的分片數不該大於20.
> Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20.

最好的肯定分區大小和數量的方式就是在你的應用場景下使用數據和查詢測試下.

The best way to determine the optimal configuration for your use case is through testing with your own data and queries.

容災

In case of disaster

爲了保證良好的性能，集羣中的節點須要放在同一個網絡環境中,由於跨數據中心移動遷移分片會耗費比較長的時間.可是爲了高可用，咱們又不能把全部雞蛋都放在一個籃子裏.怎麼作到當一個數據中心發生重大故障另外一個數據中心可以及時的接管，甚至讓用戶都感受不到發生了故障呢?可使用跨集羣副本功能也就是CCR.

For performance reasons, the nodes within a cluster need to be on the same network. Balancing shards in a cluster across nodes in different data centers simply takes too long. But high-availability architectures demand that you avoid putting all of your eggs in one basket. In the event of a major outage in one location, servers in another location need to be able to take over. Seamlessly. The answer? Cross-cluster replication (CCR).

跨集羣副本能夠自動熱備份主集羣的索引數據到第二集羣.若是主集羣掛了，第二集羣繼續對外提供服務.第二集羣也能夠提供只讀的服務這樣就能夠優選與用戶距離近的集羣提供服務了.

CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users.

跨集羣副本是主從機制.在主集羣上的是主，處理全部的寫請求。在第二集羣上的是從，只提供讀服務(只處理讀請求).

Cross-cluster replication is active-passive. The index on the primary cluster is the active leader index and handles all write requests. Indices replicated to secondary clusters are read-only followers.

運維

Care And feeding

像其它企業應用系統同樣，咱們也須要對Elasticsearch進行安全管理和監控.能夠經過Kibana提供的控制中心管理、監控Elasticsearch.好比數據回滾、管理索引生命週期啊什麼的.

As with any enterprise system, you need tools to secure, manage, and monitor your Elasticsearch clusters. Security, monitoring, and administrative features that are integrated into Elasticsearch enable you to use Kibana as a control center for managing a cluster. Features like data rollups and index lifecycle management help you intelligently manage your data over time

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。