Prometheus水平擴展Cortex的架構分析

時間 2020-04-24

原文原文鏈接

Cortex由Weaveworks建立，是一個開放源碼的時間序列數據庫和監視系統，用於應用程序和微服務。基於Prometheus，Cortex增長了水平縮放和幾乎無限的數據保留。
前端

Cortex的架構圖

Cortex中的工做的流程以下:
數據庫

Prometheus 的做用

Prometheus 實例從各個目標中抓取樣本，而後將他們推送到 Cortex 集羣（直接遠程寫入 API）。API 自己在 Http 請求主體內發出批處理的 Snappy 壓縮 Protocol Buffer (協議緩衝區)消息 PUT.後端

Cortex 要求每一個 HTTP 請求都帶有 Header 裏面有 X-Scope-OrgID 字段，這是Cortex裏面的租戶ID.請求認證以及受權則是由反向代理 Ngnix 來進行。緩存

咱們能夠對 Prometheus 進行縮放或者分片操做:服務器

Scaling Prometheus
架構

Sharding Prometheus
app

Storage

Cortex 當前支持兩個存儲引擎來存儲和查詢時間序列:負載均衡

Chunks（大快存儲）, 默認的存儲引擎，穩定；
Blocks （塊存儲），實驗性，這個存儲的方式相似與 HDFS中的Block存儲

Chunks 存儲

這種存儲形式是將咱們的單個的時間序列分別存儲到 Chunk 的單獨的對象裏面去。每塊會包含一個給定的時間段內的樣本（默認是 12 h）, 而後能夠按照時間範圍和標籤對 Chunks 進行索引。dom

目前咱們使用的快存儲技術是: Apache Cassandra微服務

在內部，對 Chunks Storge 的訪問，依賴於 Chunks Store 的統一的接口，和其餘的 Cortex的組件不同的是，這個獨立的接口不是一個單獨的服務，而是一個嵌入在須要訪問長期存儲的服務中的庫：ingester , querier , ruler .

目前Cortex 裏面對這個 Chunk 和 index 已經版本化，這也就意味着咱們能夠升級咱們的集羣去利用新的功能。同時，該策略可更改存儲格式，而無需任何停機時間或複雜的過程便可重寫存儲的數據。

Block 存儲

這種存儲方式目前還在測試階段，它是基於 Prometheus TSDB : 將每一個租戶的時間序列存儲到本身的 TSDB中，而後再將這個序列號寫到磁盤。Each Block is composed by few files storing the chunks and the block index.

Service

Cortex 實際上是一個微服務的架構:服務體系有:

Distributor
Ingester
Querier
Query Frontend
Ruler
Altermanager
Config API

Distributor

這是咱們處理來自 Prometheus 數據的入口。直譯過來就是數據的發起者。 distributor 收到 Prometheus 的數據後，會驗證每一個樣本的正確性並確保其在租戶限制內，若是未覆蓋特定租戶的限制，則返回默認值。而後將有效樣品數據分紅幾批，並並行發送至多個 ingesters。
Distributor須要完成的驗證包括:

The metric labels name are formally correct (指標標籤名稱形式正確)
The configured max number of labels per metric is respected (遵照每一個度量標準配置的最大標籤數)
The configured max length of a label name and value is respected (注意標籤名稱和值的最大配置長度)
The timestamp is not older/newer than the configured min/max time range (時間戳不早於/晚於配置的最小/最大時間範圍)

Distributor 是無狀態的 , 能夠根據須要進行放大縮小。

High Availability Tracker

Distributor 中的 HA 跟蹤器 , distributor 將對來自 Prometheus 的冗餘的數據樣本進行重複數據的刪除。至關因而說咱們拿到的 Prometheus 服務器的多個 HA 副本，將相同的數據寫入 Cortex , 而後在 Distributor 裏面作重複數據刪除。

HA Tracker基於集羣和副本標籤消除傳入樣本的重複數據。羣集標籤惟一標識給定租戶的冗餘普羅米修斯服務器羣集，而副本標籤惟一標識普羅米修斯羣集內的副本。若是收到的任何副本不是集羣中的當前主副本，則認爲傳入的樣本是重複的（並所以被丟棄）。

HA跟蹤器須要一個鍵值（KV）存儲來協調當前選擇哪一個副本。分銷商將只接受當前負責人的樣品。默認狀況下，不帶標籤（副本和羣集）的樣本將被接受，而且永遠不會進行重複數據刪除。

目前支持的這個 KV 存儲有：

Consul
Etcd

Hashing

在 Distributor 使用一致的哈希，來決定由哪一個指定的 inester 來接收給定的序列。Cortex支持兩種哈希策略：

Hash the metric name and tenant ID。默認的
Hash the metric name, labels and tenant ID。(-distributor.shard-by-all-labels=true)

Hash Ring

hash ring 是存儲在 kv store 裏面的，用於實現序列分片和複製的一致哈希。每個 Ingester 都會將自身的一個 token 註冊到這個 DHT （Distribute Hash Table）裏面，這裏也就是咱們的 Hash Ring.

目前 Hash Ring 支持的 KV 存儲包括:

Consul
Etcd
Gossip memberlist(測試階段)

Quorum Consistency

因爲咱們全部的 Distributors 抖共享同一個 Hash Ring , 所以任何一個請求在發送道 Distributor 前均可以在前面設置一個無狀態的負載均衡。

爲保持查詢結果的一致性，Cortex在讀/寫的時候使用了 Dyname Style 的 Quorum Consistency.這意味這在發送一個 sample 到成功響應 Prometheus 請求以前，distributor 須要等待一半加一個的 Ingester 的積極響應。

Ingester

ingester 主要是將接收到的序列寫入到一個長期存儲的後端 (long-term storage backend) ，並返回內存中的序列用於讀取路徑上的查詢。

接收到的序列不會當即寫入到存儲裏面，而是保存在內存中按期刷新到存儲（默認狀況下：Chunks 週期是12h , Block 是 2h ），所以 queriers 在執行讀物路徑的查詢的時候，須要同時從 ingesters 和 long-term storge 上抓取 sample (也就是說，我要查什麼東西，要從這兩個地方去查)

Inester 包含一個 lifecycler ，管理着一個 ingester 的生命週期，並將 ingester state 存儲在 Hash Ring 裏面。 Ingester 的狀態有下:

PENDING:
JOINING:
ACTIVE: 當它被徹底初始化。它能夠同時收到它擁有的令牌的寫入和讀取請求。
LEAVING:
UNHEALTHY: 當它未能檢測到環的KV存儲,在此狀態下，分發服務器在爲傳入系列構建複製集時跳過 inester，而且 inester 不會接收寫入或讀取請求

PENDING說明

`
PENDING is an ingester’s state when it just started and is waiting for a hand-over from another ingester that is LEAVING. If no hand-over occurs within the configured timeout period (「auto-join timeout」, configurable via -ingester.join-after option), the ingester will join the ring with a new set of random tokens (ie. during a scale up). When hand-over process starts, state changes to JOINING.
`

JOINING說明

`
JOINING is an ingester’s state in two situations. First, ingester will switch to a
JOINING state from PENDING state after auto-join timeout. In this case, ingester
will generate tokens, store them into the ring, optionally observe the ring for
token conflicts and then move to ACTIVE state. Second, ingester will also switch
into a JOINING state as a result of another LEAVING ingester initiating a hand-over
process with PENDING (which then switches to JOINING state). JOINING ingester then
receives series and tokens from LEAVING ingester, and if everything goes well,
JOINING ingester switches to ACTIVE state. If hand-over process fails, JOINING
ingester will move back to PENDING state and either wait for another hand-over or
auto-join timeout.
`

LEAVING說明

`
LEAVING is an ingester’s state when it is shutting down. It cannot receive write requests anymore, while it could still receive read requests for series it has in memory. While in this state, the ingester may look for a PENDING ingester to start a hand-over process with, used to transfer the state from LEAVING ingester to the PENDING one, during a rolling update (PENDING ingester moves to JOINING state during hand-over process). If there is no new ingester to accept hand-over, ingester in LEAVING state will flush data to storage instead.
`

Ingesters 是半狀態