kafka集羣Broker端參數設置及調優準則建議-kafka 商業環境實戰

時間 2019-12-08

標籤 kafka 集羣 broker 參數設置準則建議商業環境實戰欄目 Kafka 简体版

原文原文鏈接

1 Distributed streaming platform

Apache Kafka® is a distributed streaming platform. What exactly does that mean?
A streaming platform has three key capabilities:
   -  Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
   -  Store streams of records in a fault-tolerant durable way.
   -  Process streams of records as they occur.

Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
First a few concepts:

- Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
複製代碼

2 Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them 
is effectively acting as a storage system for the in-flight messages. What is 
different about Kafka is that it is a very good storage system.

- Data written to Kafka is written to disk and replicated for fault-tolerance. 
Kafka allows producers to wait on acknowledgement so that a write isn't considered
complete until it is fully replicated and guaranteed to persist even if the server 
written to fails.

- The disk structures Kafka uses scale well，Kafka will perform the same whether you 
have 50 KB or 50 TB of persistent data on the server.

- As a result of taking storage seriously and allowing the clients to control 
their read position, you can think of Kafka as a kind of special purpose 
distributed filesystem dedicated to high-performance, low-latency commit 
log storage, replication, and propagation.
複製代碼

3 kafka實現高吞吐率的祕密

一個用戶程序要把文件內容發送到網絡這個用戶程序是工做在用戶空間，文件和網絡socket屬於硬件資源，二者之間有一個內核空間。所以在操做系統內部，整個過程爲：

在Linux kernel2.2 以後出現了一種叫作"零拷貝(zero-copy)"系統調用機制，就是跳過「用戶緩衝區」的拷貝，創建一個磁盤空間和內存的直接映射，數據再也不復制到「用戶態緩衝區」後端

kafka的隊列topic被分爲了多個區partition，每一個partition又分爲多個段segment，因此一個隊列中的消息其實是保存在N多個片斷文件中，經過分段的方式，每次文件操做都是對一個小文件的操做，增長了並行處理能力

kafka容許進行批量發送消息，先將消息緩存在內存中，而後經過一次請求批量把消息發送出去，好比：能夠指定緩存的消息達到某個量的時候就發出去，或者緩存了固定的時間後就發送出去，如100條消息就發送，或者每5秒發送一次這種策略將大大減小服務端的I/O次數。
kafka還支持對消息集合進行壓縮，Producer能夠經過GZIP或Snappy格式或LZ4對消息集合進行壓縮,壓縮的好處就是減小傳輸的數據量，減輕對網絡傳輸的壓力。

4 kafka集羣Broker端全局參數設置

broker. id

惟一的整數來標識每一個broker，不能與其餘broker衝突，建議從0開始。緩存

log.dirs <= 吞吐量提高

確保該目錄有比較大的硬盤空間。若是須要指定多個目錄，以逗號分隔便可，好比/xin/kafka1,/xin/kafka2。這樣作的好處是Kafka會力求均勻地在多個目錄下存放分區(partition)數據。若是掛載多塊磁盤，那麼會有多個磁頭同時執行寫操做。對吞吐量具備很是強的提高。安全

zookeeper.connect

該參數則徹底沒有默認值，必需要配置。這個參數也能夠是一個逗號分隔值的列表，好比zk1:2181,zk2:2181,zk3:2181/kafka。注意結尾的/kafka，它是zookeeper的chroot，是可選的配置，若是不指定的話就默認使用zookeeper的根路徑。網絡

listeners

協議配置包括PLAINTEXT，SSL, SASL_SSL等，格式是[協議]://[主機名]:[端口],[[協議]://[主機名]:[端口]]，該參數是Brocker端開發給clients的監聽端口。建議配置：app

PLAINTEXT://hostname:port（未啓用安全認證）
    SSL://hostname:port（啓用安全認證）
複製代碼

unclean.leader.election.enable <= 數據的完整性保證

解決ISR全部副本爲空，leader又出現宕機的狀況。此時leader該如何選擇呢？截止kafka 1.0.0版本，該參數默認爲false，表示不容許選擇非ISR副本集以外的broker。由於高可用性與數據的完整性，kafka官方選擇了後者。socket

delete.topic.enable

很少說，是否容許刪除topic，鑑於0.9.0.0新增了ACL機制權限機制，誤操做基本是不存在的。ide

log.retention.{hours|minutes|ms} <=時間維度

優先選取ms的配置，minutes次之，hours最後，默認留存機制是7天。如何判斷：post

新版本：基於消息中的時間戳來進行判斷。老版本：根據日誌文件的最新修改時間進行比較.性能

log.retention.bytes <=空間維度

Kafka會按期刪除那些大小超過該參數值的日誌文件。默認值是-1，表示Kafka永遠不會根據大小來刪除日誌

min.insync.replicas <= 與acks=-1 搭配使用

持久化級別，用於最少須要多少副本同步。在acks=all(或-1) 時纔有意義。min.insync.replicas指定了必需要應答寫請求的最小數量的副本數。若是不能知足，producer將會拋出NotEnoughReplicas或NotEnoughReplicasAfterAppend異常。該參數用於實現更好的消息持久性。

舉例以下：

5臺broker ack =-1 min.insync.replicas = 3

上述表示最少須要3個副本同步後，Broker纔可以對外提供服務,不然將會拋出異常。若3臺Broker宕機，即便剩餘2臺所有同步結束，知足了 ack =-1也要報錯。