Kafka 官方對於自身的 LAG 監控並無太好的方法,雖然Kafka broker 自帶有 kafka-topic.sh, kafka-consumer-groups.sh, kafka-console-consumer.sh 等腳本,可是對於大規模的生產集羣上,使用腳本採集是很是不可靠的。python
LinkedIn 公司的數據基礎設施Streaming SRE團隊正在積極開發Burrow,該軟件由Go語言編寫,在Apache許可證下發布,並託管在 GitHub Burrow 上。linux
它收集集羣消費者羣組的信息,併爲每一個羣組計算出一個單獨的狀態,告訴咱們羣組是否運行正常,是否落後,速度是否變慢或者是否已經中止工做,以此來完成對消費者狀態的監控。它不須要經過監控羣組的進度來得到閾值,不過用戶仍然能夠從中得到消息的延時數量。git
Burrow自動監控全部消費者和他們消費的每一個分區。它經過消費特殊的內部Kafka主題來消費者偏移量。而後,Burrow將消費者信息做爲與任何單個消費者分開的集中式服務提供。消費者狀態經過評估滑動窗口中的消費者行爲來肯定。 github
這些信息被分解成每一個分區的狀態,而後轉化爲Consumer的單一狀態。消費狀態能夠是OK,或處於WARNING狀態(Consumer正在工做但消息消費落後),或處於ERROR狀態(Consumer已中止消費或離線)。此狀態可經過簡單的HTTP請求發送至Burrow獲取狀態,也能夠經過Burrow 按期檢查並使用通知其經過電子郵件或單獨的HTTP endpoint接口(例如監視或通知系統)發送出去。web
Burrow可以監控Consumer消費消息的延遲,從而監控應用的健康情況,而且能夠同時監控多個Kafka集羣。用於獲取關於Kafka集羣和消費者的信息的HTTP上報服務與滯後狀態分開,對於在沒法運行Java Kafka客戶端時有助於管理Kafka集羣的應用程序很是有用。docker
Burrow 是基於 Go 語言開發,當前 Burrow 的 v1.1 版本已經release。
Burrow 也提供用於 docker 鏡像。json
Burrow_1.2.2_checksums.txt 297 Byteswindows
Burrow_1.2.2_darwin_amd64.tar.gz 4.25 MBapi
Burrow_1.1.0_linux_amd64.tar.gz 3.22 MB (CentOS 6)bash
Burrow_1.2.2_linux_amd64.tar.gz 4.31 MB (CentOS 7 Require GLIBC >= 2.14)
Burrow_1.2.2_windows_amd64.tar.gz 4 MB
本發行版包含針對初始1.0.0發行版中發現的問題的一些重要修復,其中包括:
還有一些小的功能更新
Changelog - version 1.2 [d244fce922] - Bump sarama to 1.20.1 (Vlad Gorodetsky) [793430d249] - Golang 1.9.x is no longer supported (Vlad Gorodetsky) [735fcb7c82] - Replace deprecated megacheck with staticcheck (Vlad Gorodetsky) [3d49b2588b] - Link the README to the Compose file in the project (Jordan Moore) [3a59b36d94] - Tests fixed (Mikhail Chugunkov) [6684c5e4db] - Added unit test for v3 value decoding (Mikhail Chugunkov) [10d4dc39eb] - Added v3 messages protocol support (Mikhail Chugunkov) [d6b075b781] - Replace deprecated MAINTAINER directive with a label (Vlad Gorodetsky) [52606499a6] - Refactor parseKafkaVersion to reduce method complexity (gocyclo) (Vlad Gorodetsky) [b0440f9dea] - Add gcc to build zstd (Vlad Gorodetsky) [6898a8de26] - Add libc-dev to build zstd (Vlad Gorodetsky) [b81089aada] - Add support for Kafka 2.1.0 (Vlad Gorodetsky) [cb004f9405] - Build with Go 1.11 (Vlad Gorodetsky) [679a95fb38] - Fix golint import path (golint fixer) [f88bb7d3a8] - Update docker-compose Readme section with working url. (Daniel Wojda) [3f888cdb2d] - Upgrade sarama to support Kafka 2.0.0 (#440) (daniel) [1150f6fef9] - Support linux/arm64 using Dup3() instead of Dup2() (Mpampis Kostas) [1b65b4b2f2] - Add support for Kafka 1.1.0 (#403) (Vlad Gorodetsky) [74b309fc8d] - code coverage for newly added lines (Clemens Valiente) [279c75375c] - accidentally reverted this (Clemens Valiente) [192878c69c] - gofmt (Clemens Valiente) [33bc8defcd] - make first regex test case a proper match everything (Clemens Valiente) [279b256b27] - only set whitelist / blacklist if it's not empty string (Clemens Valiente) [b48d30d18c] - naming (Clemens Valiente) [7d6c6ccb03] - variable naming (Clemens Valiente) [4e051e973f] - add tests (Clemens Valiente) [545bec66d0] - add blacklist for memory store (Clemens Valiente) [07af26d2f1] - Updated burrow endpoint in README : #401 (Ratish Ravindran) [fecab1ea88] - pass custom headers to http notifications. (#357) (vixns) Changelog - version 1.1 fecab1e pass custom headers to http notifications. (#357) 7c0b8b1 Add minimum-complete config for the evaluator (#388) dc4cb84 Fix mail template (#369) e2216d7 Fetch goreleaser via curl instead of 'go get' as compilation only works in 1.10 (#387) f3659d1 Add a send-interval configuration parameter (#364) 3e488a2 Allow env vars to be used for configuration (#363) b7428c9 Fix typo in slack close (#361) 5b546cc Create the broker offset rings earlier (#360) 61f097a Metadata refresh on detecting a deleted topic must not be for that topic (#359) b890885 Make inmemory module request channel's size configurable (#352) 9911709 Update sarama to support 10.2.1 too. (#345) a1bdcde Adjusting docker build to be self-contained (#344) a91cf4d Fix an incorrect cast from #338 and add a test to cover it (#340) 389ef47 Store broker offset history (#338) 1a60efe Fix alert closing (#334) b75a6f3 Fix typo in Cluster reference cacf05e Reject offsets that are older than the group expiration time (#330) b6184ff Fix typo in the config checked for TLS no-verify #316 (#329) 3b765ea Sync Gopkg.lock with Gopkg.toml (#312) e47ec4c Fix ZK watch problem (#328) 846d785 Assume backward-compatible consumer protocol version (fix #313) (#327) e3a1493 Update sarama to support Kafka 1.0.0 (#306) 946a425 Fixing requests for StorageFetchConsumersForTopic (#310) 52e3e5d Update burrow.toml (#300) 3a4372f Upgrade sarama dependency to support Kafka 0.11.0 (#297) 8993eb7 Fix goreleaser condition (#299) d088c99 Add gitter webhook to travis config (#296) 08e9328 Merge branch 'gitter-badger-gitter-badge' 76db0a9 Fix positioning dddd0ea Add Gitter badge
安裝方法能夠選用源碼編譯,和使用官方提供的二進制包等方法。
這裏推薦使用二進制包的方式。
Burrow 是無本地狀態存儲的,CPU密集型,網絡IO密集型應用。
# wget https://github.com/linkedin/Burrow/releases/download/v1.1.0/Burrow_1.1.0_linux_amd64.tar.gz # mkdir burrow # tar -xf Burrow_1.1.0_linux_amd64.tar.gz -C burrow # cp burrow/burrow /usr/bin/ # mkdir /etc/burrow # cp burrow/config/* /etc/burrow/ # chkconfig --add burrow # /etc/init.d/burrow start
[general] pidfile="/var/run/burrow.pid" stdout-logfile="/var/log/burrow.log" access-control-allow-origin="mysite.example.com" [logging] filename="/var/log/burrow.log" level="info" maxsize=512 maxbackups=30 maxage=10 use-localtime=true use-compression=true [zookeeper] servers=[ "test1.localhost:2181","test2.localhost:2181" ] timeout=6 root-path="/burrow" [client-profile.prod] client-id="burrow-lagchecker" kafka-version="0.10.0" [cluster.production] class-name="kafka" servers=[ "test1.localhost:9092","test2.localhost:9092" ] client-profile="prod" topic-refresh=180 offset-refresh=30 [consumer.production_kafka] class-name="kafka" cluster="production" servers=[ "test1.localhost:9092","test2.localhost:9092" ] client-profile="prod" start-latest=false group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$" group-whitelist="" [consumer.production_consumer_zk] class-name="kafka_zk" cluster="production" servers=[ "test1.localhost:2181","test2.localhost:2181" ] #zookeeper-path="/" # If specified, this is the root of the Kafka cluster metadata in the Zookeeper ensemble. If not specified, the root path is used. zookeeper-timeout=30 group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$" group-whitelist="" [httpserver.default] address=":8000" [storage.default] class-name="inmemory" workers=20 intervals=15 expire-group=604800 min-distance=1 #[notifier.default] #class-name="http" #url-open="http://127.0.0.1:1467/v1/event" #interval=60 #timeout=5 #keepalive=30 #extras={ api_key="REDACTED", app="burrow", tier="STG", fabric="mydc" } #template-open="/etc/burrow/default-http-post.tmpl" #template-close="/etc/burrow/default-http-delete.tmpl" #method-close="DELETE" #send-close=false ##send-close=true #threshold=1
#!/bin/bash # # Comments to support chkconfig # chkconfig: - 98 02 # description: Burrow is kafka lag check_program by LinkedIn, Inc. # # Source function library. . /etc/init.d/functions ### Default variables prog_name="burrow" prog_path="/usr/bin/${prog_name}" pidfile="/var/run/${prog_name}.pid" options="-config-dir /etc/burrow/" # Check if requirements are met [ -x "${prog_path}" ] || exit 1 RETVAL=0 start(){ echo -n $"Starting $prog_name: " #pidfileofproc $prog_name #killproc $prog_path PID=$(pidofproc -p $pidfile $prog_name) #daemon $prog_path $options if [ -z $PID ]; then $prog_path $options > /dev/null 2>&1 & [ ! -e $pidfile ] && sleep 1 fi [ -z $PID ] && PID=$(pidof ${prog_path}) if [ -f $pidfile -a -d "/proc/$PID" ]; then #RETVAL=$? RETVAL=0 #[ ! -z "${PID}" ] && echo ${PID} > ${pidfile} echo_success [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name else RETVAL=1 echo_failure fi echo return $RETVAL } stop(){ echo -n $"Shutting down $prog_name: " killproc -p ${pidfile} $prog_name RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name return $RETVAL } restart() { stop start } case "$1" in start) start ;; stop) stop ;; restart) restart ;; status) status $prog_path RETVAL=$? ;; *) echo $"Usage: $0 {start|stop|restart|status}" RETVAL=1 esac exit $RETVAL
默認配置文件爲 burrow.toml
GET /v3/kafka/(cluster)/consumer
Burrow 返回額接口均爲 json 對象格式,因此很是方便用於二次採集處理。
GET /v3/kafka/(cluster)/consumer/(group)/status GET /v3/kafka/(cluster)/consumer/(group)/lag
消費組健康狀態的接口含義以下:
NOTFOUND – 消費組未找到 OK – 消費組狀態正常 WARN – 消費組處在WARN狀態,例如offset在移動可是Lag不停增加。 the offsets are moving but lag is increasing ERR – 消費組處在ERR狀態。例如,offset中止變更,但Lag非零。 the offsets have stopped for one or more partitions but lag is non-zero STOP – 消費組處在ERR狀態。例如offset長時間未提交。the offsets have not been committed in a log period of time STALL – 消費組處在STALL狀態。例如offset已提交可是沒有變化,Lag非零。the offsets are being committed, but they are not changing and the lag is non-zero
GET /v3/kafka/(cluster)/topic
GET /v3/kafka/(cluster)/topic/(topic)