咱們的監控系統採用的是 collectd收集,graphite存儲,grafana展現的架構,好處是新添加的服務能夠自動接入,而且圖形化的展現頁比較直觀,可是不足之處是整個體系中沒有可以在指標出現問題的時候進行報警通知的功能。通過嘗試了多種報警系統,咱們最終選定了seyren做爲報警組件,本文會介紹嘗試過的其餘組件以及優缺點。git
對報警系統的要求總結起來有以下幾點:github
可以從graphite中讀取數據:既然架構上使用了graphite做爲監控指標的存儲,咱們不但願再引入另一個存儲組件web
集中配置:全部的監控指標閥值須要再一個地方進行統一的配置和管理,方便進行調整,所以在每臺機器上進行檢查報警的相關組件不做爲候選方案json
自動化配置:新的服務進行接入或者已有的服務進行擴容的時候可以自動化進行接入,這裏就要求報警系統可以進行通用化的配置或者提供接口進行報警規則的添加服務器
報警通道支持擴展:須要接入工資提供的短信報警平臺,必須可以進行定製架構
咱們首選的是Grafana的報警功能,由於已經使用Grafana進行繪圖和dashboard展現了。Grafana從4.X開始添加了報警功能,能夠對一個查詢配置報警條件並選擇一個報警通道進行報警,配置界面以下圖:app
報警通道的選擇也比較多,包括Slack,Mail 以及 WebHook 等。其中WebHook能夠做爲擴展報警通道的方法,當觸發一個報警的時候會以POST方法訪問WebHook,把報警的具體信息上傳,咱們能夠本身實現一個HTTP接口處理請求,以便實現和不一樣報警系統的對接。POST消息體以下(摘自Grafana文檔):post
{ "title": "My alert", "ruleId": 1, "ruleName": "Load peaking!", "ruleUrl": "http://url.to.grafana/db/dashboard/my_dashboard?panelId=2", "state": "alerting", "imageUrl": "http://s3.image.url", "message": "Load is peaking. Make sure the traffic is real and spin up more webfronts", "evalMatches": [ { "metric": "requests", "tags": {}, "value": 122 } ] }
觸發報警之後在dashboard上會以不一樣的顏色展示:url
第一個問題是報警的查詢不可以支持Grafana模版。Grafana的模版功能很好的解決了新項目接入時候複雜的操做,只要按照預設的規則進行上報,新接入項目的時候徹底不用建立新的Dashboard。因爲報警模塊缺乏對模版的支持,使用上就須要每個服務器的報警查詢都必須明肯定義,不能包含模版變量,這樣就致使接入一個新的項目的時候須要大量的手工/半自動化操做纔可以完成報警的配置。spa
第二個問題是查詢表達式不可以對一個單獨的Meter單獨維護報警狀態。例如定一個報警查詢 collectd.*.cpu.percent-idle,若是咱們有2臺服務器,這個查詢就對應了2個meter:collectd.host2.cpu.percent-idle 和 collectd.host1.cpu.percent-idle,當host1的cpu idle 達到報警閥值的時候這個check的狀態會被改成ALERTING並觸發發送報警信息,可是當host2觸發到報警閥值的時候就不會發送報警了。Grafana的文檔中提到這個功能後面會有支持的計劃,可是暫時還沒法使用。
cabot 是一個主要爲Graphite數據源設計的報警系統,和Grafana相似,能夠經過定義一個grafana的metric查詢以及閥值進行報警,能夠經過本身實現插件進行報警的發送。與Grafana的報警組件相似,對於一個查詢包含了多個metric的狀況沒法單獨對每一個Metric進行報警狀態的追蹤。
seyren 也是爲Graphite數據源設置的報警系統,優勢是在metric查詢中包含多個metric的狀況下可以單獨爲每一個metric追蹤報警狀態。
首先咱們定義一個check,metric查詢是 collectd.base.control.jy.*.cpu.percent-idle, 中間*匹配的是全部服務器的IP地址。下圖是check的配置界面,定義查詢之後須要定義warn的閥值和error的閥值,定義之後會展現出最近一段時間的監控圖。
保存了Check之後就可以從dashboard中查看到報警的狀況,能夠看到全部匹配的metric都有一個獨立的狀態進行追蹤。這個特性使得自動化添加服務器和服務成爲可能,新擴容的機器只要按照約定進行監控數據的上報就可以被上述check涵蓋。
警報的發送方面,seyren支持的報警通道也比較多,例如 Email, Flowdock, HipChat, HTTP, Hubot等,這裏咱們只關心HTTP。下圖是一個HTTP報警通道的設置,只要定義一個URL就好,這個URL要可以接受POST請求,報警的具體信息會用json的方式經過post body上傳。
報警的POST Body 關鍵節點摘錄以下:
"alerts": [ { "checkId": "59327b84e4b0a957ebb25f77", "targetHash": "\ufffd\u0006LC\ufffd\ufffd\ufffd\ufffd\u0002\u007f\u0002\ufffd\ufffd\ufffdkE", "fromType": "OK", "toType": "WARN", "warn": 57, "timestamp": 1496484513227, "error": 62, "value": 58.1702216645755, "id": "59328aa1e4b0a957ebb26201", "target": "collectd.base.control.jy.host1.cpu.percent-idle" }, { "checkId": "59327b84e4b0a957ebb25f77", "targetHash": "\ufffd\ufffd\ufffd\ufffd\ufffd>G\u001c\ufffd\ufffd\ufffd\u001c\ufffd9\ufffd\ufffd", "fromType": "WARN", "toType": "OK", "warn": 57, "timestamp": 1496484513227, "error": 62, "value": 52.6318006613729, "id": "59328aa1e4b0a957ebb26209", "target": "collectd.base.control.jy.host2.cpu.percent-idle" } ],
能夠看到alert節點裏面爲每個host單獨維護和上報了檢查狀態。
下面是POST Body的所有內容:
{ "preview": "<br /><img src=http://192.168.1.1/render/?target=collectd.base.control.jy.*.cpu.percent-idle&from=10:08_20170603&until=09:08_20170603&target=alias(dashed(color(constantLine(57),%22yellow%22)),%22warn%20level%22)&target=alias(dashed(color(constantLine(62),%22red%22)),%22error%20level%22)&width=500&height=225></img>", "subscription": { "su": true, "mo": true, "tu": true, "we": true, "th": true, "fr": true, "sa": true, "ignoreWarn": false, "ignoreError": false, "ignoreOk": false, "fromTime": { "chronology": { "zone": { "fixed": true, "id": "UTC" } }, "millisOfSecond": 0, "millisOfDay": 0, "secondOfMinute": 0, "hourOfDay": 0, "minuteOfHour": 0, "fieldTypes": [ { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" } ], "values": [ 0, 0, 0, 0 ], "fields": [ { "range": 24, "rangeDurationField": { "unitMillis": 86400000, "precise": true, "name": "days", "type": { "name": "days" }, "supported": true }, "maximumValue": 23, "lenient": false, "unitMillis": 3600000, "durationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "hourOfDay", "type": { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 60000, "durationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "minuteOfHour", "type": { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 1000, "durationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "secondOfMinute", "type": { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, "supported": true }, { "range": 1000, "rangeDurationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "maximumValue": 999, "lenient": false, "unitMillis": 1, "durationField": { "unitMillis": 1, "precise": true, "name": "millis", "type": { "name": "millis" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "millisOfSecond", "type": { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" }, "supported": true } ] }, "toTime": { "chronology": { "zone": { "fixed": true, "id": "UTC" } }, "millisOfSecond": 0, "millisOfDay": 86340000, "secondOfMinute": 0, "hourOfDay": 23, "minuteOfHour": 59, "fieldTypes": [ { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" } ], "values": [ 23, 59, 0, 0 ], "fields": [ { "range": 24, "rangeDurationField": { "unitMillis": 86400000, "precise": true, "name": "days", "type": { "name": "days" }, "supported": true }, "maximumValue": 23, "lenient": false, "unitMillis": 3600000, "durationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "hourOfDay", "type": { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 60000, "durationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "minuteOfHour", "type": { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 1000, "durationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "secondOfMinute", "type": { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, "supported": true }, { "range": 1000, "rangeDurationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "maximumValue": 999, "lenient": false, "unitMillis": 1, "durationField": { "unitMillis": 1, "precise": true, "name": "millis", "type": { "name": "millis" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "millisOfSecond", "type": { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" }, "supported": true } ] }, "enabled": true, "id": "59328a65e4b0a957ebb26200", "type": "HTTP", "target": "http://10.153.74.117:8083/sonar/1.0/alarm_str" }, "check": { "subscriptions": [ { "su": true, "mo": true, "tu": true, "we": true, "th": true, "fr": true, "sa": true, "ignoreWarn": false, "ignoreError": false, "ignoreOk": false, "fromTime": { "chronology": { "zone": { "fixed": true, "id": "UTC" } }, "millisOfSecond": 0, "millisOfDay": 0, "secondOfMinute": 0, "hourOfDay": 0, "minuteOfHour": 0, "fieldTypes": [ { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" } ], "values": [ 0, 0, 0, 0 ], "fields": [ { "range": 24, "rangeDurationField": { "unitMillis": 86400000, "precise": true, "name": "days", "type": { "name": "days" }, "supported": true }, "maximumValue": 23, "lenient": false, "unitMillis": 3600000, "durationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "hourOfDay", "type": { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 60000, "durationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "minuteOfHour", "type": { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 1000, "durationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "secondOfMinute", "type": { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, "supported": true }, { "range": 1000, "rangeDurationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "maximumValue": 999, "lenient": false, "unitMillis": 1, "durationField": { "unitMillis": 1, "precise": true, "name": "millis", "type": { "name": "millis" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "millisOfSecond", "type": { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" }, "supported": true } ] }, "toTime": { "chronology": { "zone": { "fixed": true, "id": "UTC" } }, "millisOfSecond": 0, "millisOfDay": 86340000, "secondOfMinute": 0, "hourOfDay": 23, "minuteOfHour": 59, "fieldTypes": [ { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" } ], "values": [ 23, 59, 0, 0 ], "fields": [ { "range": 24, "rangeDurationField": { "unitMillis": 86400000, "precise": true, "name": "days", "type": { "name": "days" }, "supported": true }, "maximumValue": 23, "lenient": false, "unitMillis": 3600000, "durationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "hourOfDay", "type": { "durationType": { "name": "hours" }, "rangeDurationType": { "name": "days" }, "name": "hourOfDay" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 3600000, "precise": true, "name": "hours", "type": { "name": "hours" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 60000, "durationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "minuteOfHour", "type": { "durationType": { "name": "minutes" }, "rangeDurationType": { "name": "hours" }, "name": "minuteOfHour" }, "supported": true }, { "range": 60, "rangeDurationField": { "unitMillis": 60000, "precise": true, "name": "minutes", "type": { "name": "minutes" }, "supported": true }, "maximumValue": 59, "lenient": false, "unitMillis": 1000, "durationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "secondOfMinute", "type": { "durationType": { "name": "seconds" }, "rangeDurationType": { "name": "minutes" }, "name": "secondOfMinute" }, "supported": true }, { "range": 1000, "rangeDurationField": { "unitMillis": 1000, "precise": true, "name": "seconds", "type": { "name": "seconds" }, "supported": true }, "maximumValue": 999, "lenient": false, "unitMillis": 1, "durationField": { "unitMillis": 1, "precise": true, "name": "millis", "type": { "name": "millis" }, "supported": true }, "minimumValue": 0, "leapDurationField": null, "name": "millisOfSecond", "type": { "durationType": { "name": "millis" }, "rangeDurationType": { "name": "seconds" }, "name": "millisOfSecond" }, "supported": true } ] }, "enabled": true, "id": "59328a65e4b0a957ebb26200", "type": "HTTP", "target": "http://192.168.1.1/sonar/1.0/alarm_str" } ], "warn": 57, "until": null, "from": null, "lastCheck": 1496484513253, "description": null, "enabled": true, "error": 62, "name": "cpu", "id": "59327b84e4b0a957ebb25f77", "state": "ERROR", "target": "collectd.base.control.jy.*.cpu.percent-idle", "live": false }, "alerts": [ { "checkId": "59327b84e4b0a957ebb25f77", "targetHash": "\ufffd\u0006LC\ufffd\ufffd\ufffd\ufffd\u0002\u007f\u0002\ufffd\ufffd\ufffdkE", "fromType": "OK", "toType": "WARN", "warn": 57, "timestamp": 1496484513227, "error": 62, "value": 58.1702216645755, "id": "59328aa1e4b0a957ebb26201", "target": "collectd.base.control.jy.host1.cpu.percent-idle" }, { "checkId": "59327b84e4b0a957ebb25f77", "targetHash": "\ufffd\ufffd\ufffd\ufffd\ufffd>G\u001c\ufffd\ufffd\ufffd\u001c\ufffd9\ufffd\ufffd", "fromType": "WARN", "toType": "OK", "warn": 57, "timestamp": 1496484513227, "error": 62, "value": 52.6318006613729, "id": "59328aa1e4b0a957ebb26209", "target": "collectd.base.control.jy.host2.cpu.percent-idle" } ], "seyrenUrl": "http://localhost:8080/seyren" }