Prometheus是一個開源的監控系統,擁有許多Advanced Feature,他會按期用HTTP協議來pull所監控系統狀態進行數據收集,在加上timestamp等數據組織成time series data,用metric name和label來標識不一樣的time series,用戶能夠將數據用可視化工具顯示出來,並設置報警閾值進行報警。
本文將介紹Primetheus client的使用,基於golang語言,golang client 是當pro收集所監控的系統的數據時,用於響應pro的請求,按照必定的格式給pro返回數據,說白了就是一個http server, 源碼參見github,相關的文檔參見GoDoc,讀者能夠直接閱讀文檔進行開發,本文只是幫助理解。git
要想學習pro golang client,須要有一個進行測試的環境,筆者建議使用prometheus的docker環境,部署迅速,對於系統沒有影響,安裝方式參見Using Docker,須要在本地準備好Pro的配置文件prometheus.yml,而後以volme的方式映射進docker,配置文件中的內容以下:github
global: scrape_interval: 15s # By default, scrape targets every 15 seconds. # Attach these labels to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: monitor: 'codelab-monitor' # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "go-test" scrape_interval: 60s scrape_timeout: 60s metrics_path: "/metrics" static_configs: - targets: ["localhost:8888"]
能夠看到配置文件中指定了一個job_name,所要監控的任務即視爲一個job, scrape_interval和scrape_timeout是pro進行數據採集的時間間隔和頻率,matrics_path指定了訪問數據的http路徑,target是目標的ip:port,這裏使用的是同一臺主機上的8888端口。此處只是基本的配置,更多信息參見官網。
配置好以後就能夠啓動pro服務了:
docker run --network=host -p 9090:9090 -v /home/gaorong/project/prometheus_test/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
此處網絡通訊採用的是host模式,因此docker中的pro能夠直接經過localhost來指定同一臺主機上所監控的程序。prob暴露9090端口進行界面顯示或其餘操做,須要對docker中9090端口進行映射。啓動以後能夠訪問web頁面http://localhost:9090/graph
,在status下拉菜單中能夠看到配置文件和目標的狀態,此時目標狀態爲DOWN,由於咱們所須要監控的服務尚未啓動起來,那就趕忙步入正文,用pro golang client來實現程序吧。
golang
pro將全部數據保存爲timeseries data,用metric name和label區分,label是在metric name上的更細維度的劃分,其中的每個實例是由一個float64和timestamp組成,只不過timestamp是隱式加上去的,有時候不會顯示出來,以下面所示(數據來源於pro暴露的監控數據,訪問http://localhost:9090/metrics 可得),其中go_gc_duration_seconds是metrics name,quantile="0.5"是key-value pair的label,然後面的值是float64 value。
pro爲了方便client library的使用提供了四種數據類型: Counter, Gauge, Histogram, Summary, 簡單理解就是Counter對數據只增不減,Gauage可增可減,Histogram,Summary提供跟多的統計信息。下面的實例中註釋部分# TYPE go_gc_duration_seconds summary
標識出這是一個summary對象。web
# HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0.5"} 0.000107458 go_gc_duration_seconds{quantile="0.75"} 0.000200112 go_gc_duration_seconds{quantile="1"} 0.000299278 go_gc_duration_seconds_sum 0.002341738 go_gc_duration_seconds_count 18 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 107
A Basic Example 演示了使用這些數據類型的方法(注意將其中8080端口改成本文的8888)docker
package main import ( "log" "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) var ( cpuTemp = prometheus.NewGauge(prometheus.GaugeOpts{ Name: "cpu_temperature_celsius", Help: "Current temperature of the CPU.", }) hdFailures = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "hd_errors_total", Help: "Number of hard-disk errors.", }, []string{"device"}, ) ) func init() { // Metrics have to be registered to be exposed: prometheus.MustRegister(cpuTemp) prometheus.MustRegister(hdFailures) } func main() { cpuTemp.Set(65.3) hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc() // The Handler function provides a default handler to expose metrics // via an HTTP server. "/metrics" is the usual endpoint for that. http.Handle("/metrics", promhttp.Handler()) log.Fatal(http.ListenAndServe(":8888", nil)) }
其中建立了一個gauge和CounterVec對象,並分別指定了metric name和help信息,其中CounterVec是用來管理相同metric下不一樣label的一組Counter,同理存在GaugeVec,能夠看到上面代碼中聲明瞭一個lable的key爲「device」,使用的時候也須要指定一個lable: hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc()
。
變量定義後進行註冊,最後再開啓一個http服務的8888端口就完成了整個程序,pro採集數據是經過按期請求該服務http端口來實現的。
啓動程序以後能夠在web瀏覽器裏輸入http://localhost:8888/metrics 就能夠獲得client暴露的數據,其中有片斷顯示爲:shell
# HELP cpu_temperature_celsius Current temperature of the CPU. # TYPE cpu_temperature_celsius gauge cpu_temperature_celsius 65.3 # HELP hd_errors_total Number of hard-disk errors. # TYPE hd_errors_total counter hd_errors_total{device="/dev/sda"} 1
上圖就是示例程序所暴露出來的數據,而且能夠看到counterVec是有label的,而單純的gauage對象卻不用lable標識,這就是基本數據類型和對應Vec版本的差異。此時再查看http://localhost:9090/graph 就會發現服務狀態已經變爲UP了。
上面的例子只是一個簡單的demo,由於在prometheus.yml配置文件中咱們指定採集服務器信息的時間間隔爲60s,每隔60s pro會經過http請求一次本身暴露的數據,而在代碼中咱們只設置了一次gauge變量cupTemp的值,若是在60s的採樣間隔裏將該值設置屢次,前面的值就會被覆蓋,只有pro採集數據那一刻的值能被看到,而且若是再也不改變這個值,pro就始終能看到這個恆定的變量,除非用戶顯式經過Delete函數刪除這個變量。
使用Counter,Gauage等這些結構比較簡單,可是若是再也不使用這些變量須要咱們手動刪,咱們能夠調用reset
function來清除以前的metrics。瀏覽器
更高階的作法是使用Collector,go client Colletor只會在每次響應pro請求的時候才收集數據,而且須要每次顯式傳遞變量的值,不然就不會再維持該變量,在pro也將看不到這個變量,Collector是一個接口,全部收集metrics數據的對象都須要實現這個接口,Counter和Gauage等不例外,它內部提供了兩個函數,Collector用於收集用戶數據,將收集好的數據傳遞給傳入參數Channel就可,Descirbe函數用於描述這個Collector。當收集系統數據代價較大時,就能夠自定義Collector收集的方式,優化流程,而且在某些狀況下若是已經有了一個成熟的metrics,就不須要使用Counter,Gauage等這些數據結構,直接在Collector內部實現一個代理的功能便可,一些高階的用法均可以經過自定義Collector實現。服務器
package main import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" "net/http" ) type ClusterManager struct { Zone string OOMCountDesc *prometheus.Desc RAMUsageDesc *prometheus.Desc // ... many more fields } // Simulate prepare the data func (c *ClusterManager) ReallyExpensiveAssessmentOfTheSystemState() ( oomCountByHost map[string]int, ramUsageByHost map[string]float64, ) { // Just example fake data. oomCountByHost = map[string]int{ "foo.example.org": 42, "bar.example.org": 2001, } ramUsageByHost = map[string]float64{ "foo.example.org": 6.023e23, "bar.example.org": 3.14, } return } // Describe simply sends the two Descs in the struct to the channel. func (c *ClusterManager) Describe(ch chan<- *prometheus.Desc) { ch <- c.OOMCountDesc ch <- c.RAMUsageDesc } func (c *ClusterManager) Collect(ch chan<- prometheus.Metric) { oomCountByHost, ramUsageByHost := c.ReallyExpensiveAssessmentOfTheSystemState() for host, oomCount := range oomCountByHost { ch <- prometheus.MustNewConstMetric( c.OOMCountDesc, prometheus.CounterValue, float64(oomCount), host, ) } for host, ramUsage := range ramUsageByHost { ch <- prometheus.MustNewConstMetric( c.RAMUsageDesc, prometheus.GaugeValue, ramUsage, host, ) } } // NewClusterManager creates the two Descs OOMCountDesc and RAMUsageDesc. Note // that the zone is set as a ConstLabel. (It's different in each instance of the // ClusterManager, but constant over the lifetime of an instance.) Then there is // a variable label "host", since we want to partition the collected metrics by // host. Since all Descs created in this way are consistent across instances, // with a guaranteed distinction by the "zone" label, we can register different // ClusterManager instances with the same registry. func NewClusterManager(zone string) *ClusterManager { return &ClusterManager{ Zone: zone, OOMCountDesc: prometheus.NewDesc( "clustermanager_oom_crashes_total", "Number of OOM crashes.", []string{"host"}, prometheus.Labels{"zone": zone}, ), RAMUsageDesc: prometheus.NewDesc( "clustermanager_ram_usage_bytes", "RAM usage as reported to the cluster manager.", []string{"host"}, prometheus.Labels{"zone": zone}, ), } } func main() { workerDB := NewClusterManager("db") workerCA := NewClusterManager("ca") // Since we are dealing with custom Collector implementations, it might // be a good idea to try it out with a pedantic registry. reg := prometheus.NewPedanticRegistry() reg.MustRegister(workerDB) reg.MustRegister(workerCA) http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{})) http.ListenAndServe(":8888", nil) }
此時就能夠去http://localhost:8888/metrics
看到傳遞過去的數據了。示例中定義了兩個matrics, host和zone分別是其label。 其實pro client內部提供了幾個Collecto供咱們使用,咱們能夠參考他的實現,在源碼包中能夠找到go_collector.go, process_collecor.go, expvar_collector這三個文件的Collecor實現。網絡
--update at 2019.2.26---
強烈建議將pro官網Best practice 章節閱讀一下,畢竟學會使用工具以後,咱們須要明白做爲一個系統,咱們應該暴露哪些metriscs,該使用哪些變量最好....數據結構