見google-sre-ebook, 對rpc來講. 我作監控圖表時通常是qps/rtt/error, qps對應飽和度和流量(對特定業務作過壓測的狀況下), rtt(query round trip time)對應延遲。error就不用說了.git
曾經只用histogram作qps和rtt. 這樣作的問題在於, 當請求超時時, histogram的count會先降低再上升. 正確的作法是github
counter_inc start_time = now() call end_time = now() histogram
如ubuntu
sum(rate(xxxx_count{app="xxx"}[30s])) by (method)
好比, 5S採集一次. 那麼, rate統計的是5S內的平均qps. 實際qps不會平均分佈.api
平均值:app
sum(rate(xxx_ts_sum{app="xxx"}[30s]) / rate(xxx_ts_count{app="chat"}[30s])) by (method)
中位數:post
histogram_quantile(0.5, sum(rate(xxx_ts_bucket{app="xxx"}[30s])) by (le, method))
%99:google
histogram_quantile(0.99, sum(rate(xxx_ts_bucket{app="xxx"}[30s])) by (le, method))
wget http://localhost:9090/metrics以下, 只記錄了每一個區間請求的數量.
假設buckets 只有[0, 1000, 5000], 而全部請求都是100ms, 統計中位數, p99時, 都會統計爲500ms. 由於沒法得知0 至 1000ms的具體分佈狀況.
因此, 在高度集中分佈的區間, 須要將buckets劃分得細緻一些.
同時, 在分析尾部延遲時, 要注意buckets形成的統計誤差.code
# TYPE http_ts histogram # HELP http_ts Http Post execution time. http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="2"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="5"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="10"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="25"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="50"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="100"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="250"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="500"} 1