prometheus 是監控應用軟件相似於nagios.node
安裝python
1.官網下載prometheus-2.2.0.linux-amd64壓縮包,解壓,執行./prometheus便可。這裏重要的是配置文件。linux
a.若是要遠程熱加載配置文件,啓動時加上--web.enable-lifecycle參數。 調用指令是curl -X POST http://localhost:9090/-/reloadios
b.重要掌握 prometheus.yml 配置文件.prometheus啓動時會加載它。web
[root@vm-local1 prometheus-2.2.0.linux-amd64]# cat prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.評估間隔 # scrape_timeout is set to the global default (10s). 默認抓取超時10秒 # Alertmanager configuration #管理報警配置 alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] #管理報警包須要單獨下載,默認啓動端口是9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" - rules/mengyuan.rules #要發送報警,就得寫規則,定義規則文件 # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: #抓取配置,就是你要抓取那些主機 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' #任務名稱 # metrics_path defaults to '/metrics' #默認抓取監控機的url後綴地址是/metrics # scheme defaults to 'http'. #模式是http static_configs: - targets: ['localhost:9090','localhost:9100'] labels: group: 'zus' #targets就是要抓取的主機,對應的客戶端,我這有兩個,把它們倆規定爲一個組,組名是zus - job_name: dj #又創建個任務名稱 static_configs: - targets: ['localhost:8000'] #我用django自定義的客戶端
注意:數據庫
localhost:9090,默認prometheus提供了數據抓取接口,9100端口是prometheus提供的一個監控客戶端django
2.安裝prometheus客戶端json
官網下載node_exporter-0.16.0-rc.1.linux-amd64客戶端,解壓,執行./node_exporter 便可,默認是9100端口bash
3.如何自定義一個客戶端,其實很簡單,只要返回的數據庫類型是這樣就能夠.我這用的django..只要格式正確就能夠curl
def metrics(req): ss = "feiji 32" + "\n" + "caidian 31" return HttpResponse(ss)
4.編寫 rules/mengyuan.rules 規則,規則是發送報警的前提
[root@vm-local1 rules]# cat mengyuan.rules groups: - name: zus rules: # Alert for any instance that is unreachable for >5 minutes. - alert: InstanceDown #報警名字隨便寫 expr: up == 0 #這是一個表達式,若是主機up狀態爲0,表示關機了,條件爲真就會觸發報警 能夠經過$value獲得值 for: 5s #5s內,仍是0,就發送報警信息,固然是發送給報警管理器 labels: severity: page #這個類型的報警定了個標籤 annotations: summary: "Instance {{ $labels.instance }} down dangqian {{ $value }}" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
5.如今安裝報警管理器
a.官網下載alertmanager-0.15.0-rc.1.linux-amd64
重要的仍是配置文件,建立修改它
[root@vm-local1 alertmanager-0.15.0-rc.1.linux-amd64]# cat alertmanager.yml route: receiver: mengyuan2 #接收的名字,默認必須有一個,對應receivers的- name group_wait: 1s #等待1s group_interval: 1s #發送間隔1s repeat_interval: 1m #重複發送等待1m分鐘再發 group_by: ["zus"] routes: #路由了,匹配規則標籤的severity:page 走 receiver: mengyuan , 若是routes不寫,就會走默認的mengyuan2 - receiver: mengyuan match: severity: page receivers: - name: 'mengyuan' webhook_configs: #這我用的webhook_configs 鉤子方法, 默認會把規則的報警信息發送到127.0.0.1:8000 - url: http://127.0.0.1:8000 send_resolved: true - name: 'mengyuan2' webhook_configs: - url: http://127.0.0.1:8000/2 send_resolved: true
6.django接收報警發過來的消息
用Django的 request.body會受到json格式的數據,大概像這樣
{"receiver":"mengyuan","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"InstanceDown","group":"zus","instance":"localhost:9100","job":"prometheus","severity":"page"},"annotations":{"description":"localhost:9100 of job prometheus has been down for more than 5 minutes.","summary":"Instance localhost:9100 down dangqian 0"},"startsAt":"2018-04-06T22:34:13.51281763+08:00","endsAt":"2018-04-06T23:07:43.514552824+08:00","generatorURL":"http://vm-local1:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{"alertname":"InstanceDown","group":"zus","instance":"localhost:9100","job":"prometheus","severity":"page"},"commonAnnotations":{"description":"localhost:9100 of job prometheus has been down for more than 5 minutes.","summary":"Instance localhost:9100 down dangqian 0"},"externalURL":"http://vm-local1:9093","version":"4","groupKey":"{}/{severity=\"page\"}:{}"}
到此,我就能夠根據收到的數據,調用郵件接口,或其餘第三方報警接口了。
總結:
本人也是剛入門。作的一個筆記。