本文將分析Prometheus的常見配置與服務發現,分爲概述、配置詳解、服務發現、常見場景四個部分進行講解。node
Prometheus的配置能夠用命令行參數、或者配置文件,若是是在k8s集羣內,通常配置在configmap中(如下均爲prometheus2.7版本)git
查看可用的命令行參數,能夠執行 ./prometheus -hgolang
也能夠指定對應的配置文件,參數:--config.file 通常爲prometheus.ymlweb
若是配置有修改,如增添採集job,Prometheus能夠從新加載它的配置。只須要向其
進程發送SIGHUP或向/-/reload端點發送HTTP POST請求。如:api
curl -X POST http://localhost:9090/-/reload
安全
執行./prometheus -h 能夠看到各個參數的含義,例如:bash
--web.listen-address="0.0.0.0:9090" 監聽端口默認爲9090,能夠修改只容許本機訪問,或者爲了安全起見,能夠改變其端口號(默認的web服務沒有鑑權) --web.max-connections=512 默認最大鏈接數:512 --storage.tsdb.path="data/" 默認的存儲路徑:data目錄下 --storage.tsdb.retention.time=15d 默認的數據保留時間:15天。原有的storage.tsdb.retention配置已經被廢棄 --alertmanager.timeout=10s 把報警發送給alertmanager的超時限制 10s --query.timeout=2m 查詢超時時間限制默認爲2min,超過自動被kill掉。能夠結合grafana的限時配置如60s --query.max-concurrency=20 併發查詢數 prometheus的默認採集指標中有一項prometheus_engine_queries_concurrent_max能夠拿到最大查詢併發數及查詢狀況 --log.level=info 日誌打印等級一共四種:[debug, info, warn, error],若是調試屬性能夠先改成debug等級 .....
在prometheus的頁面上,status的Command-Line Flags中,能夠看到當前配置,如promethues-operator的配置是:併發
從官方的download頁下載的promethues二進制文件,會自帶一份默認配置prometheus.ymlapp
-rw-r--r--@ LICENSE -rw-r--r--@ NOTICE drwxr-xr-x@ console_libraries drwxr-xr-x@ consoles -rwxr-xr-x@ prometheus -rw-r--r--@ prometheus.yml -rwxr-xr-x@ promtool
prometheus.yml配置了不少屬性,包括遠程存儲、報警配置等不少內容,下面將對主要屬性進行解釋:運維
# 默認的全局配置 global: scrape_interval: 15s # 採集間隔15s,默認爲1min一次 evaluation_interval: 15s # 計算規則的間隔15s默認爲1min一次 scrape_timeout: 10s # 採集超時時間,默認爲10s external_labels: # 當和其餘外部系統交互時的標籤,如遠程存儲、聯邦集羣時 prometheus: monitoring/k8s # 如:prometheus-operator的配置 prometheus_replica: prometheus-k8s-1 # Alertmanager的配置 alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 # alertmanager的服務地址,如127.0.0.1:9093 alert_relabel_configs: # 在抓取以前對任何目標及其標籤進行修改。 - separator: ; regex: prometheus_replica replacement: $1 action: labeldrop # 一旦加載了報警規則文件,將按照evaluation_interval即15s一次進行計算,rule文件能夠有多個 rule_files: # - "first_rules.yml" # - "second_rules.yml" # scrape_configs爲採集配置,包含至少一個job scrape_configs: # Prometheus的自身監控 將在採集到的時間序列數據上打上標籤job=xx - job_name: 'prometheus' # 採集指標的默認路徑爲:/metrics,如 localhost:9090/metric # 協議默認爲http static_configs: - targets: ['localhost:9090'] # 遠程讀,可選配置,如將監控數據遠程讀寫到influxdb的地址,默認爲本地讀寫 remote_write: 127.0.0.1:8090 # 遠程寫 remote_read: 127.0.0.1:8090
prometheus的配置中,最經常使用的就是scrape_configs配置,好比添加新的監控項,修改原有監控項的地址頻率等。
最簡單配置爲:
scrape_configs: - job_name: prometheus metrics_path: /metrics scheme: http static_configs: - targets: - localhost:9090
完整配置爲(附prometheus-operator的推薦配置):
# job 將以標籤形式出如今指標數據中,如node-exporter採集的數據,job=node-exporter job_name: node-exporter # 採集頻率:30s scrape_interval: 30s # 採集超時:10s scrape_timeout: 10s # 採集對象的path路徑 metrics_path: /metrics # 採集協議:http或者https scheme: https # 可選的採集url的參數 params: name: demo # 當自定義label和採集到的自帶label衝突時的處理方式,默認衝突時會重名爲exported_xx honor_labels: false # 當採集對象須要鑑權才能獲取時,配置帳號密碼等信息 basic_auth: username: admin password: admin password_file: /etc/pwd # bearer_token或者文件位置(OAuth 2.0鑑權) bearer_token: kferkhjktdgjwkgkrwg bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token # https的配置,如跳過認證,或配置證書文件 tls_config: # insecure_skip_verify: true ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt server_name: kubernetes insecure_skip_verify: false # 代理地址 proxy_url: 127.9.9.0:9999 # Azure的服務發現配置 azure_sd_configs: # Consul的服務發現配置 consul_sd_configs: # DNS的服務發現配置 dns_sd_configs: # EC2的服務發現配置 ec2_sd_configs: # OpenStack的服務發現配置 openstack_sd_configs: # file的服務發現配置 file_sd_configs: # GCE的服務發現配置 gce_sd_configs: # Marathon的服務發現配置 marathon_sd_configs: # AirBnB的服務發現配置 nerve_sd_configs: # Zookeeper的服務發現配置 serverset_sd_configs: # Triton的服務發現配置 triton_sd_configs: # Kubernetes的服務發現配置 kubernetes_sd_configs: - role: endpoints namespaces: names: - monitoring # 對採集對象進行一些靜態配置,如打特定的標籤 static_configs: - targets: ['localhost:9090', 'localhost:9191'] labels: my: label your: label # 在Prometheus採集數據以前,經過Target實例的Metadata信息,動態從新寫入Label的值。 如將原始的__meta_kubernetes_namespace直接寫成namespace,簡潔明瞭 relabel_configs: - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: service replacement: $1 action: replace - source_labels: [__meta_kubernetes_pod_name] separator: ; regex: (.*) target_label: pod replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: job replacement: ${1} action: replace - separator: ; regex: (.*) target_label: endpoint replacement: web action: replace # 指標relabel的配置,如丟掉某些無用的指標 metric_relabel_configs: - source_labels: [__name__] separator: ; regex: etcd_(debugging|disk|request|server).* replacement: $1 action: drop # 限制最大采集樣本數,超過了採集將會失敗,默認爲0不限制 sample_limit: 0
上邊的配置文件中,有不少*_sd_configs的配置,如kubernetes_sd_configs,就是用於服務發現的採集配置。
支持的服務發現類型:
// prometheus/discovery/config/config.go type ServiceDiscoveryConfig struct { StaticConfigs []*targetgroup.Group `yaml:"static_configs,omitempty"` DNSSDConfigs []*dns.SDConfig `yaml:"dns_sd_configs,omitempty"` FileSDConfigs []*file.SDConfig `yaml:"file_sd_configs,omitempty"` ConsulSDConfigs []*consul.SDConfig `yaml:"consul_sd_configs,omitempty"` ServersetSDConfigs []*zookeeper.ServersetSDConfig `yaml:"serverset_sd_configs,omitempty"` NerveSDConfigs []*zookeeper.NerveSDConfig `yaml:"nerve_sd_configs,omitempty"` MarathonSDConfigs []*marathon.SDConfig `yaml:"marathon_sd_configs,omitempty"` KubernetesSDConfigs []*kubernetes.SDConfig `yaml:"kubernetes_sd_configs,omitempty"` GCESDConfigs []*gce.SDConfig `yaml:"gce_sd_configs,omitempty"` EC2SDConfigs []*ec2.SDConfig `yaml:"ec2_sd_configs,omitempty"` OpenstackSDConfigs []*openstack.SDConfig `yaml:"openstack_sd_configs,omitempty"` AzureSDConfigs []*azure.SDConfig `yaml:"azure_sd_configs,omitempty"` TritonSDConfigs []*triton.SDConfig `yaml:"triton_sd_configs,omitempty"` }
由於prometheus採用的是pull方式來拉取監控數據,這種方式須要由server側決定採集的目標有哪些,即配置在scrape_configs中的各類job,pull方式的主要缺點就是沒法動態感知新服務的加入,所以大多數監控都默認支持服務發現機制,自動發現集羣中的新端點,並加入到配置中。
Prometheus支持多種服務發現機制:文件,DNS,Consul,Kubernetes,OpenStack,EC2等等。基於服務發現的過程並不複雜,經過第三方提供的接口,Prometheus查詢到須要監控的Target列表,而後輪詢這些Target獲取監控數據。
對於kubernetes而言,Promethues經過與Kubernetes API交互,而後輪詢資源端點。目前主要支持5種服務發現模式,分別是:Node、Service、Pod、Endpoints、Ingress。對應配置文件中的role: node/role:service
如:動態獲取全部節點node的信息,能夠添加以下配置:
- job_name: kubernetes-nodes scrape_interval: 1m scrape_timeout: 10s metrics_path: /metrics scheme: https kubernetes_sd_configs: - api_server: null role: node namespaces: names: [] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true relabel_configs: - separator: ; regex: __meta_kubernetes_node_label_(.+) replacement: $1 action: labelmap - separator: ; regex: (.*) target_label: __address__ replacement: kubernetes.default.svc:443 action: replace - source_labels: [__meta_kubernetes_node_name] separator: ; regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics action: replace
就能夠在target中看到具體內容
對應的service、pod也是一樣的方式。
須要注意的是,爲了可以讓Prometheus可以訪問收到Kubernetes API,咱們要對Prometheus進行訪問受權,即serviceaccount。不然就算配置了,也沒有權限獲取。
prometheus的權限配置是一組ClusterRole+ClusterRoleBinding+ServiceAccount,而後在deployment或statefulset中指定serviceaccount。
ClusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: namespace: kube-system name: prometheus rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - nodes/proxy - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["get", "list", "watch"] - apiGroups: ["extensions"] resources: - daemonsets - deployments - replicasets - ingresses verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: - daemonsets - deployments - replicasets - statefulsets verbs: ["get", "list", "watch"] - apiGroups: ["batch"] resources: - cronjobs - jobs verbs: ["get", "list", "watch"] - apiGroups: ["autoscaling"] resources: - horizontalpodautoscalers verbs: ["get", "list", "watch"] - apiGroups: ["policy"] resources: - poddisruptionbudgets verbs: ["get", list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"]
ClusterRoleBinding.yaml
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: namespace: kube-system name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system
ServiceAccount.yaml
apiVersion: v1 kind: ServiceAccount metadata: namespace: kube-system name: prometheus
prometheus.yaml
.... spec: serviceAccountName: prometheus ....
完整的kubernete的配置以下:
- job_name: kubernetes-apiservers scrape_interval: 1m scrape_timeout: 10s metrics_path: /metrics scheme: https kubernetes_sd_configs: - api_server: null role: endpoints namespaces: names: [] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] separator: ; regex: default;kubernetes;https replacement: $1 action: keep - job_name: kubernetes-nodes scrape_interval: 1m scrape_timeout: 10s metrics_path: /metrics scheme: https kubernetes_sd_configs: - api_server: null role: node namespaces: names: [] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true relabel_configs: - separator: ; regex: __meta_kubernetes_node_label_(.+) replacement: $1 action: labelmap - separator: ; regex: (.*) target_label: __address__ replacement: kubernetes.default.svc:443 action: replace - source_labels: [__meta_kubernetes_node_name] separator: ; regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics action: replace - job_name: kubernetes-cadvisor scrape_interval: 1m scrape_timeout: 10s metrics_path: /metrics scheme: https kubernetes_sd_configs: - api_server: null role: node namespaces: names: [] bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: false relabel_configs: - separator: ; regex: __meta_kubernetes_node_label_(.+) replacement: $1 action: labelmap - separator: ; regex: (.*) target_label: __address__ replacement: kubernetes.default.svc:443 action: replace - source_labels: [__meta_kubernetes_node_name] separator: ; regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor action: replace - job_name: kubernetes-service-endpoints scrape_interval: 1m scrape_timeout: 10s metrics_path: /metrics scheme: http kubernetes_sd_configs: - api_server: null role: endpoints namespaces: names: [] relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] separator: ; regex: "true" replacement: $1 action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] separator: ; regex: (https?) target_label: __scheme__ replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] separator: ; regex: (.+) target_label: __metrics_path__ replacement: $1 action: replace - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] separator: ; regex: ([^:]+)(?::\d+)?;(\d+) target_label: __address__ replacement: $1:$2 action: replace - separator: ; regex: __meta_kubernetes_service_label_(.+) replacement: $1 action: labelmap - source_labels: [__meta_kubernetes_namespace] separator: ; regex: (.*) target_label: kubernetes_namespace replacement: $1 action: replace - source_labels: [__meta_kubernetes_service_name] separator: ; regex: (.*) target_label: kubernetes_name replacement: $1 action: replace
配置成功後,對應的target是:
如使用k8s的role:node採集集羣中node的數據,能夠經過"meta_domain_beta_kubernetes_io_zone"標籤來獲取到該節點的地域,該label爲集羣建立時爲node打上的標記,kubectl decribe node能夠看到。
而後能夠經過relabel_configs定義新的值
relabel_configs: - source_labels: ["meta_domain_beta_kubernetes_io_zone"] regex: "(.*)" replacement: $1 action: replace target_label: "zone"
後面能夠直接經過node{zone="XX"}來進行地域篩選
對於不一樣職能(開發、測試、運維)的人員可能只關心其中一部分的監控數據,他們可能各自部署的本身的Prometheus Server用於監控本身關心的指標數據,沒必要要的數據須要過濾掉,以避免浪費資源,能夠最相似配置;
metric_relabel_configs: - source_labels: [__name__] separator: ; regex: etcd_(debugging|disk|request|server).* replacement: $1 action: drop
action: drop表明丟棄掉符合條件的指標,不進行採集。
若是存在多個地域,每一個地域又有不少節點或者集羣,能夠採用默認的聯邦集羣部署,每一個地域部署本身的prometheus server實例,採集本身地域的數據。而後由統一的server採集全部地域數據,進行統一展現,並按照地域歸類
配置:
scrape_configs: - job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{__name__=~"job:.*"}' - '{__name__=~"node.*"}' static_configs: - targets: - '192.168.77.11:9090' - '192.168.77.12:9090'
本文爲容器監控實踐系列文章,完整內容見:container-monitor-book