k8s 監控（四）監控宿主機

本文屬於 k8s 監控系列，其他文章爲：node

k8s 監控的第四篇文章，這篇文章講的是監控宿主機的指標。官方和大部分使用者都會使用 node_exporter 完成此項工做，可是我更喜歡 telegraf。緣由在於 node_exporter 有如下幾大痛點：linux

指標太多，僅 cpu 而言，每一個 cpu 核心都有 6 個指標，若是 72 核心，那麼光 cpu 的指標就有 432 個，難以理解；
沒法自定義要收集的指標，你要麼收集這類指標，要麼就不收集，而不能只收集這類指標的某些部分；
不支持自定義監控腳本；
沒有 tcp 的 11 種狀態的指標（或許我不知道怎麼看？），也不知道搞那麼多網絡指標幹啥，一個都看不懂。

而 telegraf 就沒有這方面的困擾。有鑑於此，本篇文章會將二者都部署一遍，怎麼選擇就看你了。git

老規矩，全部 yml 文件都已上傳到 github。github

node_exporter

只須要注意如下幾點就行：web

使用 daemonset，確保每一個 k8s 節點都部署；
要將宿主機的 /proc 和 /sys 都掛載，貌似還要掛載根；
使用宿主機網絡名稱空間。

部署文件只有 5 個，都以 node-exporter- 開頭，具體做用一看便知，就很少說了。先 kubectl apply，等待 pod 運行 ok 以後，能夠直接訪問宿主機的 9100 端口，查看都有哪些指標：api

curl 127.0.0.1:9100/metrics
複製代碼

確保指標收集到位以後，修改 prometheus-config.yml，添加以下配置：bash

- job_name: node_exporter
 kubernetes_sd_configs:
 - role: endpoints
 namespaces:
 names:
 - monitoring
 scrape_interval: 30s
 scrape_timeout: 30s
 tls_config:
 insecure_skip_verify: true
 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 relabel_configs:
 - action: keep
 source_labels:
 - __meta_kubernetes_service_label_k8s_app
 regex: node-exporter
複製代碼

這裏注意將 service 的標籤名 k8s-app 中的 - 換爲這裏的 _，不然 reload prometheus 會報錯。markdown

修改完畢後執行 kubectl apply -f prometheus-config.yml，此時你最好登錄 prometheus 容器中查看配置文件是否生效，確保生效後，能夠在宿主機上 reload：網絡

curl -XPOST POD_IP:9090/-/reload
複製代碼

而後在 prometheus web 頁面的 target 中就能夠看到了。app

telegraf

node_exporter 中太多不明因此的指標，會佔用許多額外的資源，因此我選擇定製性更高的 telegraf。telegraf 是 InfluxData 使用 go 開發的一個指標收集工具，InfluxData 的另外一款產品 influxdb 很是有名，這二者和剩下的 Chronograf、Kapacitor 共同構成 InfluxData 的監控系統 tick。

tick 這裏就很少提了，咱們只會用到 telegraf。telegraf 有些相似於 logstash，分爲 Input、Processor、Aggregator、Output 四個部分，而每一個部分又由各個插件提供具體的功能。能夠理解爲，telegraf 的全部功能都是由插件提供，只不過插件分爲四類。

本文會用到 Input、Output 和 Processor，至於 Aggregator（聚合，用來計算一段時間內的最大、最小、平均值等）有興趣童鞋的能夠研究研究。

這裏咱們使用 telegraf 收集宿主機的性能指標，因爲指標種類不少，包括 cpu、內存、磁盤、網絡等，因此會使用多個 input 插件。有些插件會提供一些選項，讓咱們可以更好的控制須要收集的指標，這是很是方便的，比 node_exporter 一股腦收集有用多了。

得到這些指標後，由於須要經過 prometheus 收集，因此會用到 prometheus Output，也就是將全部收集到的指標經過 metirc 頁面展現出來。

首先來講說它的配置文件，它的全部配置都在這個配置文件中了。在瞭解配置文件以前，咱們得知道 telegraf 自身也有一些概念：

field：指標的名稱；
tag：指標中的標籤。

爲避免重複展現，我就直接將 configmap 的內容貼出來了，咱們只須要從 [agent] 開始看起就行。

apiVersion: v1
kind: ConfigMap
metadata:
 name: telegraf
 namespace: monitoring
 labels:
 name: telegraf
data:
  telegraf.conf: |+
    [agent]
      interval = "10s"
      round_interval = true
      collection_jitter = "1s"
      omit_hostname = true
    [[outputs.prometheus_client]]
      listen = ":9273"
      collectors_exclude = ["gocollector", "process"]
      metric_version = 2
    [[inputs.cpu]]
      percpu = false
      totalcpu = true
      collect_cpu_time = false
      report_active = false
    [[inputs.disk]]
      ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
      [inputs.disk.tagdrop]
        path = ["/etc/telegraf", "/dev/termination-log", "/etc/hostname", "/etc/hosts", "/etc/resolv.conf"]
    [[inputs.diskio]]
    [[inputs.kernel]]
    [[inputs.mem]]
      fielddrop = ["slab", "wired", "commit_limit", "committed_as", "dirty", "high_free", "high_total", "huge_page_size", "huge_pages_free", "low_free", "low_total", "mapped", "page_tables", "sreclaimable", "sunreclaim", "swap_cached", "swap_free", "vmalloc_chunk", "vmalloc_total", "vmalloc_used", "write_back", "write_back_tmp"]
    [[inputs.processes]]
    [[inputs.system]]
    [[inputs.netstat]]
    [[inputs.net]]
      ignore_protocol_stats = true
      interfaces = ["eth*", "bond*", "em*"]
      fielddrop = ["packets_sent", "packets_recv"]
複製代碼

配置文件

telegraf 配置文件的官方文檔在此，內容並很少，你能夠看看。你不想看也不要緊，我會將我這裏的配置都一一說明。telegraf 採用的是 toml 的配置文件格式，[] 表示字典，[[]] 表示列表。轉換成 yml 就長這樣：

agent:
  # 採集間隔
 interval: 30s
  # 沒有這個貌似就只會採集一次
 round_interval: true
  # 多個 input 若是在同一時間進行採集，可能會形成 cpu 尖刺，使用這個時間錯開
 collection_jitter: 1s
  # 不會爲全部的指標添加 hostname tag（標籤）
 omit_hostname: true
inputs:
 - disk:
    # 不收集指定的文件系統
 ignore_fs: []
    # 只要標籤中有 path 爲如下值的，不收集對應的指標
 tagdrop:
 path: ["/etc/telegraf", "/dev/termination-log", "/etc/hostname", "/etc/hosts", "/etc/resolv.conf"]
 - system: {}
 - cpu:
    # 不爲每顆 cpu 都單首創建指標，node_exporter 就使用這種方式，你還沒法關掉，可是 telegraf 能夠
 percpu: false
    # 這是絕對要開啓的，統計總的 cpu 使用
 totalcpu: true
    # 統計 cpu 時間，看你須要，通常不開啓
 collect_cpu_time: false
    # 是否新增一個 active 的指標，它的值是除了 idle 以外的值相加的結果，若是不統計 cpu 時間的話，能夠直接用 100 減去
    # idle，獲得的值就是 active 的值
 report_active: false
 - mem:
    # 字段名中包含這些的都不收集，至於字段有哪些，那就要看 mem inputs 的文檔了
    # 因爲老夫學藝不精，不少內存指標都看不懂，乾脆都幹掉了，大家自行掂量
 fielddrop: []
outputs:
 - prometheus_client:
 listen: :9273
    # 排除 go 自己（goroutine、gc 等）和 process 這兩種指標
 collectors_exclude: ["gocollector", "process"]
複製代碼

telegraf 的四大部分中，只有 Processor 沒有對應的關鍵字，目前它只有過濾的做用，用於 Input、Output 和 Aggregator 中。上面配置文件 Input 中的 fielddrop、tagdrop 都屬於 Processor，用於過濾指標。過濾的關鍵字有：

namepass：以指標名稱做爲過濾條件，pass 是白名單，名稱中包含哪些關鍵字的才收集，它的值是一個列表，列表中的元素可使用通配符；
namedrop：黑名單。須要注意的是，name 和 field 並不相同，好比內存指標中有 total 這個 field，可是它的 name 爲 mem_total；
fieldpass：根據字段名進行過濾，它的值類型一樣爲列表；
fielddrop：黑名單；
tagpass：若是 tag 中包含某個 key/value，那麼就不收集該指標。注意它的值類型爲字典，詳見上面使用；
tagdrop：黑名單；
taginclude：這是刪除 tag 的，值類型爲列表。列表中的 tag 都保留；
tagexclude：列表中的 tag 都刪除。

我只使用了 tagdrop 和 fielddrop，其餘有須要的大家可使用。經過 Processor 能夠很輕鬆的刪除咱們不須要的指標，這是很是方便的。

結合這些，你應該很容易就很看懂我這裏使用的配置。這裏我只收集了一些常見的系統指標，若是你有其餘的須要，能夠查看官方 input 文檔，各類插件任你挑選。

pod

telegraf 很顯然也是使用 daemonset，它的 pod 配置有些關鍵點須要提一下。

須要將根掛載到容器中，單獨掛載 /proc 和 /sys，disk 指標會有問題；
經過 HOST_PROC、HOST_SYS 和 HOST_MOUNT_PREFIX 環境變量讓 telegraf 收集掛載進來的宿主機的目錄；
hostNetwork、hostPID 都得爲 true；
使用 securityContext 讓 pod 使用非 root 用戶運行，指定的 uid 是宿主機的上用戶的 uid，鏡像中有沒有這個用戶都有不影響。pod 運行後你能夠在 pod 所在的宿主機上經過 ps -ef|grep telegraf 查看運行的用戶。

apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: telegraf
 namespace: monitoring
 labels:
 k8s-app: telegraf
spec:
 selector:
 matchLabels:
 name: telegraf
 template:
 metadata:
 labels:
 name: telegraf
 spec:
 containers:
 - name: telegraf
 image: telegraf:1.13.2-alpine
 resources:
 limits:
 memory: 500Mi
 requests:
 cpu: 500m
 memory: 500Mi
 env:
 - name: "HOST_PROC"
 value: "/host/proc"
 - name: "HOST_SYS"
 value: "/host/sys"
 - name: "HOST_MOUNT_PREFIX"
 value: "/host"
 volumeMounts:
 - name: config
 mountPath: /etc/telegraf
 readOnly: true
 - mountPath: /host
 name: root
 readOnly: true
 hostNetwork: true
 hostPID: true
 nodeSelector:
        kubernetes.io/os: linux
 securityContext:
 runAsNonRoot: true
 runAsUser: 65534
 tolerations:
 - operator: Exists
 terminationGracePeriodSeconds: 30
 volumes:
 - name: config
 configMap:
 name: telegraf
 - hostPath:
 path: /
 name: root
複製代碼

對於 service 我就很少提了，service 用來讓 prometheus 對其進行發現。將這三個文件都 apply 以後，修改 prometheus 配置。

修改 prometheus 配置

根據用法的不一樣，prometheus 配置可能會有不一樣，先上配置：

 - job_name: telegraf
 kubernetes_sd_configs:
 - role: endpoints
 namespaces:
 names:
 - monitoring
 scrape_interval: 30s
 scrape_timeout: 30s
 tls_config:
 insecure_skip_verify: true
 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
 relabel_configs:
 - action: keep
 source_labels:
 - __meta_kubernetes_service_label_k8s_app
 regex: telegraf
 - source_labels:
 - __meta_kubernetes_endpoint_node_name
 target_label: instance
複製代碼

這裏只增長了一個配置，就是將 instance 標籤換成了 node name，而非默認的 __address__。若是要保留默認的 instance，你能夠將 instance 換成你想要的名稱。我是嫌指標太多，因此纔將默認的 instance 換成了 node name。

之因此要用 node name，是由於配合使用 kubectl top node 命令（這涉及到本系列的上一篇文章）。所以，instance 標籤的值要和你使用 kubectl get node 出現的值是一一對應的（基本上都是一致的）。固然，你若是能夠直接使用 kubectl top node 命令，那麼就不必增長這個標籤了。

修改以後 apply，而後一樣是 exec 進入 prometheus 容器中，查看 /etc/prometheus/config/prometheus.yml 是否已經更改。等待更改以後，回到宿主機上執行；

curl -XPOST PROMETHEUS_CONTAINER_IP:9090/-/reload
複製代碼

reload 以後你就能夠直接經過宿主機 ip 來訪問指標頁面了。

curl IP:9273/metrics
複製代碼

能夠看到指標清晰易懂，且數量不多，比 node_exporter 強出一大截。

修改 prometheus adapter 配置

在上一篇文章中，咱們部署了 prometheus adapter，並使用它提供 resource metric api，也就是能夠經過它使用 kubectl top 命令。可是因爲我刪除了存在 id="/" 標籤的指標，因此默認的 node 指標的查詢語句就失效了。

想要使用的話，能夠恢復以前刪除的指標，默認查詢語句是這樣的：

# cpu
sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>, id='/'}[1m])) by (<<.GroupBy>>)

# memory
sum(container_memory_working_set_bytes{<<.LabelMatchers>>,id='/'}) by (<<.GroupBy>>)
複製代碼

可是若是將其恢復都恢復，指標數量會增長不少，有些得不償失。既然咱們收集了宿主機的指標了，徹底可讓其查詢宿主機的指標，根本沒有必要查詢容器的。所以咱們只須要將這兩個查詢語句換成下面這兩個：

# cpu
100-cpu_usage_idle{cpu="cpu-total", <<.LabelMatchers>>}

# mem
mem_used{<<.LabelMatchers>>}
複製代碼

可是你得確保下面配置必須存在：

 resources:
 overrides:
 instance:
 resource: nodes
複製代碼

這個配置的做用是將 node 的資源映射爲 instance 標籤的值。當你執行 kubectl top node 時，它會先得到全部的 node，而後將每一個 node 帶入到查詢表達式中，好比查詢 node 名稱爲 k8s-node1 的 cpu：

100-cpu_usage_idle{cpu="cpu-total", instance="k8s-node1"}
複製代碼

完整的配置能夠在 github 上看到，其實就是修改了兩個查詢語句而已。

apply 以後，須要刪除 prometheus adapter pod 讓其重啓，以後就能夠執行 kubectl top node 命令了，只不過 cpu 的顯示並不許確，多是缺乏了 cpu_usage_total？不是很懂 adapter 的實現邏輯，有興趣的童鞋能夠研究研究？不過內存是準的，你們看看內存就好。