Istio可觀察性--Metrcis篇

時間 2021-02-21

標籤後端 api 網絡 app 運維 tcp 分佈式 ide 性能優化欄目系統網絡简体版

原文原文鏈接

Istio爲網格內的全部服務通訊生成詳細的遙測。這種遙測功能提供了服務行爲的可觀察性，使運維同窗能夠對應用程序進行故障排除，維護和優化，而不會給服務開發人員帶來任何額外負擔。經過Istio，能夠全面瞭解受監控的服務如何與其餘服務以及與Istio組件自己進行交互。後端

Istio生成如下遙測類型，以提供總體服務網格可觀察性：api

Metrics. Istio根據監控的四個「黃金信號」(延遲，流量，錯誤和飽和度)生成一組服務指標。 Istio還提供了網格控制平面的詳細指標。還提供了基於這些指標構建的一組默認的網格監控儀表板。
Distributed Traces. Istio爲每種服務生成分佈式跟蹤範圍，從而使咱們能夠詳細瞭解網格中的調用流程和服務依賴性。
Access Logs. 當流量流入網格內的服務時，Istio能夠生成每一個請求的完整記錄，包括源和目標元數據。該信息使咱們可以審覈服務行爲，能夠到單個工做負載實例級別。

本文咱們主要講述metrcis。網絡

Metrcis簡介

在大多數狀況下，指標是可觀察性的起點。它們的收集成本低，存儲成本低，便於快速分析，而且是衡量總體健康情況的好方法。
爲了監控服務行爲，Istio會爲Istio服務網格內，外全部服務流量生成指標。這些度量標準提供有關行爲的信息，例如總流量，流量中的錯誤率以及請求的響應時間。
除了監控網格內服務的行爲外，監控網格自己的行爲也很重要。 Istio組件根據本身的內部行爲導出度量標準，以提供有關網格控制平面的運行情況和功能的看法。app

下面咱們詳細講下如下：運維

代理級別指標
服務級別指標
控制層指標

代理級別指標tcp

Istio指標收集從Sidecar代理(Envoy)開始。每一個代理都會生成一組豐富的指標，用於衡量經過代理的全部流量（入站和出站）。代理還提供有關代理自己的管理功能的詳細統計信息，包括配置和運行情況信息。分佈式

Envoy生成的度量標準以Envoy資源（例如偵聽器和集羣）的粒度提供對網格的監視。所以，須要瞭解網狀服務和Envoy資源之間的鏈接才能監視Envoy指標。ide

Istio使咱們能夠選擇在每一個工做負載實例中生成和收集哪些Envoy指標。默認狀況下，Istio僅啓用Envoy生成的統計信息的一小部分，以免過多的度量標準後端並減小與度量標準收集相關的CPU開銷。可是，咱們能夠在須要時輕鬆擴展收集的代理指標集。這樣能夠針對性地調試網絡行爲，同時下降mesh監控的總成本。性能

Envoy官方文檔包括Envoy統計信息收集的詳細概述。 Envoy Statistics上的操做指南提供了有關控制代理級度量標準生成的更多信息。優化

代理級別指標示例：

envoy_cluster_internal_upstream_rq{response_code_class="2xx",cluster_name="xds-grpc"} 7163

envoy_cluster_upstream_rq_completed{cluster_name="xds-grpc"} 7164

envoy_cluster_ssl_connection_error{cluster_name="xds-grpc"} 0

envoy_cluster_lb_subsets_removed{cluster_name="xds-grpc"} 0

envoy_cluster_internal_upstream_rq{response_code="503",cluster_name="xds-grpc"} 1

服務級別指標

除了代理級別的度量標準以外，Istio還提供了一組面向服務的度量標準，用於監控服務通訊。這些指標涵蓋了四個基本的服務監視需求：延遲，流量，錯誤和飽和度。 Istio附帶了一組默認的儀表板，用於根據這些指標監視服務行爲。
默認狀況下，標準Istio指標會導出到Prometheus。
服務級別指標的使用徹底是可選的。運維同窗能夠選擇關閉這些指標的生成和收集，以知足他們的個性化需求。

服務級別指標示例：

istio_requests_total{
  connection_security_policy="mutual_tls",
  destination_app="details",
  destination_canonical_service="details",
  destination_canonical_revision="v1",
  destination_principal="cluster.local/ns/default/sa/default",
  destination_service="details.default.svc.cluster.local",
  destination_service_name="details",
  destination_service_namespace="default",
  destination_version="v1",
  destination_workload="details-v1",
  destination_workload_namespace="default",
  reporter="destination",
  request_protocol="http",
  response_code="200",
  response_flags="-",
  source_app="productpage",
  source_canonical_service="productpage",
  source_canonical_revision="v1",
  source_principal="cluster.local/ns/default/sa/default",
  source_version="v1",
  source_workload="productpage-v1",
  source_workload_namespace="default"
} 214

控制層指標

Istio控制平面還提供了一系列自我監控指標。這些度量標準容許監視Istio自己的行爲(與網格內服務的行爲不一樣)。

主要包括pilot，galley等組件指標。更多相關信息能夠參閱官方文檔。

最佳實踐

所謂最佳實踐，只要包括如下兩個方面：

生產環境具有擴展性的Prometheus -- 單機Prometheus採集和存儲能力有限，而istio保留出了豐富的指標，因此必須提供一個高可用的彈性Prometheus集羣。關於這點，本文咱們不會詳細闡述，你們能夠經過thanos，cortex等開源項目，對prometheus進行擴展。
經過recording rules 對一些指標進行預聚合。

好比爲了聚合實例和容器之間的度量，使用如下recording rules更新默認的Prometheus配置：

groups:
- name: "istio.recording-rules"
  interval: 5s
  rules:
  - record: "workload:istio_requests_total"
    expr: |
      sum without(instance, namespace, pod) (istio_requests_total)

  - record: "workload:istio_request_duration_milliseconds_count"
    expr: |
      sum without(instance, namespace, pod) (istio_request_duration_milliseconds_count)

  - record: "workload:istio_request_duration_milliseconds_sum"
    expr: |
      sum without(instance, namespace, pod) (istio_request_duration_milliseconds_sum)

  - record: "workload:istio_request_duration_milliseconds_bucket"
    expr: |
      sum without(instance, namespace, pod) (istio_request_duration_milliseconds_bucket)

  - record: "workload:istio_request_bytes_count"
    expr: |
      sum without(instance, namespace, pod) (istio_request_bytes_count)

  - record: "workload:istio_request_bytes_sum"
    expr: |
      sum without(instance, namespace, pod) (istio_request_bytes_sum)

  - record: "workload:istio_request_bytes_bucket"
    expr: |
      sum without(instance, namespace, pod) (istio_request_bytes_bucket)

  - record: "workload:istio_response_bytes_count"
    expr: |
      sum without(instance, namespace, pod) (istio_response_bytes_count)

  - record: "workload:istio_response_bytes_sum"
    expr: |
      sum without(instance, namespace, pod) (istio_response_bytes_sum)

  - record: "workload:istio_response_bytes_bucket"
    expr: |
      sum without(instance, namespace, pod) (istio_response_bytes_bucket)

  - record: "workload:istio_tcp_connections_opened_total"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_connections_opened_total)

  - record: "workload:istio_tcp_connections_closed_total"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_connections_opened_total)

  - record: "workload:istio_tcp_sent_bytes_total_count"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_sent_bytes_total_count)

  - record: "workload:istio_tcp_sent_bytes_total_sum"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_sent_bytes_total_sum)

  - record: "workload:istio_tcp_sent_bytes_total_bucket"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_sent_bytes_total_bucket)

  - record: "workload:istio_tcp_received_bytes_total_count"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_received_bytes_total_count)

  - record: "workload:istio_tcp_received_bytes_total_sum"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_received_bytes_total_sum)

  - record: "workload:istio_tcp_received_bytes_total_bucket"
    expr: |
      sum without(instance, namespace, pod) (istio_tcp_received_bytes_total_bucket)

原理

暴露指標

默認條件下，istio經過爲控制平面和數據平面的Pod增長註解的方式，告訴Prometheus採集指標的端口和路徑。

好比控制層istiod組件：

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/port: "15014"
    prometheus.io/scrape: "true"
    sidecar.istio.io/inject: "false"
  labels:
    app: istiod

istiod 在端口15014暴露metrcis。

對於istio-ingressgateway ：

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15090"
    prometheus.io/scrape: "true"
    sidecar.istio.io/inject: "false"
  labels:
    app: istio-ingressgateway

istio-ingressgateway 在端口15090暴露metrcis。

對於數據平面的代理：

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true" 
  labels:
    app: helloworld

proxy 在端口15020暴露metrcis。

因爲ingressgateway和代理均爲基於envoy實現，因此其metrcis路徑均爲/stats/prometheus。

Prometheus配置

此時咱們配置咱們的prometheus收集這些數據便可。實際上，就是典型的prometheus k8s 服務發現功能，具體就是對於pod的自動發現。

以下：

scrape_configs:
              - job_name: kubernetes-pods
                kubernetes_sd_configs:
                - role: pod
                relabel_configs:
                - action: keep
                  regex: true
                  source_labels:
                  - __meta_kubernetes_pod_annotation_prometheus_io_scrape
                - action: replace
                  regex: (.+)
                  source_labels:
                  - __meta_kubernetes_pod_annotation_prometheus_io_path
                  target_label: __metrics_path__
                - action: replace
                  regex: ([^:]+)(?::d+)?;(d+)
                  replacement: $1:$2
                  source_labels:
                  - __address__
                  - __meta_kubernetes_pod_annotation_prometheus_io_port
                  target_label: __address__
                - action: labelmap
                  regex: __meta_kubernetes_pod_label_(.+)
                - action: replace
                  source_labels:
                  - __meta_kubernetes_namespace
                  target_label: kubernetes_namespace
                - action: replace
                  source_labels:
                  - __meta_kubernetes_pod_name
                  target_label: kubernetes_pod_name

至此指標已經收集到咱們的監控系統中了，接下來就是展現。