nginx中健康檢查(health_check)機制深刻分析

時間 2019-11-09

標籤 nginx 健康檢查 health check 機制深刻分析欄目 Nginx 简体版

原文原文鏈接

不少人都知道nginx能夠作反向代理和負載均衡，可是關於nginx的健康檢查(health_check）機制瞭解的很少。其實社區版nginx提供的health_check機制其實很薄弱，主要是經過在upstream中配置max_fails和fail_timeout來實現，這邊文章主要是深刻分析社區版的health_check機制，固然還有更好的一些建議，好比商業版的nginx plus或者阿里的tengine,他們包含的健康檢查機制更加完善和高效，若是你堅持使用nginx社區版，固然還能夠本身寫或者找第三方模塊來編譯了。html

首先說下個人測試環境，CentOS release 6.4 (Final) + nginx_1.6.0 + 2臺tomcat8.0.15做爲後端服務器。（聲明:如下全部配置僅僅爲測試所用，不表明線上環境真實所用，真正的線上環境須要更多配置和優化。）
nginx配置以下:nginx

#user  nobody;
worker_processes  1;
#pid        logs/nginx.pid;
events {
worker_connections  1024;
}

http {
include       mime.types;
default_type  application/octet-stream;

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

access_log  logs/access.log  main;

sendfile        on;
keepalive_timeout  65;
upstream backend {
    server localhost:9090 max_fails=1 fail_timeout=40s;
    server localhost:9191 max_fails=1 fail_timeout=40s;
}
server {
    listen       80;
    server_name  localhost;
    location / {
        proxy_pass http://backend;
        proxy_connect_timeout 1;
        proxy_read_timeout 1;
    }
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   html;
    }   
}

}後端

關於nginx和tomcat的配置的基本配置再也不說明，你們能夠去看官方文檔。
咱們能夠看到我在upstream 指令中配置了兩臺server,每臺server都設置了max_fails和fail_timeout值。瀏覽器

如今開始啓動nginx，而後啓動後臺的2臺server, 故意把在Tomcat Listener中Sleep 10分鐘，也就是tomcat啓動要花費10分鐘左右，端口已開，可是沒有接收請求,而後咱們訪問http://localhost/response/ (response這個接口是我在tomcat中寫的一個servlet接口，功能很簡單，若是是9090的server接收請求則返回9090，若是是9191端口的server則返回9191.),如今觀察nginx的表現。緩存

咱們查看nginx中tomcat

access.log

192.168.42.254 - - [29/Dec/2014:11:24:23 +0800] "GET /response/ HTTP/1.1" 504 537 720 380 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 2.004 host:health.iflytek.com
192.168.42.254 - - [29/Dec/2014:11:24:24 +0800] "GET /favicon.ico HTTP/1.1" 502 537 715 311 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 0.000 host:health.iflytek.com

error.log

2014/12/29 11:24:22 [error] 6318#0: *4785892017 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9090/response/", host: "health.iflytek.com"
2014/12/29 11:24:23 [error] 6318#0: *4785892017 upstream timed out (110: Connection timed out) while reading response header from upstream, client:     192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9191/response/", host: "health.iflytek.com"
2014/12/29 11:24:24 [error] 6318#0: *4785892017 no live upstreams while connecting to upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /favicon.ico HTTP/1.1", upstream: "http://health/favicon.ico", host: "health.iflytek.com"

（爲何要在listener中設置睡眠10分鐘，這是由於咱們的業務中須要作緩存預熱，因此這10分鐘就是模擬服務器啓動過程當中有10分鐘的不可用。）服務器

觀察日誌發如今兩臺tomcat啓動過程當中，發送一次請求，nginx會自動幫咱們進行重試全部的後端服務器，最後會報 no live upstreams while connecting to upstream錯誤。這也算是nginx作health_check的一種方式。這裏須要特別強調一點，咱們設置了proxy_read_timeout 爲 1秒。後面再重點講解這個參數，很重要。app

等待40s,如今把9090這臺服務器啓動完成，可是9191這臺服務器仍然是正在啓動，觀察nginx日誌表現。負載均衡

access.log測試

192.168.42.254 - - [29/Dec/2014:11:54:18 +0800] "GET /response/ HTTP/1.1" 200 19 194 423 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 0.210 host:health.iflytek.com
192.168.42.254 - - [29/Dec/2014:11:54:18 +0800] "GET /favicon.ico HTTP/1.1" 404 453 674 311 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 0.212 host:health.iflytek.com

error.log

沒有打印任何錯誤

瀏覽器返回9090,說明nginx正常接收請求。

咱們再次請求一次。

access.log

192.168.42.254 - - [29/Dec/2014:13:43:13 +0800] "GET /response/ HTTP/1.1" 200 19 194 423 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 1.005 host:health.iflytek.com

說明正常返回，同時返回9090

error.log

2014/12/29 13:43:13 [error] 6323#0: *4801368618 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9191/response/", host: "health.iflytek.com"

發現nginx error.log 增長了一行upstream time out的錯誤。可是客戶端仍然正常返回，upstream默認是輪訓的負載，因此這個請求默認會轉發到9191這臺機器，可是由於9191正在啓動，因此此次請求失敗，而後有nginx重試轉發到9090機器上面。

OK，可是fail_timeout=40s是什麼意思呢？咱們要不要重現一下這個參數的重要性？Let's go ! 如今你只須要靜靜的作個美男子，等待9191機器啓動完畢！多發送幾回請求！而後咦,你發現9191機器返回9191響應了噢！fail_timeout=40s其實就是若是上次請求發現9191沒法正常返回，那麼有40s的時間該server會不可用，可是一旦超過40s請求也會再次轉發到該server上的，無論該server到底有沒有真正的恢復。因此可見nginx社區版的health_check機制有多麼的薄弱啊，也就是一個延時屏蔽而已，如此周而復始！若是你用過nginx plus其實你會發現nginx plus 提供的health_check機制更增強大，說幾個關鍵詞，大家本身去查! zone slow_start health_check match ! 這個slow_start其實就很好的解決了緩存預熱的問題，好比nginx發現一臺機器重啓了，那麼會等待slow_starts設定的時間纔會再次發送請求到該服務器上，這就給緩存預熱提供了時間。

1. Knative Serving 健康檢查機制分析
2. Nginx 健康檢查
3. nginx限流&健康檢查
4. Nginx健康檢查模塊
5. Docker 容器健康檢查機制
6. Kong網關upstream健康檢查機制
7. SOFABoot 健康檢查能力分析
8. springboot中健康檢查AbstractHealthIndicator
9. SpringBoot健康檢查
10. K8S--(健康檢查)
更多相關文章...
• TCP滑動窗口機制深度剖析 - TCP/IP教程
• MySQL檢查約束（CHECK） - MySQL教程
• 漫談MySQL的鎖機制
• 算法總結-二分查找法

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。