nginx 502 和 504 超時演示

時間 2019-11-07

標籤 nginx 超時演示欄目 Nginx 简体版

原文原文鏈接

最近線上 nginx 遇到了一些較難排查的 502 和 504 錯誤，順便了解了一下 nginx 的相關配置。我發現網上不少介紹 nginx 超時配置只是列了這幾個配置的含義和數值，並無解釋什麼緣由會觸發哪一個配置。所以趁這個機會演示一下，如何讓 nginx 符合預期正確出現 502 和 504。html

502 和 504 的解釋

在 http status 的定義中：nginx

502 Bad Gateway: The server was acting as a gateway or proxy and received an invalid response from the upstream server.
504: he server was acting as a gateway or proxy and did not receive a timely response from the upstream server.

502 的錯誤緣由是 Bad Gateway，通常是因爲上游服務的故障引發的；而 504 則是 nginx 訪問上游服務超時，兩者徹底是兩個意思。但在某些狀況下，上游服務的超時（觸發 tcp reset）也可能引起 502，咱們會在以後詳述。git

演示環境

你須要 3 個邏輯組件：nginx 服務器，php-fpm，client 訪問客戶端。3 個組件能夠在同一臺機器中，我用的是 docker 來配置 PHP 和 nginx 環境，在宿主機上訪問。若是你很熟悉這 3 個組件，這部分能夠跳過。用 docker 來作各類測試和實驗很是方便，這裏就不展開了。docker-compose 的配置參考了這篇文章。個人 docker composer 文件以下：github

version: '3'
services:
 web:
 image: nginx:alpine
 ports:
 - "8080:80"
 volumes:
 - ./code:/code
 - ./nginx/site.conf:/etc/nginx/conf.d/site.conf
 depends_on:
 - php
 php:
 image: php:7.1-fpm-alpine
 volumes:
 - ./code:/code
 - ./php/php-fpm.conf:/usr/local/etc/php-fpm.conf
複製代碼

使用的鏡像都是基於 alpine 製做的，很是小巧：web

REPOSITORY  TAG               SIZE
php         7.1-fpm-alpin     69.5MB
nginx       alpine            18.6MB
複製代碼

nginx 的配置：docker

server {
  index index.php index.html;
  server_name php-docker.local;
  error_log  /var/log/nginx/error.log;
  access_log /var/log/nginx/access.log;
  root /code;

  location ~ \.php$ {
    try_files $uri =404;
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass 127.0.0.1:9000;
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_param PATH_INFO $fastcgi_path_info;
    fastcgi_connect_timeout 5s;
    fastcgi_read_timeout 8s;
    fastcgi_send_timeout 10s;
  }
}
複製代碼

php-fpm 的配置shell

[global]
include=etc/php-fpm.d/*.conf
request_terminate_timeout=3s
複製代碼

代碼放在 github。bash

關鍵參數

在這個演示中，PHP 的關鍵參數有兩個，一個是 PHP 腳本的 max_execution_time，這個配置在php.ini中；另外一個是 php-fpm 的 request_terminate_timeout，在php-fpm.conf中。當以 php-fpm 提供服務時，request_terminate_timeout 設置會覆蓋 max_execution_time 的設置，所以咱們這裏只測試 request_terminate_timeout。服務器

request_terminate_timeout 的意思是 php-fpm 接受的請求的超時時間，超過這個時間 php-fpm 會 kill 掉執行腳本的 worker 進程。

nginx的關鍵參數是 fastcgi 相關的 timeout，即：fastcgi_connect_timeout，fastcgi_read_timeout，fastcgi_send_timeout。

這幾個 nginx 參數的主語都是 nginx，因此 fastcgi_connect_timeout 的意思是 nginx 鏈接到 fastcgi 的超時時間，fastcgi_read_timeout 是 nginx 讀取 fastcgi 的內容的超時時間，fastcgi_send_timeout 是 nginx 發送內容到 fastcgi 的超時時間。

演示過程

首先啓動 nginx 和 PHP：

docker-compose up
複製代碼

在 code 文件夾下添加一個 index.php 文件：

<?php
sleep(70);
echo 'hello world';
複製代碼

上游服務主動 reset

訪問 php-docker.local:8080/index.php，報錯 502 bad gateway。並且是在 3s 以後報的錯，說明觸發了 request_terminate_timeout 設置，php-fpm 關閉了鏈接。

經過觀察 ps aux | grep php 能夠發現，php-fpm 是經過殺掉超時的進程來解決進程超時問題的（pid 每次有一個會變化，說明一個進程殺掉了，並啓動了另外一個進程。這和 php-fpm 的進程池設定有關，你的設定未必會從新啓動一個新的進程）。

/var/www/html # ps aux | grep php
    1 root       0:00 php-fpm: master process (/usr/local/etc/php-fpm.conf)
    6 www-data   0:00 php-fpm: pool www
    7 www-data   0:00 php-fpm: pool www
/var/www/html # ps aux | grep php
    1 root       0:00 php-fpm: master process (/usr/local/etc/php-fpm.conf)
    7 www-data   0:00 php-fpm: pool www
   17 www-data   0:00 php-fpm: pool www
/var/www/html # ps aux | grep php
    1 root       0:00 php-fpm: master process (/usr/local/etc/php-fpm.conf)
   17 www-data   0:00 php-fpm: pool www
   20 www-data   0:00 php-fpm: pool www
複製代碼

在這種狀況下，nginx 日誌中的錯誤是：

recv() failed (104: Connection reset by peer) while reading response header from upstream
複製代碼

即鏈接被服務端（PHP）reset 了，也就很好理解了。

注意，在這種狀況下，php-fpm 的日誌中也會記錄的：

php_1  | [18-Jul-2018 16:33:42] WARNING: [pool www] child 5, script '/code/index.php' (request: "GET /index.php") execution timed out (3.040130 sec), terminating
php_1  | [18-Jul-2018 16:33:42] WARNING: [pool www] child 5 exited on signal 15 (SIGTERM) after 30.035736 seconds from start
php_1  | [18-Jul-2018 16:33:42] NOTICE: [pool www] child 8 started
複製代碼

這也是能夠發現問題的一個地方。

nginx 讀取上游服務超時

刪掉 request_terminate_timeout 配置，重啓應用：

docker-compose down && docker-compose up
複製代碼

此時，PHP 腳本將要執行 70s，確定超過 nginx 設置的超時時間，get 一下發現確實如此，8s 以後拋出 504 Gateway Time-out 錯誤，nginx 日誌是：

upstream timed out (110: Operation timed out) while reading response header from upstream
複製代碼

說明觸發了 fastcgi_read_timeout 設置。

關閉上游服務

關掉 PHP 服務：

docker-composer stop php
複製代碼

PHP 服務停掉以後第一次訪問，獲得 504 錯誤，錯誤是：

upstream timed out (110: Operation timed out) while connecting to upstream
複製代碼

超時時間爲 fastcgi_connect_timeout 的設置。說明這個時候 tcp 鏈接還在，可是嘗試鏈接的時候失敗了。

再次訪問，獲得 502 錯誤，錯誤是：

connect() failed (113: Host is unreachable) while connecting to upstream
複製代碼

502 的緣由很容易理解，上游服務掛了，同時由於以前訪問的時候發現鏈接不上就把鏈接斷掉了，再次鏈接的時候便沒法找到 host 了。

我曾懷疑第一次訪問 504 是因爲 keepalive。但我停掉 PHP 以後隔了很久才發第一個請求，仍然是這個結果。

若是將 nginx fastcgi_pass 配置爲 127.0.0.1:9000（本地沒有這個端口），則立刻就會拋出 502 錯誤，錯誤爲：

connect() failed (111: Connection refused) while connecting to upstream
複製代碼

登入 nginx 服務，使用 tcpdump 監控 9000 上的通訊：

tcpdump -i eth0 -nnA tcp port 9000
# 若是你的 PHP 在本地，eth0 應該改爲 lo
複製代碼

咱們發現，當 PHP 關閉以後第一次訪問，nginx 會嘗試向 PHP 發起若干次 TCP SYN 請求，但 PHP 顯然不會響應，這個時候 nginx 就返回了 504。第二次訪問的時候 nginx 根本不會發起任何請求，直接 502 了[^2]。若是咱們這個時候執行nginx -t會發現，nginx 已經認爲配置文件有問題了：nginx: configuration file /etc/nginx/nginx.conf test failed。

換一種配置

這篇文章提到，咱們以前的 nginx 配置並不合理[^1]，咱們從新設置 nginx：

server {
  index index.php index.html;
  server_name php-docker.local;
  error_log  /var/log/nginx/error.log;
  access_log /var/log/nginx/access.log;
  root /code;
  resolver 127.0.0.11;  # here
  location ~ \.php$ {
    set $upstream php:9000; # here
    try_files $uri =404;
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass $upstream;  # here
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_param PATH_INFO $fastcgi_path_info;
    fastcgi_connect_timeout 5s;
    fastcgi_read_timeout 8s;
    fastcgi_send_timeout 10s;
  }
}
複製代碼

其中 127.0.0.11 是 docker 的內網 dns resolver。該配置動態指定 fastcgi pass，因此 nginx 不會檢查該鏈接可否創建起來。

按照這個配置啓動，先訪問 index.php 創建鏈接，而後關閉 PHP，表現爲：

在 keepalive 期間，拋出 504 錯誤，超時時間爲 fastcgi_connect_timeout，錯誤是：

upstream timed out (110: Operation timed out) while connecting to upstream
複製代碼

keepalive 斷線以後，拋出 502 錯誤，超時時間不定，錯誤是：

connect() failed (113: Host is unreachable) while connecting to upstream
複製代碼

按照這篇文章所說，這種配置 nginx 不會認爲有問題，執行nginx -t確實如此。在 一段時間 內，每次請求 nginx 都會向 upstream 發送 SYN，這段時間的狀態碼都是 504，以後再訪問就再也不發 TCP 包，狀態碼也變成 502。

其餘

除此以外，PHP 腳本還有一個超時時間的設置：max_execution_time。它是限制 PHP 腳本的執行時間，但這個時間不會計算系統調用（好比 sleep，io，等）。由於該緣由致使 PHP 殺掉進程時，會拋出 fatal error，而 php-fpm 不會有 fatal error。

這裏實驗使用的是 PHP 的 fastcgi 工做方式，若是是 nginx 經過代理的方式鏈接上游服務的話，fastcgi_connect_timeout，fastcgi_read_timeout，fastcgi_send_timeout 都須要替換成對應的 proxy_connect_timeout，proxy_read_timeout，proxy_send_timeout。

結論

504 的緣由比較簡單，通常都是上游服務的執行時間超過了 nginx 的等待時間，這種狀況是因爲上游服務的業務太過耗時致使的，或者鏈接到上游服務器超時。從上面的實驗來看，後者的緣由比較難以追蹤，由於這種狀況下鏈接是存在的，可是卻連不上，好在這種 504 通常都會在一段時間後轉爲 502。

502 的緣由是因爲上游服務器的故障，好比停機，進程被殺死，上游服務 reset 了鏈接，進程僵死等各類緣由。在 nginx 的日誌中咱們可以發現 502 錯誤的具體緣由，分別爲：104: Connection reset by peer，113: Host is unreachable，111: Connection refused。

有一些細節上的差異和 nginx 的工做原理有關，這部分還沒有深挖。

[^1]: 這篇文章代表，咱們以前的設置中，若是 PHP 沒有先啓動起來，那麼 nginx 也是啓動不起來的，這種設置並不合理：nginx 的一臺上游服務有問題，結果 nginx 就沒法提供服務了。但這和咱們的演示關係不大，所以並無在正文中過多描述。

[^2]: 按理說，既然 nginx 已經知道 PHP 不可達，不去發 TCP 請求了，那麼應該當即 502 纔是。實驗中發現，這種狀況下的 502 有 3s 左右的延時，不知何故。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。