nginx + ingress + gunicorn 環境上傳大文件報錯問題的解決思路

時間 2019-11-17

標籤 nginx ingress gunicorn 環境上傳文件報錯問題解決思路欄目 Nginx 简体版

原文原文鏈接

在基於 Kubernetes 部署，使用 Gunicorn 運行的 Python Web 應用中，上傳大文件時出現了一系列的錯誤，如今將解決問題的思路記錄以下。html

文件上傳過程

上傳文件流程nginx

上傳的文件首先到達 Kubernetes 所在的宿主機。
宿主機上的 Nginx 經過 Proxy 轉發給 Kubernetes 集羣中的 Ingress Controller，Ingress controller 也是使用 Nginx 實現的。
Ingress Controller 中的 Nginx 經過 Proxy 轉發給 Gunicorn。
Gunicorn 會啓動若干個 Worker 處理請求，因此 Gunicorn 會再轉發給 Worker。
Worker 就是最終的 Python Web App

錯誤 413 的解決

首先碰到的是 413 Request Entity Too Large 錯誤，在上傳過程當中鏈接被中斷（基本上每次都是相同的上傳百分比被中斷），請求返回 413，首先考慮到 Nginx 對於請求體的大小有限制，查看 Nginx 文檔，發現 client_max_body_size 參數控制請求體的大小，默認的設置是 1mb。git

client_max_body_size: Sets the maximum allowed size of the client request body, specified in the 「Content-Length」 request header field. If the size in a request exceeds the configured value, the 413 (Request Entity Too Large) error is returned to the client. Please be aware that browsers cannot correctly display this error. Setting size to 0 disables checking of client request body size.github

首先在 Kubernetes 宿主機上 Nginx 的 http 域中加入以下配置。ide

client_max_body_size 1024m;
複製代碼

須要注意，除了 Kubernetes 宿主機上跑的 Nginx，還要修改 Ingress Controller 中的 Nginx。Ingress Nginx 的修改方法在 Annotation 字段中加入以下配置。工具

"nginx.ingress.kubernetes.io/proxy-body-size": "1024m"
複製代碼

錯誤 504 的解決

再次嘗試上傳，發現接口依然會返回錯誤，此次是 504 Gateway Timeout，從 Chrome 的開發者工具中查看請求，發現上傳至少要持續5分鐘，接下來從 Nginx 的超時機制入手。ui

在 Nginx 和 Ingress 中分別提升了讀寫的超時限制，將發送的超時設置爲 600s，返回的超時設置爲 30s。this

proxy_send_timeout 600s;
proxy_read_timeout 30s;
複製代碼

再次嘗試，發現依然報一樣的錯誤 504，難道說還有別的超時字段須要設置？再次查看文檔發現了端倪。spa

proxy_send_timeout: Sets a timeout for transmitting a request to the proxied server. The timeout is set only between two successive write operations, not for the transmission of the whole request. If the proxied server does not receive anything within this time, the connection is closed.rest

proxy_read_timeout: Defines a timeout for reading a response from the proxied server. The timeout is set only between two successive read operations, not for the transmission of the whole response. If the proxied server does not transmit anything within this time, the connection is closed.

這裏的 send 和 read，主語不是客戶端，而是 Nginx 本身，超時的時候，是 Nginx 向 Upstream 發送了文件，而等到 Upstream 處理完返回時候，超過了 proxy_read_timeout 的限制，因此須要增長的是 read_timeout。

將宿主機上的 Nginx 和 Ingress 分別作以下配置。

proxy_send_timeout 30s;
proxy_read_timeout 600s;
複製代碼

nginx.ingress.kubernetes.io/proxy-send-timeout: 30s
nginx.ingress.kubernetes.io/proxy-read-timeout: 600s
複製代碼

錯誤 502 的解決

修改好了超時和上傳文件大小的限制後，又出現了新的錯誤 502 Bad Gateway，此次就沒有頭緒了，因爲是新的報錯，上面的修改應該是生效了的，而且也不是上面兩個限制致使的，經過查詢 Nginx 和 Ingress 的日誌，發現 Ingress 中有這樣的報錯。

2019/02/27 07:18:36 [error] 4265#4265: *19932411 upstream prematurely closed connection while reading response header from upstream, client: 172.20.0.1, server: example.com, request: "POST /upload HTTP/1.0", upstream: "http://172.0.0.1/upload", host: "example.com", referrer: "http://example.com/"
複製代碼

這就比較奇怪了，剛纔已經修改了超時，爲何 Ingress 還會有超時的報錯呢？從日誌上看，多是 Ingress 的 Upstream 超時了，也就是 Gunicorn，Stackoverflow 上有人遇到了相似的問題，答案是給 Gunicorn 設置 -t 參數。查看 Gunicorn 的文檔，timeout 參數是這麼定義的。

timeout: Workers silent for more than this many seconds are killed and restarted. Generally set to thirty seconds. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.

也就是說，當某一個 Worker 處理文件上傳請求時候，若是在默認的超時時間內沒有響應 Master，就會被殺掉，這也不難理解爲何 Ingress 從 Upstream 獲取返回值時候鏈接會被關閉了。修改 Gunicorn 的配置，將超時時間設置爲 600s，從新上傳，問題解決。