在以前的一篇工做日誌《 【原創】記錄幾個最近遇到的未解問題(resolved) 》中,記錄了 3 個問題,其中兩個比較直接,三言兩語就可以打發了,而針對最後一個問題,卻有很多內容能夠說說,本文針對「 終端設備經過 HTTP 協議經由 nginx 訪問後端的 api 服務器時,TCP 鏈接行爲詭異 」這個問題展開討論。
問題描述:
線上環境中,公司自研即時通信軟件不定時掉線。
問題排查:
由運維和測試人員發現並報告,線上環境出現網絡異常,具體表現爲登陸服務器虛擬 IP 地址沒法 ping 通,即時通信工具不定時掉線;
在此狀況下,現場人員第一反應就是受到了外部攻擊(由於之前遇到過攻擊狀況),由於看到了以下信息nginx
... Apr 20 18:21:48 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:24:37 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:25:50 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:27:02 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:29:01 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:30:14 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:31:28 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:32:44 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:35:33 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:37:06 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:37:52 localhost ntpd_intres[1732]: host name not found: 0.centos.pool.ntp.org Apr 20 18:38:12 localhost ntpd_intres[1732]: host name not found: 1.centos.pool.ntp.org Apr 20 18:38:20 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:38:32 localhost ntpd_intres[1732]: host name not found: 2.centos.pool.ntp.org Apr 20 18:38:52 localhost ntpd_intres[1732]: host name not found: 3.centos.pool.ntp.org Apr 20 18:39:29 localhost kernel: possible SYN flooding on port 80. Sending cookies. Apr 20 18:40:43 localhost kernel: possible SYN flooding on port 80. Sending cookies. ...
以及抓包內容
通過研發人員的初步排查,排除了遭受外部攻擊的可能,由於發現後端
綜上,基本上能夠判定問題的緣由爲:centos
當時的 系統配置爲:api
後續再郵件討論中,某大牛給出了以下結論:緩存
平臺開啓了 SYN 攻擊檢測,故平臺會認爲 80 端口受到 SYN flood 攻擊,而在這種狀況下連正常的 keepalive 心跳檢測也會受到影響。昨天的問題,是由於 SYN 攻擊檢測致使的,不是 TCP SYN 緩存隊列佔滿的緣由。
上述結論剛給出的時候,我沒有提出任何異議(由於哥也沒深刻研究過~)。然而通過後續研究,我發現上述結論其實是存在問題的:服務器
雖然上述結論存在必定誤差,但對問題的總體推動仍是有好處的,而 如何進行問題修復其實比較簡單,由於在此次事件中,能夠很明顯的看出「主犯」是問題終端,而未經優化的內核參數以及 nginx 配置則是「從犯」,因此,優先槍決「主犯」,基本上就能解決問題了。而「從犯」理論上講是能夠緩刑處理的。
隨着排查的展開,陸續又有更進一步的結論產生:cookie
另外在問題復現過程當中,還抓到了以下內容
能夠看到,在抓包最開始的時候,並不是只有終端發起的 SYN 包,而是經歷了 SYN->SYN,ACK->RST 過程;在運行了一段時間後,才變成了只有 SYN 包被髮送,而沒有其餘迴應的。
上述抓包提供了很高的價值,通過排查獲得了如下結論:
終端在發送 SYN 後,會在另外的線程中啓動定時器對當前 fd 是否可寫進行超時斷定(聽說爲 10 秒),在特定狀況下(因爲內核參數沒有進行過調整,因此應該很容易達到所謂的特定狀況 ),會觸發此超時,致使業務層認爲當前鏈接未創建成功,因而經過 close 關閉該 socket 。另外,因爲此時並未成功創建 TCP 鏈接,故客戶端側協議棧不會發送 FIN 包。而以後當收到來自服務器端的 SYN,ACK 時(由於服務器側並不知道當前鏈接已經被關閉 ),則直接由客戶端底層 TCP 協議棧回覆 RST 。
問題到此已經獲得瞭解決,而此時還剩最後一個問題:線上問題是如何被觸發的?按道理說應該一直存在該問題的呀~~
結論大體以下(推斷出來的):nginx 所在機器在未通過內核參數優化的狀況下,可以處理必定量的併發鏈接,在請求量不大,每一個請求的訪問時間較短的狀況下,是可以正常提供服務的。因爲近期存在一些其餘服務的版本升級,懷疑部分業務的請求處理耗時有所增加,致使請求處理速度的總體降低,另外因爲針對 nginx 的狀態 檢測腳本自己也是依賴 HTTP 接口進行的狀態檢測,也就必然會致使鏈接資源的緊張,而一旦檢測失敗後進行虛地址切換,又會致使問題的加重。綜上,致使了最終上述問題的爆發。
相關參數說明:
man 2 listen網絡
... #include <sys/types.h> /* See NOTES */ #include <sys/socket.h> int listen(int sockfd, int backlog); ... The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog. When syncookies are enabled there is no logical maximum length and this setting is ignored. See tcp(7) for more information. If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.
man 7 tcp併發
tcp_abort_on_overflow (Boolean; default: disabled; since Linux 2.4) Enable resetting connections if the listening service is too slow and unable to keep up and accept them. It means that if overflow occurred due to a burst, the connection will recover. Enable this option only if you are really sure that the listening daemon cannot be tuned to accept connections faster. Enabling this option can harm the clients of your server. ... tcp_max_syn_backlog (integer; default: see below; since Linux 2.2) The maximum number of queued connection requests which have still not received an acknowledgement from the connecting client. If this number is exceeded, the kernel will begin dropping requests. The default value of 256 is increased to 1024 when the memory present in the system is adequate or greater (>= 128Mb), and reduced to 128 for those systems with very low memory (<= 32Mb). It is recommended that if this needs to be increased above 1024, TCP_SYNQ_HSIZE in include/net/tcp.h be modified to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog, and the kernel be recompiled. ... tcp_synack_retries (integer; default: 5; since Linux 2.2) The maximum number of times a SYN/ACK segment for a passive TCP connection will be retransmitted. This number should not be higher than 255. tcp_syncookies (Boolean; since Linux 2.2) Enable TCP syncookies. The kernel must be compiled with CONFIG_SYN_COOKIES. Send out syncookies when the syn backlog queue of a socket overflows. The syncookies feature attempts to protect a socket from a SYN flood attack. This should be used as a last resort, if at all. This is a violation of the TCP protocol, and conflicts with other areas of TCP such as TCP extensions. It can cause problems for clients and relays. It is not recommended as a tuning mechanism for heavily loaded servers to help with overloaded or misconfigured conditions. For recommended alternatives see tcp_max_syn_backlog, tcp_synack_retries, and tcp_abort_on_overflow. ...