【原創】TCP backlog 在 Linux 中如何起做用

時間 2019-11-11

原文原文鏈接

How TCP backlog works in Linux

January 1, 2014 (updated March 14, 2015)

When an application puts a socket into LISTEN state using the listen syscall, it needs to specify a backlog for that socket. The backlog is usually described as the limit for the queue of incoming connections.html

當應用程序經過 listen 系統調用令一個 socket 進入 LISTEN 狀態時，須要爲該 socket 指定 backlog 參數；
backlog 參數一般被描述爲用於保存 incoming 鏈接的 queue 的長度；

Because of the 3-way handshake used by TCP, an incoming connection goes through an intermediate state SYN RECEIVED before it reaches the ESTABLISHED state and can be returned by the accept syscall to the application (see the part of the TCP state diagram reproduced above). This means that a TCP/IP stack has two options to implement the backlog queue for a socket in LISTEN state:linux

因爲 TCP 基於三次握手實現，因此一條 incoming 鏈接會先經歷 SYN RECEIVED 中間狀態，再進入 ESTABLISHED 狀態，從而可以被應用經過 accept 系統調用獲取到；
這也就意味着 TCP/IP 協議棧針對處於 LISTEN 狀態的 socket 能夠有兩種方式實現 backlog queue ：

The implementation uses a single queue, the size of which is determined by the backlog argument of the listen syscall. When a SYN packet is received, it sends back a SYN/ACK packet and adds the connection to the queue. When the corresponding ACK is received, the connection changes its state to ESTABLISHED and becomes eligible for handover to the application. This means that the queue can contain connections in two different state: SYN RECEIVED and ESTABLISHED. Only connections in the latter state can be returned to the application by the accept syscall.git

第一種方式：github

僅使用單獨一條 queue ；queue 的長度由 listen 系統調用的 backlog 參數決定；
在收到 SYN 包後，會回覆 SYN,ACK 包，同時將當前鏈接添加到 queue 中；
當最後的 ACK 被收到時，該鏈接的狀態會變動成 ESTABLISHED 狀態，以後應用纔可以獲取到該鏈接；
這就意味着，queue 中會維護處於兩種狀態下的鏈接：SYN RECEIVED 和 ESTABLISHED ；
而只有處於 ESTABLISHED 狀態的鏈接才能被應用經過 accept 系統調用獲取到；

The implementation uses two queues, a SYN queue (or incomplete connection queue) and an accept queue (or complete connection queue). Connections in state SYN RECEIVED are added to the SYN queue and later moved to the accept queue when their state changes to ESTABLISHED, i.e. when the ACK packet in the 3-way handshake is received. As the name implies, the accept call is then implemented simply to consume connections from the accept queue. In this case, the backlog argument of the listen syscall determines the size of the accept queue.算法

第二種方式：服務器

使用兩條 queue 維護鏈接；一條爲 SYN queue 或者稱做 incomplete connection queue ；一條爲 accept queue 或者稱做 complete connection queue ；
處於 SYN RECEIVED 狀態的鏈接會被放入到 SYN queue 中，並在鏈接狀態變爲 ESTABLISHED 後，被搬移到 accept queue 中；
上述搬移行爲發生在三次握手的最後 ACK 被收到時；
在這種實現方式中，accept 系統調用只需簡單的從 accept queue 中獲取鏈接便可；而 backlog 參數決定的就是 accept queue 的大小；

Historically, BSD derived TCP implementations use the first approach. That choice implies that when the maximum backlog is reached, the system will no longer send back SYN/ACK packets in response to SYN packets. Usually the TCP implementation will simply drop the SYN packet (instead of responding with a RST packet) so that the client will retry. This is what is described in section 14.5, listen Backlog Queue in W. Richard Stevens’ classic textbook TCP/IP Illustrated, Volume 3.網絡

從歷史淵源上看，源於 BSD 的 TCP 實現使用的均爲第一種方式；app

這種選擇，隱式的代表了：一旦 backlog 的上限被達到後，系統將不會在收到 SYN 包後，應答 SYN,ACK 包；
一般狀況下，TCP 實現會以簡單丟棄收到的 SYN 包方式進行處理（而不是採用回覆 RST 包的方式），這樣客戶端側將會進行 SYN 重發；
這種行爲正是 W. Richard Stevens 在 TCP/IP Illustrated, Volume 3 中 section 14.5 描述的內容；

Note that Stevens actually explains that the BSD implementation does use two separate queues, but they behave as a single queue with a fixed maximum size determined by (but not necessary exactly equal to) the backlog argument, i.e. BSD logically behaves as described in option 1:less

須要注意的是，Stevens 在解釋 BSD 實現時確實說起了兩條 queue ，但這兩條 queue 表現的如同單獨一條 queue 同樣，具備由 backlog 參數決定的（並不是等於 backlog 的值）固定大小（也能夠說成，backlog 所指的長度值爲兩個隊列之和）；
也就是說，BSD 實現從邏輯上來看，表現的就像採用了上述第一種實現同樣；

The queue limit applies to the sum of […] the number of entries on the incomplete connection queue […] and […] the number of entries on the completed connection queue […].socket

queue 的大小限制取決於 incomplete connection queue 中的 entry 數目 + completed connection queue 中的 entry 數目

On Linux, things are different, as mentioned in the man page of the listen syscall:

在 Linux 系統中，上述結論與實際狀況存在着一些差別，正如在 listen 系統調用的 man 手冊描述的那樣：

The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog.
This means that current Linux versions use the second option with two distinct queues: a SYN queue with a size specified by a system wide setting and an accept queue with a size specified by the application.

從 Linux 2.2 開始，TCP socket 的 backlog 參數行爲發生了變化：

如今，backlog 參數用於針對保存 completely established sockets 的 queue 長度進行控制；
而用於保存 incomplete sockets 的 queue 的長度則經過 /proc/sys/net/ipv4/tcp_max_syn_backlog 進行控制；
這就意味着，當前 Linux 版本實際使用的是上述第二種實現方式，即採用了兩條 queue 維護鏈接；
SYN queue 的大小經過系統範圍有效的參數進行設置；
accept queue 的大小經過應用指定的 backlog 值進行設置；

The interesting question is now how such an implementation behaves if the accept queue is full and a connection needs to be moved from the SYN queue to the accept queue, i.e. when the ACK packet of the 3-way handshake is received. This case is handled by the tcp_check_req function in net/ipv4/tcp_minisocks.c. The relevant code reads:

一個有趣的問題是：在採用兩條 queue 維護鏈接的實現中，若是 accept queue 已經滿了，但卻有鏈接須要從 SYN queue 中搬移到 accept queue 中（即收到了三次握手的最後 ACK），該如何處理？
這種狀況已經在 net/ipv4/tcp_minisocks.c 中的 tcp_check_req 函數裏進行了處理，相關代碼以下：

child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
        if (child == NULL)
                goto listen_overflow;

For IPv4, the first line of code will actually call tcp_v4_syn_recv_sock in net/ipv4/tcp_ipv4.c, which contains the following code:

對於 IPv4 來講，上述代碼的第一行實際上調用的是 net/ipv4/tcp_ipv4.c 中的 tcp_v4_syn_recv_sock 函數，其中包含了以下代碼片斷：

if (sk_acceptq_is_full(sk))
        goto exit_overflow;
...
exit_overflow:
    NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
exit:
    NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
    dst_release(dst);
    return NULL;
...

We see here the check for the accept queue. The code after the exit_overflow label will perform some cleanup, update the ListenOverflows and ListenDrops statistics in /proc/net/netstat and then return NULL. This will trigger the execution of the listen_overflow code in tcp_check_req:

這裏咱們能夠看到，代碼針對 accept queue 是否已滿進行了檢測；
位於 exit_overflow 標籤以後的代碼會先進行一些清理操做，再更新 /proc/net/netstat 中 ListenOverflows 和 ListenDrops 的統計信息，最後返回 NULL ；

這將觸發 tcp_check_req 函數中 listen_overflow 代碼的執行：

listen_overflow:
        if (!sysctl_tcp_abort_on_overflow) {
                inet_rsk(req)->acked = 1;
                return NULL;
        }

This means that unless /proc/sys/net/ipv4/tcp_abort_on_overflow is set to 1 (in which case the code right after the code shown above will send a RST packet), the implementation basically does… nothing!

從代碼中能夠看出，除非設置了 /proc/sys/net/ipv4/tcp_abort_on_overflow = 1 （設置爲 1 則會發送 RST 包），不然上述實現代碼基本上啥也沒作；

To summarize, if the TCP implementation in Linux receives the ACK packet of the 3-way handshake and the accept queue is full, it will basically ignore that packet. At first, this sounds strange, but remember that there is a timer associated with the SYN RECEIVED state: if the ACK packet is not received (or if it is ignored, as in the case considered here), then the TCP implementation will resend the SYN/ACK packet (with a certain number of retries specified by /proc/sys/net/ipv4/tcp_synack_retries and using an exponential backoff algorithm).

總結一下：若是 Linux 的 TCP 實如今接收到了三次握手的最後 ACK 包時 accept queue 已滿，則其行爲基本上是忽略該包；

也許你會對這種實現策略感到奇怪，但須要記住，還有一個和 SYN RECEIVED 狀態相關的定時器的存在：若是 ACK 包壓根就沒收到（或者收到後被忽略了），那麼 TCP 實現將會重發 SYN,ACK 包（重發次數取決於 /proc/sys/net/ipv4/tcp_synack_retries 的設置，並實現了指數退讓算法）；

This can be seen in the following packet trace for a client attempting to connect (and send data) to a socket that has reached its maximum backlog:

這種行爲能夠經過下面的抓包信息看出來：客戶端向一個已經達到 backlog 上限的 socket 進行鏈接嘗試（和數據發送）：

0.000  127.0.0.1 -> 127.0.0.1  TCP 74 53302 > 9999 [SYN] Seq=0 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 66 53302 > 9999 [ACK] Seq=1 Ack=1 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 71 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.207  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.623  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  1.199  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  1.199  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 6#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  1.455  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.123  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.399  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  3.399  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 10#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  6.459  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  7.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  7.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 13#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 13.131  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 15.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 15.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 16#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 26.491  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 31.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 31.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 19#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 53.179  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491  127.0.0.1 -> 127.0.0.1  TCP 54 9999 > 53302 [RST] Seq=1 Len=0

Since the TCP implementation on the client side gets multiple SYN/ACK packets, it will assume that the ACK packet was lost and resend it (see the lines with TCP Dup ACK in the above trace). If the application on the server side reduces the backlog (i.e. consumes an entry from the accept queue) before the maximum number of SYN/ACK retries has been reached, then the TCP implementation will eventually process one of the duplicate ACKs, transition the state of the connection from SYN RECEIVED to ESTABLISHED and add it to the accept queue. Otherwise, the client will eventually get a RST packet (as in the sample shown above).

由於客戶端側的 TCP 實現會收到屢次 SYN,ACK 包，因此在其重複收到 SYN,ACK 時，會認爲以前發送的 ACK 已經丟失了，並進行重發（能夠從上面的 TCP Dup ACK 看出來）；

若是服務器側的應用程序在 SYN/ACK 重發還沒有結束前（即未達到最大次數），成功減小了當前的 backlog 值（這裏不是指調整 listen 中的參數值，而是指從 accept queue 中取走了至少一個 entry），此時服務器側的 TCP 實現將可以成功處理一個 duplicate ACKs ，進而將鏈接的狀態從 SYN RECEIVED 轉變成 ESTABLISHED ，並將該鏈接添加到 accept queue 中；

不然，（若 accept queue 始終處於滿的狀態）客戶端側最終會收到一個 RST 包做爲終止（正如上面的示例顯示）

The packet trace also shows another interesting aspect of this behavior. From the point of view of the client, the connection will be in state ESTABLISHED after reception of the first SYN/ACK. If it sends data (without waiting for data from the server first), then that data will be retransmitted as well. Fortunately TCP slow-start should limit the number of segments sent during this phase.

上述抓包信息還顯示出了另一個有趣的事實：從客戶端的角度來看，在首次收到 SYN,ACK 包後，鏈接就已經處於 ESTABLISHED 狀態了；若是客戶端此時發送了數據（而不是須要先等來自服務器側的數據），那麼該數據一樣會被重發；

幸運的是，TCP 的慢啓動策略將會限制數據重發階段的 segment 數量；

On the other hand, if the client first waits for data from the server and the server never reduces the backlog, then the end result is that on the client side, the connection is in state ESTABLISHED, while on the server side, the connection is considered CLOSED. This means that we end up with a half-open connection!

而另外一方面，若是客戶端須要先等待來自服務器側的數據，而服務器一直沒有機會減低當前 backlog 值，那麼最終的結果就是，在客戶端側，該鏈接的狀態顯示爲 ESTABLISHED ，而在服務器側，該鏈接已經被認爲處於 CLOSED 狀態了；這就意味着，咱們其實是以半打開鏈接的形式終止的；

There is one other aspect that we didn’t discuss yet. The quote from the listen man page suggests that every SYN packet would result in the addition of a connection to the SYN queue (unless that queue is full). That is not exactly how things work. The reason is the following code in the tcp_v4_conn_request function (which does the processing of SYN packets) in net/ipv4/tcp_ipv4.c:

還有另一個方面咱們還沒有討論到：在 listen 的 man 手冊中的提到，每個 SYN 包都將會致使 SYN queue 中鏈接數的增長（除非 SYN queue 自己已經滿了）； 而實際上的表現與這段描述並不一致；具體緣由在 net/ipv4/tcp_ipv4.c 中的 tcp_v4_conn_request 函數實現（負責鏈接創建的 SYN 包處理）中能夠看到：

/* Accept backlog is full. If we have already queued enough
         * of warm entries in syn queue, drop request. It is better than
         * clogging syn queue with openreqs with exponentially increasing
         * timeout.
         */
        if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
                goto drop;
        }
...
drop:
     return 0;
...

What this means is that if the accept queue is full, then the kernel will impose a limit on the rate at which SYN packets are accepted. If too many SYN packets are received, some of them will be dropped. In this case, it is up to the client to retry sending the SYN packet and we end up with the same behavior as in BSD derived implementations.

上述代碼的意思是，若是 accept queue 已滿，那麼內核將會針對 SYN 包的接收狀況施加一個速率限制 ：即 若是接收到了過多的 SYN 包，則會直接丟棄其中的一些；在這種狀況下，將取決於客戶端側的 SYN 包重傳解決相關問題；此時咱們就退化成和 BSD 實現相同的處理方式了；

To conclude, let’s try to see why the design choice made by Linux would be superior to the traditional BSD implementation. Stevens makes the following interesting point:

最後的最後，讓咱們看看爲何 Linux 的設計選擇要比傳統的 BSD 實現更加高級；Stevens 本身給出了以下觀點：

The backlog can be reached if the completed connection queue fills (i.e., the server process or the server host is so busy that the process cannot call accept fast enough to take the completed entries off the queue) or if the incomplete connection queue fills. The latter is the problem that HTTP servers face, when the round-trip time between the client and server is long, compared to the arrival rate of new connection requests, because a new SYN occupies an entry on this queue for one round-trip time. […]

當 completed connection queue 或 incomplete connection queue 被不斷添加鏈接的狀況下，backlog 上限很容易被達到；

completed connection queue 滿的可能狀況：服務進程自己或者服務器宿主處於忙碌狀態，致使進程調用 accept 獲取 completed entries 的速度不夠快；

incomplete connection queue 滿的狀況是 HTTP 服務器常常會面對的：與新鏈接請求（即 SYN 包）到達速率相比，若客戶端和服務器之間的 round-trip time（即 RTT）過長（意思就是 SYN->SYN,ACK->ACK 的交互時間過長），則會發生此狀況；

The completed connection queue is almost always empty because when an entry is placed on this queue, the server’s call to accept returns, and the server takes the completed connection off the queue.

completed connection queue 幾乎老是空的，由於只要有 entry 被放入該 queue 中，業務中調用的 accept 就會返回，同時取走該 queue 中的鏈接；

The solution suggested by Stevens is simply to increase the backlog. The problem with this is that it assumes that an application is expected to tune the backlog not only taking into account how it intents to process newly established incoming connections, but also in function of traffic characteristics such as the round-trip time. The implementation in Linux effectively separates these two concerns: the application is only responsible for tuning the backlog such that it can call accept fast enough to avoid filling the accept queue); a system administrator can then tune /proc/sys/net/ipv4/tcp_max_syn_backlog based on traffic characteristics.

Stevens 建議的解決辦法是：增大 backlog ；

這種簡單的解決辦法是基於這樣一種假設：應用程序但願調節 backlog 值的緣由，不只考慮到針對新 incoming 鏈接處理，還考慮了網絡通訊的運行特性（例如基於 RTT 的考慮）；

Linux 中的實現實際上針對兩種狀況進行了拆分處理：

應用程序僅需關注 backlog 的調節（backlog 的上限爲 somaxconn 的值，本文未說起），以便其可以足夠快速的調用 accept 以免 accept queue 被填滿；
系統管理員能夠經過基於網絡通訊特性的分析，進行 /proc/sys/net/ipv4/tcp_max_syn_backlog 的調整（調整該參數的效果和 linux 內核版本有關，本文未提價）；