【譯】TCP Implementation in Linux

時間 2019-12-09

原文原文鏈接

TCP Implementation in Linux: A Brief Tutorial

一個簡單教程關於 TCP 協議在 linux 內核的實現linux

翻譯：內核小王子（歡迎訂閱微信公衆號）原文：Helali Bhuiyan, Mark McGinley, Tao Li, Malathi Veeraraghavan University of Virginia算法

原文連接 TCP Implementation in Linux: A Brief Tutorial微信

A. Introduction

This document provides a brief overview of how TCP is implemented in Linux. 1 It is not meant to be comprehensive, nor do we assert that it is without inaccuracies.網絡

本文檔簡要概述瞭如何在Linux中實現TCP。他可能並不全面，而且也不能保證徹底準確。數據結構

B. TCP implementation in Linux

Figures 1 and 2 show the internals of the TCP implemen- tation in Linux kernel. Fig. 1 shows the path taken by a new packet from the the wire to a user application. The Linux kernel uses an sk buff data structure to describe each packet. When a packet arrives at the NIC, it invokes the DMA engine to place the packet into the kernel memory via empty sk buffs stored in a ring buffer called rx ring. An incoming packet is dropped if the ring buffer is full. When a packet is processed at higher layers, packet data remains in the same kernel memory, avoiding any extra memory copies.app

圖1 和圖2 展現了 TCP/IP 協議棧在 Linux 內核中的實現，圖1 展現了一個網絡包經過物理網線到達應用程序的過程，Linux 內核使用一個名爲 sk_buff 的數據結構來表示一個網絡包。當一個網絡包到達網卡時，會經過 DMA 引擎將這個 sk_buff 加入到一個叫 rx ring 的 ring buffer 中，當這個 ring buffer 已經滿了的時候，的報文將被捨棄。當更高層的協議處理數據包的時候，報文保存在內核的內存中從而避免了額外的拷貝。less

Once a packet is successfully received, the NIC raises an interrupt to the CPU, which processes each incoming packet and passes it to the IP layer. The IP layer performs its processing on each packet, and passes it up to the TCP layer if it is a TCP packet. The TCP process is then scheduled to handle received packets. Each packet in TCP goes through a series of complex processing steps. The TCP state machine is updated, and finally the packet is stored inside the TCP recv buffer.socket

一旦成功接收到一個數據包，網卡會向 CPU 發送一箇中斷，中斷處理函數將數據包傳給 IP 層。 IP層處理完後，判斷若是是 TCP 報文，就會將數據包發給 TCP 層處理，數據包通過 TCP 層一系列複雜的處理過程，會更新 TCP 的狀態機，最後將數據包存儲在 TCP 的接收緩衝區中。tcp

A critical parameter for tuning TCP is the size of the recv buffer at the receiver. The number of packets a TCP sender is able to have outstanding (unacknowledged) is the minimum of the congestion window (cwnd) and the receiver’s advertised window (rwnd). The maximum size of the receiver’s advertised window is the TCP recv buffer size. Hence, if the size of the recv buffer is smaller than the the bandwidth- delay product (BDP) of the end-to-end path, the achievable throughput will be low. On the other hand, a large recv buffer allows a correspondingly large number of packets to remain outstanding, possibly exceeding the number of packets an end- to-end path can sustain. The size of the recv buffer can be set by modifying the /proc/sys/net/ipv4/tcp rmem variable. It takes three different values, i.e, min, default, and max. The min value defines the minimum receive buffer size even when the operating system is under hard memory pressure. The default is the default size of the receive buffer, which is used together with the TCP window scaling factor to calculate the actual advertised window. The max defines the maximum size of the receive buffer，ide

TCP 調優的一個關鍵參數爲接收端的 recv 緩衝區大小。TCP 發送方可以發送的數據包的數量爲發送方的擁塞控制窗口 (cwnd) 和接收方的告知的接收窗口 (rwnd) 中的最小值。而接收方告知的接收窗口的最大值就是 recv 緩衝區大小。所以，若是 recv 緩衝區設置的比 BGP (帶寬延遲積) 小，則網絡的吞吐量將會很低。另外，一個大的 recv 緩衝區容許大量的數據包處於未完成狀態，可能超過了雙方能夠維持的數據包數量。recv 緩衝區大小能夠經過修改 /proc/sys/net/ipv4/tcp rmem變量來設置。它須要三個值，最大值，最小值，默認值。最小值定義了最小能夠接收的緩衝區大小，即便操做系統處於硬件內存很小。默認值是接收緩衝區的默認大小，它與TCP滑動窗口比例一塊兒用來計算實際公示的窗口大小。max 定義接收緩衝區的最大值。

Also at the receiver, the parameter netdev max backlog dictates the maximum number of packets queued at a device, which are waiting to be processed by the TCP receiving process. If a newly received packet when added to the queue would cause the queue to exceed netdev max backlog then it is discarded.

此外在接收端，參數netdev max backlog 指示網卡設備上排隊的最大數據包數，這些數據包等待TCP接收進程處理。若是一個新收到的數據包在添加到隊列時會致使隊列超過netdev max backlog，那麼它將被丟棄。

On the sender, as shown in Fig 2, a user application writes the data into the TCP send buffer by calling the write() system call. Like the TCP recv buffer, the send buffer is a crucial parameter to get maximum throughput. The maximum size of the congestion window is related to the amount of send buffer space allocated to the TCP socket. The send buffer holds all outstanding packets (for potential retransmission) as well as all data queued to be transmitted. Therefore, the congestion window can never grow larger than send buffer can accommodate. If the send buffer is too small, the congestion window will not fully open, limiting the throughput. On the other hand, a large send buffer allows the congestion window to grow to a large value. If not constrained by the TCP recv buffer, the number of outstanding packets will also grow as the congestion window grows, causing packet loss if the end-to- end path can not hold the large number of outstanding packets. The size of the send buffer can be set by modifying the /proc/sys/net/ipv4/tcp wmem variable, which also takes three different values, i.e., min, default, and max.

在發送端，如圖 2 ，所示，用戶程序經過系統調用 write() 將數據寫入 TCP 的 send buffer，和接收端的緩衝區同樣，send buffer 也是提供吞吐量很重要的參數。擁塞窗口的最大值和分配給 TCP socket 的 send buffer 空間大小相關，send buffer 保存了全部尚未確認的數據包，由於該數據包可能還須要重發，若是s end buffer 設置的過小，則擁塞窗口也會變小，將影響吞吐量。另外，一個大的 send buffer 可能致使擁塞窗口變大，若是沒有經過接收端的 recv buffer 來限制，未確認的報文數目會隨着擁塞窗口的增長而變大，若是超過雙方能夠維持的最大包數目從而致使丟包。send buffer 的大小能夠經過修改 /proc/sys/net/ipv4/tcp 的 wmem 變量值，一樣須要配置最大最小值和默認值。

The analogue to the receiver’s netdev max backlog is the sender’s txqueuelen. The TCP layer builds packets when data is available in the send buffer or ACK packets in response to data packets received. Each packet is pushed down to the IP layer for transmission. The IP layer enqueues each packet in an output queue (qdisc) associated with the NIC. The size of the qdisc can be modified by assigning a value to the txqueuelen variable associated with each NIC device. If the output queue is full, the attempt to enqueue a packet generates a local- congestion event, which is propagated upward to the TCP layer. The TCP congestion-control algorithm then enters into the Congestion Window Reduced (CWR) state, and reduces the congestion window by one every other ACK (known as rate halving). After a packet is successfully queued inside the output queue, the packet descriptor (sk buff) is then placed in the output ring buffer tx ring. When packets are available inside the ring buffer, the device driver invokes the NIC DMA engine to transmit packets onto the wire.

相似於接收端的 netdev max backlog 是發送者的網卡設備上排隊的最大數據包數。TCP 層在數據到達 send buffer的時候會構建報文，當收到確認報文回覆的時候也會更高數據包狀態。構建好 TCP 報文後會將數據包推送到協議下層的 IP 層進行傳輸，IP 層將加數據包放入一個和網卡關聯的輸出隊列。該隊列的大小能夠經過修改和網卡設備關聯的 txqueuelen 變量值來設置。若是隊列已滿，會嘗試將數據包排隊生成一個阻塞事件傳播到 TCP層。TCP 擁塞控制算法將減小擁塞窗口的狀態變量，每有一個阻塞事件會將當前擁塞窗口的狀態變量減半。當數據包成功加入到隊列，則這個數據包的描述符 (sk buff) 將會放入到發送方的 ring buffer 中，以後設備驅動經過 DMA engine 將數據包傳輸到線路中。

While the above parameters dictate the flow-control profile of a connection, the congestion-control behavior can also have a large impact on the throughput. TCP uses one of several congestion control algorithms to match its sending rate with the bottleneck-link rate. Over a connectionless network, a large number of TCP flows and other types of traffic share the same bottleneck link. As the number of flows sharing the bottleneck link changes, the available bandwidth for a certain TCP flow varies. Packets get lost when the sending rate of a TCP flow is higher than the available bandwidth. On the other hand, packets are not lost due to competition with other flows in a circuit as bandwidth is reserved. However, when a fast sender is connected to a circuit with lower rate, packets can get lost due to buffer overflow at the switch.

上述參數展現了網絡鏈接的流量控制，但擁塞控制行爲也會對對吞吐量產生很大影響。TCP使用多種擁塞控制算法來匹配發送速率以適應有瓶頸的線路。在一個無鏈接的網絡環境裏，大量的TCP流和其餘類型的流量共享同一個瓶頸鍊路，當鏈路上的數據包數量發生變化的時候，TCP 流的可用帶寬也會變化。當TCP流的發送速率高於可用帶寬時，數據包會丟失。另外一方面，因爲帶寬被保留，數據包不會由於與電路中其餘流的競爭而丟失。但，當一個發送速率很快的發送端鏈接到一個速率較低的鏈路時，因爲交換機的緩衝區溢出，數據包也可能會丟失。

When a TCP connection is set up, a TCP sender uses ACK packets as a ’clock, known as ACK-clocking, to inject new packets into the network [1]. Since TCP receivers cannot send ACK packets faster than the bottleneck-link rate, a TCP senders transmission rate while under ACK-clocking is matched with the bottleneck link rate. In order to start the ACK-clock, a TCP sender uses the slow-start mechanism. During the slow-start phase, for each ACK packet received, a TCP sender transmits two data packets back-to-back. Since ACK packets are coming at the bottleneck-link rate, the sender is essentially transmitting data twice as fast as the bottleneck link can sustain. The slow-start phase ends when the size of the congestion window grows beyond ssthresh. In many congestion control algorithms, such as BIC [2], the initial slow start threshold (ssthresh) can be adjusted, as can other factors such as the maximum increment, to make BIC more or less aggressive. However, like changing the buffers via the sysctl function, these are system-wide changes which could adversely affect other ongoing and future connections. A TCP sender is allowed to send the minimum of the con- gestion window and the receivers advertised window number of packets. Therefore, the number of outstanding packets is doubled in each roundtrip time, unless bounded by the receivers advertised window. As packets are being forwarded by the bottleneck-link rate, doubling the number of outstanding packets in each roundtrip time will also double the buffer occupancy inside the bottleneck switch. Eventually, there will be packet losses inside the bottleneck switch once the buffer overflows.

當一個 TCP 完成鏈接創建後，發送方使用確認報文做爲一個時鐘從而將新的數據包加入網絡，稱爲 ACK-clocking。因爲 TCP 接收端發送 ACK 數據包的速度不能超過瓶頸鍊路速率，所以ACK 時鐘下的 TCP 發送端傳輸速率與瓶頸鍊路速率匹配。爲了啓動 ACK 時鐘，TCP 發送端使用慢速啓動機制。在慢啓動階段，對於接收到的每一個 ACK 數據包，TCP發送端連續傳輸兩個數據包。因爲 ACK 數據包以瓶頸鍊路速率傳輸，發送方傳輸數據的速度基本上是瓶頸鍊路可以維持的速度的兩倍。當擁塞窗口的大小超過 ssthresh 時，慢啓動階段結束。在許多擁塞控制算法中，如 bic，能夠調整初始慢啓動閾值（ssthresh），以及其餘因素（如最大增量），使bic或多或少提升效率。可是，與經過sysctl函數更改緩衝區同樣，這些是系統範圍內的更改，可能會對其餘正在進行的鏈接和未來的鏈接產生不利影響。TCP 發送端最多隻能發送擁塞窗口和接收端公佈的窗口中的最小值。所以，除非受接收端公示的窗口的限制，不然每一個往返時間內未完成數據包的數量將增長一倍。因爲數據包是由瓶頸鍊路速率轉發的，所以在每一個往返時間內，將未完成數據包的數量加倍也將使瓶頸交換機內的緩衝區佔用率加倍。最後，一旦緩衝區溢出，瓶頸交換機內部就會有數據包丟失。

After packet loss occurs, a TCP sender enters into the congestion avoidance phase. During congestion avoidance, the congestion window is increased by one packet in each roundtrip time. As ACK packets are coming at the bottleneck link rate, the congestion window keeps growing, as does the the number of outstanding packets. Therefore, packets will get lost again once the number of outstanding packets grows larger than the buffer size in the bottleneck switch plus the number of packets on the wire.

當發生數據包丟失後，TCP發送端進入擁塞控制階段。在這期間，每收到一個回覆報文擁塞窗口加一。當 ACK 數據包以瓶頸鍊路速率返回時，擁塞窗口和未完成數據包的數量都在不斷增長。所以，一旦未完成數據包的數量超過瓶頸鍊路交換機中的緩衝區大小加上線路上的數據包數量，數據包將再次丟失。

There are many other parameters that are relevant to the operation of TCP in Linux, and each is at least briefly explained in the documentation included in the distribution (Documentation/networking/ip-sysctl.txt). An example of a configurable parameter in the TCP implementation is the RFC2861 congestion window restart function. RFC2861 pro- poses restarting the congestion window if the sender is idle for a period of time (one RTO). The purpose is to ensure that the congestion window reflects the current state of the network. If the connection has been idle, the congestion window may reflect an obsolete view of the network and so is reset. This be- havior can be disabled using the sysctl tcp slow start after idle but, again, this change affects all connections system-wide.

還有許多與 Linux 中的 TCP 操做相關的其餘參數，而且每一個參數都在發佈的文檔（documentation/networking/ip sysctl.txt）中進行了簡要說明。TCP 實現可配置參數的一個例子是 rfc2861 擁塞窗口重啓功能。若是發送方空閒一段時間（一個 RTO），則RFC2861 Pro 將從新啓動擁塞窗口，目的是確保擁塞窗口反映網絡的當前狀態。若是鏈接處於空閒狀態，擁塞窗口可能反映網絡的已通過時狀態，須要進行重置。可使用 ysctl tcp slow start 在空閒後禁用此行爲，但此更改會影響系統範圍內的全部鏈接。

若是對 TCP 對流量控制和擁塞控制不是很理解，歡迎關注公衆號 內核小王子 ，下週將分享 網絡內核之如何實現c10m 深刻分析linux的網絡模型