鏈接跟蹤之TCP

簡介

相對於udp協議來講,tcp更加複雜。爲了給應用層提供可靠性傳輸,tcp協議引入了標誌位,選項,序列號,滑動窗口等特性。linux

TCP標誌位

#define TCPHDR_FIN 0x01
#define TCPHDR_SYN 0x02
#define TCPHDR_RST 0x04
#define TCPHDR_PSH 0x08
#define TCPHDR_ACK 0x10
#define TCPHDR_URG 0x20
#define TCPHDR_ECE 0x40
#define TCPHDR_CWR 0x80

URG 緊急指針(urgent pointer)有效(見20.8節)。
ACK 確認序號有效。
PSH 接收方應該儘快將這個報文段交給應用層。
RST 重建鏈接。
SYN 同步序號用來發起一個鏈接。這個標誌和下一個標誌將在第18章介紹。
FIN 發端完成發送任務。
ECE:ECN響應標誌被用來在TCP3次握手時代表一個TCP端是具有ECN功能的,而且代表接收到的TCP包的IP頭部的ECN被設置爲11。更多信息請參考RFC793。
CWR:擁塞窗口減小標誌被髮送主機設置,用來代表它接收到了設置ECE標誌的TCP包。擁塞窗口是被TCP維護的一個內部變量,用來管理髮送窗口大小。

當兩個支持ECN的TCP端進行TCP鏈接時,它們交換SYN,SYN-ACK和ACK包。對於支持ECN的TCP端來講,SYN包的ECE和CWR標誌都被設置了SYN-ACK只設置ECE標誌windows

tcp狀態變遷

鏈接跟蹤會根據tcp交互報文的標誌位進行狀態跟蹤。因爲防火牆處於客戶端和服務器的中間,因此鏈接跟蹤的狀態機與客戶端和服務器端並不徹底同樣。數組

下圖是客戶端和服務器短的狀態變遷圖:
image.png服務器

linux鏈接跟蹤的狀態變遷:數據結構

/*
 * The TCP state transition table needs a few words...
 *
 * We are the man in the middle. All the packets go through us
 * but might get lost in transit to the destination.
 * It is assumed that the destinations can't receive segments
 * we haven't seen.
 *
 * The checked segment is in window, but our windows are *not*
 * equivalent with the ones of the sender/receiver. We always
 * try to guess the state of the current sender.
 *
 * The meaning of the states are:
 *
 * NONE:    initial state
 * SYN_SENT:    SYN-only packet seen
 * SYN_SENT2:    SYN-only packet seen from reply dir, simultaneous open
 * SYN_RECV:    SYN-ACK packet seen
 * ESTABLISHED:    ACK packet seen
 * FIN_WAIT:    FIN packet seen
 * CLOSE_WAIT:    ACK seen (after FIN)
 * LAST_ACK:    FIN seen (after FIN)
 * TIME_WAIT:    last ACK seen
 * CLOSE:    closed connection (RST)
 *
 * Packets marked as IGNORED (sIG):
 *    if they may be either invalid or valid
 *    and the receiver may send back a connection
 *    closing RST or a SYN/ACK.
 *
 * Packets marked as INVALID (sIV):
 *    if we regard them as truly invalid packets
 */
static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
    {
/* ORIGINAL */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*syn*/       { sSS, sSS, sIG, sIG, sIG, sIG, sIG, sSS, sSS, sS2 },
/*
 *    sNO -> sSS    Initialize a new connection
 *    sSS -> sSS    Retransmitted SYN
 *    sS2 -> sS2    Late retransmitted SYN
 *    sSR -> sIG
 *    sES -> sIG    Error: SYNs in window outside the SYN_SENT state
 *            are errors. Receiver will reply with RST
 *            and close the connection.
 *            Or we are not in sync and hold a dead connection.
 *    sFW -> sIG
 *    sCW -> sIG
 *    sLA -> sIG
 *    sTW -> sSS    Reopened connection (RFC 1122).
 *    sCL -> sSS
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*synack*/ { sIV, sIV, sSR, sIV, sIV, sIV, sIV, sIV, sIV, sSR },
/*
 *    sNO -> sIV    Too late and no reason to do anything
 *    sSS -> sIV    Client can't send SYN and then SYN/ACK
 *    sS2 -> sSR    SYN/ACK sent to SYN2 in simultaneous open
 *    sSR -> sSR    Late retransmitted SYN/ACK in simultaneous open
 *    sES -> sIV    Invalid SYN/ACK packets sent by the client
 *    sFW -> sIV
 *    sCW -> sIV
 *    sLA -> sIV
 *    sTW -> sIV
 *    sCL -> sIV
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*fin*/    { sIV, sIV, sFW, sFW, sLA, sLA, sLA, sTW, sCL, sIV },
/*
 *    sNO -> sIV    Too late and no reason to do anything...
 *    sSS -> sIV    Client migth not send FIN in this state:
 *            we enforce waiting for a SYN/ACK reply first.
 *    sS2 -> sIV
 *    sSR -> sFW    Close started.
 *    sES -> sFW
 *    sFW -> sLA    FIN seen in both directions, waiting for
 *            the last ACK.
 *            Migth be a retransmitted FIN as well...
 *    sCW -> sLA
 *    sLA -> sLA    Retransmitted FIN. Remain in the same state.
 *    sTW -> sTW
 *    sCL -> sCL
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*ack*/       { sES, sIV, sES, sES, sCW, sCW, sTW, sTW, sCL, sIV },
/*
 *    sNO -> sES    Assumed.
 *    sSS -> sIV    ACK is invalid: we haven't seen a SYN/ACK yet.
 *    sS2 -> sIV
 *    sSR -> sES    Established state is reached.
 *    sES -> sES    :-)
 *    sFW -> sCW    Normal close request answered by ACK.
 *    sCW -> sCW
 *    sLA -> sTW    Last ACK detected (RFC5961 challenged)
 *    sTW -> sTW    Retransmitted last ACK. Remain in the same state.
 *    sCL -> sCL
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*rst*/    { sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL },
/*none*/   { sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV }
    },
    {
/* REPLY */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*syn*/       { sIV, sS2, sIV, sIV, sIV, sIV, sIV, sSS, sIV, sS2 },
/*
 *    sNO -> sIV    Never reached.
 *    sSS -> sS2    Simultaneous open
 *    sS2 -> sS2    Retransmitted simultaneous SYN
 *    sSR -> sIV    Invalid SYN packets sent by the server
 *    sES -> sIV
 *    sFW -> sIV
 *    sCW -> sIV
 *    sLA -> sIV
 *    sTW -> sSS    Reopened connection, but server may have switched role
 *    sCL -> sIV
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*synack*/ { sIV, sSR, sIG, sIG, sIG, sIG, sIG, sIG, sIG, sSR },
/*
 *    sSS -> sSR    Standard open.
 *    sS2 -> sSR    Simultaneous open
 *    sSR -> sIG    Retransmitted SYN/ACK, ignore it.
 *    sES -> sIG    Late retransmitted SYN/ACK?
 *    sFW -> sIG    Might be SYN/ACK answering ignored SYN
 *    sCW -> sIG
 *    sLA -> sIG
 *    sTW -> sIG
 *    sCL -> sIG
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*fin*/    { sIV, sIV, sFW, sFW, sLA, sLA, sLA, sTW, sCL, sIV },
/*
 *    sSS -> sIV    Server might not send FIN in this state.
 *    sS2 -> sIV
 *    sSR -> sFW    Close started.
 *    sES -> sFW
 *    sFW -> sLA    FIN seen in both directions.
 *    sCW -> sLA
 *    sLA -> sLA    Retransmitted FIN.
 *    sTW -> sTW
 *    sCL -> sCL
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*ack*/       { sIV, sIG, sSR, sES, sCW, sCW, sTW, sTW, sCL, sIG },
/*
 *    sSS -> sIG    Might be a half-open connection.
 *    sS2 -> sIG
 *    sSR -> sSR    Might answer late resent SYN.
 *    sES -> sES    :-)
 *    sFW -> sCW    Normal close request answered by ACK.
 *    sCW -> sCW
 *    sLA -> sTW    Last ACK detected (RFC5961 challenged)
 *    sTW -> sTW    Retransmitted last ACK.
 *    sCL -> sCL
 */
/*          sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2    */
/*rst*/    { sIV, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL, sCL },
/*none*/   { sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sIV }
    }
};

tcp滑動窗口和窗口擴大因子

鏈接跟蹤實現了對tcp傳輸協議的窗口的檢測,經過對報文的發送序列號和應答序列號的檢測,過濾掉非法的tcp報文,減輕DDOS攻擊。app

滑動窗口

image.png
如上圖所示,發送端在發送報文時,會根據客戶端發佈的窗口大小,以及確認的字節數,來判斷當前可以發送的最大的報文序列號。從上圖來看,發送端最大能發送的合理的序列號爲9,最小能發送的序列號是4(可能會重傳)。對於正常主機來講,能夠限定當前發送的的發送序列號在4到9之間即爲合法。tcp

​ 防火牆位於客戶端和服務器之間,爲了實現對seq的過濾,防火牆不能阻塞任意一個由正常主機發送的報文。爲了準確的猜出發送端的seq範圍,要求客戶端和服務器端的全部報文都要經過防火牆。ide

在論文Real Stateful TCP Packet Filtering in IP Filter中做者定義了以下四個不等式來校驗發送序列號和應答序列號:函數

  • 發送序列號上限值:從上圖可知,發送方的發送序列號的上限值是應答方應答的最大序列號+應答方發送的最新窗口值
  • 發送序列號下限值:從上圖可知,發送方的發送序列號的下限值是應答方已經應答的最大序列號,可是因爲可能存在應答報文通過防火牆後,並無當即到達請求方,或者應答報文出現了丟失的狀況,因此發送方可能會重傳一些防火牆認爲已經應答的報文,因此發送序列號的下限須要放寬。目前linux實現的下限爲發送方已經發送的最大序號減去接收方的最新窗口值
  • 應答序列號上限值:應答序列號是接收方對發送方已經發送的字節的應答,因此其上限必然是發送方已經發送的最大序列號,不能應答發送方沒有發送的字節數。
  • 應答序列號下限值:應答序列號的下限值比較難以肯定,由於應答報文可能會同時攜帶發送數據,既然攜帶發送數據就有可能重傳。也就是說,即便接收方已經發送過了一個應答,而且發送方已經接收了該應答,也有可能發送方應答攜帶數據的應答報文時,出現應答丟失的狀況,致使應答方再次發送該應答報文。種種緣由應答序列號的下限值放寬到以下:
對方發送的最大字節序列號 - MAXACKWINDOW。其中MAXACKWINDOW的值爲65536。在linux實現中使用MAXACKWINCONST,其值爲66000。

linux在實現的時候對上面四個不等式進行了調整:ui

The boundaries and the conditions are changed according to RFC793:
   //報文必須與窗口相交,即報文的最大發送序列號(seq+len)能夠大於sender.td_maxend,可是seq必須小於
   //sender.td_maxend。這個與論文不一樣,論文是seq+len < sender.td_maxend。
   //報文的最大發送序列號(seq+len)必須大於下邊界sender.td_end - receiver.td_maxwin。
   the packet must intersect the window (i.e. segments may be
   after the right or before the left edge) and thus receivers may ACK
   segments after the right edge of the window.
    //td_maxend爲本端可以發送的最大字節序列號,其等於對端發送的最大的應答序列號 + 當前對端的窗口號
    td_maxend = max(sack + max(win,1)) seen in reply packets
    //td_maxwin爲本端的窗口+已經選擇性肯定的報文字節數,這個有必定的誤差,不是很嚴謹,可是不會阻塞合法報文。
    td_maxwin = max(max(win, 1)) + (sack - ack) seen in sent packets
    td_maxwin += seq + len - sender.td_maxend
            if seq + len > sender.td_maxend
    //本端發送的最大的序列號。        
    td_end    = max(seq + len) seen in sent packets
   //發送序列號上邊界,須要小於sender.td_maxend
   I.   Upper bound for valid data:    seq <= sender.td_maxend
   //發送序列號下邊界,要求報文的最後一個字節的序列號大於sender.td_end - receiver.td_maxwin便可。
   //對論文中進行了邊界放寬。
   II.  Lower bound for valid data:    seq + len >= sender.td_end - receiver.td_maxwin
   //應答序列號上邊界,小於對端發送的最大的字節序列號,這個與論文中是一致的。
   III.    Upper bound for valid (s)ack:   sack <= receiver.td_end
   //應答序列號的下邊界,與論文中是一致的。
   IV.    Lower bound for valid (s)ack:    sack >= receiver.td_end - MAXACKWINDOW
   //sack是全部右邊界中最大的一個數值,它會大於等於ack。在沒有sack的時候該值等於ack。
   where sack is the highest right edge of sack block found in the packet
   or ack in the case of packet without SACK option.

   The upper bound limit for a valid (s)ack is not ignored -
   we doesn't have to deal with fragments.

數據結構

struct ip_ct_tcp_state {
    u_int32_t    td_end;        /* max of seq + len 本次報文發送端的最大序列號,即發送端下一個發送的報文的第一個字節序列號*/
    u_int32_t    td_maxend;    /* max of ack + max(win, 1) 本端能發送的最大字節序列號。其值爲對端發送的應答號 + 窗口號。1表示在窗口爲0時,容許發送1字節的探測報文。發送序列號的上邊界 */
    u_int32_t    td_maxwin;    /* max(win) the maximum window seen,本段發佈的最新窗口 */
    u_int32_t    td_maxack;    /* max of ack 本端發送的最大確認號 */
    u_int8_t    td_scale;    /* window scale factor 窗口擴大因子 */
    u_int8_t    flags;        /* per direction options 每方向的選項標誌 */
};
struct ip_ct_tcp {
    struct ip_ct_tcp_state seen[2];    /* connection parameters per direction 每一個方向上的序列號狀態 */
    u_int8_t    state;        /* state of the connection (enum tcp_conntrack) 鏈接狀態 */
    /* For detecting stale connections */
    u_int8_t    last_dir;    /* Direction of the last packet (enum ip_conntrack_dir) 上一個報文的方向 */
    u_int8_t    retrans;    /* Number of retransmitted packets 重傳報文次數 */
    u_int8_t    last_index;    /* Index of the last packet 上一個報文的標誌集合索引 */
    u_int32_t    last_seq;    /* Last sequence number seen in dir 上一個報文的發送序列號 */
    u_int32_t    last_ack;    /* Last sequence number seen in opposite dir 上一個報文的應答序列號 */
    u_int32_t    last_end;    /* Last seq + len 上一個報文的長度與發送序列號之和 */
    u_int16_t    last_win;    /* Last window advertisement seen in dir 上一次發佈的窗口 */
    /* For SYN packets while we may be out-of-sync */
    u_int8_t    last_wscale;    /* Last window scaling factor seen 上一次發佈的窗口擴大因子 */
    u_int8_t    last_flags;    /* Last flags set 報文tcp選項標誌 */
};

初始化

正常狀況下,第一個SYN報文到來時會建立會話,進行初始化,初始化後必須讓接下來的報文經過序列號檢查。接下來的報文正常狀況下是SYN/ACK報文,或者是SYN報文重傳。還有一種狀況是,防火牆設備重啓後,丟失了不少會話的信息,會話首包不是SYN包,而是一些中間報文,這種異常狀況須要特殊處理。

初始化使用tcp_new函數:

/* 計算最後一個字節的序列號,dataoff表示的是ip頭部的長度,len表示ip報文的長度 */
static inline __u32 segment_seq_plus_len(__u32 seq,
                     size_t len,
                     unsigned int dataoff,
                     const struct tcphdr *tcph)
{
    /* XXX Should I use payload length field in IP/IPv6 header ?
     * - YK 
     * 報文長度減去TCP頭的報文的偏移,再減去tcp頭長度即獲得負載的長度
     * 同時若是有syn或者fin報文的話,都須要加1,由於這兩個標誌佔用seq
     * 這個*/
    return (seq + len - dataoff - tcph->doff*4
        + (tcph->syn ? 1 : 0) + (tcph->fin ? 1 : 0));
}

/* Fixme: what about big packets? */
#define MAXACKWINCONST            66000
#define MAXACKWINDOW(sender)                        \
    ((sender)->td_maxwin > MAXACKWINCONST ? (sender)->td_maxwin    \
                          : MAXACKWINCONST)

/* Called when a new connection for this protocol found. */
/* 新鏈接到來時,調用該函數進行檢查 */              
static bool tcp_new(struct nf_conn *ct, const struct sk_buff *skb,
            unsigned int dataoff, unsigned int *timeouts)
{
    enum tcp_conntrack new_state;
    const struct tcphdr *th;
    struct tcphdr _tcph;
    struct net *net = nf_ct_net(ct);
    struct nf_tcp_net *tn = tcp_pernet(net);
    //發送方向狀態
    const struct ip_ct_tcp_state *sender = &ct->proto.tcp.seen[0];
    //接收方向狀態
    const struct ip_ct_tcp_state *receiver = &ct->proto.tcp.seen[1];
    //獲取tcp頭部地址
    th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
    BUG_ON(th == NULL);

    /* Don't need lock here: this conntrack not in circulation yet 獲取下一個狀態*/
    /* 根據報文的方向(請求方向)和標誌位計算tcp的下一個狀態,默認首包的起始狀態爲TCP_CONNTRACK_NONE */
    new_state = tcp_conntracks[0][get_conntrack_index(th)][TCP_CONNTRACK_NONE];

    /* Invalid: delete conntrack 不合適狀態,直接放棄鏈接跟蹤 */
    if (new_state >= TCP_CONNTRACK_MAX) {
        pr_debug("nf_ct_tcp: invalid new deleting.\n");
        return false;
    }

    //初始化state
    if (new_state == TCP_CONNTRACK_SYN_SENT) {/* 首包是syn報文的話,通常是這個狀態,初始化相關信息 */
        //狀態清0 
        memset(&ct->proto.tcp, 0, sizeof(ct->proto.tcp));
        /* SYN packet */
        ct->proto.tcp.seen[0].td_end = /* td_end = s + n 即報文的發送序列號 + 報文的長度*/
            segment_seq_plus_len(ntohl(th->seq), skb->len,
                         dataoff, th);
        //本報文攜帶的窗口信息
        ct->proto.tcp.seen[0].td_maxwin = ntohs(th->window);
        if (ct->proto.tcp.seen[0].td_maxwin == 0)
            ct->proto.tcp.seen[0].td_maxwin = 1;
        ct->proto.tcp.seen[0].td_maxend = ct->proto.tcp.seen[0].td_end;
        //tcp選項處理。
        tcp_options(skb, dataoff, th, &ct->proto.tcp.seen[0]);
    } else if (tn->tcp_loose == 0) {/* 嚴格狀況,直接不讓經過,能夠經過sysctl進行設置,synproxy會設置該標誌 */
        /* Don't try to pick up connections. */
        return false;
    } else {//寬鬆環境下,這裏是異常狀況,即首包不是syn報文,而是中間報文,窗口擴大因子選項只能在syn和synack中攜帶,丟失了不少信息,這裏只能簡單處理
        memset(&ct->proto.tcp, 0, sizeof(ct->proto.tcp));
        /*
         * We are in the middle of a connection,
         * its history is lost for us.
         * Let's try to use the data from the packet.
         */
        ct->proto.tcp.seen[0].td_end =
            segment_seq_plus_len(ntohl(th->seq), skb->len,
                         dataoff, th);
        ct->proto.tcp.seen[0].td_maxwin = ntohs(th->window);
        if (ct->proto.tcp.seen[0].td_maxwin == 0)
            ct->proto.tcp.seen[0].td_maxwin = 1;//最小爲1,由於窗口爲0時須要接受一個字節的持續窗口探測報文。     
        ct->proto.tcp.seen[0].td_maxend =/* td_maxend=ack+win,可是這裏尚未收到對方的信息,因此僅僅初始化爲 */
            ct->proto.tcp.seen[0].td_end +
            ct->proto.tcp.seen[0].td_maxwin;

        /* We assume SACK and liberal window checking to handle
         * window scaling 設置IP_CT_TCP_FLAG_BE_LIBERAL標誌再也不處理序列號的問題 */
        ct->proto.tcp.seen[0].flags =
        ct->proto.tcp.seen[1].flags = IP_CT_TCP_FLAG_SACK_PERM |
                          IP_CT_TCP_FLAG_BE_LIBERAL;
    }

    /* tcp_packet will set them,由tcp_packet函數設置該值 */
    ct->proto.tcp.last_index = TCP_NONE_SET;

    pr_debug("tcp_new: sender end=%u maxend=%u maxwin=%u scale=%i "
         "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
         sender->td_end, sender->td_maxend, sender->td_maxwin,
         sender->td_scale,
         receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
         receiver->td_scale);
    return true;
}

序列號校驗

/* 返回0表示報文不接受,返回1表示報文接受 */
static bool tcp_in_window(const struct nf_conn *ct,/* 報文所屬的鏈接跟蹤控制塊 */
              struct ip_ct_tcp *state,/* 鏈接跟蹤的tcp控制塊 */
              enum ip_conntrack_dir dir,/* 報文方向 */
              unsigned int index,/* 報文的標誌索引 */
              const struct sk_buff *skb,/* 報文 */
              unsigned int dataoff,/*tcp頭偏移*/
              const struct tcphdr *tcph)/* tcp頭部指針 */
{
    struct net *net = nf_ct_net(ct);
    struct nf_tcp_net *tn = tcp_pernet(net);
    struct ip_ct_tcp_state *sender = &state->seen[dir];
    struct ip_ct_tcp_state *receiver = &state->seen[!dir];
    const struct nf_conntrack_tuple *tuple = &ct->tuplehash[dir].tuple;
    __u32 seq, ack, sack, end, win, swin;
    s32 receiver_offset;
    bool res, in_recv_win;

    /*
     * Get the required data from the packet.
     */
    seq = ntohl(tcph->seq);/* 獲取發送序列號 */
    ack = sack = ntohl(tcph->ack_seq);/* 獲取應答序列號 */
    win = ntohs(tcph->window);/* 獲取窗口大小 */
    /* 計算新的序列號,即end等於最後一個字節的下一個字節的序列號 */
    end = segment_seq_plus_len(seq, skb->len, dataoff, tcph);
    /* 若是接收方使能了sack,那麼須要處理sack,這裏獲取sack中最大的有邊界值。若是沒有sack選項,則sack的值等於ack */
    if (receiver->flags & IP_CT_TCP_FLAG_SACK_PERM)
        tcp_sack(skb, dataoff, tcph, &sack);

    /* Take into account NAT sequence number mangling */
    /* nat可能會致使報文負載發生變化(如alg,synproxy),這裏進行序列號調整,獲取報文的累計偏移,這裏判斷的是確認偏移。 */
    receiver_offset = nf_ct_seq_offset(ct, !dir, ack - 1);
    ack -= receiver_offset;/* 獲得真實的序列號 */
    sack -= receiver_offset;/* 獲得真實的序列號 */

    pr_debug("tcp_in_window: START\n");
    pr_debug("tcp_in_window: ");
    nf_ct_dump_tuple(tuple);
    pr_debug("seq=%u ack=%u+(%d) sack=%u+(%d) win=%u end=%u\n",
         seq, ack, receiver_offset, sack, receiver_offset, win, end);
    pr_debug("tcp_in_window: sender end=%u maxend=%u maxwin=%u scale=%i "
         "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
         sender->td_end, sender->td_maxend, sender->td_maxwin,
         sender->td_scale,
         receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
         receiver->td_scale);
    
    if (sender->td_maxwin == 0) {/* 窗口爲0,多是syn-ack報文(被動打開),或者syn報文(同時打開),或者會話丟失,中間報文建立的會話 */
        /*
         * Initialize sender data.
         * syn-ack報文,這邊是應答方發送的第一個報文,進行應答方初始化。
         */
        if (tcph->syn) {
            /*
             * SYN-ACK in reply to a SYN
             * or SYN from reply direction in simultaneous open.
             */
            sender->td_end =
            sender->td_maxend = end;/* seq + 1 */
            sender->td_maxwin = (win == 0 ? 1 : win);/* 初始化窗口大小,後面會加上窗口擴大因子的影響,+1表示要處理持續報文 */
            /* 處理tcp選項,獲取窗口擴大因子 */
            tcp_options(skb, dataoff, tcph, sender);
            /*
             * RFC 1323:
             * Both sides must send the Window Scale option
             * to enable window scaling in either direction.
             * 只有雙方都發送了窗口擴大選項,才能開啓窗口擴大功能
             */
            if (!(sender->flags & IP_CT_TCP_FLAG_WINDOW_SCALE
                  && receiver->flags & IP_CT_TCP_FLAG_WINDOW_SCALE))
                sender->td_scale =/* 關閉窗口擴大因子 */
                receiver->td_scale = 0;
            //沒有ack說明同時打開,接受。
            if (!tcph->ack)
                return true;
        } else {//沒有syn標誌,說明是中間報文,因爲窗口擴大因子只能在syn報文中攜帶,會話丟失歷史消息,再也不校驗。
            /*
             * We are in the middle of a connection,
             * its history is lost for us.
             * Let's try to use the data from the packet.
             */
            sender->td_end = end;
            //窗口擴大調整
            swin = win << sender->td_scale;
            sender->td_maxwin = (swin == 0 ? 1 : swin);
            sender->td_maxend = end + sender->td_maxwin;
            /*
             * We haven't seen traffic in the other direction yet
             * but we have to tweak window tracking to pass III
             * and IV until that happens.
             */
            if (receiver->td_maxwin == 0)
                receiver->td_end = receiver->td_maxend = sack;
        }
    } else if (((state->state == TCP_CONNTRACK_SYN_SENT//syn報文重傳
             && dir == IP_CT_DIR_ORIGINAL)
           || (state->state == TCP_CONNTRACK_SYN_RECV//syn-ack重傳
             && dir == IP_CT_DIR_REPLY))
           && after(end, sender->td_end)) {
        /*
         * RFC 793: "if a TCP is reinitialized ... then it need
         * not wait at all; it must only be sure to use sequence
         * numbers larger than those recently used."
         */
        sender->td_end =
        sender->td_maxend = end;
        sender->td_maxwin = (win == 0 ? 1 : win);

        tcp_options(skb, dataoff, tcph, sender);
    }

    if (!(tcph->ack)) {
        /*
         * If there is no ACK, just pretend it was set and OK.
         * 報文沒有ack標誌,也就是ack序列號無效,這裏直接獲取對端的最大發送序列號做爲應答序列號。
         */
        ack = sack = receiver->td_end;
    } else if (((tcp_flag_word(tcph) & (TCP_FLAG_ACK|TCP_FLAG_RST)) ==/* rst包 */
            (TCP_FLAG_ACK|TCP_FLAG_RST))
           && (ack == 0)) {
        /*
         * Broken TCP stacks, that set ACK in RST packets as well
         * with zero ack value.
         */
        ack = sack = receiver->td_end;
    }
    /* 第一個reset報文,即客戶端發送請求,服務器端收到第一個報,當即發送的rst報文 */
    if (tcph->rst && seq == 0 && state->state == TCP_CONNTRACK_SYN_SENT)
        /*
         * RST sent answering SYN.
         */
        seq = end = sender->td_end;

    pr_debug("tcp_in_window: ");
    nf_ct_dump_tuple(tuple);
    pr_debug("seq=%u ack=%u+(%d) sack=%u+(%d) win=%u end=%u\n",
         seq, ack, receiver_offset, sack, receiver_offset, win, end);
    pr_debug("tcp_in_window: sender end=%u maxend=%u maxwin=%u scale=%i "
         "receiver end=%u maxend=%u maxwin=%u scale=%i\n",
         sender->td_end, sender->td_maxend, sender->td_maxwin,
         sender->td_scale,
         receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
         receiver->td_scale);

    /* Is the ending sequence in the receive window (if available)? */
    /* 判斷等式(II)是否知足,即發送序列號的下邊界 */
    in_recv_win = !receiver->td_maxwin ||
              after(end, sender->td_end - receiver->td_maxwin - 1);/* end > td_end -td_maxwin */
    //|--------------|
    //               td_end

    pr_debug("tcp_in_window: I=%i II=%i III=%i IV=%i\n",
         before(seq, sender->td_maxend + 1),
         (in_recv_win ? 1 : 0),
         before(sack, receiver->td_end + 1),
         after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1));

    if (before(seq, sender->td_maxend + 1) && // 檢查發送序列號上邊界,等式(I)
        in_recv_win &&                         //檢查發送序列號下邊界。等式(II)
        before(sack, receiver->td_end + 1) &&  // 檢查應答序列號上邊界,等式(III)
        after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1)) {// 檢查應答序列號下邊界,等式(IV)
        /*  |---------------------------| */
        //  |----sack-------------------|td_end
        /*
         * Take into account window scaling (RFC 1323).
         * 根據發送方發佈的窗口和窗口縮放係數計算真實的發送方的接收端口
         */
        if (!tcph->syn)
            win <<= sender->td_scale;

        /*
         * Update sender data.
         * 更新發送方數據。
         */
        // 本端選擇性確認了一些報文,說明本端能夠接收更多字節。
        swin = win + (sack - ack);
        if (sender->td_maxwin < swin)/* 窗口更新 */
            sender->td_maxwin = swin;
        if (after(end, sender->td_end)) {/* 更新本端已經發送的最大的字節序列號 */
            sender->td_end = end;
            sender->flags |= IP_CT_TCP_FLAG_DATA_UNACKNOWLEDGED;/* 發送端新增未被確認的字節 */
        }
        
        if (tcph->ack) {/* 存在確認標誌,則須要更新最大確認序列號 */
            if (!(sender->flags & IP_CT_TCP_FLAG_MAXACK_SET)) {
                sender->td_maxack = ack;
                sender->flags |= IP_CT_TCP_FLAG_MAXACK_SET;
            } else if (after(ack, sender->td_maxack))/* 本次確認了新的數據,更新本地最大確認序列號 */
                sender->td_maxack = ack;/* 更新本端已經確認收到對端的報文字節數 */
        }

        /*
         * Update receiver data.
         * 更新對端數據
         */
        // 這裏沒看明白,本次發送超過了最大容許發送的字節,爲何要刷新接收方的窗口呢?
        if (receiver->td_maxwin != 0 && after(end, sender->td_maxend)
            receiver->td_maxwin += end - sender->td_maxend;
        // 計算td_maxend。td_maxend接收端的。本次確認的序列號,加上本端發佈的窗口,就是對端能發送的最大序列號。
        if (after(sack + win, receiver->td_maxend - 1)) {/* 本次報文sack確認的序列號 */
            receiver->td_maxend = sack + win;/* td_maxend = ack + win */
            if (win == 0)//在窗口爲0時,持續定時器的探測報文1字節容許經過,因此這裏加1
                receiver->td_maxend++;
        }
        
        if (ack == receiver->td_end)/* 本次發送的報文已經確認對端發送的報文已經所有收到 */
            receiver->flags &= ~IP_CT_TCP_FLAG_DATA_UNACKNOWLEDGED;/* 對端發送的報文已經被本段所有確認 */

        /*
         * Check retransmissions.
         * 檢查重傳
         */
        if (index == TCP_ACK_SET) {
            if (state->last_dir == dir/* 方向相同 */
                && state->last_seq == seq/* 序列號相同 */
                && state->last_ack == ack
                && state->last_end == end
                && state->last_win == win)
                state->retrans++;//上一個報文的重傳報文,累計
            else {//非重傳報文,刷新上一次紀錄。
                state->last_dir = dir;
                state->last_seq = seq;
                state->last_ack = ack;
                state->last_end = end;
                state->last_win = win;
                state->retrans = 0;
            }
        }
        res = true;
    } else {//若是seq大於上邊界,即報文沒有與窗口相交,則有誤。
        res = false;
        if (sender->flags & IP_CT_TCP_FLAG_BE_LIBERAL ||
            tn->tcp_be_liberal)//中間報文構建的會話,丟失了一些歷史信息,不進行校驗。
            res = true;
        if (!res) {
            nf_ct_l4proto_log_invalid(skb, ct,
            "%s",
            before(seq, sender->td_maxend + 1) ?  //值爲0打印 "SEQ is over the upper bound (over the window of the receiver)"
            in_recv_win ?                         //值爲0打印     "SEQ is under the lower bound (already ACKed data retransmitted)"
            before(sack, receiver->td_end + 1) ?  //值爲0打印     "ACK is over the upper bound (ACKed data not seen yet)"
            after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1) ? "BUG" //值爲0打印  "ACK is under the lower bound (possible overly delayed ACK)"
            : "ACK is under the lower bound (possible overly delayed ACK)"    //值全爲1打印「BUG」
            : "ACK is over the upper bound (ACKed data not seen yet)"
            : "SEQ is under the lower bound (already ACKed data retransmitted)"
            : "SEQ is over the upper bound (over the window of the receiver)");
        }
    }

    pr_debug("tcp_in_window: res=%u sender end=%u maxend=%u maxwin=%u "
         "receiver end=%u maxend=%u maxwin=%u\n",
         res, sender->td_end, sender->td_maxend, sender->td_maxwin,
         receiver->td_end, receiver->td_maxend, receiver->td_maxwin);

    return res;
}

實現

鏈接跟蹤控制塊

const struct nf_conntrack_l4proto nf_conntrack_l4proto_tcp4 =
{
    .l3proto        = PF_INET,
    .l4proto         = IPPROTO_TCP,
    .pkt_to_tuple         = tcp_pkt_to_tuple,
    .invert_tuple         = tcp_invert_tuple,
#ifdef CONFIG_NF_CONNTRACK_PROCFS
    .print_conntrack     = tcp_print_conntrack,
#endif
    .packet         = tcp_packet,
    .get_timeouts        = tcp_get_timeouts,
    .new             = tcp_new,
    .error            = tcp_error,
    .can_early_drop        = tcp_can_early_drop,/* 鏈接控制塊是否能夠提早回收 */
#if IS_ENABLED(CONFIG_NF_CT_NETLINK)
    .to_nlattr        = tcp_to_nlattr,
    .from_nlattr        = nlattr_to_tcp,
    .tuple_to_nlattr    = nf_ct_port_tuple_to_nlattr,
    .nlattr_to_tuple    = nf_ct_port_nlattr_to_tuple,
    .nlattr_tuple_size    = tcp_nlattr_tuple_size,
    .nlattr_size        = TCP_NLATTR_SIZE,
    .nla_policy        = nf_ct_port_nla_policy,
#endif
#if IS_ENABLED(CONFIG_NF_CT_NETLINK_TIMEOUT)
    .ctnl_timeout        = {
        .nlattr_to_obj    = tcp_timeout_nlattr_to_obj,
        .obj_to_nlattr    = tcp_timeout_obj_to_nlattr,
        .nlattr_max    = CTA_TIMEOUT_TCP_MAX,
        .obj_size    = sizeof(unsigned int) *
                    TCP_CONNTRACK_TIMEOUT_MAX,
        .nla_policy    = tcp_timeout_nla_policy,
    },
#endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
    .init_net        = tcp_init_net,
    .get_net_proto        = tcp_get_net_proto,
};

常見標誌

/* Window scaling is advertised by the sender */
#define IP_CT_TCP_FLAG_WINDOW_SCALE        0x01   //支持窗口縮放

/* SACK is permitted by the sender 發送方支持sack選項 */
#define IP_CT_TCP_FLAG_SACK_PERM        0x02

/* This sender sent FIN first 發送方首先發送了FIN */
#define IP_CT_TCP_FLAG_CLOSE_INIT        0x04

/* Be liberal in window checking 中間報文建立會話,鏈接跟蹤丟失歷史信息 */
#define IP_CT_TCP_FLAG_BE_LIBERAL        0x08

/* Has unacknowledged data 本端發送的數據尚未被所有確認 */
#define IP_CT_TCP_FLAG_DATA_UNACKNOWLEDGED    0x10

/* The field td_maxack has been set td_maxack已經被設置了 */
#define IP_CT_TCP_FLAG_MAXACK_SET        0x20

/* Marks possibility for expected RFC5961 challenge ACK */
#define IP_CT_EXP_CHALLENGE_ACK         0x40

/* Simultaneous open initialized */
#define IP_CT_TCP_SIMULTANEOUS_OPEN        0x80

struct nf_ct_tcp_flags {
    __u8 flags;
    __u8 mask;
};

tcp_error

該函數是在鏈接跟蹤過程當中第一個調用的函數,用於錯誤檢查,主要是頭部合法性校驗,flag標誌位校驗,校驗碼校驗。

/* Protect conntrack agaist broken packets. Code taken from ipt_unclean.c.  */
static int tcp_error(struct net *net, struct nf_conn *tmpl,
             struct sk_buff *skb,
             unsigned int dataoff,
             u_int8_t pf,
             unsigned int hooknum)
{
    const struct tcphdr *th;
    struct tcphdr _tcph;
    unsigned int tcplen = skb->len - dataoff;
    u_int8_t tcpflags;

    /* Smaller that minimal TCP header? */
    th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
    if (th == NULL) {
        tcp_error_log(skb, net, pf, "short packet");
        return -NF_ACCEPT;
    }

    /* Not whole TCP header or malformed packet */
    if (th->doff*4 < sizeof(struct tcphdr) || tcplen < th->doff*4) {
        tcp_error_log(skb, net, pf, "truncated packet");
        return -NF_ACCEPT;
    }

    /* Checksum invalid? Ignore.
     * We skip checking packets on the outgoing path
     * because the checksum is assumed to be correct.
     */
    /* FIXME: Source route IP option packets --RR */
    if (net->ct.sysctl_checksum && hooknum == NF_INET_PRE_ROUTING &&
        nf_checksum(skb, hooknum, dataoff, IPPROTO_TCP, pf)) {
        tcp_error_log(skb, net, pf, "bad checksum");
        return -NF_ACCEPT;
    }

    /* Check TCP flags. */
    /* 獲取tcp的flag字段,去掉TCPHDR_ECE|TCPHDR_CWR|TCPHDR_PSH三個bit */
    tcpflags = (tcp_flag_byte(th) & ~(TCPHDR_ECE|TCPHDR_CWR|TCPHDR_PSH));
    /* 校驗剩下bit是否合法 */
    if (!tcp_valid_flags[tcpflags]) {
        tcp_error_log(skb, net, pf, "invalid tcp flag combination");
        return -NF_ACCEPT;
    }

    return NF_ACCEPT;
}
/* table of valid flag combinations - PUSH, ECE and CWR are always valid */
/* 標誌之間的組合,不能存在非法組合 */
static const u8 tcp_valid_flags[(TCPHDR_FIN|TCPHDR_SYN|TCPHDR_RST|TCPHDR_ACK|
                 TCPHDR_URG) + 1] =
{
    [TCPHDR_SYN]                = 1,//syn報文能夠只設置一個標誌
    [TCPHDR_SYN|TCPHDR_URG]            = 1,
    [TCPHDR_SYN|TCPHDR_ACK]            = 1,//syn+ack報文會同時有這兩個標誌位
    [TCPHDR_RST]                = 1,//rst報文能夠只設置rst標誌
    [TCPHDR_RST|TCPHDR_ACK]            = 1,//rst報文能夠對報文進行確認,便可以設置ack標誌。
    [TCPHDR_FIN|TCPHDR_ACK]            = 1,//fin報文必須對報文進行確認,必須設置ack標誌。
    [TCPHDR_FIN|TCPHDR_ACK|TCPHDR_URG]    = 1,
    [TCPHDR_ACK]                = 1,//中間的數據報文,能夠只設置ack標誌。
    [TCPHDR_ACK|TCPHDR_URG]            = 1,
};

tcp_pkt_to_tuple

獲取報文的源目的端口,填充到tuple中。

static bool tcp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff,
                 struct net *net, struct nf_conntrack_tuple *tuple)
{
    const struct tcphdr *hp;
    struct tcphdr _hdr;

    /* Actually only need first 4 bytes to get ports. */
    hp = skb_header_pointer(skb, dataoff, 4, &_hdr);
    if (hp == NULL)
        return false;

    tuple->src.u.tcp.port = hp->source;
    tuple->dst.u.tcp.port = hp->dest;

    return true;
}

tcp_invert_tuple

求反向路徑的傳輸層tuple信息,將報文的源目的端口交換便可。

static bool tcp_invert_tuple(struct nf_conntrack_tuple *tuple,
                 const struct nf_conntrack_tuple *orig)
{
    tuple->src.u.tcp.port = orig->dst.u.tcp.port;
    tuple->dst.u.tcp.port = orig->src.u.tcp.port;
    return true;
}

tcp_options

簡化版tcp選項處理函數,重點關注sack選項和窗口縮放因子選項

/*
 * Simplified tcp_parse_options routine from tcp_input.c
 * 簡單的tcp選項處理函數,用於處理鏈接跟蹤須要處理的兩個選項
 * sack選項和窗口擴大因子。
 */
static void tcp_options(const struct sk_buff *skb,
            unsigned int dataoff,
            const struct tcphdr *tcph,
            struct ip_ct_tcp_state *state)
{
    unsigned char buff[(15 * 4) - sizeof(struct tcphdr)];
    const unsigned char *ptr;
    int length = (tcph->doff*4) - sizeof(struct tcphdr);

    if (!length)
        return;

    ptr = skb_header_pointer(skb, dataoff + sizeof(struct tcphdr),
                 length, buff);
    BUG_ON(ptr == NULL);

    state->td_scale =
    state->flags = 0;

    while (length > 0) {
        int opcode=*ptr++;
        int opsize;

        switch (opcode) {
        case TCPOPT_EOL:
            return;
        case TCPOPT_NOP:    /* Ref: RFC 793 section 3.1 */
            length--;
            continue;
        default:
            if (length < 2)
                return;
            opsize=*ptr++;
            if (opsize < 2) /* "silly options" */
                return;
            if (opsize > length)
                return;    /* don't parse partial options */
            //支持sack選項
            if (opcode == TCPOPT_SACK_PERM
                && opsize == TCPOLEN_SACK_PERM)
                state->flags |= IP_CT_TCP_FLAG_SACK_PERM;
            else if (opcode == TCPOPT_WINDOW//窗口擴大因子選項
                 && opsize == TCPOLEN_WINDOW) {
                state->td_scale = *(u_int8_t *)ptr;

                if (state->td_scale > TCP_MAX_WSCALE)/* 處理窗口擴大因子 */
                    state->td_scale = TCP_MAX_WSCALE;

                state->flags |=
                    IP_CT_TCP_FLAG_WINDOW_SCALE;
            }
            ptr += opsize - 2;
            length -= opsize;
        }
    }
}

/* 報文的sack選項處理,選取最大的right-edge */
static void tcp_sack(const struct sk_buff *skb, unsigned int dataoff,
                     const struct tcphdr *tcph, __u32 *sack)
{
    unsigned char buff[(15 * 4) - sizeof(struct tcphdr)];
    const unsigned char *ptr;
    int length = (tcph->doff*4) - sizeof(struct tcphdr);
    __u32 tmp;

    if (!length)
        return;
    /* 獲取選項起始地址 */
    ptr = skb_header_pointer(skb, dataoff + sizeof(struct tcphdr),
                 length, buff);
    BUG_ON(ptr == NULL);

    /* Fast path for timestamp-only option */
    /* 快速選項過濾,對於選項中只有時間戳選項,直接跳過。 */
    if (length == TCPOLEN_TSTAMP_ALIGNED
        && *(__be32 *)ptr == htonl((TCPOPT_NOP << 24)
                       | (TCPOPT_NOP << 16)
                       | (TCPOPT_TIMESTAMP << 8)
                       | TCPOLEN_TIMESTAMP))
        return;

    while (length > 0) {
        int opcode = *ptr++;
        int opsize, i;

        switch (opcode) {
        case TCPOPT_EOL:
            return;
        case TCPOPT_NOP:    /* Ref: RFC 793 section 3.1 */
            length--;
            continue;
        default:
            if (length < 2)
                return;
            opsize = *ptr++;
            if (opsize < 2) /* "silly options" */
                return;
            if (opsize > length)
                return;    /* don't parse partial options */

            if (opcode == TCPOPT_SACK
                && opsize >= (TCPOLEN_SACK_BASE
                      + TCPOLEN_SACK_PERBLOCK)
                && !((opsize - TCPOLEN_SACK_BASE)
                 % TCPOLEN_SACK_PERBLOCK)) {
                for (i = 0;
                     i < (opsize - TCPOLEN_SACK_BASE);
                     i += TCPOLEN_SACK_PERBLOCK) {
                    tmp = get_unaligned_be32((__be32 *)(ptr+i)+1);
                    //獲取最大的sack值。 
                    if (after(tmp, *sack))
                        *sack = tmp;
                }
                return;
            }
            ptr += opsize - 2;
            length -= opsize;
        }
    }
}

tcp_packet

對報文進行合法性校驗,狀態轉換和狀態合法性檢查,窗口合法性檢查。

/* Returns verdict for packet, or -1 for invalid. */
/* 返回報文的裁決,-1表示非法 */
static int tcp_packet(struct nf_conn *ct,
              const struct sk_buff *skb,
              unsigned int dataoff,
              enum ip_conntrack_info ctinfo,
              unsigned int *timeouts)
{
    struct net *net = nf_ct_net(ct);
    struct nf_tcp_net *tn = tcp_pernet(net);
    struct nf_conntrack_tuple *tuple;
    enum tcp_conntrack new_state, old_state;
    enum ip_conntrack_dir dir;
    const struct tcphdr *th;
    struct tcphdr _tcph;
    unsigned long timeout;
    unsigned int index;

    th = skb_header_pointer(skb, dataoff, sizeof(_tcph), &_tcph);
    BUG_ON(th == NULL);

    spin_lock_bh(&ct->lock);
    old_state = ct->proto.tcp.state;
    dir = CTINFO2DIR(ctinfo);
    index = get_conntrack_index(th);
    /* 獲取tcp狀態機的下一個狀態 */
    new_state = tcp_conntracks[dir][index][old_state];
    tuple = &ct->tuplehash[dir].tuple;

    switch (new_state) {
    case TCP_CONNTRACK_SYN_SENT:/* syn報文,重傳syn */
        if (old_state < TCP_CONNTRACK_TIME_WAIT)//從新打開,特殊處理。
            break;
        /* RFC 1122: "When a connection is closed actively,
         * it MUST linger in TIME-WAIT state for a time 2xMSL
         * (Maximum Segment Lifetime). However, it MAY accept
         * a new SYN from the remote TCP to reopen the connection
         * directly from TIME-WAIT state, if..."
         * We ignore the conditions because we are in the
         * TIME-WAIT state anyway.
         *
         * Handle aborted connections: we and the server
         * think there is an existing connection but the client
         * aborts it and starts a new one.
         */
        if (((ct->proto.tcp.seen[dir].flags
              | ct->proto.tcp.seen[!dir].flags)
             & IP_CT_TCP_FLAG_CLOSE_INIT)//雙方主動發送
            || (ct->proto.tcp.last_dir == dir
                && ct->proto.tcp.last_index == TCP_RST_SET)) {
            /* Attempt to reopen a closed/aborted connection.
             * Delete this connection and look up again. */
            spin_unlock_bh(&ct->lock);

            /* Only repeat if we can actually remove the timer.
             * Destruction may already be in progress in process
             * context and we must give it a chance to terminate.
             */
            if (nf_ct_kill(ct))
                return -NF_REPEAT;//這裏返回REPEAT,從新進行會話建立。
            return NF_DROP;
        }
        /* Fall through */
    case TCP_CONNTRACK_IGNORE:
        /* Ignored packets:
         *
         * Our connection entry may be out of sync, so ignore
         * packets which may signal the real connection between
         * the client and the server.
         *
         * a) SYN in ORIGINAL
         * b) SYN/ACK in REPLY
         * c) ACK in reply direction after initial SYN in original.
         *
         * If the ignored packet is invalid, the receiver will send
         * a RST we'll catch below.
         */
        if (index == TCP_SYNACK_SET
            && ct->proto.tcp.last_index == TCP_SYN_SET
            && ct->proto.tcp.last_dir != dir
            && ntohl(th->ack_seq) == ct->proto.tcp.last_end) {/* syn報文的應答,這個不該該是TCP_CONNTRACK_IGNORE狀態啊? */
            /* b) This SYN/ACK acknowledges a SYN that we earlier
             * ignored as invalid. This means that the client and
             * the server are both in sync, while the firewall is
             * not. We get in sync from the previously annotated
             * values.
             * 此SYN/ACK確認了一個咱們先前忽略爲無效的SYN。這意味着客戶機和服務器都是同步的,
             * 而防火牆則不是同步的。咱們與之前註釋過的值同步。更新防火牆的狀態爲正確的狀態。
             */
            old_state = TCP_CONNTRACK_SYN_SENT;
            new_state = TCP_CONNTRACK_SYN_RECV;
            ct->proto.tcp.seen[ct->proto.tcp.last_dir].td_end =
                ct->proto.tcp.last_end;
            ct->proto.tcp.seen[ct->proto.tcp.last_dir].td_maxend =
                ct->proto.tcp.last_end;
            ct->proto.tcp.seen[ct->proto.tcp.last_dir].td_maxwin =
                ct->proto.tcp.last_win == 0 ?
                    1 : ct->proto.tcp.last_win;
            ct->proto.tcp.seen[ct->proto.tcp.last_dir].td_scale =
                ct->proto.tcp.last_wscale;
            ct->proto.tcp.last_flags &= ~IP_CT_EXP_CHALLENGE_ACK;
            ct->proto.tcp.seen[ct->proto.tcp.last_dir].flags =
                ct->proto.tcp.last_flags;
            memset(&ct->proto.tcp.seen[dir], 0,
                   sizeof(struct ip_ct_tcp_state));
            break;
        }
        ct->proto.tcp.last_index = index;/* 更新上一個報文的標誌索引 */
        ct->proto.tcp.last_dir = dir;/* 更新報文方向 */
        ct->proto.tcp.last_seq = ntohl(th->seq);/* 更新發送序列號 */
        ct->proto.tcp.last_end =
            segment_seq_plus_len(ntohl(th->seq), skb->len, dataoff, th);/* 更新長度與序列號之和,能夠反推出上一個報文的長度 */
        ct->proto.tcp.last_win = ntohs(th->window);

        /* a) This is a SYN in ORIGINAL. The client and the server
         * may be in sync but we are not. In that case, we annotate
         * the TCP options and let the packet go through. If it is a
         * valid SYN packet, the server will reply with a SYN/ACK, and
         * then we'll get in sync. Otherwise, the server potentially
         * responds with a challenge ACK if implementing RFC5961.
         * 這是請求方向的SYN。客戶端和服務器多是同步的,但咱們不是。
         * 在這種狀況下,咱們註釋TCP選項並讓數據包經過。若是它是
         * 一個有效的SYN數據包,服務器將使用SYN/ACK進行應答,而後
         * 咱們將同步。
         */
        if (index == TCP_SYN_SET && dir == IP_CT_DIR_ORIGINAL) {
            struct ip_ct_tcp_state seen = {};

            ct->proto.tcp.last_flags =
            ct->proto.tcp.last_wscale = 0;
            tcp_options(skb, dataoff, th, &seen);
            if (seen.flags & IP_CT_TCP_FLAG_WINDOW_SCALE) {
                ct->proto.tcp.last_flags |=
                    IP_CT_TCP_FLAG_WINDOW_SCALE;
                ct->proto.tcp.last_wscale = seen.td_scale;
            }
            if (seen.flags & IP_CT_TCP_FLAG_SACK_PERM) {
                ct->proto.tcp.last_flags |=
                    IP_CT_TCP_FLAG_SACK_PERM;
            }
            /* Mark the potential for RFC5961 challenge ACK,
             * this pose a special problem for LAST_ACK state
             * as ACK is intrepretated as ACKing last FIN.
             */
            if (old_state == TCP_CONNTRACK_LAST_ACK)
                ct->proto.tcp.last_flags |=
                    IP_CT_EXP_CHALLENGE_ACK;
        }
        spin_unlock_bh(&ct->lock);
        nf_ct_l4proto_log_invalid(skb, ct, "invalid packet ignored in "
                      "state %s ", tcp_conntrack_names[old_state]);
        return NF_ACCEPT;
    case TCP_CONNTRACK_MAX:
        /* Special case for SYN proxy: when the SYN to the server or
         * the SYN/ACK from the server is lost, the client may transmit
         * a keep-alive packet while in SYN_SENT state. This needs to
         * be associated with the original conntrack entry in order to
         * generate a new SYN with the correct sequence number.
         * SYN代理的特例:當SYN到服務器或服務器的SYN/ACK丟失時,客戶端可
         * 以在SYN_SENT狀態下發送一個保持活動的數據包。這須要與原始鏈接項
         * 相關聯,以便生成具備正確序列號的新SYN。
         */
        if (nfct_synproxy(ct) && old_state == TCP_CONNTRACK_SYN_SENT &&
            index == TCP_ACK_SET && dir == IP_CT_DIR_ORIGINAL &&
            ct->proto.tcp.last_dir == IP_CT_DIR_ORIGINAL &&
            ct->proto.tcp.seen[dir].td_end - 1 == ntohl(th->seq)) {
            pr_debug("nf_ct_tcp: SYN proxy client keep alive\n");
            spin_unlock_bh(&ct->lock);
            return NF_ACCEPT;
        }

        /* Invalid packet */
        pr_debug("nf_ct_tcp: Invalid dir=%i index=%u ostate=%u\n",
             dir, get_conntrack_index(th), old_state);
        spin_unlock_bh(&ct->lock);
        nf_ct_l4proto_log_invalid(skb, ct, "invalid state");
        return -NF_ACCEPT;
    case TCP_CONNTRACK_TIME_WAIT:/* 客戶端收到對端的fin報文後,發送ack後進入該狀態,因此本報文必定是一個ack報文 */
        /* RFC5961 compliance cause stack to send "challenge-ACK"
         * e.g. in response to spurious SYNs.  Conntrack MUST
         * not believe this ACK is acking last FIN.
         */
        if (old_state == TCP_CONNTRACK_LAST_ACK &&
            index == TCP_ACK_SET &&
            ct->proto.tcp.last_dir != dir &&
            ct->proto.tcp.last_index == TCP_SYN_SET &&
            (ct->proto.tcp.last_flags & IP_CT_EXP_CHALLENGE_ACK)) {
            /* Detected RFC5961 challenge ACK */
            ct->proto.tcp.last_flags &= ~IP_CT_EXP_CHALLENGE_ACK;
            spin_unlock_bh(&ct->lock);
            nf_ct_l4proto_log_invalid(skb, ct, "challenge-ack ignored");
            return NF_ACCEPT; /* Don't change state */
        }
        break;
    case TCP_CONNTRACK_SYN_SENT2:/* 發送sync報文後,收到對端sync報文,是同時打開狀況 */
        /* tcp_conntracks table is not smart enough to handle
         * simultaneous open.
         */
        ct->proto.tcp.last_flags |= IP_CT_TCP_SIMULTANEOUS_OPEN;
        break;
    case TCP_CONNTRACK_SYN_RECV:/* 同時打開狀況下,收到對端發送來的ack報文,則認爲鏈接創建 */
        if (dir == IP_CT_DIR_REPLY && index == TCP_ACK_SET &&
            ct->proto.tcp.last_flags & IP_CT_TCP_SIMULTANEOUS_OPEN)
            new_state = TCP_CONNTRACK_ESTABLISHED;
        break;
    case TCP_CONNTRACK_CLOSE:
        if (index == TCP_RST_SET
            && (ct->proto.tcp.seen[!dir].flags & IP_CT_TCP_FLAG_MAXACK_SET)
            && before(ntohl(th->seq), ct->proto.tcp.seen[!dir].td_maxack)) {
            /* Invalid RST  */
            spin_unlock_bh(&ct->lock);
            nf_ct_l4proto_log_invalid(skb, ct, "invalid rst");
            return -NF_ACCEPT;
        }
        if (index == TCP_RST_SET
            && ((test_bit(IPS_SEEN_REPLY_BIT, &ct->status)
             && ct->proto.tcp.last_index == TCP_SYN_SET)
            || (!test_bit(IPS_ASSURED_BIT, &ct->status)
                && ct->proto.tcp.last_index == TCP_ACK_SET))
            && ntohl(th->ack_seq) == ct->proto.tcp.last_end) {
            /* RST sent to invalid SYN or ACK we had let through
             * at a) and c) above:
             *
             * a) SYN was in window then
             * c) we hold a half-open connection.
             *
             * Delete our connection entry.
             * We skip window checking, because packet might ACK
             * segments we ignored. */
            goto in_window;
        }
        /* Just fall through */
    default:
        /* Keep compilers happy. */
        break;
    }
    /* 判斷報文是否在窗口中,在窗口處理中處理報文重傳問題 */
    if (!tcp_in_window(ct, &ct->proto.tcp, dir, index,/* 返回1表示報文在窗口中,返回0表示報文 */
               skb, dataoff, th)) {
        spin_unlock_bh(&ct->lock);
        return -NF_ACCEPT;
    }
     in_window:
    /* From now on we have got in-window packets */
    ct->proto.tcp.last_index = index;
    ct->proto.tcp.last_dir = dir;

    pr_debug("tcp_conntracks: ");
    nf_ct_dump_tuple(tuple);
    pr_debug("syn=%i ack=%i fin=%i rst=%i old=%i new=%i\n",
         (th->syn ? 1 : 0), (th->ack ? 1 : 0),
         (th->fin ? 1 : 0), (th->rst ? 1 : 0),
         old_state, new_state);

    ct->proto.tcp.state = new_state;//更新狀態
    if (old_state != new_state
        && new_state == TCP_CONNTRACK_FIN_WAIT)
        ct->proto.tcp.seen[dir].flags |= IP_CT_TCP_FLAG_CLOSE_INIT;/* 發送方發送了第一個fin包 */

    if (ct->proto.tcp.retrans >= tn->tcp_max_retrans &&/* 修改重傳狀態下超時時間 */
        timeouts[new_state] > timeouts[TCP_CONNTRACK_RETRANS])
        timeout = timeouts[TCP_CONNTRACK_RETRANS];
    else if ((ct->proto.tcp.seen[0].flags | ct->proto.tcp.seen[1].flags) &
         IP_CT_TCP_FLAG_DATA_UNACKNOWLEDGED &&
         timeouts[new_state] > timeouts[TCP_CONNTRACK_UNACK])
        timeout = timeouts[TCP_CONNTRACK_UNACK];
    else if (ct->proto.tcp.last_win == 0 &&
         timeouts[new_state] > timeouts[TCP_CONNTRACK_RETRANS])
        timeout = timeouts[TCP_CONNTRACK_RETRANS];
    else
        timeout = timeouts[new_state];
    spin_unlock_bh(&ct->lock);

    if (new_state != old_state)/* 狀態發生變化,通知協議發生變化 */
        nf_conntrack_event_cache(IPCT_PROTOINFO, ct);
    /* 該鏈接尚未見到應答方向的報文 */
    if (!test_bit(IPS_SEEN_REPLY_BIT, &ct->status)) {
        /* If only reply is a RST, we can consider ourselves not to
           have an established connection: this is a fairly common
           problem case, so we can delete the conntrack
           immediately.  --RR */
        if (th->rst) {
            nf_ct_kill_acct(ct, ctinfo, skb);
            return NF_ACCEPT;
        }
        /* ESTABLISHED without SEEN_REPLY, i.e. mid-connection
         * pickup with loose=1. Avoid large ESTABLISHED timeout.
         */
        if (new_state == TCP_CONNTRACK_ESTABLISHED &&
            timeout > timeouts[TCP_CONNTRACK_UNACK])
            timeout = timeouts[TCP_CONNTRACK_UNACK];
    } else if (!test_bit(IPS_ASSURED_BIT, &ct->status)
           && (old_state == TCP_CONNTRACK_SYN_RECV
               || old_state == TCP_CONNTRACK_ESTABLISHED)
           && new_state == TCP_CONNTRACK_ESTABLISHED) {
        /* Set ASSURED if we see see valid ack in ESTABLISHED
           after SYN_RECV or a valid answer for a picked up
           connection. 鏈接已經創建,設置不能過早超時標誌 */
        set_bit(IPS_ASSURED_BIT, &ct->status);
        nf_conntrack_event_cache(IPCT_ASSURED, ct);
    }
    /* 更新連接超時時間 */
    nf_ct_refresh_acct(ct, ctinfo, skb, timeout);

    return NF_ACCEPT;
}

tcp_get_timeouts

獲取tcp協議的超時時間數組。tcp的超時時間根據鏈接的狀態不一樣而不一樣。

//tcp超時定時數組
static const unsigned int tcp_timeouts[TCP_CONNTRACK_TIMEOUT_MAX] = {
    [TCP_CONNTRACK_SYN_SENT]    = 2 MINS,
    [TCP_CONNTRACK_SYN_RECV]    = 60 SECS,
    [TCP_CONNTRACK_ESTABLISHED]    = 5 DAYS,
    [TCP_CONNTRACK_FIN_WAIT]    = 2 MINS,
    [TCP_CONNTRACK_CLOSE_WAIT]    = 60 SECS,
    [TCP_CONNTRACK_LAST_ACK]    = 30 SECS,
    [TCP_CONNTRACK_TIME_WAIT]    = 2 MINS,
    [TCP_CONNTRACK_CLOSE]        = 10 SECS,
    [TCP_CONNTRACK_SYN_SENT2]    = 2 MINS,
/* RFC1122 says the R2 limit should be at least 100 seconds.
   Linux uses 15 packets as limit, which corresponds
   to ~13-30min depending on RTO. */
    [TCP_CONNTRACK_RETRANS]        = 5 MINS,
    [TCP_CONNTRACK_UNACK]        = 5 MINS,
};

static unsigned int *tcp_get_timeouts(struct net *net)
{
    return tcp_pernet(net)->timeouts;
}

tcp_can_early_drop

鏈接跟蹤是否能夠提早銷燬。用於垃圾回收。

static bool tcp_can_early_drop(const struct nf_conn *ct)
{
    switch (ct->proto.tcp.state) {
    case TCP_CONNTRACK_FIN_WAIT:
    case TCP_CONNTRACK_LAST_ACK:
    case TCP_CONNTRACK_TIME_WAIT:
    case TCP_CONNTRACK_CLOSE:
    case TCP_CONNTRACK_CLOSE_WAIT:
        return true;
    default:
        break;
    }

    return false;
}
相關文章
相關標籤/搜索