內核TCP在收到SYN報文時,會根據報文的目的IP和Port,在本地匹配處於LISTEN狀態的套接字進行握手過程。linux
The current listener hashtable is hashed by port only. When a process is listening at many IP addresses with the same port (e.g.[IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener() performance is degraded to a link list. It is prone to syn attack.
4.17版本以前,TCP的listener socket是按port
進行hash,而後插入到對應的衝突鏈表中的。這就使得若是不少個listen套接字都偵聽同一個port,就會使得鏈表拉得比較長, 這種狀況在3.9版本引入REUSEPORT
以後更加嚴重git
舉個栗子,主機上啓動了6個listener,它們都偵聽21端口,所以被放到同一條鏈表上(其中sk_B
使用了REUSEPORT
)。若是此時收到一個目標位1.1.1.4:21
的SYN鏈接請求,內核在查找listenr的時候,始終會從頭開始遍歷到尾,直到找到匹配的sk_D
。socket
4.17版本增長了一個新的hashtable(lhash2
)來組織listen套接字,這個lhash2
是按port+addr
做爲key進行hash的,而原來按port
進行hash的hashtable保持不變。換句話說,同一個listen套接字會同時放到兩個hashtable中(例外狀況是,若是它綁定的本地地址是0.0.0.0,則只會放到原來的hashtable中)tcp
lhash2
增長了addr做爲key,也就增長hash的隨機性。仍是以上面的例子爲例,此時,原來的sk_A~C
可能就被hash到其餘衝突鏈了,固然與此同時,也有可能有原來在其餘衝突鏈上的sk_E
被hash到lhash2[0]
這條衝突鏈。spa
所以在listen套接字的查找時,內核會根據SYN報文中的port+addr
,同時計算出知足條件的套接字應該在兩個hashtable中所屬的鏈表,而後比較這兩個鏈表的長度,若是在1st鏈表長度不長或者小於2nd鏈表的長度,則仍是以原來的方式,在1st鏈表中進行查找,不然就在2nd鏈表中進行查找。3d
struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, @@ -217,10 +306,42 @@ struct sock *__inet_lookup_listener(struct net *net, unsigned int hash = inet_lhashfn(net, hnum); struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash]; bool exact_dif = inet_exact_dif_match(net, skb); + struct inet_listen_hashbucket *ilb2; struct sock *sk, *result = NULL; int score, hiscore = 0; + unsigned int hash2; u32 phash = 0; + if (ilb->count <= 10 || !hashinfo->lhash2) + goto port_lookup; + + /* Too many sk in the ilb bucket (which is hashed by port alone). + * Try lhash2 (which is hashed by port and addr) instead. + */ + + hash2 = ipv4_portaddr_hash(net, daddr, hnum); + ilb2 = inet_lhash2_bucket(hashinfo, hash2); + if (ilb2->count > ilb->count) + goto port_lookup; + + result = inet_lhash2_lookup(net, ilb2, skb, doff, + saddr, sport, daddr, hnum, + dif, sdif); + if (result) + return result; + + /* Lookup lhash2 with INADDR_ANY */ + + hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); + ilb2 = inet_lhash2_bucket(hashinfo, hash2); + if (ilb2->count > ilb->count) + goto port_lookup; + + return inet_lhash2_lookup(net, ilb2, skb, doff, + saddr, sport, daddr, hnum, + dif, sdif); + +port_lookup: sk_for_each_rcu(sk, &ilb->head) { score = compute_score(sk, net, hnum, daddr, dif, sdif, exact_dif);
內核在5.0版本又將查找方式改成了只在2nd hashtable中進行查找。這樣修改的緣由是按原來的查找方式,若是選擇了在1st hashtable中進行查找,可能發生在通配地址(0.0.0.0)和特定地址(好比1.1.1.1)都偵聽同一個Port
時,反而匹配上通配地址的listener的問題。這其實不是4.17版本的鍋,而是在3.9版本引入SO_PORTREUSE
就已經存在了!code
來看看怎麼回事:orm
設置了SO_REUSEPORT
的sk_A
和sk_B
同時偵聽21端口,若是sk_A
是後啓動,那麼它將添加到鏈表頭,這樣當收到一個1.1.1.2:21
的報文時,內核會發現sk_A
就已經匹配了,它就不會再去嘗試匹配更精確的sk_B
!這顯然很差,要知道在SO_REUSEPORT
進入內核以前,內核會遍歷整個鏈表,對每一個套接字進行匹配程度打分(compute_score
)。blog
5.0版本修改成只在2nd hashtable中進行查找,而且修改了compute_score
的實現方式,若是偵聽地址與報文的目的地址不相同,則直接算匹配失敗。而在以前,通配地址是能夠直接經過這項檢查的。ip
查找方式的修改:
struct sock *__inet_lookup_listener(struct net *net, const __be32 daddr, const unsigned short hnum, const int dif, const int sdif) { - unsigned int hash = inet_lhashfn(net, hnum); - struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash]; - bool exact_dif = inet_exact_dif_match(net, skb); struct inet_listen_hashbucket *ilb2; - struct sock *sk, *result = NULL; - int score, hiscore = 0; + struct sock *result = NULL; unsigned int hash2; - u32 phash = 0; - - if (ilb->count <= 10 || !hashinfo->lhash2) - goto port_lookup; - - /* Too many sk in the ilb bucket (which is hashed by port alone). - * Try lhash2 (which is hashed by port and addr) instead. - */ hash2 = ipv4_portaddr_hash(net, daddr, hnum); ilb2 = inet_lhash2_bucket(hashinfo, hash2); - if (ilb2->count > ilb->count) - goto port_lookup; result = inet_lhash2_lookup(net, ilb2, skb, doff, saddr, sport, daddr, hnum, @@ -335,34 +313,12 @@ struct sock *__inet_lookup_listener(struct net *net, goto done; /* Lookup lhash2 with INADDR_ANY */ - hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); ilb2 = inet_lhash2_bucket(hashinfo, hash2); - if (ilb2->count > ilb->count) - goto port_lookup; result = inet_lhash2_lookup(net, ilb2, skb, doff, - saddr, sport, daddr, hnum, + saddr, sport, htonl(INADDR_ANY), hnum, dif, sdif); - goto done; - -port_lookup: - sk_for_each_rcu(sk, &ilb->head) { - score = compute_score(sk, net, hnum, daddr, - dif, sdif, exact_dif); - if (score > hiscore) { - if (sk->sk_reuseport) { - phash = inet_ehashfn(net, daddr, hnum, - saddr, sport); - result = reuseport_select_sock(sk, phash, - skb, doff); - if (result) - goto done; - } - result = sk; - hiscore = score; - } - }
打分部分的修改
@@ -234,24 +234,16 @@ static inline int compute_score(struct sock *sk, struct net *net, const int dif, const int sdif, bool exact_dif) { int score = -1; - struct inet_sock *inet = inet_sk(sk); - bool dev_match; - if (net_eq(sock_net(sk), net) && inet->inet_num == hnum && + if (net_eq(sock_net(sk), net) && sk->sk_num == hnum && !ipv6_only_sock(sk)) { - __be32 rcv_saddr = inet->inet_rcv_saddr; - score = sk->sk_family == PF_INET ? 2 : 1; - if (rcv_saddr) { - if (rcv_saddr != daddr) - return -1; - score += 4; - } - dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, - dif, sdif); - if (!dev_match) + if (sk->sk_rcv_saddr != daddr) + return -1; + + if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif)) return -1; - score += 4; + score = sk->sk_family == PF_INET ? 2 : 1; if (sk->sk_incoming_cpu == raw_smp_processor_id()) score++; }
inet: Add a 2nd listener hashtable (port+addr) inet_connection_sock.h
inet: Add a 2nd listener hashtable (port+addr) inet_hashtables.h
inet: Add a 2nd listener hashtable (port+addr) inet_hashtables.c
net: tcp: prefer listeners bound to an address inet_hashtables.c