iptables 是基於內核態的 netfilter 框架,用來過濾 ip 數據包和網絡地址轉換 NAT 的一個工具,通常用做防火牆功能,或負載均衡功能。
簡單點說:iptables 是用戶態的命令行工具,能夠操做內核態的 iptables 的幾個模塊(基於更底層的 netfilter 模塊),而後達到過濾或網絡地址轉換數據包。node
K8S 裏的 kube-proxy 模塊會調用 iptables go client 來往 linux 內核中去用戶自定義 iptables chain,和往
一些表內(如 filter 表或 nat 表)寫自定義rule,來實現每個 Node 節點作到四層負載均衡功能,且因爲 kube-proxy
做爲 DaemonSet 部署,會在每個 Node 節點內運行這個進程,致使這種負載均衡仍是分佈式的。另外,因爲集羣內每次新建一個 service 都會
在每個 Node 節點上寫 iptables rules,因爲 iptables 的數據結構是鏈表,因此每一次讀操做都是 O(n),效率就很低了,致使一個 k8s cluster
若是使用 iptables 做爲負載均衡底層技術的話,就只能支撐中小數量的 service 了。這一點就沒有 ipvs 效率高,ipvs 的數據結構是哈希表,並且 ipvs 天生就是
爲負載均衡而生,支持的負載均衡策略更多,包括 rr、wrr 或 lc 等等。linux
可是不是說 iptables 沒有學習的價值,相反,官方的 K8S 在代碼裏也大量使用 iptables 功能,並且即便使用 ipvs,也在一些狀況下必須藉助 iptables。nginx
本文筆記記錄,做爲最近幾天的學習總結。網絡
公式:iptables = 4 tables = (5 chains + user-defined chains) = rules * EveryChain
rule 會去匹配(-m 參數)這個數據包,告訴該數據包 packet 下一跳幹啥去(-J 參數),是跳到下一個 target,仍是直接丟棄。數據結構
一個數據包來了,會發生什麼?(沒考慮表內用戶自定義的 chain)負載均衡
kube-proxy 是怎麼利用 iptables 作 service 四層負載均衡的?
建立一個 NodePort service 時,在每個機器內核中,會把 service 後面掛的每個 endpoint ip ,寫入到自定義 chain 的規則中。NodePort service 過程以下:框架
# (1) 首先進入 prerouting chain,跳轉到 KUBE-SERVICES chain sudo iptables-save -t nat | grep -- '-A PREROUTING' -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES # (2) 跳轉到 KUBE-NODEPORTS chain sudo iptables-save -t nat | grep -- '-A KUBE-SERVICES' -A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS # (3) 發現 default/nginx-demo NodePort service 會跳轉到 KUBE-SVC-JKOCBQALQGD3X3RT chain sudo iptables-save -t nat | grep -- '-A KUBE-NODEPORTS' -A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx-demo-1:" -m tcp --dport 32719 -j KUBE-MARK-MASQ -A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx-demo-1:" -m tcp --dport 32719 -j KUBE-SVC-JKOCBQALQGD3X3RT # (4) 查看 KUBE-SVC-JKOCBQALQGD3X3RT chain,又發現 0.33333333349 機率跳轉到 KUBE-SEP-HWWSIA644OJY5W7C # 剩下的2/3機率中,又有 0.50000000000 機率跳轉到 KUBE-SEP-5Z6HLG57ALXCA2BN,最後剩下的機率跳轉到 KUBE-SEP-HE7NEHV2WH3AYFZT。 # 即以平均機率跳轉到 KUBE-SEP-HWWSIA644OJY5W7C、KUBE-SEP-5Z6HLG57ALXCA2BN、KUBE-SEP-HE7NEHV2WH3AYFZT sudo iptables-save -t nat | grep -- '-A KUBE-SVC-JKOCBQALQGD3X3RT' -A KUBE-SVC-JKOCBQALQGD3X3RT -m comment --comment "default/nginx-demo-1:" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-HWWSIA644OJY5W7C -A KUBE-SVC-JKOCBQALQGD3X3RT -m comment --comment "default/nginx-demo-1:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-5Z6HLG57ALXCA2BN -A KUBE-SVC-JKOCBQALQGD3X3RT -m comment --comment "default/nginx-demo-1:" -j KUBE-SEP-HE7NEHV2WH3AYFZT
iptables 包含了哪些表 Tables?
iptables 包含的 tables:
filter: This is the default table (if no -t option is passed). 包含 chains:
INPUT(for packets destined to local sockets),
FORWARD(for packets being routed through the box),
OUTPUT(for locally-generated packets)dom
nat: 網絡地址轉換表,如 SNAT 或 DNAT。
PREROUTING(for altering packets as soon as they come in),
OUTPUT(for altering locally-generated packets before routing),
POSTROUTING(for altering packets as they are about to go out)socket
mangle: This table is used for specialized packet alteration.
PREROUTING(for altering incoming packets before routing)
INPUT(for packets coming into the box itself)
OUTPUT(for altering locally-generated packets before routing)
FORWARD(for altering packets being routed through the box)
POSTROUTING(for altering packets as they are about to go out)tcp
raw: This table is used mainly for configuring exemptions from connection tracking in combination with the NOTRACK target. It registers at the netfilter hooks with higher priority and is thus called before ip_conntrack, or any other IP tables.PREROUTING(for packets arriving via any network interface)OUTPUT(for packets generated by local processes)