node節點的iptables是由kube-proxy生成的,具體實現能夠參見kube-proxy的代碼html
kube-proxy只修改了filter和nat表,它對iptables的鏈進行了擴充,自定義了KUBE-SERVICES,KUBE-NODEPORTS,KUBE-POSTROUTING,KUBE-MARK-MASQ和KUBE-MARK-DROP五個鏈,並主要經過爲 KUBE-SERVICES鏈(附着在PREROUTING和OUTPUT)增長rule來配製traffic routing 規則,官方定義以下:node
// the services chain kubeServicesChain utiliptables.Chain = "KUBE-SERVICES" // the external services chain kubeExternalServicesChain utiliptables.Chain = "KUBE-EXTERNAL-SERVICES" // the nodeports chain kubeNodePortsChain utiliptables.Chain = "KUBE-NODEPORTS" // the kubernetes postrouting chain kubePostroutingChain utiliptables.Chain = "KUBE-POSTROUTING" // the mark-for-masquerade chain KubeMarkMasqChain utiliptables.Chain = "KUBE-MARK-MASQ" /*對於未能匹配到跳轉規則的traffic set mark 0x8000,有此標記的數據包會在filter表drop掉*/ // the mark-for-drop chain KubeMarkDropChain utiliptables.Chain = "KUBE-MARK-DROP" /*對於符合條件的包 set mark 0x4000, 有此標記的數據包會在KUBE-POSTROUTING chain中統一作MASQUERADE*/ // the kubernetes forward chain kubeForwardChain utiliptables.Chain = "KUBE-FORWARD"
KUBE-MARK-MASQ和KUBE-MARK-DROP
這兩個規則主要用來對通過的報文打標籤,打上標籤的報文可能會作相應處理,打標籤處理以下:
(注:iptables set mark的用法能夠參見
https://unix.stackexchange.com/questions/282993/how-to-add-marks-together-in-iptables-targets-mark-and-connmark
http://ipset.netfilter.org/iptables-extensions.man.html)
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000 -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
KUBE-MARK-DROP和KUBE-MARK-MASQ本質上就是使用了iptables的MARK命令nginx
Chain KUBE-MARK-DROP (6 references) pkts bytes target prot opt in out source destination 0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK or 0x8000 Chain KUBE-MARK-MASQ (89 references) pkts bytes target prot opt in out source destination 88 5280 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000
對於KUBE-MARK-MASQ鏈中全部規則設置了kubernetes獨有MARK標記,在KUBE-POSTROUTING鏈中對NODE節點上匹配kubernetes獨有MARK標記的數據包,當報文離開node節點時進行SNAT,MASQUERADE源IPgit
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
而對於KUBE-MARK-DROP設置標記的報文則會在KUBE_FIREWALL中所有丟棄 github
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
KUBE_SVC和KUBE-SEPdocker
Kube-proxy接着對每一個服務建立「KUBE-SVC-」鏈,並在nat表中將KUBE-SERVICES鏈中每一個目標地址是service的數據包導入這個「KUBE-SVC-」鏈,若是endpoint還沒有建立,KUBE-SVC-鏈中沒有規則,任何incomming packets在規則匹配失敗後會被KUBE-MARK-DROP。在iptables的filter中有以下處理,若是KUBE-SVC處理失敗會經過KUBE_FIREWALL丟棄後端
Chain INPUT (policy ACCEPT 209 packets, 378K bytes) pkts bytes target prot opt in out source destination 540K 1370M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ 540K 1370M KUBE-FIREWALL all -- * * 0.0.0.0/0 0.0.0.0/0
KUBE_FIREWALL內容以下,就是直接丟棄全部報文:api
Chain KUBE-FIREWALL (2 references) pkts bytes target prot opt in out source destination 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
下面是對nexus的service的處理,能夠看到該規對目的IP爲172.21.12.49(Cluster IP)且目的端口爲8080的報文做了特殊處理:KUBE-SVC-HVYO5BWEF5HC7MD7session
-A KUBE-SERVICES -d 172.21.12.49/32 -p tcp -m comment --comment "default/sonatype-nexus: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-HVYO5BWEF5HC7MD7
KUBE-SEP表示的是KUBE-SVC對應的endpoint,當接收到的 serviceInfo中包含endpoint信息時,爲endpoint建立跳轉規則,如上述的KUBE-SVC-HVYO5BWEF5HC7MD7有endpoint,其iptables規則以下:app
-A KUBE-SVC-HVYO5BWEF5HC7MD7 -m comment --comment "oqton-backoffice/sonatype-nexus:" -j KUBE-SEP-ESZGVIJJ5GN2KKU
KUBE-SEP-ESZGVIJJ5GN2KKU中的處理爲將通過該鏈的全部tcp報文,DNAT爲container 內部暴露的訪問方式172.20.5.141:8080。結合對KUBE-SVC的處理可可知,這種訪問方式就是cluster IP的訪問方式,即將目的IP是cluster IP且目的端口是service暴露的端口的報文DNAT爲目的IP是container且目的端口是container暴露的端口的報文,
-A KUBE-SEP-ESZGVIJJ5GN2KKUR -p tcp -m comment --comment "oqton-backoffice/sonatype-nexus:" -m tcp -j DNAT --to-destination 172.20.5.141:8080
若是service類型爲nodePort,(從LB轉發至node的數據包均屬此類)那麼將KUBE-NODEPORTS鏈中每一個目的地址是NODE節點端口的數據包導入這個「KUBE-SVC-」鏈;KUBE-NODEPORTS必須位於KUBE-SERVICE鏈的最後一個,能夠看到iptables在處理報文時會優先處理目的IP爲cluster IP的報文,匹配失敗以後再去使用NodePort方式。以下規則代表,NodePort方式下會將目的ip爲node節點且端口爲node節點暴露的端口的報文進行KUBE-SVC-HVYO5BWEF5HC7MD7處理,KUBE-SVC-HVYO5BWEF5HC7MD7中會對報文進行DNAT轉換。所以Custer IP和NodePort方式的惟一不一樣點就是KUBE-SERVICE中是根據cluster IP仍是根據node port進行匹配
"-m addrtype --dst-type LOCAL"表示對目的地址是本機地址的報文執行KUBE-NODEPORTS鏈的操做
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-NODEPORTS -p tcp -m comment --comment "oqton-backoffice/sonatype-nexus:" -m tcp --dport 32257 -j KUBE-MARK-MASQ -A KUBE-NODEPORTS -p tcp -m comment --comment "oqton-backoffice/sonatype-nexus:" -m tcp --dport 32257 -j KUBE-SVC-HVYO5BWEF5HC7MD7
若是服務用到了loadblance,此時報文是從LB inbound的,報文的outbound處理則是經過KUBE-FW實現outbound報文的負載均衡。以下對目的IP是50.1.1.1(LB公網IP)且目的端口是443(通常是https)的報文做了KUBE-FW-J4ENLV444DNEMLR3處理。(參考kubernetes ingress到pod的數據流)
-A KUBE-SERVICES -d 50.1.1.1/32 -p tcp -m comment --comment "kube-system/nginx-ingress-lb:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-J4ENLV444DNEMLR3
以下在KUBE-FW-J4ENLV444DNEMLR3中顯示的是LB的3個endpoint(該endpoint多是service),使用比率對報文進行了負載均衡控制
Chain KUBE-SVC-J4ENLV444DNEMLR3 (3 references) 10 600 KUBE-SEP-ZVUNFBS77WHMPNFT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kube-system/nginx-ingress-lb:https */ statistic mode random probability 0.33332999982 18 1080 KUBE-SEP-Y47C2UBHCAA5SP4C all -- * * 0.0.0.0/0 0.0.0.0/0 /* kube-system/nginx-ingress-lb:https */ statistic mode random probability 0.50000000000 16 960 KUBE-SEP-QGNNICTBV4CXTTZM all -- * * 0.0.0.0/0 0.0.0.0/0 /* kube-system/nginx-ingress-lb:https */
而上述3條鏈對應的處理以下,能夠看到上述的每條鏈都做了DNAT,將目的IP由LB公網IP轉換爲LB的container IP
-A KUBE-SEP-ZVUNFBS77WHMPNFT -s 172.20.1.231/32 -m comment --comment "kube-system/nginx-ingress-lb:https" -j KUBE-MARK-MASQ -A KUBE-SEP-ZVUNFBS77WHMPNFT -p tcp -m comment --comment "kube-system/nginx-ingress-lb:https" -m tcp -j DNAT --to-destination 172.20.1.231:443 -A KUBE-SEP-Y47C2UBHCAA5SP4C -s 172.20.2.191/32 -m comment --comment "kube-system/nginx-ingress-lb:https" -j KUBE-MARK-MASQ -A KUBE-SEP-Y47C2UBHCAA5SP4C -p tcp -m comment --comment "kube-system/nginx-ingress-lb:https" -m tcp -j DNAT --to-destination 172.20.2.191:443 -A KUBE-SEP-QGNNICTBV4CXTTZM -s 172.20.2.3/32 -m comment --comment "kube-system/nginx-ingress-lb:https" -j KUBE-MARK-MASQ -A KUBE-SEP-QGNNICTBV4CXTTZM -p tcp -m comment --comment "kube-system/nginx-ingress-lb:https" -m tcp -j DNAT --to-destination 172.20.2.3:443
從上面能夠看出,node節點上的iptables中有到達全部service的規則,service 的cluster IP並非一個實際的IP,它的存在只是爲了找出實際的endpoint地址,對達到cluster IP的報文都要進行DNAT爲Pod IP(+port),不一樣node上的報文其實是經過POD IP傳輸的,cluster IP只是本node節點的一個概念,用於查找並DNAT,即目的地址爲clutter IP的報文只是本node發送的,其餘節點不會發送(也沒有路由支持),即默認下cluster ip僅支持本node節點的service訪問,若是須要跨node節點訪問,可使用插件實現,如flannel,它將pod ip進行了封裝
TIPs:
同時也能夠經過:{service:servicePort}-->(iptables)DNAT-->{dstPodIP:dstPodPort}的方式在集羣內部訪問後端服務。
apiVersion: v1 kind: Service metadata: annotations: name: app-test namespace: openshift-monitoring spec: externalTrafficPolicy: Cluster ports: - name: cluster nodePort: 33333 port: 44444 protocol: TCP targetPort: 55555 selector: app: app sessionAffinity: None type: NodePort
主要參考:https://blog.csdn.net/ebay/article/details/52798074
kube-proxy的轉發規則查看:http://www.lijiaocn.com/%E9%A1%B9%E7%9B%AE/2017/03/27/Kubernetes-kube-proxy.html