本文經過兩個簡單的服務之間的訪問,結合tcpdump抓包,詳細分析下在IPVS模式下,kubernetes實現經過服務名稱訪問NodePort、ClusterIp類型的service的原理。java
固然kubernetes網絡實現牽扯到不少知識,特別是對Linux低層的模塊的各類調用,若是對Linux中的網絡命名空間、eth設備對、網橋等模塊不熟悉的話,能夠先參考下另外一篇文章Docker 網絡,以後也能夠看下另外一篇文件Kubernetes kube-proxy來了解下kube-proxy的IPVS模式node
本文環境基於flannel網絡插件,具體搭建參考kubernetes安裝-二進制web
角色 | 系統 | CPU Core | 內存 | 主機名稱 | ip | 安裝組件 |
---|---|---|---|---|---|---|
master | 18.04.1-Ubuntu | 4 | 8G | master | 192.168.0.107 | kubectl,kube-apiserver,kube-controller-manager,kube-scheduler,etcd,flannald,kubelet,kube-proxy |
slave | 18.04.1-Ubuntu | 4 | 4G | slave | 192.168.0.114 | docker,flannald,kubelet,kube-proxy,coredns |
master節點spring
$ route -n -v 內核 IP 路由表 目標 網關 子網掩碼 標誌 躍點 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlp3s0 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlp3s0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-471858815e83 172.30.22.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 172.30.78.0 172.30.78.0 255.255.255.0 UG 0 0 0 flannel.1 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlp3s0
slave節點docker
route -v -n 內核 IP 路由表 目標 網關 子網掩碼 標誌 躍點 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlo1 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlo1 172.30.22.0 172.30.22.0 255.255.255.0 UG 0 0 0 flannel.1 172.30.78.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlo1
web鏡像後端
用spring boot啓動了一個web服務,監聽8080端口,裏面提供一個方法 /header/list,調用這個方法後,會把調用者地址相關信息輸出出來api
@RequestMapping("/header/list") public String listHeader(HttpServletRequest request) { log.info("host is" + request.getHeader("host")); log.info("remoteAddr is " + request.getRemoteHost()); log.info("remotePort is " + request.getRemotePort()); return "OK"; }
curl鏡像服務器
基於 alpine鏡像,只安裝了一個curl命令,使咱們能夠經過這個命令訪問web服務網絡
FROM alpine:latest RUN apk update RUN apk add --upgrade curl
爲了控制pod啓動到指定的節點完成下面的分析,給兩個節點分別添加不一樣的labelapp
$ kubectl label nodes master sample=master node/master labeled $ kubectl label nodes slave sample=slave node/slave labeled
web服務的curl服務對應的pod都在master節點上,由拓撲圖可知,這次訪問通訊只用通過master 節點上的docker0網橋便可實現
編寫web服務啓動文件
$ cat > web.yml <<EOF apiVersion: v1 kind: Service metadata: name: clientip spec: #type: NodePort selector: app: clientip ports: - name: http port: 8080 targetPort: 8080 #nodePort: 8086 --- apiVersion: apps/v1 kind: Deployment metadata: name: clientip-deployment spec: selector: matchLabels: app: clientip replicas: 1 template: metadata: labels: app: clientip spec: nodeSelector: sample: master containers: - name: clientip image: 192.168.0.107/k8s/client-ip-test:0.0.2 ports: - containerPort: 8080 EOF
編寫啓動 curl pod的文件
$ cat > pod_curl.yml <<EOF apiVersion: v1 kind: Pod metadata: name: curl spec: containers: - name: curl image: 192.168.0.107/k8s/curl:1.0 command: - sleep - "3600" nodeSelector: sample: master EOF
啓動服務
$ kubectl create -f web.yml -f pod_curl.yml $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES clientip-deployment-5d8b5dcb46-qprps 1/1 Running 0 4s 172.30.22.4 master <none> <none> curl 1/1 Running 0 9s 172.30.22.3 master <none> <none> $ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE clientip ClusterIP 10.254.0.30 <none> 8080/TCP 51s kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 25d
能夠看到,兩個服務服務都正常啓動起來,並啓動在master節點上
啓動監聽master節點上docker0、flannel.1設備
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1
$ kubectl exec -it curl curl http://clientip:8080/header/list OK
監控日誌分析
web 服務日誌
2020-03-06 08:29:05.447 INFO 6 --- [nio-8080-exec-1] c.falcon.clientip.ClientIpController : host isclientip:8080 2020-03-06 08:29:05.447 INFO 6 --- [nio-8080-exec-1] c.falcon.clientip.ClientIpController : remoteAddr is 172.30.22.3 2020-03-06 08:29:05.447 INFO 6 --- [nio-8080-exec-1] c.falcon.clientip.ClientIpController : remotePort is 42000
docker0網絡監控(只摘錄了主要流程的日誌)
172.30.22.3.47980 > 10.254.0.2.53: [bad udp cksum 0xcd6e -> 0xdae6!] 22093+ A? clientip.default.svc.cluster.local. (52) ... 10.254.0.2.53 > 172.30.22.3.47980: [udp sum ok] 22093*- q: A? clientip.default.svc.cluster.local. 1/0/0 clientip.default.svc.cluster.local. A 10.254.0.30 (102) ... 172.30.22.3.42000 > 10.254.0.30.8080: Flags [P.], cksum 0xcdbb (incorrect -> 0x95b1), seq 0:88, ack 1, win 507, options [nop,nop,TS val 3200284558 ecr 1892112994], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 172.30.22.3.42000 > 172.30.22.4.8080: Flags [P.], cksum 0x84c2 (incorrect -> 0xdeaa), seq 1:89, ack 1, win 507, options [nop,nop,TS val 3200284558 ecr 1892112994], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 172.30.22.4.8080 > 172.30.22.3.42000: Flags [P.], cksum 0x84dd (incorrect -> 0xe64b), seq 1:116, ack 89, win 502, options [nop,nop,TS val 1892113104 ecr 3200284558], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Fri, 06 Mar 2020 08:29:05 GMT OK[!http]
觀察flannel.1設備的輸出,此時是不會出現和請求172.30.22.4相關的信息,此處略去
curl服務對應的pod在master節點上,web服務對應的pod在slave節點上,由拓撲圖可知,這時要完成從curl的pod內部訪問到web服務依次要通過 master.docker0->master.flannel.1->master.wlp3s0->slave.wlo1->slave.flannel.1->slave.docker0
修改web的啓動文件,將nodeSelector的值修改爲sample=slave,從新啓動web應用
$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES clientip-deployment-68c57b7965-pmwp2 1/1 Running 0 33s 172.30.78.3 slave <none> <none> curl 1/1 Running 0 48m 172.30.22.3 master <none> <none> $ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE clientip ClusterIP 10.254.167.63 <none> 8080/TCP 94s kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 25d
日誌監控
監控web服務的日誌
$ kubectl logs -f clientip-deployment-68c57b7965-pmwp2
監控master各個網絡設備的日誌
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1 $ tcpdump -n -vv -i wlp3s0
監控slave節點各個網絡設備日誌
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1 $ tcpdump -n -vv -i wlo1
監控日誌分析
web日誌
2020-03-07 11:13:22.384 INFO 6 --- [nio-8080-exec-3] c.falcon.clientip.ClientIpController : host isclientip:8080 2020-03-07 11:13:22.384 INFO 6 --- [nio-8080-exec-3] c.falcon.clientip.ClientIpController : remoteAddr is 172.30.22.3 2020-03-07 11:13:22.384 INFO 6 --- [nio-8080-exec-3] c.falcon.clientip.ClientIpController : remotePort is 51596
master網絡設備的日誌分析(只展現主要流程,tcp握手過程略去)
docker0設備
... 11:13:22.346481 IP (tos 0x0, ttl 64, id 28047, offset 0, flags [DF], proto UDP (17), length 80) 172.30.22.3.35482 > 10.254.0.2.53: [bad udp cksum 0xcd6e -> 0x55df!] 3111+ A? clientip.default.svc.cluster.local. (52) ... 11:13:22.355447 IP (tos 0x0, ttl 62, id 34179, offset 0, flags [DF], proto UDP (17), length 130) 10.254.0.2.53 > 172.30.22.3.35482: [udp sum ok] 3111*- q: A? clientip.default.svc.cluster.local. 1/0/0 clientip.default.svc.cluster.local. A 10.254.167.63 (102) ... 11:13:22.359009 IP (tos 0x0, ttl 64, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 10.254.167.63.8080: Flags [P.], cksum 0x74dd (incorrect -> 0x0f66), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.372907 IP (tos 0x0, ttl 62, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 10.254.167.63.8080 > 172.30.22.3.51596: Flags [P.], cksum 0x077c (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
返回信息是從10.254.167.63:8080發回來的,和發出去的路徑是一致的,在返回時IPVS的masq(SNAT),將真實服務器地址轉換成了虛擬地址
flannel.1網絡設備
11:13:22.359020 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xbcc1 (incorrect -> 0xc781), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.372887 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbf97 (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
wlp3s0網卡(物理網卡)
... 11:13:22.359026 IP (tos 0x0, ttl 64, id 22491, offset 0, flags [none], proto UDP (17), length 190) 192.168.0.107.33404 > 192.168.0.114.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.372815 IP (tos 0x0, ttl 64, id 57065, offset 0, flags [none], proto UDP (17), length 217) 192.168.0.114.43021 > 192.168.0.107.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbf97 (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
master節點上數據傳輸總結(經過抓包中的時間分析出數據到達各個設備的前後順序,紅色方塊curl開始)
slave節點上網絡設備的日誌分析(只展現主要流程,tcp握手過程略去)
docker0設備
... 11:13:22.379401 IP (tos 0x0, ttl 62, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 ... 11:13:22.389173 IP (tos 0x0, ttl 64, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbcdc (incorrect -> 0xbf97), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200
flannel.1設備
11:13:22.379392 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.389192 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbcdc (incorrect -> 0xbf97), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
wlo1網卡
11:13:22.379300 IP (tos 0x0, ttl 64, id 22491, offset 0, flags [none], proto UDP (17), length 190) 192.168.0.107.33404 > 192.168.0.114.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.389223 IP (tos 0x0, ttl 64, id 57065, offset 0, flags [none], proto UDP (17), length 217) 192.168.0.114.43021 > 192.168.0.107.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbf97 (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http] ...
slave節點上響應請求過程(經過抓包中的時間分析出數據到達各個設備的前後順序,紅色方塊請求進入爲起始點)
只在slave上啓動一個web服務,type設定成NodePort,對應的nodePort設置成8086,從master宿主機上使用curl http://slaveIp:8088/header/list 訪問web服務(直接從slave上訪問,數據不須要傳輸,沒法看到slave機器上物理網卡上的數據包,因此爲了分析,咱們從master上訪問)
當建立NodePort類型的service時,Kubernetes會從API Server指定的參數--service-node-port-range中選擇一個port分配給service,也能夠本身經過.spec.ports[*].nodePort本身指定。以後kubernetes會在集羣的每一個node上監聽對應的port。
除了在全部節點節點上監聽port外,kubernetes會自動給咱們建立一個ClusterIP類型的service,因此建立NodePort的service後,也能夠像上個例子同樣在集羣內部經過 service Name+ service Port的形式訪問
此時數據包不須要在集羣內pod中跨主機流轉,因此數據包不會通過flannel.1,數據包處理流程: master.wlp3s0->.slave.wlo1->slave.docker0->slave.docker0->slave.wlo1-> master.wlp3s0
修改web啓動文件
$cat > web.yml <<EOF apiVersion: v1 kind: Service metadata: name: clientip spec: type: NodePort selector: app: clientip ports: - name: http port: 8080 targetPort: 8080 nodePort: 8086 --- apiVersion: apps/v1 kind: Deployment metadata: name: clientip-deployment spec: selector: matchLabels: app: clientip replicas: 1 template: metadata: labels: app: clientip spec: nodeSelector: sample: slave containers: - name: clientip image: 192.168.0.107/k8s/client-ip-test:0.0.2 ports: - containerPort: 8080 EOF
啓動服務
$ kubectl create -f web.yml service/clientip created deployment.apps/clientip-deployment created $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES clientip-deployment-68c57b7965-28w4t 1/1 Running 0 10s 172.30.78.3 slave <none> <none> $ kubectl get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR clientip NodePort 10.254.85.24 <none> 8080:8086/TCP 17s app=clientip kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 27d <none>
監控web服務的日誌
$ kubectl logs -f clientip-deployment-68c57b7965-28w4t
監控master wlp3s0網卡的日誌
$ tcpdump -n -vv -i wlp3s0
監控slave節點各個網絡設備日誌
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1 $ tcpdump -n -vv -i wlo1
web日誌
2020-03-08 10:15:01.498 INFO 6 --- [nio-8080-exec-2] c.falcon.clientip.ClientIpController : host is192.168.0.114:8086 2020-03-08 10:15:01.499 INFO 6 --- [nio-8080-exec-2] c.falcon.clientip.ClientIpController : remoteAddr is 172.30.78.1 2020-03-08 10:15:01.499 INFO 6 --- [nio-8080-exec-2] c.falcon.clientip.ClientIpController : remotePort is 38362
docker0設備日誌
... 10:15:01.494019 IP (tos 0x0, ttl 63, id 41431, offset 0, flags [DF], proto TCP (6), length 145) 172.30.78.1.38362 > 172.30.78.3.8080: Flags [P.], cksum 0x171b (correct), seq 0:93, ack 1, win 502, options [nop,nop,TS val 670876057 ecr 1109116899], length 93: HTTP, length: 93 GET /header/list HTTP/1.1 Host: 192.168.0.114:8086 User-Agent: curl/7.58.0 Accept: */* ... 10:15:01.503806 IP (tos 0x0, ttl 64, id 34492, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 192.168.0.107.38362: Flags [P.], cksum 0xbbce (incorrect -> 0x0f9e), seq 1:116, ack 94, win 502, options [nop,nop,TS val 1109116911 ecr 670876057], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sun, 08 Mar 2020 02:15:01 GMT OK[!http] ...
第一條,請求從docker0進入172.30.78.3.8080,注意此時的請求是從172.30.78.1.38362過來的,就是咱們在web容器中看到的remoteAddr,緣由參考請求過程
第二條,請求處理後從docker0返回,這時對應的響應返回的地址又變成了實際發出訪問的地址192.168.0.107.38362
flnanel.1設備日誌
tcpdump -n -vv -i flannel.1 tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
wlo1物理網卡日誌
... 10:15:01.493998 IP (tos 0x0, ttl 64, id 41431, offset 0, flags [DF], proto TCP (6), length 145) 192.168.0.107.38362 > 192.168.0.114.8086: Flags [P.], cksum 0x8928 (correct), seq 1:94, ack 1, win 502, options [nop,nop,TS val 670876057 ecr 1109116899], length 93 ... 10:15:01.503827 IP (tos 0x0, ttl 63, id 34492, offset 0, flags [DF], proto TCP (6), length 167) 192.168.0.114.8086 > 192.168.0.107.38362: Flags [P.], cksum 0x489f (correct), seq 1:116, ack 94, win 502, options [nop,nop,TS val 1109116911 ecr 670876057], length 115 ...
master wlp3s0網卡日誌
... 10:15:01.447172 IP (tos 0x0, ttl 64, id 41431, offset 0, flags [DF], proto TCP (6), length 145) 192.168.0.107.38362 > 192.168.0.114.8086: Flags [P.], cksum 0x82b1 (incorrect -> 0x8928), seq 1:94, ack 1, win 502, options [nop,nop,TS val 670876057 ecr 1109116899], length 93 ... 10:15:01.460324 IP (tos 0x0, ttl 63, id 34492, offset 0, flags [DF], proto TCP (6), length 167) 192.168.0.114.8086 > 192.168.0.107.38362: Flags [P.], cksum 0x489f (correct), seq 1:116, ack 94, win 502, options [nop,nop,TS val 1109116911 ecr 670876057], length 115 ...
slave上響應請求過程總結(紅色方塊請求進入爲起始點)
從master到slave的請求過程和普通請求同樣,此處不在描述
在POSTROUTING階段,按照IPTABLES的規則會進行masquerade(爲何執行,參考執行masquerade的緣由),以後進行路由選擇,根據slave的路由規則表,發往172.30.78.3.8080的數據須要通過docker0,根據masquerade的原理,在發送時將源地址變成了docker0網絡設備的地址
$ ip addr 6: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default link/ether 02:42:0d:ab:b0:60 brd ff:ff:ff:ff:ff:ff inet 172.30.78.1/24 brd 172.30.78.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:dff:feab:b060/64 scope link valid_lft forever preferred_lft forever對應的地址是172.30.78.1,這就是爲何咱們在web日誌,以及在docker0網絡上看到請求是172.30.78.1的緣由
web容器處理完請求構成響應體,在返回時發現這個請求是通過masquerade進來的,返回時查找masquerade前的真實請求發起者,將數據返回地址設置爲192.168.0.107.38362,以後根據路由規則,發送給192.168.0.107.38362的數據包,須要從物理網卡wlo1發送,因此數據轉發給了wlo1網卡,在進入以前,會執行IPVS的masquerade,將源地址修改爲192.168.0.114,並經過wlo1網卡發送給master
IPVS判斷出192.168.0.114.8086是集羣服務原理
咱們知道IPVS根據本身的hash表中的內容進行判斷,因此kubernetes只須要把集羣服務相關的信息存入到IPVS的hash表中就能實現了。利用ipvsadm工具查看當啓動一個NodePort的service後,kubernetes會在這個hash表中存入哪些內容(下面命令輸出中略去了不相干的記錄)
$ ipvsadm --list IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP localhost:8086 rr -> 172.30.78.3:http-alt Masq 1 0 0 TCP slave:8086 rr -> 172.30.78.3:http-alt Masq 1 0 0 TCP promote.cache-dns.local:http rr -> 172.30.78.3:http-alt Masq 1 0 0 ...
請求在POSTROUTING階段執行masquerade的原理
首先看下采用IPVS模式時,kubernetes給咱們建立的ipset,及其做用
set name | members | usage |
---|---|---|
KUBE-CLUSTER-IP | All service IP + port | Mark-Masq for cases that masquerade-all=true or clusterCIDR specified |
KUBE-LOOP-BACK | All service IP + port + IP | masquerade for solving hairpin purpose |
KUBE-EXTERNAL-IP | service external IP + port | masquerade for packages to external IPs |
KUBE-LOAD-BALANCER | load balancer ingress IP + port | masquerade for packages to load balancer type service |
KUBE-LOAD-BALANCER-LOCAL | LB ingress IP + port with externalTrafficPolicy=local | accept packages to load balancer with externalTrafficPolicy=local |
KUBE-LOAD-BALANCER-FW | load balancer ingress IP + port with loadBalancerSourceRanges | package filter for load balancer with loadBalancerSourceRanges specified |
KUBE-LOAD-BALANCER-SOURCE-CIDR | load balancer ingress IP + port + source CIDR | package filter for load balancer with loadBalancerSourceRanges specified |
KUBE-NODE-PORT-TCP | nodeport type service TCP port | masquerade for packets to nodePort(TCP) |
KUBE-NODE-PORT-LOCAL-TCP | nodeport type service TCP port with externalTrafficPolicy=local | accept packages to nodeport service with externalTrafficPolicy=local |
KUBE-NODE-PORT-UDP | nodeport type service UDP port | masquerade for packets to nodePort(UDP) |
KUBE-NODE-PORT-LOCAL-UDP | nodeport type service UDP port with externalTrafficPolicy=local | accept packages to nodeport service with externalTrafficPolicy=local |
其次,須要知道kubernetes是如何利用這些ipset的,再看下kubernetes爲咱們在iptables中追加的規則
下面的輸出內容是在kube-proxy啓動參數:iptables.masqueradeAll=false;clusterCIDR=172.30.0.0/16時的結果,配置成其餘的KUBE-SERVICES的規則鏈會稍有不一樣,輸出進行了精簡只包含了和kubernetes相關的規則
$ iptables -n -L -t nat Chain PREROUTING (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ Chain OUTPUT (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ Chain POSTROUTING (policy ACCEPT) target prot opt source destination KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */ Chain KUBE-FIREWALL (0 references) target prot opt source destination KUBE-MARK-DROP all -- 0.0.0.0/0 0.0.0.0/0 Chain KUBE-KUBELET-CANARY (0 references) target prot opt source destination Chain KUBE-LOAD-BALANCER (0 references) target prot opt source destination KUBE-MARK-MASQ all -- 0.0.0.0/0 0.0.0.0/0 Chain KUBE-MARK-DROP (1 references) target prot opt source destination Chain KUBE-MARK-MASQ (2 references) target prot opt source destination MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000 Chain KUBE-NODE-PORT (1 references) target prot opt source destination KUBE-MARK-MASQ tcp -- 0.0.0.0/0 0.0.0.0/0 /* Kubernetes nodeport TCP port for masquerade purpose */ match-set KUBE-NODE-PORT-TCP dst Chain KUBE-POSTROUTING (1 references) target prot opt source destination MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000 MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src Chain KUBE-SERVICES (2 references) target prot opt source destination KUBE-MARK-MASQ all -- !172.30.0.0/16 0.0.0.0/0 /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst KUBE-NODE-PORT all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst
PREROUTING階段
在KUBE-NODE-PORT規則鏈中會判斷請求目的端口號是否在KUBE-NODE-PORT-TCP這個ipset中,是的話跳轉到KUBE-MARK-MASQ,看下咱們啓動NodePort service後這個ipset中的值
$ ipset --list KUBE-NODE-PORT-TCP Name: KUBE-NODE-PORT-TCP Type: bitmap:port Revision: 3 Header: range 0-65535 Size in memory: 8268 References: 1 Number of entries: 1 Members: 8086kubernetes的確把咱們建立的服務對應的node port值存入這個裏面了
KUBE-MARK-MASQ規則鏈對進入這個規則鏈的全部請求都打上一個標籤0x4000
這樣,本文用三個例子,經過用tcpdump對各個網絡設備上數據包的分析,闡述了不一樣狀況下kubernetes的網絡請求過程。最後一個例子結合kubernetes給我建立的ipset、iptables規則講述了kubernetes實現服務訪問的原理。前面兩個例子讀者也能夠採用這樣的方式結合下iptables中的規則鏈,來驗證下數據的流轉流程。
另外最後一個例子,還能夠經過集羣中master節點的8086來訪問web服務,這時數據包還會通過兩個節點的flannel.1網絡設備,但不會通過master.docker0設備,而且web中收到請求的remoteAddr也會不同,下面只給出請求過程再也不給出具體的tcpdump日誌信息
請求:
master.wlp3s0->master.flannel.1->master.wlp3s0->slave.wlo1->slave.flannel.1->slave.docker0
響應:
slave.docker0->slave.flannel.1->slave.wlo1->master.wlp3s0->master.flannel.1-> master.wlp3s0
讀者朋友能夠自行試下,結合tcpdump工具和iptables中的規則對數據包流轉過程進行分析。
額外的一點思考,爲啥kubernetes要設計的這麼複雜對經過node port的請求進行masquerade呢,這是由於當建立一個NodePort服務後,kubernetes不僅是讓服務對應的endpoint所在的節點上可以提供服務,而是讓集羣中全部的節點均可以在對應的port上提供服務,這樣咱們從外部經過node port訪問集羣服務時,有可能訪問的服務對應的pod不在咱們訪問的節點上,這樣要是不通過masquerade,真實的endpoint處理完請求後在響應時看到的也是真實的clientIP,數據就不會先返回到client一開始請求的node上,而是直接返回給了client,這樣client收到結果發現是和請求的地址不同的服務器給了響應,會認爲這是不合法的的響應體。因此爲了讓client能從請求的節點上拿到響應體,因此須要對外部訪問node port的請求統一作masquerade,這樣數據返回時,會首先返回到client請求的節點上,再由此節點返回給client。若是由於業務需求,如一些審計什麼的,必需要獲取到client的真實IP,能夠考慮下面三種方式:
kubernetes的官網對此的探討:
Using Source IP