openshift封裝了k8s,在網絡上結合ovs實現了多租戶隔離,對外提供服務時報文須要通過ovs的tun0接口。下面就如何經過tun0訪問pod(172.30.0.0/16)進行解析(下圖來自理解OpenShift(3):網絡之 SDN,刪除了原圖的IP)html
openshift版本以下:node
# openshift version openshift v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 etcd 3.2.1
首先在查看openshift上pod(該pod爲elasticsearch)的路由,默認網關爲10.131.2.1,出接口爲eth0(IP:10.131.2.45)linux
sh-4.2$ ip route
default via 10.131.2.1 dev eth0 10.128.0.0/14 dev eth0 10.131.2.0/23 dev eth0 proto kernel scope link src 10.131.2.45 224.0.0.0/4 dev eth0
其次查看node節點的路由,到達pod service的cluster ip(172.30.229.30)的網段(172.30.0.0)出接口爲tun0,所以外部流量首先會達到tun0,而後經過tun0轉發到poddocker
[root@dt-infra1 home]# ip route
default via 10.122.70.1 dev eth0 proto static metric 100
10.122.70.0/24 dev eth0 proto kernel scope link src 10.122.70.72 metric 100
10.128.0.0/14 dev tun0 proto kernel scope link 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 172.30.0.0/16 dev tun0 scope link
查看node節點的接口信息以下(已刪減無關接口信息),能夠看到vethd0d3571b就是pod中eth0接口的對端,tun0 IP地址也爲pod中的默認網關,cookie
[root@dt-infra1 home]# ip a 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 50:6b:8d:d4:25:db brd ff:ff:ff:ff:ff:ff inet 10.122.70.72/24 brd 10.122.70.255 scope global dynamic eth0 valid_lft 2137631556sec preferred_lft 2137631556sec inet6 fe80::748f:487f:6bf1:685c/64 scope link tentative dadfailed valid_lft forever preferred_lft forever inet6 fe80::8fab:527e:43dd:f1b1/64 scope link tentative dadfailed valid_lft forever preferred_lft forever inet6 fe80::f2d6:994f:f43:cce5/64 scope link tentative dadfailed valid_lft forever preferred_lft forever 5: br0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN qlen 1000 link/ether 8e:13:86:1d:ab:43 brd ff:ff:ff:ff:ff:ff 9: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000 link/ether d6:48:30:3a:bb:d0 brd ff:ff:ff:ff:ff:ff inet6 fe80::d448:30ff:fe3a:bbd0/64 scope link valid_lft forever preferred_lft forever 10: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN qlen 1000 link/ether a6:02:fb:84:24:d0 brd ff:ff:ff:ff:ff:ff inet 10.131.2.1/23 scope global tun0 valid_lft forever preferred_lft forever inet6 fe80::a402:fbff:fe84:24d0/64 scope link valid_lft forever preferred_lft forever 53: vethd0d3571b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP link/ether 0e:7d:3a:9f:ff:6e brd ff:ff:ff:ff:ff:ff link-netnsid 3 inet6 fe80::c7d:3aff:fe9f:ff6e/64 scope link valid_lft forever preferred_lft forever
查看該pod的接口信息,能夠看出eth0爲一對veth pair,對應所在node節點編號爲if53的veth網絡
sh-4.2$ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3: eth0@if53: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT link/ether 0a:58:0a:83:02:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0
從上圖中能夠看到tun0和pod的veth pair是鏈接到了一個bridge上,該bridge使用ovs建立的(ip link show type bridge 不可見),使用以下命令能夠查看位於br0的ports信息,上面鏈接了tun0和pod veth的對端elasticsearch
[root@dt-infra1 home]# ovs-vsctl show b9e6f9ba-efb4-4d5f-97f0-e53191ccf174 Bridge "br0" fail_mode: secure Port "vxlan0" Interface "vxlan0" type: vxlan options: {key=flow, remote_ip=flow} Port "br0" Interface "br0" type: internal Port "vethd0d3571b" Interface "vethd0d3571b" Port "tun0" Interface "tun0" type: internal ovs_version: "2.7.2"
使用以下命令能夠查看該環境上的ovs bridge,當前環境只有一個br0tcp
[root@dt-infra1 home]# ovs-vsctl list-br
br0
使用以下命令查看當前環境的port編號,該編號在openflow處理時用於指定接口ide
[root@dt-infra1 home]# ovs-ofctl -O OpenFlow13 dump-ports-desc br0 OFPST_PORT_DESC reply (OF1.3) (xid=0x2): 1(vxlan0): addr:aa:fd:ab:96:85:ed config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:a6:02:fb:84:24:d0 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 10(vetha739667c): addr:9e:14:13:9c:6e:73 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 37(veth6c630c72): addr:2e:f7:70:40:b6:fb config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 45(vethd0d3571b): addr:0e:7d:3a:9f:ff:6e config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:8e:13:86:1d:ab:43 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max
與上述pod相關的iptables以下,在進入tun0以前會將ip DNAT爲10.131.2.45:9200,即Container IP:ContainerPortoop
-A KUBE-SERVICES -d 172.30.229.30/32 -p tcp -m comment --comment "logging/logging-es: cluster IP" -m tcp --dport 9200 -j KUBE-SVC-BWSQUABZDDFLJOKN -A KUBE-SVC-BWSQUABZDDFLJOKN -m comment --comment "logging/logging-es:" -j KUBE-SEP-H2SZLG7QT6WVYISM -A KUBE-SEP-H2SZLG7QT6WVYISM -s 10.131.2.45/32 -m comment --comment "logging/logging-es:" -j KUBE-MARK-MASQ -A KUBE-SEP-H2SZLG7QT6WVYISM -p tcp -m comment --comment "logging/logging-es:" -m tcp -j DNAT --to-destination 10.131.2.45:9200
從pod中出來的報文會傳輸到br0進行流表處理,而後選擇出接口tun0。ovs流表在處理時有以下規則:
table 0: 根據輸入端口(in_port)作入口分流,來自VXLAN隧道的流量轉到表10並將其VXLAN VNI 保存到 OVS 中供後續使用,從tun0過阿里的(來自本節點或進本節點來作轉發的)流量分流到表30,將剩下的即本節點的容器(來自veth***)發出的流量轉到表20; table 10: 作入口合法性檢查,若是隧道的遠端IP(tun_src)是某集羣節點的IP,就認爲是合法,繼續轉到table 30去處理; table 20: 作入口合法性檢查,若是數據包的源IP(nw_src)與來源端口(in_port)相符,就認爲是合法的,設置源項目標記,繼續轉到table 30去處理;若是不一致,便可能存在ARP/IP欺詐,則認爲這樣的的數據包是非法的; table 30: 數據包的目的(目的IP或ARP請求的IP)作轉發分流,分別轉到table 40~70 去處理; table 40: 本地ARP的轉發處理,根據ARP請求的IP地址,從對應的端口(veth)發出; table 50: 遠端ARP的轉發處理,根據ARP請求的IP地址,設置VXLAN隧道遠端IP,並從隧道發出; table 60: Service的轉發處理,根據目標Service,設置目標項目標記和轉發出口標記,轉發到table 80去處理; table 70: 對訪問本地容器的包,作本地IP的轉發處理,根據目標IP,設置目標項目標記和轉發出口標記,轉發到table 80去處理; table 80: 作本地的IP包轉出合法性檢查,檢查源項目標記和目標項目標記是否匹配,或者目標項目是不是公開的,若是知足則轉發;(這裏實現了 OpenShift 網絡層面的多租戶隔離機制,其實是根據項目/project 進行隔離,由於每一個項目都會被分配一個 VXLAN VNI,table 80 只有在網絡包的VNI和端口的VNI tag 相同纔會對網絡包進行轉發) table 90: 對訪問遠端容器的包,作遠端IP包轉發「尋址」,根據目標IP,設置VXLAN隧道遠端IP,並從隧道發出; table 100: 作出外網的轉出處理,將數據包從tun0發出。
該pod對應有以下兩條規則,對應arp和ip協議的處理。第一條爲arp處理,直接轉發到port 45,即pod的veth0對端;後兩條爲ip處理,第二條首先對目的地址進行判斷,而後將0x2d加載到NXM_NX_REG2中,第三條接着處理,對源地址10.131.2.1(即tun0)的報文轉發到NXM_NX_REG2[](即0x2d,十進制值爲45,爲pod的veth對端)
1 cookie=0x0, duration=1876880.378s, table=40, n_packets=30185, n_bytes=1267770, priority=100,arp,arp_tpa=10.131.2.45 actions=output:45 2 cookie=0x0, duration=1876880.373s, table=70, n_packets=743139978, n_bytes=738498091589, priority=100,ip,nw_dst=10.131.2.45 actions=load:0->NXM_NX_REG1[],load:0x2d->NXM_NX_REG2[],goto_table:80 3 cookie=0x0, duration=9851088.683s, table=80, n_packets=29879646, n_bytes=113936915873, priority=300,ip,nw_src=10.131.2.1 actions=output:NXM_NX_REG2[]
總結:
報文從外部到pod的整個流程爲:node route->iptables->ovs流表->pod。須要注意的是pod內部通訊走的是vxlan。全流程參見SDN Flows Inside a Node
TIPS:
參考:
https://cloud.tencent.com/developer/article/1070415
https://blog.51cto.com/c2014/1829179
http://www.cnblogs.com/sammyliu/p/10064450.html
http://www.openvswitch.org/support/dist-docs-2.5/ovs-ofctl.8.html
http://www.openvswitch.org/support/dist-docs/ovs-fields.7.txt
https://www.ibm.com/developerworks/cn/cloud/library/1401_zhaoyi_openswitch/index.html
https://www.ibm.com/developerworks/cn/linux/l-tuntap/