學習docker網絡,能夠帶着下面兩個問題來探討linux
Docker 自己的技術依賴Linux的內核虛擬化技術,因此爲了可以更好的理解Docker的網絡實現,必需要對牽扯到的主要技術作些瞭解nginx
網絡命名空間的操做(如下命令須要root權限操做)web
ip netns add <name>
例:建立命名空間 netns1docker
ip netns add netns1
ip netns exec <name> <command>
例:配置netns1中的veth1的地址,並將設備啓動bash
ip netns exec netns1 ip addr add 30.0.0.1/24 dev veth1 ip netns exec netns1 ip link set dev veth1 up
或者網絡
ip netns exec netns1 ifconfig veth1 30.0.0.1/24 up
ip netns exec <name> bash
ip netns list
ip netns delete <name>
veth指虛擬以太網口對(virtual ethernet),只能成對出現,不能單個存在,在Linux術語中把其中一個稱爲另外一個的peer。在Linux中全部的網絡設備都只能存在於一個命名空間,物理設備一般只關聯到root命名空間中。veth能夠經過命令來任意建立和銷燬,並能夠關聯到指定的某個命名空間中,還能夠經過命令操做其在不一樣的命名空間中移動。
前面提到網絡命名空間表明的是一個獨立的協議棧,相互之間是隔離的,彼此沒法直接通訊,那若是想在不一樣的命名空間之間通訊,就要用到veth設備對了。他的做用就像是一個管道,一端連着本身的命名空間協議棧,另外一端連着另外一個命名空間的peer,當veth設備在一端發送數據時,會直接將數據發送給本身的peer,並觸發peer接收數據,而後peer把收到的數據再提交到本身的網絡協議棧進行處理,從而實現不一樣協議棧之間的數據傳輸。Veth的示意圖以下:
併發
veth設備對的經常使用操做tcp
ip link add <name1> type veth peer name <name2>
例:建立name分別時veth1和veth2的設備對工具
ip link add veth1 type veth peer name veth2
ip link show
或者oop
ip addr
ip link set <vethName> netns <netnsName>
例:將veth1移動到netns1命名空間下
ip link set veth1 netns netns1
爲了驗證對網絡命名空間和veth設備的理解,咱們能夠設計下面的場景,來加以證實
建立兩個命名空間,建立一個veth設備對,將設備對分別放到兩個命名空間中,給兩個veth添加網絡地址,從兩個命名空間中分別ping對端,彼此能ping通,則說明veth對的確能打破不一樣命名空間的壁壘,實現通訊
建立命名空間
root@chengf:~# ip netns add netns1 root@chengf:~# ip netns add netns2 root@chengf:~# ip netns list netns2 netns1
建立veth設備對
root@chengf:~# ip link add veth1 type veth peer name veth2 root@chengf:~# ip link show ... 58: veth2@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether fe:62:b8:f5:4a:68 brd ff:ff:ff:ff:ff:ff 59: veth1@veth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 46:69:c0:3f:e2:27 brd ff:ff:ff:ff:ff:ff
將veth設備對分別移到不一樣的命名空間下,執行移動後,能夠看到在root空間下已經沒有了剛纔建立的veth設備了
root@chengf:~# ip link set veth1 netns netns1 root@chengf:~# ip link set veth2 netns netns2
在對應的命名空間中查看,能夠看到veth的確移動到對應的命名空間中
root@chengf:~# ip netns exec netns1 ip link show 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 59: veth1@if58: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 46:69:c0:3f:e2:27 brd ff:ff:ff:ff:ff:ff link-netnsid 1 root@chengf:~# ip netns exec netns2 ip link show 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 58: veth2@if59: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether fe:62:b8:f5:4a:68 brd ff:ff:ff:ff:ff:ff link-netnsid 0
給對應的veth設置網絡地址,而且啓動veth設備
root@chengf:~# ip netns exec netns1 ip addr add 30.0.0.1/24 dev veth1 root@chengf:~# ip netns exec netns1 ip link set dev veth1 up root@chengf:~# ip netns exec netns2 ip addr add 30.0.0.2/24 dev veth2 root@chengf:~# ip netns exec netns2 ip link set dev veth2 up
經過ip addr 命令產看啓動後的設備狀況
root@chengf:~# ip netns exec netns1 ip addr 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 59: veth1@if58: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 46:69:c0:3f:e2:27 brd ff:ff:ff:ff:ff:ff link-netnsid 1 inet 30.0.0.1/24 scope global veth1 valid_lft forever preferred_lft forever inet6 fe80::4469:c0ff:fe3f:e227/64 scope link valid_lft forever preferred_lft forever root@chengf:~# ip netns exec netns2 ip addr 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 58: veth2@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether fe:62:b8:f5:4a:68 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 30.0.0.2/24 scope global veth2 valid_lft forever preferred_lft forever inet6 fe80::fc62:b8ff:fef5:4a68/64 scope link valid_lft forever preferred_lft forever
在兩個命名空間中相互ping對方
root@chengf:~# ip netns exec netns1 ping -c 1 30.0.0.2 PING 30.0.0.2 (30.0.0.2) 56(84) bytes of data. 64 bytes from 30.0.0.2: icmp_seq=1 ttl=64 time=0.125 ms --- 30.0.0.2 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.125/0.125/0.125/0.000 ms root@chengf:~# ip netns exec netns2 ping -c 1 30.0.0.1 PING 30.0.0.1 (30.0.0.1) 56(84) bytes of data. 64 bytes from 30.0.0.1: icmp_seq=1 ttl=64 time=0.084 ms --- 30.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.084/0.084/0.084/0.000 ms
能夠看到經過veth設備對,的確可以在不一樣的命名空間之間相互訪問,因爲veth只能是一對一的,那咱們有多個命名空間時(兩個以上docker容器),怎麼相互訪問呢,這就是網橋(bridge)的功能了,結下來再看bridge的基本介紹。
相比較於傳統的交換機,對於接收到的報文要麼轉發,要麼丟棄,運行於Linux內部網絡棧裏面的網橋,所在的機器自己就是一臺主機,有可能就是網絡報文的目的地,其收到的報文出了轉發和丟棄外,有可能被送到網絡協議棧的上層(網絡層),從而被本身主機處理。若是給bridge分配一個IP,其餘設備在訪問bridge時,就能夠經過bridge鏈接的宿主機的網絡協議棧來對數據進行處理,因此從這個層面來說,網橋也能夠看做是一個三層設備,其工做示意圖以下(圖片來自網絡):
網橋的經常使用操做
brctl addbr <name>
brctl delbr <name>
brctl addif <brname> <ifname>
例: 將eth0網卡追加到br001
brctl addif br001 eth0
ifconfig <name> <IP>
ifconfig <name> up
ifconfig <name> down
brctl show
瞭解過了bridge的基本知識後,咱們能夠嘗試用下面的簡單實驗來測試bridge的功能
建立三個命名空間,建立三對veth,三對veth中的其中一端分別放在三個命名空間中,另外一端都放到網橋br001中,分配好地址空間後,相互能ping通,則證實網橋的確能夠實現將數據根據目的發送到不一樣網絡。
建立命名空間
root@chengf:~# ip netns add netns1 root@chengf:~# ip netns add netns2 root@chengf:~# ip netns add netns3 root@chengf:~# ip netns list netns3 netns2 netns1
建立veth對
root@chengf:~# ip link add veth001 type veth peer name peer-veth001 root@chengf:~# ip link add veth002 type veth peer name peer-veth002 root@chengf:~# ip link add veth003 type veth peer name peer-veth003 root@chengf:~# ip link show ... 51: peer-veth001@veth001: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 2e:8b:9e:92:76:28 brd ff:ff:ff:ff:ff:ff 52: veth001@peer-veth001: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 6a:9e:04:44:30:a9 brd ff:ff:ff:ff:ff:ff 53: peer-veth002@veth002: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 56:1f:8e:f5:8b:8c brd ff:ff:ff:ff:ff:ff 54: veth002@peer-veth002: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether f2:1a:b4:54:94:3d brd ff:ff:ff:ff:ff:ff 55: peer-veth003@veth003: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1a:b9:5f:ab:6a:d2 brd ff:ff:ff:ff:ff:ff 56: veth003@peer-veth003: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 0a:84:78:42:8f:77 brd ff:ff:ff:ff:ff:ff
將veth對中的一端移到對應命名空間中
root@chengf:~# ip link set veth001 netns netns1 root@chengf:~# ip link set veth002 netns netns2 root@chengf:~# ip link set veth003 netns netns3
建立bridge
root@chengf:~# brctl addbr br001 root@chengf:~# ip link show ... 57: br001: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 56:0b:ad:d5:fe:73 brd ff:ff:ff:ff:ff:ff
veth對的另外一端attach到bridge
root@chengf:~# brctl addif br001 peer-veth001 root@chengf:~# brctl addif br001 peer-veth002 root@chengf:~# brctl addif br001 peer-veth003 root@chengf:~# ip link show ... 51: peer-veth001@if52: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master br001 state DOWN mode DEFAULT group default qlen 1000 link/ether 2e:8b:9e:92:76:28 brd ff:ff:ff:ff:ff:ff link-netnsid 2 53: peer-veth002@if54: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master br001 state DOWN mode DEFAULT group default qlen 1000 link/ether 56:1f:8e:f5:8b:8c brd ff:ff:ff:ff:ff:ff link-netnsid 3 55: peer-veth003@if56: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master br001 state DOWN mode DEFAULT group default qlen 1000 link/ether 1a:b9:5f:ab:6a:d2 brd ff:ff:ff:ff:ff:ff link-netnsid 4 57: br001: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1a:b9:5f:ab:6a:d2 brd ff:ff:ff:ff:ff:ff
分別設置netns中的veth設備的IP地址,並啓動對應的設備
root@chengf:~# ip netns exec netns1 ifconfig veth001 30.0.1.1/16 up root@chengf:~# ip netns exec netns2 ifconfig veth002 30.0.1.2/16 up root@chengf:~# ip netns exec netns3 ifconfig veth003 30.0.1.3/16 up
root@chengf:~# ip netns exec netns1 ping 30.0.1.2 PING 30.0.1.2 (30.0.1.2) 56(84) bytes of data. ^C --- 30.0.1.2 ping statistics --- 13 packets transmitted, 0 received, 100% packet loss, time 12285ms
設置bridge的IP地址,並啓動bridge
root@chengf:~# ifconfig br001 30.0.0.1/16 up
root@chengf:~# ip netns exec netns1 ping -c 1 30.0.1.2 PING 30.0.1.2 (30.0.1.2) 56(84) bytes of data. --- 30.0.1.2 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms
啓動peer設備
root@chengf:~# ip link set dev peer-veth001 up root@chengf:~# ip link set dev peer-veth002 up root@chengf:~# ip link set dev peer-veth003 up
相互訪問
root@chengf:~# ip netns exec netns1 ping -c 1 30.0.1.2 PING 30.0.1.2 (30.0.1.2) 56(84) bytes of data. 64 bytes from 30.0.1.2: icmp_seq=2 ttl=64 time=0.249 ms --- 30.0.1.2 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 1027ms rtt min/avg/max/mdev = 0.249/0.325/0.401/0.076 ms root@chengf:~# ip netns exec netns1 ping -c 1 30.0.1.3 PING 30.0.1.3 (30.0.1.3) 56(84) bytes of data. 64 bytes from 30.0.1.3: icmp_seq=1 ttl=64 time=0.472 ms --- 30.0.1.3 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.472/0.472/0.472/0.000 ms root@chengf:~# ip netns exec netns2 ping -c 1 30.0.1.1 PING 30.0.1.1 (30.0.1.1) 56(84) bytes of data. 64 bytes from 30.0.1.1: icmp_seq=1 ttl=64 time=0.219 ms --- 30.0.1.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.219/0.219/0.219/0.000 ms root@chengf:~# ip netns exec netns2 ping -c 1 30.0.1.3 PING 30.0.1.3 (30.0.1.3) 56(84) bytes of data. 64 bytes from 30.0.1.3: icmp_seq=1 ttl=64 time=0.389 ms --- 30.0.1.3 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.389/0.389/0.389/0.000 ms root@chengf:~# ip netns exec netns3 ping -c 1 30.0.1.1 PING 30.0.1.1 (30.0.1.1) 56(84) bytes of data. 64 bytes from 30.0.1.1: icmp_seq=1 ttl=64 time=0.194 ms --- 30.0.1.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.194/0.194/0.194/0.000 ms root@chengf:~# ip netns exec netns3 ping -c 1 30.0.1.2 PING 30.0.1.2 (30.0.1.2) 56(84) bytes of data. 64 bytes from 30.0.1.2: icmp_seq=1 ttl=64 time=0.216 ms --- 30.0.1.2 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.216/0.216/0.216/0.000 ms
br001上對應的tcpdump信息
21:49:59.741205 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.2 tell 30.0.1.1, length 28 21:49:59.741313 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.2 is-at f2:1a:b4:54:94:3d, length 28 21:49:59.741345 IP (tos 0x0, ttl 64, id 14004, offset 0, flags [DF], proto ICMP (1), length 84) 30.0.1.1 > 30.0.1.2: ICMP echo request, id 10633, seq 1, length 64 21:49:59.741526 IP (tos 0x0, ttl 64, id 3913, offset 0, flags [none], proto ICMP (1), length 84) 30.0.1.2 > 30.0.1.1: ICMP echo reply, id 10633, seq 1, length 64 21:50:00.768500 IP (tos 0x0, ttl 64, id 14060, offset 0, flags [DF], proto ICMP (1), length 84) 30.0.1.1 > 30.0.1.2: ICMP echo request, id 10633, seq 2, length 64 21:50:00.768646 IP (tos 0x0, ttl 64, id 3968, offset 0, flags [none], proto ICMP (1), length 84) 30.0.1.2 > 30.0.1.1: ICMP echo reply, id 10633, seq 2, length 64
root@chengf:~# tcpdump -n -vv -i br001 | grep ARP tcpdump: listening on br001, link-type EN10MB (Ethernet), capture size 262144 bytes 22:36:54.364519 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.2 tell 30.0.1.3, length 28 22:36:54.364646 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.2 is-at f2:1a:b4:54:94:3d, length 28 22:37:04.604563 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.3 tell 30.0.1.2, length 28 22:37:04.604783 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.3 is-at 0a:84:78:42:8f:77, length 28 22:37:28.156545 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.2 tell 30.0.1.3, length 28 22:37:28.156703 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.2 is-at f2:1a:b4:54:94:3d, length 28 22:37:33.280248 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.3 tell 30.0.1.2, length 28 22:37:33.280383 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.3 is-at 0a:84:78:42:8f:77, length 28 22:37:54.780257 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.3 tell 30.0.1.2, length 28 22:37:54.780418 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.3 is-at 0a:84:78:42:8f:77, length 28 22:38:01.948258 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 30.0.1.2 tell 30.0.1.3, length 28 22:38:01.948414 ARP, Ethernet (len 6), IPv4 (len 4), Reply 30.0.1.2 is-at f2:1a:b4:54:94:3d, length 28
在Linux網絡協議棧中,在數據包的處理過程當中用來對數據包進行過濾、修改、丟棄的技術叫作netfilter和iptables,其中netfilter工做在系統內核中,是利用各類規則真正對數據進行處理的邏輯;而iptables工做在用戶空間,提供了豐富的命令行參數,負責協助用戶維護、修改netfilter要使用的各類規則表,兩者相互配合共同完成Linux對於數據包的各類處理。
在Linux網絡協議棧中Netfilter能夠參與進去執行動做的點稱爲掛接點,每一個掛接點能夠有多個規則,造成鏈,成爲規則鏈。其中掛接點共有5個(prerouting、input、output、forward、postrouting),相應的規則鏈也有5個,可是也容許用戶創建本身的規則鏈,並經過在默認的規則鏈中執行跳轉到自定義的規則鏈中對數據進行處理。
五種掛接點在數據處理過程當中的位置以下圖所示:
這五種規則鏈,按照做用分紅了四類,存儲在Linux的四張表中分別是
這四種表裏面規則的優先級從高到低依次是 raw --> mangle --> nat --> filter
而且這四種表並不是都對5個掛接點進行存儲,其對應關係以下
MSQUERADE(SNAT的一種,可是地址是動態獲取發送數據的網卡地址)
例:
下面的例子演示了以下場景,主機192.168.0.102和192.168.0.108在同一個局域網中,其中108機器上運行了一個docker容器,容器中是一個web應用,端口是80,並經過端口映射到108機器的1180機器上,當102機器訪問這個web應用時的處理過程以下:
其中在prerouting中的規則:
-A PREROUTING ! -i docker0 -p tcp -m tcp --dport 1180 -j DNAT --to-destination 172.17.0.2:80
表示在prerouting階段,對於不是從docker0進入的、協議類型是tcp、訪問端口是1180的數據包,將目的地址轉換成訪問172.17.0.2:80。即訪問本機1180的請求轉換成訪問172.17.0.2:80
經常使用命令
用戶配置iptables規則主要用iptables,而且docker或者kubernets在安裝和發佈容器的過程當中會自動幫咱們建立不少規則,Linux提供了豐富的參數來控制iptables命令,從而追加不一樣的規則,能夠經過iptables -h 來具體查看。
這裏列舉幾條查看已建立的規則命令
iptables-save iptables -n -v -L iptables -n -v -L -t nat
基本概念
在上一章節中,咱們知道Linux在收到數據包,執行prerouting後會利用路由決策機制來決定數據包具體是發送網本機或是forward,這時就須要使用到路由功能。
路由功能有IP層維護的路由表來實現,當收到網絡側接收到數據報文時,IP層首先會檢查報文的IP地址是否是本機地址,若是時,那麼報文會進入傳輸層相關協議中,若是不是本機的地址,主機將根據路由規則,將數據轉發,若是沒有匹配的路由規則,數據將會丟棄。
路由表中的數據是以條目形式存在的,一個典型的路由條目一般包含如下幾個主要項目:
具體的路由過程以下
首先,路由表會在條目中查找第一個字段(IP地址)與數據報文中目的IP地址徹底相同的條目。若是找到,則數據包被髮送到條目相應設備或中間路由器。
若是沒有找到一個徹底的匹配IP,那麼就接着搜索相匹配的網絡ID,掩碼從長到短匹配。若是找到,那麼該數據報文會被轉發到指定的路由器。這樣能夠實現這個網絡上的全部主機都經過這個路由表中的單個(這個)條目來管理,從而減小路由表的條目數。
若是上述兩個條件都不匹配,那麼該數據報文將轉發到一個「默認路由器」,Genmask(掩碼對應的是0.0.0.0)。
若是上述步驟失敗,即沒有默認路由器,那麼該數據報文最終沒法被轉發。任何沒法投遞的數據報文都將產生一個ICMP主機不可達或ICMP網絡不可達的錯誤,並將此錯誤返回給生成此數據報文的應用程序
基本命令
route -n -v
查看ip v6
route -n -v -6
ip route list
ip route list table all
route add -host 192.168.0.114 dev eth0
表示到 192.168.0.114主機的數據都經由eth0接口發送
route add -nat 192.168.0.0 netmask 255.255.255.0 dev eth0
基本概念
標準的docker支持如下4類網絡模型
其中bridge是默認模式,咱們建立容器時,默認不指定時就採用這種模式。因此咱們也主要探討bridge下的docker網絡。
在bridge模式下,Docker Daemon第一次啓動時會建立一個虛擬的網橋docker0,並從幾個備選地址段裏給他選擇一個地址,通常是172開頭,這個地址和主機地址不重疊。以後針對docker建立的每個容器,都會利用網絡命名空間技術,給容器建立一個獨立的命名空間,接着會建立一對虛擬的以太設備(veth設備對),其中一端關聯到網橋docker0上,另外一端再使用Linux的網絡命名技術,移動到容器內的網絡空間,並修更名稱爲eth0,而後,從docker0網橋相同的地址段內給eth0接口分配一個IP地址。
bridge模式下的網絡模型示意圖以下
由上圖能夠看出,容器內的IP默認狀況下是不對外部宿主機暴露的,在不經過宿主機影射出去的話,外部機器也沒法訪問到容器內的應用。
結下來咱們經過三種場景下Linux的網絡協議棧信息變化來進一步分析docker網絡
docker 安裝前
root@slave:~# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp8s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 link/ether 48:0f:cf:6a:23:fc brd ff:ff:ff:ff:ff:ff 3: wlo1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether c4:8e:8f:d0:33:89 brd ff:ff:ff:ff:ff:ff inet 192.168.0.114/24 brd 192.168.0.255 scope global dynamic noprefixroute wlo1 valid_lft 6415sec preferred_lft 6415sec inet6 fe80::c25:3752:ea14:78dc/64 scope link noprefixroute valid_lft forever preferred_lft forever root@slave:~# iptables-save root@slave:~# route -v -n 內核 IP 路由表 目標 網關 子網掩碼 標誌 躍點 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlo1 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlo1 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlo1
docker 安裝啓動後
root@slave:~# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp8s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 link/ether 48:0f:cf:6a:23:fc brd ff:ff:ff:ff:ff:ff 3: wlo1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether c4:8e:8f:d0:33:89 brd ff:ff:ff:ff:ff:ff inet 192.168.0.114/24 brd 192.168.0.255 scope global dynamic noprefixroute wlo1 valid_lft 5916sec preferred_lft 5916sec inet6 fe80::c25:3752:ea14:78dc/64 scope link noprefixroute valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:8c:7a:a3:9d brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever root@slave:~# iptables-save # Generated by iptables-save v1.6.1 on Sun Dec 22 12:18:01 2019 *filter :INPUT ACCEPT [124:8899] :FORWARD DROP [0:0] :OUTPUT ACCEPT [70:7745] :DOCKER - [0:0] :DOCKER-ISOLATION-STAGE-1 - [0:0] :DOCKER-ISOLATION-STAGE-2 - [0:0] :DOCKER-USER - [0:0] -A FORWARD -j DOCKER-USER -A FORWARD -j DOCKER-ISOLATION-STAGE-1 -A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A FORWARD -o docker0 -j DOCKER -A FORWARD -i docker0 ! -o docker0 -j ACCEPT -A FORWARD -i docker0 -o docker0 -j ACCEPT -A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2 -A DOCKER-ISOLATION-STAGE-1 -j RETURN -A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP -A DOCKER-ISOLATION-STAGE-2 -j RETURN -A DOCKER-USER -j RETURN COMMIT # Completed on Sun Dec 22 12:18:01 2019 # Generated by iptables-save v1.6.1 on Sun Dec 22 12:18:01 2019 *nat :PREROUTING ACCEPT [11:508] :INPUT ACCEPT [9:428] :OUTPUT ACCEPT [7:491] :POSTROUTING ACCEPT [7:491] :DOCKER - [0:0] -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER -A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE -A DOCKER -i docker0 -j RETURN COMMIT # Completed on Sun Dec 22 12:18:01 2019 root@slave:~# route -v -n 內核 IP 路由表 目標 網關 子網掩碼 標誌 躍點 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlo1 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlo1 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlo1
在iptables的nat表中的POSTROUTING中追加了以下一條記錄
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE表示若是源地址是172.17.0.0/16網段的數據,通過非docker0接口發出去的數據,要把數據包中的源地址修改爲發送數據的網卡地址,即從容器中訪問外部網絡時應該作一次SNAT。
路由表中追加了一項
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
當訪問目標地址是172.17.0.0/16網段的請求時,使用docker0接口,根據以前講的網橋的做用,這樣,容器之間就能夠相互訪問了。
以一個簡單的nginx容器爲例
slave@slave:~$ docker run -d -it --name nginxfornet nginx:1.9.1 a3b1d91e31ea72420233a8a3bcd1ce0946cee60ce47f2f4004e74103a96098e5 slave@slave:~$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp8s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 link/ether 48:0f:cf:6a:23:fc brd ff:ff:ff:ff:ff:ff 3: wlo1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether c4:8e:8f:d0:33:89 brd ff:ff:ff:ff:ff:ff inet 192.168.0.114/24 brd 192.168.0.255 scope global dynamic noprefixroute wlo1 valid_lft 6637sec preferred_lft 6637sec inet6 fe80::c25:3752:ea14:78dc/64 scope link noprefixroute valid_lft forever preferred_lft forever 4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:8c:7a:a3:9d brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:8cff:fe7a:a39d/64 scope link valid_lft forever preferred_lft forever 6: vethf064714@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default link/ether 16:66:c2:9d:e5:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::1466:c2ff:fe9d:e503/64 scope link valid_lft forever preferred_lft forever slave@slave:~$ docker exec -it nginxfornet ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 5: eth0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever slave@slave:~$ sudo iptables-save # Generated by iptables-save v1.6.1 on Sun Dec 22 21:39:14 2019 *filter :INPUT ACCEPT [40148:56446687] :FORWARD DROP [0:0] :OUTPUT ACCEPT [18377:800696] :DOCKER - [0:0] :DOCKER-ISOLATION-STAGE-1 - [0:0] :DOCKER-ISOLATION-STAGE-2 - [0:0] :DOCKER-USER - [0:0] -A FORWARD -j DOCKER-USER -A FORWARD -j DOCKER-ISOLATION-STAGE-1 -A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A FORWARD -o docker0 -j DOCKER -A FORWARD -i docker0 ! -o docker0 -j ACCEPT -A FORWARD -i docker0 -o docker0 -j ACCEPT -A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2 -A DOCKER-ISOLATION-STAGE-1 -j RETURN -A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP -A DOCKER-ISOLATION-STAGE-2 -j RETURN -A DOCKER-USER -j RETURN COMMIT # Completed on Sun Dec 22 21:39:14 2019 # Generated by iptables-save v1.6.1 on Sun Dec 22 21:39:14 2019 *nat :PREROUTING ACCEPT [38:2301] :INPUT ACCEPT [32:1417] :OUTPUT ACCEPT [62:4827] :POSTROUTING ACCEPT [62:4827] :DOCKER - [0:0] -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER -A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE -A DOCKER -i docker0 -j RETURN COMMIT # Completed on Sun Dec 22 21:39:14 2019 slave@slave:~$ route -n -v 內核 IP 路由表 目標 網關 子網掩碼 標誌 躍點 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlo1 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlo1 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlo1
仍是以nginx爲例,把容器內部的80端口,暴露到宿主句的1180上
slave@slave:~$ docker rm -f nginxfornet nginxfornet slave@slave:~$ docker run -d -it --name nginxfornet -p 1180:80 nginx:1.9.1 93b58ae974017b63e6e344ec0f1c4f5e45bbf119146f97422ce04d8b60b68010 slave@slave:~$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp8s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 link/ether 48:0f:cf:6a:23:fc brd ff:ff:ff:ff:ff:ff 3: wlo1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether c4:8e:8f:d0:33:89 brd ff:ff:ff:ff:ff:ff inet 192.168.0.114/24 brd 192.168.0.255 scope global dynamic noprefixroute wlo1 valid_lft 6488sec preferred_lft 6488sec inet6 fe80::c25:3752:ea14:78dc/64 scope link noprefixroute valid_lft forever preferred_lft forever 4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:8c:7a:a3:9d brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:8cff:fe7a:a39d/64 scope link valid_lft forever preferred_lft forever 8: veth9e0cff3@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default link/ether 12:9c:31:7b:29:0c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::109c:31ff:fe7b:290c/64 scope link valid_lft forever preferred_lft forever slave@slave:~$ sudo iptables-save # Generated by iptables-save v1.6.1 on Sun Dec 22 21:41:14 2019 *filter :INPUT ACCEPT [49:3352] :FORWARD DROP [0:0] :OUTPUT ACCEPT [38:5168] :DOCKER - [0:0] :DOCKER-ISOLATION-STAGE-1 - [0:0] :DOCKER-ISOLATION-STAGE-2 - [0:0] :DOCKER-USER - [0:0] -A FORWARD -j DOCKER-USER -A FORWARD -j DOCKER-ISOLATION-STAGE-1 -A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT -A FORWARD -o docker0 -j DOCKER -A FORWARD -i docker0 ! -o docker0 -j ACCEPT -A FORWARD -i docker0 -o docker0 -j ACCEPT -A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 80 -j ACCEPT -A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2 -A DOCKER-ISOLATION-STAGE-1 -j RETURN -A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP -A DOCKER-ISOLATION-STAGE-2 -j RETURN -A DOCKER-USER -j RETURN COMMIT # Completed on Sun Dec 22 21:41:14 2019 # Generated by iptables-save v1.6.1 on Sun Dec 22 21:41:14 2019 *nat :PREROUTING ACCEPT [6:216] :INPUT ACCEPT [6:216] :OUTPUT ACCEPT [3:219] :POSTROUTING ACCEPT [3:219] :DOCKER - [0:0] -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER -A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE -A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 80 -j MASQUERADE -A DOCKER -i docker0 -j RETURN -A DOCKER ! -i docker0 -p tcp -m tcp --dport 1180 -j DNAT --to-destination 172.17.0.2:80 COMMIT # Completed on Sun Dec 22 21:41:14 2019 slave@slave:~$ route -n -v 內核 IP 路由表 目標 網關 子網掩碼 標誌 躍點 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlo1 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlo1 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlo1
在iptables中追加了以下內容
*filter -A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 80 -j ACCEPT *nat -A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 80 -j MASQUERADE ... -A DOCKER ! -i docker0 -p tcp -m tcp --dport 1180 -j DNAT --to-destination 172.17.0.2:80
經過在和宿主機(192.168.0.114)在一個局域網裏面的另外一臺機器(192.168.0.109)對宿主機的1180進行訪問,並使用tcpdumap工具對wlo1和docker0這兩個設備進行抓包,看下請求過程
宿主機物理網卡wlo1
root@slave:~# tcpdump -n -vv -i wlo1 ... 21:46:02.990084 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.1 tell 192.168.0.114, length 28 21:46:02.990605 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 9c:21:6a:d7:91:28, length 28 ... 21:46:13.521957 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.114 (c4:8e:8f:d0:33:89) tell 192.168.0.109, length 28 21:46:13.521983 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.114 is-at c4:8e:8f:d0:33:89, length 28 ... 21:46:27.345491 IP (tos 0x0, ttl 128, id 4599, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.109.26999 > 192.168.0.114.1180: Flags [S], cksum 0xb9eb (correct), seq 4049181629, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 2543868184 ecr 0], length 0 21:46:27.345597 IP (tos 0x0, ttl 128, id 4600, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.109.27000 > 192.168.0.114.1180: Flags [S], cksum 0x23d2 (correct), seq 3764470477, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 2543868185 ecr 0], length 0 21:46:27.345789 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.114.1180 > 192.168.0.109.26999: Flags [S.], cksum 0x6121 (correct), seq 1260248266, ack 4049181630, win 65160, options [mss 1460,sackOK,TS val 4197603350 ecr 2543868184,nop,wscale 7], length 0 21:46:27.345863 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.114.1180 > 192.168.0.109.27000: Flags [S.], cksum 0x2595 (correct), seq 996708850, ack 3764470478, win 65160, options [mss 1460,sackOK,TS val 4197603350 ecr 2543868185,nop,wscale 7], length 0 ...
docker0網橋
root@slave:~# tcpdump -n -vv -i docker0 tcpdump: listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes ... 21:46:27.345560 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.17.0.2 tell 172.17.0.1, length 28 21:46:27.345651 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.17.0.2 is-at 02:42:ac:11:00:02, length 28 ... 21:46:32.430063 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.17.0.1 tell 172.17.0.2, length 28 21:46:32.430161 ARP, Ethernet (len 6), IPv4 (len 4), Reply 172.17.0.1 is-at 02:42:8c:7a:a3:9d, length 28 21:46:27.345676 IP (tos 0x0, ttl 127, id 4599, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.109.26999 > 172.17.0.2.80: Flags [S], cksum 0xd33e (correct), seq 4049181629, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 2543868184 ecr 0], length 0 21:46:27.345684 IP (tos 0x0, ttl 127, id 4600, offset 0, flags [DF], proto TCP (6), length 60) 192.168.0.109.27000 > 172.17.0.2.80: Flags [S], cksum 0x3d25 (correct), seq 3764470477, win 8192, options [mss 1460,nop,wscale 8,sackOK,TS val 2543868185 ecr 0], length 0 21:46:27.345731 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.80 > 192.168.0.109.26999: Flags [S.], cksum 0x6d57 (incorrect -> 0x7a74), seq 1260248266, ack 4049181630, win 65160, options [mss 1460,sackOK,TS val 4197603350 ecr 2543868184,nop,wscale 7], length 0 21:46:27.345748 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.17.0.2.80 > 192.168.0.109.27000: Flags [S.], cksum 0x6d57 (incorrect -> 0x3ee8), seq 996708850, ack 3764470478, win 65160, options [mss 1460,sackOK,TS val 4197603350 ecr 2543868185,nop,wscale 7], length 0
本文經過Linux基本組件veth設備對、bridge、route等幾個組件開始,介紹了docker底層實現容器之間訪問的實現原理,固然docker容器網絡技術棧不只僅只有bridge,而且在生產環境下也不同意用bridge默認網橋docker0,經過docker自建網橋還具備服務發現的功能。另外docker swarm模式下還具備overlay、和macvlan等解決方案,功能更強大,固然也更復雜,等後續再進一步研究。文中有不對或理解不當的地方還望各位網友指正。