DVR分佈式路由

1. 背景

  沒有使用DVR的場景:node

  

  從圖中能夠明顯看到東西向和南北向的流量會集中到網絡節點,這會使網絡節點成爲瓶頸。linux

  若是啓用DVR,以下圖:數據庫

  

 

  對於東西向的流量, 流量會直接在計算節點之間傳遞。cookie

  對於南北向的流量,若是有floating ip,流量就直接走計算節點。若是沒有floating ip,則會走網絡節點。
 

 2.部署以及流量走向

  

   

  2.1東西向流量

  VM1 (10.0.1.5 Net1) ping VM2 (10.0.2.5 Net2)網絡

   1) VM1 (10.0.1.5) -> qr (10.0.1.1)app

    VM1 根據默認路由發送arp(廣播)請求qr網關的地址,請求到網關地址後,icmp報文走向qr口。tcp

    (關於報文格式的一點解釋,當VM1 ping VM2時,報文的源/目的IP始終不變,報文的源/目的MAC則會根據不一樣的路段而變化。oop

    同時,br-tun網橋會丟棄目的地址是interface_distributed接口的arp廣播,不至於讓沒必要要的流量流向外面:學習

# ovs-ofctl dump-flows br-tun
NXST_FLOW reply (xid=0x4):  
...
cookie=0x0, duration=64720.432s, table=1, n_packets=4, n_bytes=168, idle_age=64607, priority=3,arp,dl_vlan=1,arp_tpa=10.0.1.1 actions=drop
...

   2)qr  (10.0.1.1) -> qr (10.0.2.1)spa

    進入qrouter namespace後,利用linux內核的高級路由功能,查看路由規則。

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip rule  
0: from all lookup local  
32766: from all lookup main  
32767: from all lookup default  
32768: from 10.0.1.5 lookup 16  
32769: from 10.0.2.3 lookup 16  
167772417: from 10.0.1.1/24 lookup 167772417  
167772417: from 10.0.1.1/24 lookup 167772417  
167772673: from 10.0.2.1/24 lookup 167772673 

    先查看main表:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip route list table main  
10.0.1.0/24 dev qr-ddbdc784-d7 proto kernel scope link src 10.0.1.1  
10.0.2.0/24 dev qr-001d0ed9-01 proto kernel scope link src 10.0.2.1  
169.254.31.28/31 dev rfp-0fbb351e-a proto kernel scope link src 169.254.31.28

    在main表中知足以上路由,所以會從另外一個qr口出去。(Q1:不一樣計算節點的同一子網下qr口ip是相同的嗎?)

   3)qr -> br-int   

  以後須要去查詢10.0.2.5的MAC地址, MAC是由neutron使用靜態ARP的方式設定的,因爲Neutron知道全部VM的信息,所以他能夠事先設定好靜態ARP:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip nei 10.0.1.5 dev qr-ddbdc784-d7 lladdr fa:16:3e:da:75:6d PERMANENT  
10.0.2.3 dev qr-001d0ed9-01 lladdr fa:16:3e:a4:fc:98 PERMANENT  
10.0.1.6 dev qr-ddbdc784-d7 lladdr fa:16:3e:9f:55:67 PERMANENT  
10.0.2.2 dev qr-001d0ed9-01 lladdr fa:16:3e:13:55:66 PERMANENT  
10.0.2.5 dev qr-001d0ed9-01 lladdr fa:16:3e:51:99:b8 PERMANENT 10.0.1.4 dev qr-ddbdc784-d7 lladdr fa:16:3e:da:e3:6e PERMANENT  
10.0.1.7 dev qr-ddbdc784-d7 lladdr fa:16:3e:14:b8:ec PERMANENT  
169.254.31.29 dev rfp-0fbb351e-a lladdr 42:0d:9f:49:63:c6 STALE

  此時,報文進入br-int,根據table 0 進行normal轉發:

cookie=0x0, duration=16440.644s, table=0, n_packets=1074, n_bytes=104318, idle_age=8917, priority=1 actions=NORMAL

  normal動做則表示根據OVS fdb表項匹配目的MAC地址,從而決定該報文要往哪一個端口發送。若是沒有該MAC的fdb表項記錄,則進行泛洪,對除了報文進來的端口之外的全部同屬於一個vlan的端口發送該報文。例如:

# ovs-appctl fdb/show br-int
 port  VLAN  MAC                Age
LOCAL     0  da:91:42:cd:fb:44   18
   18     0  52:54:00:a9:b8:b0    0
   19     0  52:54:00:a9:b8:b1    0

  所以若是此時VM2也在該compute node上,則VM2也會直接收到該報文,不須要走br-tun(有了VM2的MAC fdb表項記錄後)。不然,繼續往br-tun走。

  4)br-int -> br-tun -> 出compute node 1

  而後報文從br-int進入br-tun匹配流表:

 cookie=0x0, duration=66172.51s, table=0, n_packets=58, n_bytes=5731, idle_age=20810, hard_age=65534, priority=1,in_port=3 actions=resubmit(,4)
 cookie=0x0, duration=67599.526s, table=0, n_packets=273, n_bytes=24999, idle_age=1741, hard_age=65534, priority=1,in_port=1 actions=resubmit(,1)
 cookie=0x0, duration=64437.052s, table=0, n_packets=28, n_bytes=2980, idle_age=20799, priority=1,in_port=4 actions=resubmit(,4)
 cookie=0x0, duration=67601.704s, table=0, n_packets=5, n_bytes=390, idle_age=65534, hard_age=65534, priority=0 actions=drop
 cookie=0x0, duration=66135.811s, table=1, n_packets=140, n_bytes=13720, idle_age=65534, hard_age=65534, priority=1,dl_vlan=1,dl_src=fa:16:3e:66:13:af actions=mod_dl_src:fa:16:3f:fe:49:e9,resubmit(,2)
 cookie=0x0, duration=64082.141s, table=1, n_packets=2, n_bytes=200, idle_age=64081, priority=1,dl_vlan=2,dl_src=fa:16:3e:69:b4:05 actions=mod_dl_src:fa:16:3f:fe:49:e9,resubmit(,2)
 cookie=0x0, duration=66135.962s, table=1, n_packets=1, n_bytes=98, idle_age=65301, hard_age=65534, priority=2,dl_vlan=1,dl_dst=fa:16:3e:66:13:af actions=drop 
 cookie=0x0, duration=64082.297s, table=1, n_packets=0, n_bytes=0, idle_age=64082, priority=2,dl_vlan=2,dl_dst=fa:16:3e:69:b4:05 actions=drop
 cookie=0x0, duration=66136.115s, table=1, n_packets=4, n_bytes=168, idle_age=65534, hard_age=65534, priority=3,arp,dl_vlan=1,arp_tpa=10.0.1.1 actions=drop
 cookie=0x0, duration=64082.449s, table=1, n_packets=2, n_bytes=84, idle_age=63991, priority=3,arp,dl_vlan=2,arp_tpa=10.0.2.1 actions=drop
 cookie=0x0, duration=67599.22s, table=1, n_packets=123, n_bytes=10687, idle_age=1741, hard_age=65534, priority=0 actions=resubmit(,2)

  先匹配table 0,而後匹配table 1,它會把源MAC地址(另外一個qr口)改成全局惟一與計算節點綁定的MAC。

  這個全局惟一和計算節點綁定的MAC地址,是由neutron全局分配的,數據庫中能夠看到這個MAC是每一個host一個:

  

  它的base MAC是能夠在neutron.conf中配置的:

  

  同時,後面的兩條table1會丟棄目標ip是interface_distributed接口的ARP和目的MAC是interface_distributed的包,以防止虛機發送給本地IP的包不會被轉發到網絡中。

  而後繼續查詢table 2,table 2是vxlan表,若是是廣播包就會查詢表22,若是是單播包就查詢table 20

cookie=0x0, duration=67601.554s, table=2, n_packets=176, n_bytes=16981, idle_age=20810, hard_age=65534, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,20)
 cookie=0x0, duration=67601.406s, table=2, n_packets=92, n_bytes=7876, idle_age=1741, hard_age=65534, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=resubmit(,22)

  廣播MAC地址是FF:FF:FF:FF:FF:FF,組播MAC地址以01-00-5E開頭(具體可查看http://book.51cto.com/art/200904/120471.htm),匹配規則知足CIDR。

  ICMP包是單播包,所以會查詢表20,因爲開啓了L2 pop功能,在表20中會事先學習到應該轉發到哪一個VTEP:

cookie=0x0, duration=64015.308s, table=20, n_packets=0, n_bytes=0, idle_age=64015, priority=2,dl_vlan=2,dl_dst=fa:16:3e:51:99:b8 actions=strip_vlan,set_tunnel:0x3eb,output:4

  (Q2:社區br-tun下面的隧道口是如何與物理口創建聯繫的?)

  5)進compute node 2 -> br-tun

  在br-tun中,從外面進入的報文將首先匹配如下table0表:

 cookie=0x0, duration=66293.658s, table=0, n_packets=31, n_bytes=3936, idle_age=22651, hard_age=65534, priority=1,in_port=3 actions=resubmit(,4)
 cookie=0x0, duration=69453.368s, table=0, n_packets=103, n_bytes=9360, idle_age=22651, hard_age=65534, priority=1,in_port=1 actions=resubmit(,1)
 cookie=0x0, duration=66292.808s, table=0, n_packets=20, n_bytes=1742, idle_age=3598, hard_age=65534, priority=1,in_port=4 actions=resubmit(,4)
 cookie=0x0, duration=69455.675s, table=0, n_packets=5, n_bytes=390, idle_age=65534, hard_age=65534, priority=0 actions=drop

  在table 4中,會將對應的vni改成本地vlan id,以後查詢表9:

 cookie=0x0, duration=65937.871s, table=4, n_packets=32, n_bytes=3653, idle_age=22651, hard_age=65534, priority=1,tun_id=0x3eb actions=mod_vlan_vid:3,resubmit(,9)
 cookie=0x0, duration=66294.732s, table=4, n_packets=19, n_bytes=2025, idle_age=3598, hard_age=65534, priority=1,tun_id=0x3e9 actions=mod_vlan_vid:2,resubmit(,9)
 cookie=0x0, duration=69455.115s, table=4, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=0 actions=drop

  在表9中,若是發現包的源地址是全局惟一併與計算節點綁定的MAC地址,就將其轉發到br-int:

cookie=0x0, duration=69453.507s, table=9, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=1,dl_src=fa:16:3f:fe:49:e9 actions=output:1
 cookie=0x0, duration=69453.782s, table=9, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=1,dl_src=fa:16:3f:72:3f:a7 actions=output:1
 cookie=0x0, duration=69453.23s, table=9, n_packets=56, n_bytes=6028, idle_age=3598, hard_age=65534, priority=0 actions=resubmit(,10)

  6)br-tun -> br-int

  進入br-int後,在table 0中,若是是全局惟一併與計算節點綁定的MAC地址就查詢table 1,不然就正常轉發;

  在table 1中,事先設定好了flow,若是目的MAC是發送給VM2,就將源MAC改成Net2的網關MAC地址(qr口)(Q3:修改源MAC的緣由?爲了報文能返回)

cookie=0x0, duration=70039.903s, table=0, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=2,in_port=6,dl_src=fa:16:3f:72:3f:a7 actions=resubmit(,1)
 cookie=0x0, duration=70039.627s, table=0, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=2,in_port=6,dl_src=fa:16:3f:fe:49:e9 actions=resubmit(,1)
 cookie=0x0, duration=70040.053s, table=0, n_packets=166, n_bytes=15954, idle_age=4184, hard_age=65534, priority=1 actions=NORMAL
 cookie=0x0, duration=66458.695s, table=1, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=4,dl_vlan=3,dl_dst=fa:16:3e:51:99:b8 actions=strip_vlan,mod_dl_src:fa:16:3e:69:b4:05,output:12
 cookie=0x0, duration=66877.515s, table=1, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=4,dl_vlan=2,dl_dst=fa:16:3e:14:b8:ec actions=strip_vlan,mod_dl_src:fa:16:3e:66:13:af,output:9
 cookie=0x0, duration=66877.369s, table=1, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=2,ip,dl_vlan=2,nw_dst=10.0.1.0/24 actions=strip_vlan,mod_dl_src:fa:16:3e:66:13:af,output:9
 cookie=0x0, duration=66458.559s, table=1, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=2,ip,dl_vlan=3,nw_dst=10.0.2.0/24 actions=strip_vlan,mod_dl_src:fa:16:3e:69:b4:05,output:12

  7)br-int -> VM2

  至此,VM2就會收到VM1的包了。從通訊的過程能夠看到,跨網段的東西向流量沒有通過網絡節點。

  2.2 南北向流量(VM有floating ip)   

  VM1 (local ip:10.0.1.5 , floating ip: 172.24.4.5)ping 8.8.8.8

  1)VM1 (10.0.1.5) -> qr (10.0.1.1)

    與上面一致

  2) qr (10.0.1.1) -> rfp (169.254.31.28) -> fpr (169.254.31.29)

  進入qrouter namespace後:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip rule  
0: from all lookup local  
32766: from all lookup main  
32767: from all lookup default  
32768: from 10.0.1.5 lookup 16  
32769: from 10.0.2.3 lookup 16  
167772417: from 10.0.1.1/24 lookup 167772417  
167772417: from 10.0.1.1/24 lookup 167772417  
167772673: from 10.0.2.1/24 lookup 167772673

  在main表中沒有合適的路由:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip route list table main  
10.0.1.0/24 dev qr-ddbdc784-d7 proto kernel scope link src 10.0.1.1  
10.0.2.0/24 dev qr-001d0ed9-01 proto kernel scope link src 10.0.2.1  
169.254.31.28/31 dev rfp-0fbb351e-a proto kernel scope link src 169.254.31.28

  因爲包是從10.0.1.5發來的以後會查看table 16,包會命中這條路由。

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip route list table 16  
default via 169.254.31.29 dev rfp-0fbb351e-a

  路由以後會經過netfilter的POSTROUTING鏈中進行SNAT:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa iptables -nvL -t nat
...
Chain neutron-l3-agent-float-snat (1 references)
 pkts bytes target prot opt in out source destination
    0 0 SNAT all -- * * 10.0.2.3 0.0.0.0/0 to:172.24.4.7
    0 0 SNAT all -- * * 10.0.1.5 0.0.0.0/0 to:172.24.4.5
...

  以後就能夠看到包會經過rfp-0fbb351e-a發送給169.254.31.29。

  端口rfp-0fbb351e-a和fpr-0fbb351e-a是一對veth pair。在fip namespace中你能夠看到這個接口:

  3) fpr (169.254.31.29) -> fg (172.24.4.6)

  到了fip的namespace以後,會查詢路由, 在main表裏有通往公網的默認路由:

# ip netns exec fip-fbd46644-c70f-4227-a414-862a00cbd1d2 ip route  
default via 172.24.4.1 dev fg-081d537b-06  
169.254.31.28/31 dev fpr-0fbb351e-a proto kernel scope link src 169.254.31.29  
172.24.4.0/24 dev fg-081d537b-06 proto kernel scope link src 172.24.4.6  
172.24.4.5 via 169.254.31.28 dev fpr-0fbb351e-a  
172.24.4.7 via 169.254.31.28 dev fpr-0fbb351e-a

  經過fg-081d537b-06發送到br-ex。這是從虛機發送到公網的過程。(Q4:br-ex上的流表是什麼樣的?若是沒有br-ex,直接走br-int,流表會有什麼變化?)

  

  外網 ping VM1 ( floating ip: 172.24.4.5)

  1)fip namespace

  此時fip的namespace會作arp代理:

  (Q5:arp代理的做用?外部arp廣播報文進入fip ns,查詢172.24.4.5的mac地址,因爲arp報文沒法跨路由器傳播,並且該ip在qrouter ns裏。

# ip netns exec fip-fbd46644-c70f-4227-a414-862a00cbd1d2 sysctl net.ipv4.conf.fg-081d537b-06.proxy_arp  
net.ipv4.conf.fg-081d537b-06.proxy_arp = 1

  能夠看到接口的arp代理是打開的,對於floating ip 有如下路由:

# ip netns exec fip-fbd46644-c70f-4227-a414-862a00cbd1d2 ip route  
...
172.24.4.5 via 169.254.31.28 dev fpr-0fbb351e-a 172.24.4.7 via 169.254.31.28 dev fpr-0fbb351e-a
...

  ARP會去經過VETH Pair到IR(Inter Router)的namespace中去查詢,在IR中能夠看到,接口rfp-0fbb351e-a配置了floating ip:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip addr 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default  
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
    inet 127.0.0.1/8 scope host lo 
       valid_lft forever preferred_lft forever 
    inet6 ::1/128 scope host  
       valid_lft forever preferred_lft forever 
2: rfp-0fbb351e-a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ea:5c:56:9a:36:9c brd ff:ff:ff:ff:ff:ff
    inet 169.254.31.28/31 scope global rfp-0fbb351e-a
       valid_lft forever preferred_lft forever
    inet 172.24.4.5/32 brd 172.24.4.5 scope global rfp-0fbb351e-a
       valid_lft forever preferred_lft forever
    inet 172.24.4.7/32 brd 172.24.4.7 scope global rfp-0fbb351e-a
       valid_lft forever preferred_lft forever
    inet6 fe80::e85c:56ff:fe9a:369c/64 scope link 
       valid_lft forever preferred_lft forever 
17: qr-ddbdc784-d7: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default  
    link/ether fa:16:3e:66:13:af brd ff:ff:ff:ff:ff:ff 
    inet 10.0.1.1/24 brd 10.0.1.255 scope global qr-ddbdc784-d7 
       valid_lft forever preferred_lft forever 
    inet6 fe80::f816:3eff:fe66:13af/64 scope link  
       valid_lft forever preferred_lft forever 
19: qr-001d0ed9-01: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default  
    link/ether fa:16:3e:69:b4:05 brd ff:ff:ff:ff:ff:ff 
    inet 10.0.2.1/24 brd 10.0.2.255 scope global qr-001d0ed9-01 
       valid_lft forever preferred_lft forever 
    inet6 fe80::f816:3eff:fe69:b405/64 scope link  
       valid_lft forever preferred_lft forever

  所以fip的namespace會對這個floating ip進行ARP迴應。

  外部發起目標地址爲floating ip的請求後,fip會將其轉發到IR中,IR的RPOROUTING鏈中規則以下:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa iptables -nvL -t nat
...
Chain neutron-l3-agent-PREROUTING (1 references)
 pkts bytes target prot opt in out source destination
    0 0 REDIRECT tcp -- * * 0.0.0.0/0 169.254.169.254 tcp dpt:80 redir ports 9697
    0 0 DNAT all -- * * 0.0.0.0/0 172.24.4.7 to:10.0.2.3
    0 0 DNAT all -- * * 0.0.0.0/0 172.24.4.5 to:10.0.1.5
...

  這條DNAT規則會將floating ip地址轉換爲內部地址,以後進行路由查詢:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip route  
10.0.1.0/24 dev qr-ddbdc784-d7 proto kernel scope link src 10.0.1.1  
10.0.2.0/24 dev qr-001d0ed9-01 proto kernel scope link src 10.0.2.1  
169.254.31.28/31 dev rfp-0fbb351e-a proto kernel scope link src 169.254.31.28

  目的地址是10.0.1.0/24網段的,所以會從qr-ddbdc784-d7轉發出去。以後就會轉發到br-int再到虛機。

 

  2.3 南北向流量(VM沒有floating ip)

  在虛機沒有floating ip的狀況下,從虛機發出的包會首先到IR,IR中查詢路由:

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip rule  
0: from all lookup local  
32766: from all lookup main  
32767: from all lookup default  
32768: from 10.0.1.5 lookup 16  
32769: from 10.0.2.3 lookup 16  
167772417: from 10.0.1.1/24 lookup 167772417   
167772673: from 10.0.2.1/24 lookup 167772673

  會先查詢main表,以後查詢167772417表。(Q7:不會匹配table 16?) 

# ip netns exec qrouter-0fbb351e-a65b-4790-a409-8fb219ce16aa ip route list table 167772417  
default via 10.0.1.6 dev qr-ddbdc784-d7

  這個表會將其轉發給10.0.1.6,而這個IP就是在network node上的router_centralized_snat接口。

  在network node的snat namespace中,咱們能夠看到這個接口。

$ sudo ip netns exec snat-0fbb351e-a65b-4790-a409-8fb219ce16aa iptables -nvL -t nat
...
Chain neutron-l3-agent-snat (1 references)
 pkts bytes target prot opt in out source destination
    0 0 SNAT all -- * * 10.0.1.0/24 0.0.0.0/0 to:172.24.4.4
    0 0 SNAT all -- * * 10.0.2.0/24 0.0.0.0/0 to:172.24.4.4
...

  這裏就和之前的L3相似,會將沒有floating ip的包SNAT成一個172.24.4.4(DVR的網關臂)。這個過程是和之前L3相似的,再也不累述。

  參考:http://www.sxt.cn/u/756/blog/3168

3. QA

  (未完)

相關文章
相關標籤/搜索