最近一直在測試k8s,若是你瞭解或者解接觸過docker,那你必定知道docker 相關的網絡很大部分在橋接、路由、Iptables 上作文章。若是你湊巧接觸過k8s,而且瞭解其後面的原理,那你必定知道kube-proxy 把iptables 玩的簡直要飛起來。固然你可能會想到一些排錯工具,好比我以前經常使用的抓包工具,或者路由跟蹤工具,但這些工具在目前這樣複雜的環境下,是不太趁手的,特別是包在本機的多個網卡或者虛擬網卡里轉來轉去,還有不少個iptables策略,路由等讓包在內核空間中轉來轉去。抓包工具抓不到這些信息,traceroute 跟蹤路由時你會發現你須要跟蹤一個src,dst 還有port的包的路由信息是沒有法達成的。html
這裏介紹一些新的排錯工具:linux
說到Iptables 排錯,我不得不拿出這張邏輯很是清晰的圖出來,建議Iptables 排錯時經常對照下這張圖,看下數據包的傳遞路徑。在我以前的IPtables 知識範疇裏,我覺得它多個表之間傳遞時是沒有路由選擇這個操做的,結果實際的排錯加上這樣圖來看。原來在不一樣的table 之間可能通過Routing decision. git
請參考個人這篇K8s Issue 中的排錯過程。github
而後我不得不說下Iptables 的TRACE Target,沒有了解到這個Target以前,我用LOG Target,結果發現要寫好多個IPtables你也不必定能跟蹤的全每一個包通過的策略,以及策略如何處理的。docker
我目前演示的在ubuntu 上面:ubuntu
######### 檢查是TRACE相關的mod 是否載入 modprobe nf_log_ipv4 ########## TRACE Target 只能應用於RAW Table sudo iptables -t raw -I PREROUTING -p tcp -m tcp --dport 8081 -j TRACE sudo iptables -t raw -I OUTPUT -p tcp -m tcp --dport 8081 -j TRACE ########### grep TRACE in /var/log/kern.log grep TRACE /var/log/kern.log ubuntu@ceph3:~$ grep TRACE /var/log/kern.log|grep 2213090174 May 8 16:30:29 ceph3 kernel: [324781.838361] TRACE: raw:OUTPUT:policy:2 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.43.206.251 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838389] TRACE: nat:OUTPUT:rule:1 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.43.206.251 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838417] TRACE: nat:KUBE-SERVICES:rule:9 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.43.206.251 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838439] TRACE: nat:KUBE-SVC-ZP4VKUJYTBCROZYY:rule:1 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.43.206.251 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838454] TRACE: nat:KUBE-SEP-OR6JECCPPINGGGRC:rule:2 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.43.206.251 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838479] TRACE: filter:OUTPUT:rule:1 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838493] TRACE: filter:KUBE-SERVICES:return:2 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838505] TRACE: filter:OUTPUT:rule:2 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838518] TRACE: filter:KUBE-FIREWALL:return:2 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838531] TRACE: filter:OUTPUT:rule:4 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838551] TRACE: filter:OUTPUT:policy:6 IN= OUT=enp3s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838564] TRACE: nat:POSTROUTING:rule:1 IN= OUT=enp5s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838577] TRACE: nat:KUBE-POSTROUTING:return:2 IN= OUT=enp5s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838589] TRACE: nat:POSTROUTING:rule:8 IN= OUT=enp5s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000 May 8 16:30:29 ceph3 kernel: [324781.838609] TRACE: nat:POSTROUTING:policy:10 IN= OUT=enp5s0 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=57266 DF PROTO=TCP SPT=18130 DPT=8081 SEQ=2213090174 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04D5CCCC0000000001030307) UID=1000 GID=1000
解釋一下,每一個trace 會記錄表名字好比raw:OUTPUT:policy:2
或者nat:OUTPUT:rule:1
,代表爲Table:Chain:顯式策略爲Rule,Table默認策略爲Policy:rule 編號。我通常用grep ID=57266 這種方法去過濾同一個包。bash
以上都抓取的iptables 的日誌,若是中間遇到路由問題呢,好比我這個問題包日誌以下,包到了nat:PREROUTING:policy:3 就沒有下文了,原本應該繼續進mangle:INPUT 或者filter:INPUT,結果都沒有,參考以上數據包圖,能夠發現這裏有個route decision的過程。那麼接下來我看看若是排除本地路由的問題。網絡
ubuntu@ceph2:~$ grep TRACE /var/log/kern.log|grep 1726587944 May 8 15:51:07 ceph2 kernel: [309854.514762] TRACE: raw:PREROUTING:policy:2 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514799] TRACE: nat:PREROUTING:rule:1 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514841] TRACE: nat:KUBE-SERVICES:rule:13 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514861] TRACE: nat:KUBE-NODEPORTS:return:1 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514881] TRACE: nat:KUBE-SERVICES:return:14 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514897] TRACE: nat:PREROUTING:rule:2 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514914] TRACE: nat:DOCKER:return:2 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307) May 8 15:51:07 ceph2 kernel: [309854.514930] TRACE: nat:PREROUTING:policy:3 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307)
通常狀況下本地路由並無多少條,因此通常傳統方法是逐條對下路由條目,而後人工判斷最終會丟到那裏,若是沒有發現路由能處理,就會被DROP了。新版本的linux 上用ip rule 和ip route 顯示和操做路由表,ip 屬於iproute2 包中的套件,後面大體看了下文檔,才發現有種還有這種操做的
的感受。tcp
######## ip rule to list ip route tables ubuntu@ceph2:~$ ip rule 0: from all lookup local 32766: from all lookup main 32767: from all lookup default ######## 這裏有三個table ,先查0編號的local,而後查32766的main,而後查32767的default ######## 列出每一個table 中的rule ubuntu@ceph2:~$ ip route list table local broadcast 10.0.1.0 dev enp5s0 proto kernel scope link src 10.0.1.12 local 10.0.1.12 dev enp5s0 proto kernel scope host src 10.0.1.12 broadcast 10.0.1.255 dev enp5s0 proto kernel scope link src 10.0.1.12 local 10.42.2.0 dev flannel.1 proto kernel scope host src 10.42.2.0 broadcast 10.42.2.0 dev cni0 proto kernel scope link src 10.42.2.1 local 10.42.2.1 dev cni0 proto kernel scope host src 10.42.2.1 broadcast 10.42.2.255 dev cni0 proto kernel scope link src 10.42.2.1 broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1 local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1 local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1 broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1 broadcast 172.17.0.0 dev docker0 proto kernel scope link src 172.17.0.1 linkdown local 172.17.0.1 dev docker0 proto kernel scope host src 172.17.0.1 broadcast 172.17.255.255 dev docker0 proto kernel scope link src 172.17.0.1 linkdown broadcast 192.168.235.0 dev enp3s0 proto kernel scope link src 192.168.235.12 local 192.168.235.12 dev enp3s0 proto kernel scope host src 192.168.235.12 broadcast 192.168.235.255 dev enp3s0 proto kernel scope link src 192.168.235.12 ubuntu@ceph2:~$ ip route list table main default via 192.168.235.2 dev enp3s0 onlink 10.0.1.0/24 dev enp5s0 proto kernel scope link src 10.0.1.12 10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink 10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink 10.42.2.0/24 dev cni0 proto kernel scope link src 10.42.2.1 10.42.3.0/24 via 10.42.3.0 dev flannel.1 onlink 10.42.4.0/24 via 10.42.4.0 dev flannel.1 onlink 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 192.168.235.0/24 dev enp3s0 proto kernel scope link src 192.168.235.12 ubuntu@ceph2:~$ ip route list table default ubuntu@ceph2:~$
前面說了通常方法是逐條對路由來看路由條目對特定的包是否有規則對應,但這種方法須要你對路由規則很是熟悉,並且人工容易判斷漏。那麼這裏介紹一個測試路由規則的命令ip route get
ide
###### 偷懶摘抄下man 8 ip 裏的說明 ip route get - get a single route this command gets a single route to a destination and prints its contents exactly as the kernel sees it. to ADDRESS (default) #the destination address. from ADDRESS #the source address. tos TOS dsfield TOS # TOS=the Type Of Service. iif NAME #the device from which this packet is expected to arrive. oif NAME #force the output device on which this packet will be routed.
以上面iptables 跟蹤部分的這條日誌爲例
May 8 15:51:07 ceph2 kernel: [309854.514930] TRACE: nat:PREROUTING:policy:3 IN=enp5s0 OUT= MAC=00:23:7d:5b:96:ec:00:21:5a:ef:39:fe:08:00 SRC=192.168.235.13 DST=10.0.1.12 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=28265 DF PROTO=TCP SPT=14024 DPT=8081 SEQ=1726587944 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A04CCCA410000000001030307)
假設咱們要判斷這個包的路由選擇會匹配哪條路由,咱們能夠下面命令來看,初一看結果會第一反應是否是我命令語法有問題報錯呢?來看看我當時參考了rp_filter_kernel_setting後的運行結果
ubuntu@ceph2:~$ ip route get from 192.168.235.13 to 10.0.1.12 iif enp5s0 tos 0x00 RTNETLINK answers: Invalid cross-device link ######### change rp_filter kernel setting ubuntu@ceph2:~$ sudo bash root@ceph2:~# echo 2 > /proc/sys/net/ipv4/conf/default/rp_filter root@ceph2:~# echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter ######### Ok 在更改rp_filter 內核參數後,咱們的一樣的命令有匹配結果了。 root@ceph2:~# ip route get from 192.168.235.13 to 10.0.1.12 iif enp5s0 tos 0x00 local 10.0.1.12 from 192.168.235.13 dev lo cache <local> iif enp5s0 root@ceph2:~#
到這裏我想你能夠再嘗試一些其餘的ip route get 命令來連連手,看看輸出結果,好比
ubuntu@ceph2:~$ ip route get from 172.18.0.3 to 10.0.1.12 RTNETLINK answers: Invalid argument # 本機根本無法從172.18.0.3 路由到10.0.1.12 ubuntu@ceph2:~$ ip route get from 192.168.235.3 to 10.0.1.12 RTNETLINK answers: Invalid argument # 本機也無法從192.168.235.3 路由到10.0.1.12 ubuntu@ceph2:~$ ip route get from 192.168.235.12 to 10.0.1.12 #從本地的一個卡爲192.168.235.12 能夠從lo 上路由到10.0.1.12 local 10.0.1.12 from 192.168.235.12 dev lo cache <local> ubuntu@ceph2:~$ ip route get from 192.168.235.12 to 10.0.1.13 #從本地的一個卡爲192.168.235.12 能夠從enp5s0 上路由到10.0.1.13(另一個主機) 10.0.1.13 from 192.168.235.12 dev enp5s0 cache
最後咱們詳細解讀下路由規則的顯示意思,具體能夠參考【iproute2 doc】
以這條比較長的broadcast 10.0.1.0 dev enp5s0 proto kernel scope link src 10.0.1.12
爲例
broadcast 10.0.1.0
第一個爲路由類型,能夠爲broadcast,unicast,local等等,若是不寫,則爲unicast,10.0.1.0 爲目的網絡。dev enp5s0
這表明出去的時候走網卡enp5s0via 10.42.3.0
你可能在有些規則中看到這句,表明下一跳網關是10.42.3.0proto kernel
路由協議是kernel,由kernel 生成。scope link
該地址只在該link 上有效src 10.0.1.12
源Ip爲10.0.1.12, 這裏的10.0.1.12 必須在本地的網卡地址上能找到onlink
僞裝下一跳的網關在這個link上 。OK,快到最後不得不提下linux 的內核參數設置,這些參數能在內核中能夠設置,每每是提煉了又提煉的精華部分。那麼問題來了?
Q : 我怎麼知道哪些參數是我須要的呢?
A : linux的內核文檔中會對這些參數加以詳細描述,所以咱們能夠閱讀內核文檔,好比和IP相關的參數,來尋找咱們可能須要的參數,個人思路是經過本身以爲有可能的內核參數名去搜索互聯網,而後看結果中別人使用這個參數具體解決了什麼問題。
Q: 那裏找到linux 內核文檔?
A: 以ubuntu 爲例,linux-doc 是當前kernel的文檔包,安裝後的文件在/usr/share/doc/linux-doc/
主目錄下,能夠 dpkg -L linux-doc
查看尋找所需文檔。好比zcat /usr/share/doc/linux-doc/networking/ip-sysctl.txt.gz
能夠閱讀網絡相關內核參數的文檔。
rp_filter 設置爲2時會針對全部網卡匹配包的src,若是匹配則路由,設置爲1時,若是包通過的網卡發現返回路徑不是最優,則丟棄包。
net.ipv4.conf.default.rp_filter = 2
net.ipv4.conf.all.rp_filter=2