某天更新發現某服務收到請求但客戶端沒法收到返回。幸運的是,客戶端同窗能在測試環境重現問題。2分法找到首個故障版本並進行了回退。故障版本僅僅修改了snd_buf,理論上不該致使該問題。python
15:27:52.589785 IP (tos 0x0, ttl 52, id 15592, offset 0, flags [DF], proto TCP (6), length 187) 106.11.0.208.26082 > XXX.12345: Flags [P.], cksum 0xfe51 (correct), seq 51:186, ack 104, win 347, options [nop,nop,TS val 94902541 ecr 3888443977], length 135 E...<.@.4..Aj.......e.09..1........[.Q..... .......I....s...K......}..[..KG.....N...r?[T....\"...U@q.3.O*M..5..........e...hx.1...43K7#.......<Gu..O..&qNX.o .....Z......?.s....q....N*.y.. 15:27:52.632302 IP (tos 0x0, ttl 63, id 13795, offset 0, flags [DF], proto TCP (6), length 52) XXX.12345 > 106.11.0.208.26082: Flags [.], cksum 0x6b54 (correct), seq 104, ack 186, win 224, options [nop,nop,TS val 3888446989 ecr 94902541], length 0 E..45.@.?.......j...09e.......2(....kT..... .......
服務端調用了 :gen_tcp.send, 發送接口返回:ok. 爲何抓包沒法抓到呢?linux
erlang版本: OTP-21.0
最終定位至 inet_drv.c:10902, 這裏也能夠看到, erlang最終是調用了writev.
使用ltrace跟蹤writev系統調用.docker
1ltrace -p 343 -e writev 2 3<... writev resumed> ) = 4 4<... writev resumed> ) = 4 5exe->writev(233, 0xe697e9d0, 2, 247) = 213 6exe->writev(249, 0xe697e9d0, 2, 10) = 0x5166 7exe->writev(19, 0xe687c800, 2, 0x2b5b82c0) = 34
發現調用了writev, 0x5166的長度, 和代碼裏send的長度20834一致(0x5166=20834+4, 還有erlang自動添加的4字節包頭長度).ubuntu
inet_drv.c:10856 case TCP_PB_4: put_int32(len, buf); h_len = 4; break
嘗試在docker 容器內用執行以下python代碼oracle
import socket s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind(('0.0.0.0',12345)) s.listen(5) conn, addr = s.accept() conn.setblocking(0) conn.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4096) conn.send("a" * 40960) conn.close()
在另外一個容器telnet, 可以收到返回, 不能重現問題.
而直接在容器外, 使用一樣腳本, 從外網telnet. 能夠重現問題. 故排除docker問題.socket
發現一樣python腳本, 在某個QA環境, 內核版本爲4.15.0-1026-gcp, 沒法重現問題.
在另外一QA環境, 內核版本爲4.15.0-1034-gcp, 可重現問題. 因而懷疑是linux kernel.tcp
linux v4.15有9個rc版本, 正式版在Jan 29, 2018, 理論上, 不該該那麼久都沒有人報bug的.測試
因此, gcp是什麼呢?google
Software Description linux - Linux kernel linux-aws - Linux kernel for Amazon Web Services (AWS) systems linux-gcp - Linux kernel for Google Cloud Platform (GCP) systems linux-kvm - Linux kernel for cloud environments linux-raspi2 - Linux kernel for Raspberry Pi 2 linux-snapdragon - Linux kernel for Snapdragon processors linux-azure - Linux kernel for Microsoft Azure Cloud systems linux-hwe - Linux hardware enablement (HWE) kernel linux-oem - Linux kernel for OEM processors linux-oracle - Linux kernel for Oracle Cloud systems linux-aws-hwe - Linux kernel for Amazon Web Services (AWS-HWE) systems
這裏猜想是谷歌雲對linux kernel作了一些私有化修改.
因而從1026往上逐個測試linux內核版本:命令行
root@XXX:/home/dingxinglong# apt-cache search gcp | grep 4.15.0 | grep signed linux-image-unsigned-4.15.0-1018-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1019-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1021-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1023-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1024-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1025-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1026-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1027-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1028-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1029-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1030-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1032-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1033-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1034-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1036-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1037-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1040-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1041-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1042-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1044-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1046-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP linux-image-unsigned-4.15.0-1047-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
發現確實和linux kernel image相關:
4.15.0-1026-gcp OK
4.15.0-1027-gcp OK
4.15.0-1028-gcp OK
4.15.0-1029-gcp OK
4.15.0-1030-gcp OK
4.15.0-1032-gcp OK
4.15.0-1033-gcp OK
4.15.0-1034-gcp BUG
4.15.0-1036-gcp OK
同一臺機器, 在1033版本ok, 升到1034後有BUG, 升到1036後再次OK.
可查到對應的bug:
http://changelogs.ubuntu.com/...
bug出現時間是:
-- Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Thu, 06 Jun 2019 11:07:33 -0300
buf fix時間是:
-- Kleber Sacilotto de Souza <kleber.souza@canonical.com> Mon, 24 Jun 2019 14:48:10 +0200
沒有搜索到bug_fix對應的代碼. 猜想是非開源代碼. 也許這是google雲幹不過amazon的緣由?
這裏附上升級kernel的命令行.
apt-get install -y linux-image-4.15.0-1034-gcp linux-headers-4.15.0-1034-gcp linux-modules-4.15.0-1034-gcp linux-modules-extra-4.15.0-1034-gcp