4.15.0-1034-gcp內核TCP SACK BUG致使的請求無返回

原由

某天更新發現某服務收到請求但客戶端沒法收到返回。幸運的是,客戶端同窗能在測試環境重現問題。2分法找到首個故障版本並進行了回退。故障版本僅僅修改了snd_buf,理論上不該致使該問題。python

追查過程

抓包判斷故障端

定位到服務端未返回

15:27:52.589785 IP (tos 0x0, ttl 52, id 15592, offset 0, flags [DF], proto TCP (6), length 187)
    106.11.0.208.26082 > XXX.12345: Flags [P.], cksum 0xfe51 (correct), seq 51:186, ack 104, win 347, options [nop,nop,TS val 94902541 ecr 3888443977], length 135
E...<.@.4..Aj.......e.09..1........[.Q.....
.......I....s...K......}..[..KG.....N...r?[T....\"...U@q.3.O*M..5..........e...hx.1...43K7#.......<Gu..O..&qNX.o .....Z......?.s....q....N*.y..
15:27:52.632302 IP (tos 0x0, ttl 63, id 13795, offset 0, flags [DF], proto TCP (6), length 52)
    XXX.12345 > 106.11.0.208.26082: Flags [.], cksum 0x6b54 (correct), seq 104, ack 186, win 224, options [nop,nop,TS val 3888446989 ecr 94902541], length 0
E..45.@.?.......j...09e.......2(....kT.....
.......

服務端調用了 :gen_tcp.send, 發送接口返回:ok. 爲何抓包沒法抓到呢?linux

排查erlang的問題

erlang版本: OTP-21.0
最終定位至 inet_drv.c:10902, 這裏也能夠看到, erlang最終是調用了writev.
使用ltrace跟蹤writev系統調用.docker

1ltrace -p 343 -e writev
2
3<... writev resumed> )                                                                                                                            = 4
4<... writev resumed> )                                                                                                                            = 4
5exe->writev(233, 0xe697e9d0, 2, 247)                                                                                                              = 213
6exe->writev(249, 0xe697e9d0, 2, 10)                                                                                                               = 0x5166
7exe->writev(19, 0xe687c800, 2, 0x2b5b82c0)                                                                                                        = 34

發現調用了writev, 0x5166的長度, 和代碼裏send的長度20834一致(0x5166=20834+4, 還有erlang自動添加的4字節包頭長度).ubuntu

inet_drv.c:10856
    case TCP_PB_4: 
            put_int32(len, buf);
            h_len = 4; 
            break

排查docker的問題

嘗試在docker 容器內用執行以下python代碼oracle

import socket
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('0.0.0.0',12345))
s.listen(5)
conn, addr = s.accept()
conn.setblocking(0)
conn.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 4096)
conn.send("a" * 40960)
conn.close()

在另外一個容器telnet, 可以收到返回, 不能重現問題.
而直接在容器外, 使用一樣腳本, 從外網telnet. 能夠重現問題. 故排除docker問題.socket

排查linux kernel問題

發現一樣python腳本, 在某個QA環境, 內核版本爲4.15.0-1026-gcp, 沒法重現問題.
在另外一QA環境, 內核版本爲4.15.0-1034-gcp, 可重現問題. 因而懷疑是linux kernel.tcp

查看linux kernel 4.15.0 提交日誌

linux v4.15有9個rc版本, 正式版在Jan 29, 2018, 理論上, 不該該那麼久都沒有人報bug的.測試

gcp

因此, gcp是什麼呢?google

Software Description

linux - Linux kernel
linux-aws - Linux kernel for Amazon Web Services (AWS) systems
linux-gcp - Linux kernel for Google Cloud Platform (GCP) systems
linux-kvm - Linux kernel for cloud environments
linux-raspi2 - Linux kernel for Raspberry Pi 2
linux-snapdragon - Linux kernel for Snapdragon processors
linux-azure - Linux kernel for Microsoft Azure Cloud systems
linux-hwe - Linux hardware enablement (HWE) kernel
linux-oem - Linux kernel for OEM processors
linux-oracle - Linux kernel for Oracle Cloud systems
linux-aws-hwe - Linux kernel for Amazon Web Services (AWS-HWE) systems

這裏猜想是谷歌雲對linux kernel作了一些私有化修改.
因而從1026往上逐個測試linux內核版本:命令行

root@XXX:/home/dingxinglong# apt-cache search gcp | grep 4.15.0 | grep signed
linux-image-unsigned-4.15.0-1018-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1019-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1021-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1023-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1024-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1025-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1026-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1027-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1028-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1029-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1030-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1032-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1033-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1034-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1036-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1037-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1040-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1041-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1042-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1044-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1046-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP
linux-image-unsigned-4.15.0-1047-gcp - Linux kernel image for version 4.15.0 on 64 bit x86 SMP

發現確實和linux kernel image相關:
4.15.0-1026-gcp OK
4.15.0-1027-gcp OK
4.15.0-1028-gcp OK
4.15.0-1029-gcp OK
4.15.0-1030-gcp OK
4.15.0-1032-gcp OK
4.15.0-1033-gcp OK
4.15.0-1034-gcp BUG
4.15.0-1036-gcp OK
同一臺機器, 在1033版本ok, 升到1034後有BUG, 升到1036後再次OK.
可查到對應的bug:
http://changelogs.ubuntu.com/...

  • Remote denial of service (resource exhaustion) caused by TCP SACK scoreboard

bug出現時間是:
-- Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Thu, 06 Jun 2019 11:07:33 -0300
buf fix時間是:
-- Kleber Sacilotto de Souza <kleber.souza@canonical.com> Mon, 24 Jun 2019 14:48:10 +0200
沒有搜索到bug_fix對應的代碼. 猜想是非開源代碼. 也許這是google雲幹不過amazon的緣由?
這裏附上升級kernel的命令行.

apt-get install -y linux-image-4.15.0-1034-gcp linux-headers-4.15.0-1034-gcp linux-modules-4.15.0-1034-gcp  linux-modules-extra-4.15.0-1034-gcp

總結

  • 發佈任何更新內容前, 準備好回滾方案, 總有想象不到的問題.
  • 自動化測試用例儘可能豐富. 若是測試用例的返回包夠大, 會及早暴露問題.
  • 可以重現的bug, 儘早追到底, 總有收穫.
相關文章
相關標籤/搜索