最近作性能測試時,發現鏈接第三方系統時會有約1%的交易提示以下錯誤java
nested exception is org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
起先抱着能google就google的思路去找,結果沒找到相應的解決方案,只能本身一步一步去排查了,下面記錄下排查過程。python
涉及到網絡的問題,先抓個包,tcpdump走一波。shell
tcpdump -i eth0 host xxx.xxx.xxx.xxx > tcp.dmp
節選了一次TCP鏈接的整個內容,從三次握手到報文發送到四次揮手,下面將一次通信分爲三個部分進行講解apache
時間戳 協議類型 請求方地址>接收方地址 Flags [xxx], seq xxx,......,length
協議類型:此處針對TCP/IP協議,爲IP網絡
Flags常見值及含義併發
標識 | 含義 |
---|---|
S | 即SYN,用於創建鏈接 |
. | .即ACK,可與其餘組合使用好比[S.]表明SYN的ACK回執 |
P | 即PSH,指示接收方應當即將數據交給上層 |
F | 即FIN,用於斷開鏈接 |
09:23:00.038908 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [S], seq 2590465106, win 29200, options [mss 1460,sackOK,TS val 1752507811 ecr 0,nop,wscale 7], length 0 09:23:00.039329 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [S.], seq 148046299, ack 2590465107, win 14480, options [mss 1460,sackOK,TS val 2174555693 ecr 1752507811,nop,wscale 7], length 0 09:23:00.039341 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [.], ack 1, win 229, options [nop,nop,TS val 1752507811 ecr 2174555693], length 0
客戶端發送SYN=1,seq=隨機生成的序號
到服務端,對應第一個包內容app
服務端響應SYN=1,seq=服務端隨機序號,ack=第一步中的seq+1
給客戶端,對應第二個包內容tcp
客戶端發送SYN=0,seq=第一步中的seq+2,ack=服務端的隨機序號+1
給服務端,對應第三個包性能
至此鏈接創建完成測試
TCP鏈接爲全雙工的,既能夠C->S,亦可S->C
09:23:00.039406 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [P.], seq 1:167, ack 1, win 229, options [nop,nop,TS val 1752507811 ecr 2174555693], length 166 09:23:00.039426 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [P.], seq 167:2259, ack 1, win 229, options [nop,nop,TS val 1752507811 ecr 2174555693], length 2092 09:23:00.039778 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [.], ack 167, win 122, options [nop,nop,TS val 2174555693 ecr 1752507811], length 0 09:23:00.039784 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [.], ack 2259, win 155, options [nop,nop,TS val 2174555693 ecr 1752507811], length 0 09:23:00.533102 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [P.], seq 1:972, ack 2259, win 155, options [nop,nop,TS val 2174555817 ecr 1752507811], length 971 09:23:00.533114 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [.], ack 972, win 244, options [nop,nop,TS val 1752507935 ecr 2174555817], length 0
09:23:00.533121 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [F.], seq 972, ack 2259, win 155, options [nop,nop,TS val 2174555817 ecr 1752507811], length 0 09:23:00.533166 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [F.], seq 2259, ack 973, win 244, options [nop,nop,TS val 1752507935 ecr 2174555817], length 0 09:23:00.533479 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [.], ack 2260, win 155, options [nop,nop,TS val 2174555817 ecr 1752507935], length 0
此處由於使用了HTTP協議,且在系統頭中設置了
Connection=close
,因此在請求處理完成後,服務端便會將鏈接關閉。
FIN
到客戶端,對應包1。ACK
到服務端,對應包2。FIN
到服務端,對應包2。由於第2、第三步都是從客戶端發送到服務端,TCP協議實現時在此處作了合併,將ACK和FIN一併發給對端,減小了網絡消耗。
ACK
到客戶端,對應包3,鏈接釋放完成。包是抓到了,但是在性能測試時,業務量特別大,若是純粹靠手工分析的話,怕是要找到地老天荒了,爲此寫了一段Python代碼來分析tcpdump的包,由於本次的問題爲請求鏈接有問題,必然是三次握手的時候存在問題,因此如下代碼主要分析三次握手時的狀況。
解析包內容
import re class Packet(object): '分組' timestamp = '00:00:00.000000' # 時間戳 from_addr = '0.0.0.0' # 源地址 to_addr = '0.0.0.0' # 目的地址 protocol_type = 'IP' # 協議類型 flags = '[S]' seq = '0' ack = '0' length = 0 # 數據包長度 def __init__(self, packet_str): hb = re.split(': ', packet_str) header = hb[0] body = hb[1] hs = re.split('[\s]', header) self.timestamp = hs[0] self.protocol_type = hs[1] self.from_addr = hs[2] self.to_addr = hs[4] bs = re.split(',\s', body) for b in bs: kv = re.split('\s', b, 1) k = kv[0] v = kv[1] if 'Flags' == k: self.flags = v elif 'seq' == k: if self.flags != '[P.]': self.seq = v elif 'length' == k: self.length = int(v) elif 'ack' == k: self.ack = v def __str__(self): return self.timestamp + self.protocol_type + self.from_addr + '==>' + self.to_addr + self.flags + str( self.seq) + str(self.ack) + str(self.length)
分析dmp包內容
from tcpdump.Packet import Packet client_addr = '192.168.1.100' server_addr = '192.168.1.200.39045' packet_times = {} # 鏈接ID:時間 first_packets = [] #三次握手第一次的包 ''' 第一步:syn=1,seq=xxx 第二步:syn=1,seq=yyy,ack=xxx+1 第三步:syn=0,seq=xxx+1,ack=yyy+1 ''' with open('C:\\Users\\xxx\\Desktop\\39045.dmp') as dmp: for line in dmp.readlines(): line = str(line.strip()) packet = Packet(line) if packet.from_addr.startswith(client_addr) and packet.flags == '[S]': first_packets.append(packet) id = packet.from_addr + packet.to_addr + packet.flags + 'ack=' + packet.ack if packet.flags != '[S.]': id = id + 'seq=' + packet.seq packet_times[id] = packet.timestamp print('請求ID(客戶端IP.端口\\請求序號)\t第一步時間\t\t第二步時間\t\t第三步時間') for fpack in first_packets: sync_time = fpack.timestamp # 第一步的時間 id = fpack.to_addr + fpack.from_addr + '[S.]' + 'ack=' + str(int(fpack.seq) + 1) # 第二步:ack=第一步的seq+1 ack_time = packet_times.get(id) conn_time = None if ack_time is not None: conn_time = packet_times.get(fpack.from_addr + fpack.to_addr + '[.]ack=1seq=0') # 第三步:ack=1,seq無值 print( '[' + fpack.from_addr + '\\' + fpack.seq + ']\t' + sync_time + '\t\t' + str(ack_time) + '\t\t' + str(conn_time))
經過代碼分析,發現存在部分建立鏈接時第二步的時間爲None的,所以能夠得出結論爲發出去的SYN請求包沒有獲得服務端的回執,能夠光明正大的摔鍋給網絡/第三方了![]