某關鍵業務系統上頻繁出現業務失敗,併發生了一次大規模業務中斷。
該系統採用兩臺IBM Power 740運行AIX 6.1+Oracle 11gR2 RAC做爲數據庫服務器,兩臺DELL PowerEdge R720做爲應用服務器,前端採用F5做爲負載均衡設備。
數據庫服務器和應用服務器之間有Cisco ASA5585硬件防火牆進行訪問控制。
應用服務器上運行自行開發的C程序,經過Oracle Client提供的接口,以長鏈接的方式訪問RAC數據庫。
故障時,先檢查了數據庫服務器的系統負載,發現相對於正常時期,CPU負載偏高,IO負載明顯升高,IOPS達到13000左右。
正常時的負載
異常時的負載
檢查數據庫相關日誌,發現有大量的TNS錯誤:
前端
複製內容到剪貼板python
Fatal NI connect error 12170.
linux
VERSION INFORMATION: VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Mon Feb 23 13:22:16 2015
Time: 23-Feb-2015 13:22:16
*********************************************************************** Time: 23-Feb-2015 13:22:16
Time: 23-Feb-2015 13:22:16
Time: 23-Feb-2015 13:22:16
Tracing not turned on.
Tracing not turned on.
Fatal NI connect error 12170.
Tracing not turned on.
Tracing not turned on.
Tns error struct:
Tns error struct:
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Tns error struct:
Tns error struct:
ns main err code: 12535
ns main err code: 12535
Time: 23-Feb-2015 13:22:16
ns main err code: 12535
ns main err code: 12535
Tracing not turned on.
Tns error struct:
TNS-12535: TNS:operation timed out
TNS-12535: TNS:operation timed out
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
ns secondary err code: 12560
nt main err code: 505
ns secondary err code: 12560
nt main err code: 505
nt main err code: 505
nt main err code: 505
TNS-12535: TNS:operation timed out
TNS-00505: Operation timed out
TNS-00505: Operation timed out
ns secondary err code: 12560
TNS-00505: Operation timed out
nt secondary err code: 78
nt secondary err code: 78
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 78
nt OS err code: 0
nt OS err code: 0
nt secondary err code: 78
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=37975))
TNS-00505: Operation timed out
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=25972))
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=9108))
nt secondary err code: 78
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=52073))
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=49148))
Mon Feb 23 13:22:16 2015
複製內容到剪貼板數據庫
[sysadmin@appsrv1 ~]$ netstat -an|grep 10.1.197.15
服務器
tcp 0 0 10.1.32.70:37975 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:25972 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:9108 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:52073 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:49148 10.1.197.15:1521 ESTABLISHED
.
.
.
.
複製內容到剪貼板網絡
net.netfilter.nf_conntrack_tcp_timeout_established = 300
session
就是它讓iptables對於已創建的鏈接,300秒若沒有活動,那麼則清除掉,默認的時間是432000秒(5天)。
問是誰設置的這個參數,答覆是應用組在上線前作性能測試時,按照應用供應商的建議修改的。爲啥要改,爲啥改爲這個值也不清楚 。
好吧,應該就是它了,改回默認的值 432000 應該就能解決了。同時讓應用組的優化優化SQL,就能夠寫故障報告了。
故事就到這裏就完了? 固然不是。 在調整了這個值後,一開始還風平浪靜。過了一段時間,應用組又來報告說,又出現了不多的
業務超時,並且有愈來愈頻繁的趨勢,從一天一兩次到一條一二十次。真是不省心啊。。
繼續瞭解狀況,此次彷彿跟數據庫服務器沒啥關係了,是這個應用服務器到服務器總線應用之間的鏈接出現了問題。服務總線應用服務器向該應用服務器發起鏈接的時候,偶爾會被拒絕。
考慮到應用服務器以前有F5來做爲負載均衡,先檢查了F5上應用服務狀態有沒有異常,結果良好,F5上對應用的健康探測沒有異常。 好吧,仍是直接看數據流,爲啥會拒絕應用鏈接。
服務總線應用服務器和該應用服務器之間通訊是短鏈接,有業務調用的時候,由服務總線方發起鏈接。應用組找到了被拒絕的鏈接,經過debug日誌找到相關端口信息,繼續讓網絡組
提取相關鏈接數據包。
併發
這裏能夠看到,在svcbus2向appsrv1發起請求後,一樣有應答,可是一分鐘後,svcbus2關閉了鏈接。再過了3分鐘appsrv1處理完請求後返回數據給svcbus2的時候就被svcbus2給拒絕掉了,而後一樣也是不停重傳,最後超時。
應用組說svcbus2上,應用有一個60秒超時的機制,當對端在60秒內沒有返回結果就直接關閉這個請求,斷掉鏈接。
從上面的報文也能夠看到,svcbus2發起了FIN報文,可是因爲appsrv1沒有處理完,因此這個鏈接關閉是單向的,直到appsrv1處理完成,重傳數據超時,鏈接徹底關閉。
這和svcbus2新建鏈接到appsrv1被拒絕有啥關係呢?我一時也沒有想明白。 appsrv1上的iptables對於服務端口應該是一直開放的,不該該拒絕新建的鏈接啊,除非這個新建的鏈接有什麼特殊的地方。
經過紅帽的客戶網站,找到了一些相關信息。
https://access.redhat.com/solutions/73353 iptables randomly drops new connection requests
app
if /proc/net/ipv4/netfilter/ip_conntrack_tcp_loose is set to 1, iptables creates a new connection-tracking entry after receiving any packet, not just packets with the SYN flag set負載均衡
the ip_conntrack_tcp_loose setting is useful in firewalls for keeping already established connections unaffected if the firewall restarts, but this causes issues in the following scenario:
a client initiates a new connection with the same source port it used in a previous connection
the server already has an iptables connection-tracking entry for the same IP address and port, with states ESTABLISHED and UNREPLIED
the default rule gets hit, instead of the desired one
the packet is dropped, and the server returns icmp-host-prohibited
https://access.redhat.com/solutions/73273 Why iptables sporadically drops initial connections requests?
This behavior is caused by enabling ip_conntrack_tcp_loose sysctl parameter which enables iptables to create a conntrack entry whenever it sees any packet from any direction and not just SYN packet.
This is called connection picking and is usually used by the firewalls to ensure that the ongoing end to end sessions remain active if for some reason the firewall gets restarted. But this feature is not needed in standalone servers.
這裏面提到了iptables在規則爲allow的狀況下也會對某些數據包drop的狀況。大意就是在默認狀況下(/proc/net/ipv4/netfilter/ip_conntrack_tcp_loose 爲1時),iptables會對即時是不完整的TCP鏈接也會記錄其狀態,這樣避免
iptables重啓的時候影響已有的鏈接,可是會影響一些特殊狀況。
咱們的應用正好就出現了這種特殊狀況。當svcbus2由於等待appsrv1應答超時的時候,關閉了鏈接。而appsrv1上的nf_conntrack表中,當收到svcbus2發送的FIN包是,對於這個鏈接的記錄會變成CLOSE_WAIT狀態,而後在60秒後,
條目被清除。 可是,當appsrv1開始迴應數據的時候,nf_conntrack表中又出現了這個鏈接的條目,而且狀態是ESTABLISHED [UNREPLIED]。
複製內容到剪貼板
[root@appsrv1 ~]# grep 51522 /proc/net/nf_conntrack
ipv4 2 tcp 6 35 CLOSE_WAIT src=10.1.32.70 dst=10.1.41.192 sport=9102 dport=51522 src=10.1.41.192 dst=10.1.32.70 sport=51522 dport=9102 [ASSURED] mark=0 secmark=0 use=2
.
.
wait more seconds
.
.
[root@appsrv1 ~]# grep 51522 /proc/net/nf_conntrack
ipv4 2 tcp 6 431965 ESTABLISHED src=10.1.32.70 dst=10.1.41.192 sport=9102 dport=51522 [UNREPLIED] src=10.1.41.192 dst=10.1.32.70 sport=51522 dport=9102 mark=0 secmark=0 use=2
複製內容到剪貼板
1.配置服務器的iptables 容許tcp 8888端口.
2.用python來創建一個監聽服務,監聽在tcp 8888.
[root@server ~]# python
Python 2.6.6 (r266:84292, Sep 4 2013, 07:46:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> serversocket.bind(("0.0.0.0",8888))
>>> serversocket.listen(5)
>>> (clientsocket, address) = serversocket.accept()
3. 從客戶端以tcp 22222爲源端口發起鏈接,併發送請求數據
[root@client ~]# python
Python 2.6.6 (r266:84292, Sep 4 2013, 07:46:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.bind(("0.0.0.0",22222))
>>> s.connect(("1.1.1.101",8888))
>>> s.send("request",100)
在server端檢查iptable contrack 的狀態
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 431949 ESTABLISHED src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 src=1.1.1.101 dst=1.1.1.103 sport=8888 dport=22222 [ASSURED] mark=0 secmark=0 use=2
Wait some seconds, then close the connection on client.
>>> s.close()
>>> exit()
繼續檢查server端的iptable contrack 狀態
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 54 CLOSE_WAIT src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 src=1.1.1.101 dst=1.1.1.103 sport=8888 dport=22222 [ASSURED] mark=0 secmark=0 use=2
[root@server ~]# sleep 55
[root@server ~]# grep 103 /proc/net/nf_conntrack
server端的條目消失了.
4. 當server端向client發送響應數據的時候
>>> clientsocket.recv(1000)
'request'
>>> clientsocket.send("respond",100)
再到server端查看iptable contrack 狀態就會發現有 ESTABLISHED[UNREPLIED] 條目.
可是看TCP 鏈接的狀態倒是 CLOSE_WAIT.
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 431996 ESTABLISHED src=1.1.1.10
1 dst=1.1.1.103 sport=8888 dport=22222 [UNREPLIED] src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 mark=0 secmark=0 use=2
[root@server ~]# netstat -ntp|grep 103
tcp 1 7 1.1.1.101:8888 1.1.1.103:22222 CLOSE_WAIT 28978/python
這個時候,從wireshark上抓包也發現了client拒絕server的響應數據併發送ICMP[host administratively prohibited].
當server端TCP鏈接CLOSE_WAIT超時後, 受到net.netfilter.nf_conntrack_tcp_timeout_established參數的控制,nf_contrack表中ESTABLISHED[UNREPLIED] 條目是依然存在的。
[root@server ~]# netstat -ntp|grep 103
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 431066 ESTABLISHED src=1.1.1.101 dst=1.1.1.103 sport=8888 dport=22222 [UNREPLIED] src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 mark=0 secmark=0 use=2
監聽端口狀態是正常的
[root@server ~]# netstat -nplt|grep 8888
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 28978/python
5. 這個時候,當client再以tcp 22222爲源端口向tcp 8888發起鏈接,server端的iptables就拒絕了這個SYN報文. client端鏈接報錯。
[root@client ~]# python
Python 2.6.6 (r266:84292, Sep 4 2013, 07:46:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.bind(("0.0.0.0",22222))
>>> s.connect(("1.1.1.101",8888))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1, in connect
socket.error: [Errno 113] No route to host
>>>