ORACLE的Dead Connection Detection淺析

時間 2019-11-16

標籤 oracle dead connection detection 淺析欄目 Oracle 简体版

原文原文鏈接

在複雜的應用環境下，咱們常常會遇到一些很是複雜而且有意思的問題，例如，咱們會遇到網絡異常（網絡掉包、無線網絡斷線）、客戶端程序異常（例如應用程序崩潰Crash）、操做系統藍屏、客戶端電腦掉電、死機重啓等異常狀況，此時數據庫鏈接可能都沒有正常關閉（Colse）、事務都沒有提交，鏈接（connections）就斷開了。若是遇到這些狀況，你未提交的一個事務在數據庫中是否會回滾？若是回滾，什麼條件纔會觸發回滾？須要多久纔會觸發回滾（不是回滾須要多少時間）？若是是一個查詢呢，那麼狀況又是怎麼樣呢？ORACLE數據庫是否提供某些機制來解決這些問題呢？若是這些問題你都能回答，那麼能夠不用看下文了，在介紹理論知識以前，咱們先經過構造測試案例，測試一下，畢竟實踐出真知，抽象的理論須要實驗來加深理解、全面詳細闡述。 html

咱們首先來測試一下數據庫會話正常退出的狀況吧，我在客戶端使用（SQL*Plus）鏈接到數據庫，執行一個UPDATE語句後不提交，而後退出（注意：實驗步驟是在服務器端查詢一些信息後才退出）。以下所示： web

 
  SQL> select * from v$mystat where rownum=1; 
   
         SID STATISTIC#      VALUE 
   
  ---------- ---------- ---------- 
   
         196          0          0 
   
  SQL> select sid,serial# from v$session where sid=196; 
   
         SID    SERIAL# 
   
  ---------- ---------- 
   
         196          9 
   
  SQL> update scott.dept set loc='CHICAGO' where deptno=40; 
   
  1 row updated. 
   
  SQL> exit    --在服務器查詢一些信息後才執行該命令

在服務器端咱們查看會話（196,9）的一些相關信息，以下所示：sql

 
  SQL> set linesize 1200 
   
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- ----------- ------------------------------------------------- 
   
         196              33 SQL*Net message from client 
   
  SQL> SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6         ,A.LOCKED_MODE 
   
    7  FROM   V$LOCKED_OBJECT A,  
   
    8         V$SESSION B  
   
    9  WHERE  A.SESSION_ID = B.SID  
   
   10  ORDER  BY B.LOGON_TIME; 
   
  USERNAME              SID    SERIAL# LOGON_TIM  OBJECT_ID   LOCKED_MODE  
   
  ----------------- ---------- ---------- --------- ----------   ----------- 
   
  TEST                   196          9 01-DEC-16      73199       3

從上面能夠看到196會話對錶SCOTT.DEPT持有鎖（Row-X 行獨佔(RX)），對象ID爲73199，而後咱們在客戶端不提交UPDATE語句就執行exit命令退出會話後，而後在服務器端檢查會話是否回滾。以下所示，測試結果咱們能夠看到，正常exit後，會話會當即回滾。（pmon進程當即回收相關進程，回收資源）數據庫

 
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196; 
   
  no rows selected 
   
  SQL> SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  no rows selected 
   
  SQL>

接下來，咱們來構造網絡異常的案例（須要多臺機器或虛擬機），以下所示，咱們首先在虛擬機上使用SQL*Plus鏈接到服務器端（帳號爲test，另外服務器上sqlnet.ora 不要設置SQLNET.EXPIRE_TIME參數，不啓用DCD，後面介紹至這個），而後執行一個UPATE語句不提交服務器

 
  SQL> show user; 
   
  USER is "TEST" 
   
  SQL> select * from v$mystat where rownum =1; 
   
         SID STATISTIC#      VALUE 
   
  ---------- ---------- ---------- 
   
         914          0          1 
   
  SQL> select sid,serial# from v$session where sid=914; 
   
         SID    SERIAL# 
   
  ---------- ---------- 
   
         914       3944 
   
  SQL> update scott.emp set sal=8000 where empno=7369; 
   
  1 row updated. 
   
  SQL>

而後咱們斷開虛擬機的網絡，構造網絡異常案例（在客戶端機器上執行service network stop命令斷開網絡），咱們在服務器端使用SQL*Plus查看會話（914,3944）的狀況，以下所示網絡

 
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- --------------- ---------------------------------------------------------------- 
   
         914              93 SQL*Net message from client 
   
  SQL>  SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID 
   
  ------------------------------ ---------- ---------- --------- ---------- 
   
  TEST                                  914       3944 01-DEC-16     782460 
   
  SQL>

咱們繼續執行上面語句，你會看到看到會話914一直是INACTIVE，對錶一直持有Row-X 行獨佔(RX)，並且seconds_in_wait也一直在增加 session

 
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- --------------- ----------------------------------------------------- 
   
         914            4928 SQL*Net message from client 
   
  SQL>  SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID 
   
  ------------------------------ ---------- ---------- --------- ---------- 
   
  TEST                                  914       3944 01-DEC-16     782460 
   
  SQL>  select sid, seconds_in_wait, event from v$session_wait where sid=914; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- --------------- ------------------------------------------------ 
   
         914            5853 SQL*Net message from client 
   
  SQL> SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID 
   
  ------------------------------ ---------- ---------- --------- ---------- 
   
  TEST                                  914       3944 01-DEC-16     782460 
   
  SQL>

最後一直等待pmon進程回收資源，通過屢次測試，發現這個時間都是在7860多秒後纔會被PMON進程回收資源。 oracle

那麼這個是否有一個固定的值？這個值是否有規律呢？我構造了這樣一個腳本在服務器端運行(根據實際狀況修改sid, serial#的值)，測試數據庫須要耗費多久時間，PMON進程纔會回收進程，釋放資源，回滾事務。app

 
  CREATE TABLE TEST.SESSION_WAIT_RECORD 
   
  AS 
   
  SELECT sid,  
   
         seconds_in_wait,  
   
         event, 
   
     sysdate as curr_datetime 
   
  FROM   v$session_wait  
   
         where 1=0; 
   
  CREATE TABLE TEST.LOCK_OBJECT_RECORD AS  
   
          SELECT B.username,  
   
                 B.sid,  
   
                 B.serial#,  
   
                 logon_time,  
   
                 A.object_id , 
   
                 sysdate as curr_datetime 
   
          FROM   v$locked_object A,  
   
                 v$session B  
   
          WHERE  A.session_id = B.sid  
   
         AND 1=0; 
   
  DECLARE  
   
      v_index NUMBER := 1;  
   
  BEGIN  
   
      WHILE v_index != 0 LOOP  
   
            INSERT INTO SESSION_WAIT_RECORD 
   
          SELECT sid,  
   
                 seconds_in_wait,  
   
                 event , 
   
           sysdate 
   
          FROM   v$session_wait  
   
          WHERE  sid = 916;  
   
                  INSERT INTO LOCK_OBJECT_RECORD 
   
          SELECT B.username,  
   
                     B.sid,  
   
                 B.serial#,  
   
                 logon_time,  
   
                 A.object_id , 
   
                 sysdate 
   
          FROM   v$locked_object A,  
   
                 v$session B  
   
          WHERE  A.session_id = B.sid  
   
                  AND A.session_id=916 AND B.serial#=415 
   
          ORDER  BY B.logon_time;  
   
          commit; 
   
          dbms_lock.Sleep(10);  
   
          SELECT Count(*)  
   
          INTO   v_index  
   
          FROM   v$session_wait  
   
          WHERE  sid = 916;  
   
      END LOOP;  
   
  END;

1：屢次的測試結果是否一直一致？這個值是否有什麼規律？ less

從個人幾回測試結果來看（固然沒有大量測試和考慮各類場景），幾回測試的結果以下（查詢SESSION_WAIT_RECORD表），基本上都是在7872~7876. 因爲上面SQL會休眠10秒，因此能夠推斷數據庫會在一個固定的時間後清理斷開的會話。

測試實驗1

測試實驗2：

測試實驗3：

將上面腳本休眠的時間改成2秒，避免修改時間過長引發的偏差，測試結果是7876

看似結果有點不一致，實際上是由於偏差，由於腳本里面休眠的時間(實驗一、2的休眠時間爲10秒，實驗3改成2秒)，以及其餘方面的一些偏差致使，規律就是這個跟Linux系統的TCP keepalive有關係，咱們先來看看TCP keepalive概念，以下

The keepalive concept is very simple: when you set up a TCP connection, you associate a set of timers. Some of these timers deal with the keepalive procedure. When the keepalive timer reaches zero, you send your peer a keepalive probe packet with no data in it and the ACK flag turned on. You can do this because of the TCP/IP specifications, as a sort of duplicate ACK, and the remote endpoint will have no arguments, as TCP is a stream-oriented protocol. On the other hand, you will receive a reply from the remote host (which doesn't need to support keepalive at all, just TCP/IP), with no data and the ACK set.

顧名思義，TCP keepalive它是用來保存TCP鏈接的，注意它只適用於TCP鏈接。系統會替你維護一個timer，時間到了，就會向remote peer發送一個probe package，固然裏面是沒有數據的，對方就會返回一個應答，這時你就知道這個通道保持正常。與TCP keepalive有關的三個參數tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes

[root@myln01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_time

7200

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_probes

[root@getlnx01uat ~]#

/proc/sys/net/ipv4/tcp_keepalive_time 當keepalive起用的時候，TCP發送keepalive消息的頻度。默認是2小時。

/proc/sys/net/ipv4/tcp_keepalive_intvl 當探測沒有確認時，keepalive探測包的發送間隔。缺省是75秒。

/proc/sys/net/ipv4/tcp_keepalive_probes 若是對方不予應答，keepalive探測包的發送次數。缺省值是9。

那麼在Oracle沒有啓用DCD時，系統和數據庫如何判斷一個鏈接是否異常，須要關閉呢？這個時間是這樣計算的，首先它等待了7200，而後每隔75秒發送探測包，一共發送了9次後（7200+ 75*9 = 7875 ），都沒有收到客戶端應答，那麼它就判斷這個鏈接死掉了，能夠關閉了。因此這個值是一個固定值, 具體爲7875, 固然不一樣的操做系統可能有所不一樣,取決於上面三個tcp_keepalive參數，過了7875秒後，這個時候PMON進程就會回收與它相關的全部資源（例如回滾事務，釋放lock、latch、memory）。這個值與我測試的時間很是接近了（考慮咱們是採集的等待時間，以及測試腳本里面有休眠時間，這樣採集的數據有些許誤差）。

2：若是是一個查詢操做呢？結果又是什麼狀況。

若是是查詢操做，結果依然是如此，有興趣的能夠自行測試。

3：這個是否跟專用鏈接服務器模式與共享鏈接服務器模式有關？

測試結果發現，專用鏈接服務器模式與共享鏈接服務器模式都同樣。只是跟Linux的系統內核參數tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes有關係。

那麼問題來了，若是會話持續持有Row-X 行獨佔(RX)長達7875秒，那麼頗有可能致使系統出現一些性能問題，重要的系統裏面這個是不可接受的，好了，如今回到咱們討論的正題，ORACLE是怎麼處理這些問題的？它應該有一套機制來解決這個問題，不然它也太弱了。其實ORACLE提供了DCD（Dead Connection Detection 即死鏈接檢測）機制來解決這個問題，下面來介紹這個：

Dead Connection Detection概念

DCD是Dead Connection Detection的縮寫，用於檢查那些死掉但沒有斷開的session。它的具體原理以下所示：

當一個新的數據庫鏈接創建後，SQL*Net讀取參數文件的SQLNET.EXPIRE_TIME設置（若是設置了的話），在服務端初始化DCD，DCD會爲這個鏈接建立一個定時器，當該定時器超過SQLNET.EXPIRE_TIME指定時間間隔後，就會向客戶端發送一個probe package（偵測包），該包實質上是一個空的SQL*NET包，不包括任何有用數據，它僅在底層協議上建立了數據流。若是此時客戶端鏈接仍是正常的話，那麼這個probe package就會被客戶端直接丟棄，而後Oracle服務器就會把該鏈接對應的定時器從新復位。若是客戶異常退出的話，偵測包由客戶端的IP層交到TCP層時，就會發現原先的鏈接已經不存在了，而後TCP層就會返回錯誤信息，該信息被ORACLE服務端接收到後，ORACLE就會知道該鏈接已經不可用了，因而SQL*NET就會向操做系統發送消息，釋放該鏈接的相關資源。

官方文檔關於Dead Connection Detection的介紹請參考文檔「Dead Connection Detection (DCD) Explained (文檔 ID 151972.1)」，摘抄部分以下所示

 DEAD CONNECTION DETECTION ========================= OVERVIEW -------- Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Oracle Net8 and Oracle NET. DCD detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and flags the dead session so PMON can release the resources associated with it. DCD is intended primarily for environments in which clients power down their systems without disconnecting from their Oracle sessions, a problem characteristic of networks with PC clients. DCD is initiated on the server when a connection is established. At this time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an alarm. The timer interval is set by providing a non-zero value in minutes for the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file. When the timer expires, SQL*Net on the server sends a "probe" packet to the client. (In the case of a database link, the destination of the link constitutes the server side of the connection.) The probe is essentially an empty SQL*Net packet and does not represent any form of SQL*Net level data, but it creates data traffic on the underlying protocol. If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server will signal the operating system to release the connection's resources. On Unix servers, the sqlnet.ora file must be in either $TNS_ADMIN or $ORACLE_HOME/network/admin. Neither /etc nor /var/opt/oracle alone is valid. It should be also be noted that in SQL*Net 2.1.x, an active orphan process (one processing a query, for example) will not be killed until the query completes. In SQL*Net 2.2, orphaned resources will be released regardless of activity. This is a server feature only. The client may be running any supported SQL*Net V2 release.

如何開啓/啓用DCD

開啓DCD(Dead Connection Detection)很是簡單，只須要在服務器端的sqlnet.ora裏面設置SQLNET.EXPIRE_TIME便可，固然客戶端也須要支持SQL*Net V2以及後面版本。如何檢查、確認是否開啓了DCD，官方文檔有詳細介紹：Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g。此處不作展開。

DCD的問題與異常

DCD在一些版本和平臺仍是有蠻多Bug的，你在Oracle Metalink上搜索一下，都能查到不少，另外我在測試過程當中，設置SQLNET.EXPIRE_TIME=5，測試發現，清理這些Dead Connection的時間不是5分鐘，而是20多分鐘，

搜索了大量資料，也沒有徹底完全弄清楚這個問題，只是知道這個跟TCP/IP有超時重傳機制有關係，網絡知識是個人薄弱項啊（嘗試了屢次無果後，只能放棄），固然，數據庫回收Dead Connection也不會徹底跟SQLNET.EXPIRE_TIME指定的時間一致的（例如，下面官方文檔就明確指出not at the exact time of the DCD value）。另外這個值還有可能被防火牆影響，能夠參考防火牆、DCD與TCP Keep alive這篇文章。

To answer common questions about Dead Connection Detection (DCD). Common Questions about Dead Connection Detection ------------------------------------------------ Q: What is Dead Connection Detection? A: Dead Connection Detection (DCD) allows SQL*Net/Net8 to identify connections that have been left hanging by the abnormal termination of a client. This feature minimizes the waste of resources by connections that are no longer valid. It also automatically forces a rollback of uncommitted transactions and locks held by the user of the broken connection. Q: How does Dead Connection Work? A: On a connection with DCD enabled, a small probe packet is sent from server to client at a user defined interval (usually several minutes). If the connection is invalid (usually due to the client process or machine being unreachable), the session is "flagged" as dead, and PMon cleans up that session when next doing housekeeping (not at the exact time of the DCD value). The DCD mechanism does NOT terminate any sessions (idle or active). It merely marks the "dead" session for deletion by PMon. Q: How do you set the Dead Connection Detection feature? A: DCD is enabled on the server side of the connection by defining a parameter in the sqlnet.ora file in $ORACLE_HOME/network/admin called SQLNET.EXPIRE_TIME. This parameter determines the time period between successive probe packets across a connection between client and server. SQLNET.EXPIRE_TIME= <# of minutes> The sqlnet.expire_time parameter is defined in minutes and can have any value between 1 and an infinite number. If it is not defined in the sqlnet.ora file, DCD is not used. A time of 10 minutes is probably optimum for most applications. DCD probe packets originate on the server side, so it must be enabled on the server side. If you define sqlnet.expire_time on the client side, it will be ignored. Q: Will this work with the Oracle Multi-Threaded Server? A: DCD will work and is very useful with Multi-Threaded Server (MTS) configurations. MTS alone does not solve the problem, as a client that is powered down when connected to a MTS will also leave a defunct connection within the MTS (at least until the underlying protocol detects the loss of the client, at which time it will inform MTS, which will then free the resources). The resources used per client with MTS are less than those used by dedicated server, however, so the net gain per connection within MTS is less than that with dedicated server. Having said that, DCD has a distinct advantage within MTS configurations - as each server process is managing multiple clients simultaneously, the DBA has no option of killing a single process as a result of the termination of a single client. DCD therefore increases database uptime by allowing resources to be managed more effectively. Q: Can I use DCD on all of my connections over all protocols? A: You can use DCD over all protocols except for APPC/LU6.2, which prevents DCD from working due to its half-duplex nature. It also does not work over bequeathed connections. You should carefully consider whether to use DCD before you use it, however, as it creates additional processing work on the server and can also increase network traffic. Furthermore, some protocols already implement a form of DCD already, so it may not necessarily be needed on all protocols. Q: Are there any differences if I am using DCD on connections that go through the Oracle Multi-Protocol Interchange (MPI) or Connection Manager (CMAN)? A: No. DCD works through MPI and CMAN in the same way as direct client/server. If your connection spans across half-duplex and full-duplex protocols (for example APPC/LU6.2 and TCP/IP), DCD will be disabled by the server.

DCD的好處與弊端

其實，DCD的好處上面已經基本闡述清楚了，其實DCD仍是有一些弊端的。例如，在Windows平臺性能不好（bug#303578）；在SCO Unix下它會觸發Bug，消耗大量CPU資源; DCD 在協議層是很消耗資源的, 因此若是要用DCD來清除死進程, 會加劇系統的負擔, 任什麼時候候, 乾淨的退出系統，這是首要的. 以下英文所述：

DCD is much more resource-intensive than similar mechanisms at the protocol level, so if you depend on DCD to clean up all dead processes, that will put

an undue load on the server. Clearly it is advantageous to exit applications cleanly in the first place.

參考資料：

http://www.laoxiong.net/firewall-dcd-and-tcp-keep-alive.html

Note.601605.1 A discussion of Dead Connection Detection, Resource Limits, V$SESSION, V$PROCESS and OS processes:Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g:Connections on Windows Platform Timout after 2 Hours, Why ? (文檔 ID 1073461.1)Concurrent Manager Functionality Not Working And PCP Failover Takes Long Inspite of Enabling DCD With Database Server (文檔 ID 438921.1)Common Questions About Dead Connection Detection (DCD) (文檔 ID 1018160.6)Dead Connection Detection (DCD) Explained (文檔 ID 151972.1)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。