技術培訓 | RAC 宕機罪犯案情探析之子游標

時間 2019-11-18

原文原文鏈接

你們好，我是雲和恩墨的李軼楠，不過網上的朋友更習慣叫我600，因此我也慢慢熟悉了這個稱呼，其實這個稱呼來自於ITPUB論壇上當時我註冊的論壇ID「ORA-600」，由於這個ID跟Oracle的著名錯誤號同樣，很容易給你們留下深入印象，因此被我借用了過來，呵呵。這些年經過論壇上認識了不少朋友，也結識瞭如今與我一塊兒奮戰的恩墨小夥伴們。閒話很少說，咱們來看看咱們今天要分享的主題吧，這些年咱們積累了大量的客戶羣體，也意味着咱們面對着各類複雜的環境與事件，後續我會把咱們小夥伴們所遭遇到的各類或者經典、或者靈異、或者有趣的case，逐一作成案件分析來分享給你們，但願你們喜歡。今天就以去年咱們所遇到的一次RAC節點宕機的故障做爲開場吧。前端

1. 案情描述

2015年6月的一個傍晚，大雨滂沱，正坐在家裏發呆中，忽然被一陣鈴聲驚醒。拿起電話，發現是客戶來電，因而趕忙接聽：」咱們的核心繫統在晚上9點多忽然有一個節點掛了，實例自動重啓。雖然沒有影響業務，但這種平白無故的重啓發生在覈心繫統上，直接威脅咱們的業務運行，領導很重視，因此今天必須把緣由找到，多晚都要給出結果，拜託了！「sql

2. 案情分析

聽到客戶的描述，內心第一個反應是：還好，只是單節點重啓，問題自己應該不嚴重，至少對業務影響並不大，看來這大雨天是不用淋雨去客戶現場了。。。呵呵，當心思而已。客戶目前急需的是快速給出問題的緣由，以便根據狀況做出後續的規避措施，因此須要儘快拿到現場的一些數據來作分析，畢竟分析所須要的信息數據距離故障時間越近，精準度就越高，越有利分析。數據庫

通過進一步的溝通，獲得了一些案發時的基本信息：api

重啓的是整個數據庫架構的2節點，這個庫是核心系統，晚上也有業務；
重啓實例的狀況其實之前也發生過，只不過發生的很少；（潛臺詞是也許這不是個案，之前的重啓意味着可能這個問題一直存在）
當前數據庫版本爲11.2.0.1。（看到每一個大版本的第一個小版本，老是以爲這種系統會BUG纏身。。。雖然誇大了點，但咱們確實遇到了很多這種小版本的BUG）

固然，很明顯只聽客戶描述是不夠的。因而，我遠程登陸了客戶的服務器，開始作進一步檢查。在開始分析以前，先跟你們及一下故障分析的常規思路，也便於初學者在遇到問題的時候知道從何處入手。服務器

在遇到故障時，首先要辨識一下當前的場景主要是性能問題的仍是真的故障；
若是是性能問題，那就須要收集當時的系統性能信息與數據庫性能信息，awr、ash，或者系統的nmon、top之類均可以；
若是是故障，那就要檢查數據庫的告警日誌與跟蹤文件了，很是典型的就是alertSID.log，這裏面每每給了咱們進一步追蹤痕跡的指引。除此以外，各個進程的trace文件，asm的trace文件以及RAC的各類log、trace文件都會給出一些故障的緣由或者印記，具體的這些文件所在目錄你們都應該熟悉，最典型的就是經過background_dump_dest、user_dump_dest參數去查找（注意，rac的log和trace文件與數據庫的目錄並不在一塊兒，能夠檢查$GRID_HOME/log/<hostname>的各個子目錄），更詳細的在這裏就再也不展開了。
另外，當遭遇這些問題的時候，若是有MOS帳號的話，建議首先去MOS中查看是否有故障相關的BUG或者技術文章，這既是快速診斷問題、解決問題的途徑，也是DBA快速成長的重要手段。關於MOS的使用，你們能夠加公衆號「oracle」，在其中能夠看到5月5日發的一篇「Oracle初學者入門指南-什麼是Metalink或MOS？」

好吧，回到咱們的案例中：session

一、下面是節點2的alert log部份內容：架構

Fri Jun 26 20:24:52 2015
 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_p001_13581.trc  (incident=204565):
 ORA-00600: 內部錯誤代碼, 參數: [kksfbc-wrong-kkscsflgs], [39913467504], [33], [], [], [], [], [], [], [], [], []
 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204565/orcl2_p001_13581_i204565.trc
 Fri Jun 26 20:25:06 2015
 Trace dumping is performing id=[cdmp_20150626202506]
 Fri Jun 26 20:25:06 2015
 Sweep [inc][204565]: completed
 Sweep [inc2][204565]: completed
 Fri Jun 26 20:25:18 2015
 Trace dumping is performing id=[cdmp_20150626202517]
 。。。。。
 Fri Jun 26 21:21:08 2015
 Archived Log entry 164580 added for thread 2 sequence 48360 ID 0xa822d65a dest 1:
 Fri Jun 26 21:41:09 2015
 Thread 2 advanced to log sequence 48362 (LGWR switch)
   Current log# 2 seq# 48362 mem# 0: +DATA/orcl/onlinelog/redo21.log
   Current log# 2 seq# 48362 mem# 1: +DATA/orcl/onlinelog/redo22.log
 Fri Jun 26 21:41:09 2015
 Archived Log entry 164584 added for thread 2 sequence 48361 ID 0xa822d65a dest 1:
 Fri Jun 26 21:50:26 2015
 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs.
 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc  (incident=204141):
 ORA-29771: process MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds
 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204141/orcl2_lmhb_29939_i204141.trc
 MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait
 LMHB (ospid: 29939) kills MMON (ospid: 29967).
 Please check LMHB trace file for more detail.
 Fri Jun 26 21:51:06 2015
 Restarting dead background process MMON
 Fri Jun 26 21:51:06 2015
 MMON started with pid=213, OS id=16612
 Fri Jun 26 21:51:09 2015
 Sweep [inc][204141]: completed
 Sweep [inc2][204141]: completed
 Fri Jun 26 21:54:10 2015
 LMS1 (ospid: 29929) waits for latch 'object queue header operation' for 81 secs.
 LCK0 (ospid: 29987) has not called a wait for 85 secs.
 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc  (incident=204142):
 ORA-29770: global enqueue process LMS1 (OSID 29929) is hung for more than 70 seconds
  Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204142/orcl2_lmhb_29939_i204142.trc
 Fri Jun 26 21:54:20 2015
 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc  (incident=204143):
 ORA-29770: global enqueue process LCK0 (OSID 29987) is hung for more than 70 seconds
 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204143/orcl2_lmhb_29939_i204143.trc
 ERROR: Some process(s) is not making progress.
 LMHB (ospid: 29939) is terminating the instance.
 Please check LMHB trace file for more details.
 Please also check the CPU load, I/O load and other system properties for anomalous behavior
 ERROR: Some process(s) is not making progress.
 LMHB (ospid: 29939): terminating the instance due to error 29770
 Fri Jun 26 21:54:21 2015
 opiodr aborting process unknown ospid (26414) as a result of ORA-1092
 Fri Jun 26 21:54:21 2015
 ORA-1092 : opitsk aborting process
 Fri Jun 26 21:54:21 2015
 System state dump is made for local instance
 System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_29889.trc
 Instance terminated by LMHB, pid = 29939
 Sat Jun 27 12:52:23 2015
 Starting ORACLE instance (normal)
 LICENSE_MAX_SESSION = 0
 LICENSE_SESSIONS_WARNING = 0
 Interface type 1 eth1 172.16.0.0 configured from GPnP Profile for use as a cluster interconnect
 Interface type 1 eth0 192.168.0.128 configured from GPnP Profile for use as a public interface
 Picked latch-free SCN scheme 3
 Autotune of undo retention is turned on.
 LICENSE_MAX_USERS = 0
 SYS auditing is disabled

在告警日誌中咱們發現一個很明顯的ORA-600錯誤，同時也發現一些其餘的ORA報錯，見上面標紅部分。因而咱們對這些錯誤分別進行了分析，發現：oracle

1）ORA-600 [kksfbc-wrong-kkscsflgs] (Doc ID 970885.1) 確實是一個bug，在MOS上有說明：app

Bug	Fixed	Description
9067282	11.2.0.1.2, 11.2.0.1.BP01, 11.2.0.3, 12.1.0.1	ORA-600 [kksfbc-wrong-kkscsflgs] can occur
9066130	10.2.0.5, 11.1.0.7.2, 11.2.0.2, 12.1.0.1	OERI [kksfbc-wrong-kkscsflgs] / spin with multiple children
8828328	11.2.0.1.1, 11.2.0.1.BP04, 11.2.0.2, 12.1.0.1	OERI [kksfbc-wrong-kkscsflgs]
8661168	11.2.0.1.1, 11.2.0.1.BP01, 11.2.0.2, 12.1.0.1	OERI[kksfbc-wrong-kkscsflgs] can occur

但MOS上並未說明該BUG會致使實例宕機，這個600錯誤看來應該與這次重啓關係不大，好吧，做爲一個問題記下來就是了。ide

2）在故障時間點，LMHB 進程check發現mmon進程阻塞了LCK0進程，超過70s，所以嘗試kill MMON進程，該進程被kill以後將會自動重啓。能夠看到在Jun 26 21:51:06 時間點，MMON進程重啓完成。

這裏在插播一件事，正好就在前幾天，咱們的客戶也遇到了MMON進程的相關問題，在這也順便分享給你們：客戶的MMON進程被Kill以後並未自動啓動，致使AWR等性能信息沒法自動收集，所以客戶但願可以在不啓動數據庫的狀況下啓動MMON，再想了各類辦法以後，最終找到了方法：

Restart the database instance

或者

Set the instance to "restricted session" mode and then bring it back to normal mode using following commands as SYSDBA:
alter system enable restricted session;
alter system disable restricted session;

你們也能夠搜尋一下MOS上的文章，看到相關的信息：文檔 ID 2023652.1

接下來，在Jun 26 21:54:10，LMS1進程報錯沒法得到latch（object queue header operation）超過85秒。

注: LMHB是11gR2中引入的後臺進程，官方文檔介紹是Global Cache/Enqueue Service Heartbeat Monitor，Monitor the heartbeat of LMON, LMD, and LMSn processes，LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning。該進程負責監控LMON、LMD、LMSn等RAC關鍵的後臺進程，保證這些background process不被阻塞或spin。 LMHB是Lock Manager Heartbeat的縮寫。

二、爲了更清楚的理清線索，咱們根據節點2的alert log信息，整理出以下的時間線點：

Jun 26 20:24:52   ORA-00600  [kksfbc-wrong-kkscsflgs]
Jun 26 21:50:26   LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs
                  MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds
                  MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait
                  LMHB (ospid: 29939) kills MMON (ospid: 29967)
                  
Jun 26 21:51:06   MMON started with pid=213, OS id=16612   

Jun 26 21:54:10   LMS1 (ospid: 29929) waits for latch 'object queue header operation' for 81 secs
                  LCK0 (ospid: 29987) has not called a wait for 85 secs
                  ORA-29770: global enqueue process LMS1 (OSID 29929) is hung for more than 70 seconds
                  
Jun 26 21:54:20   ORA-29770: global enqueue process LCK0 (OSID 29987) is hung for more than 70 seconds
                  ERROR: Some process(s) is not making progress.
                  LMHB (ospid: 29939) is terminating the instance 
                  LMHB (ospid: 29939): terminating the instance due to error 29770

從最後的信息能夠看出,在21:54:20時間點，LMHB進程強行終止了數據庫實例. 而終止實例的緣由是LMHB進程發現LCK0進行hung住了，並且超過了70s。在從前面的信息也能夠看出，實際上在21:54:10時間點，LCK0進程就已經沒有活動了了，並且在該時間點LMS1進程也一直在等待latch。很明顯，若是LMS1進程沒法正常工做，Oracle爲了保證集羣數據的一致性，爲了不腦裂，必然將問題節點強行驅逐重啓。那麼LMS1在等什麼呢？LCK0爲何被Hung住了？

三、lms1進程的狀況

讓咱們來看看LMS1到底在幹什麼？檢查orcl2_lmhb_29939_i204142.trc，而該trace 文件產生的時間點是：Jun 26 21:54:10：

SO: 0x9a1175160, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x9a1175160, name=process, file=ksu.h LINE:11459, pg=0
(process) Oracle pid:14, ser:1, calls cur/top: 0x9b17e5330/0x9b17e0e60
          flags : (0x6) SYSTEM
          flags2: (0x100),  flags3: (0x0) 
          intr error: 0, call error: 0, sess error: 0, txn error 0
          intr queue: empty
ksudlp FALSE at location: 0
  (post info) last post received: 0 0 116
              last post received-location: kjc.h LINE:1810 ID:KJCS Post snd proxy to flush msg
              last process to post me: 991126330 1 6
              last post sent: 41134582880 162 1
              last post sent-location: ksl2.h LINE:2160 ID:kslges
              last process posted by me: 9811032c8 1 14
  (latch info) wait_event=0 bits=a
        Location from where call was made: kcbo.h LINE:890 ID:kcbo_unlink_q_bg: 
    waiting for 993cfec60 Child object queue header operation level=5 child#=117 
        Location from where latch is held: kcl2.h LINE:3966 ID:kclbufs: 
        Context saved from call: 0
        state=busy(shared) [value=0x4000000000000001] wlstate=free [value=0]
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         14 (95, 1435326858, 4)
         21 (94, 1435326858, 7)
         waiter count=2
        gotten 73961423 times wait, failed first 4752 sleeps 1927
        gotten 33986 times nowait, failed: 4
        possible holder pid = 36 ospid=29987
    on wait list for 993cfec60
    holding    (efd=5) 9b59be480 Child gc element level=3 child#=20 
        Location from where latch is held: kcl2.h LINE:3535 ID:kclbla: 
        Context saved from call: 0
        state=busy(exclusive) [value=0x200000000000000e, holder orapid=14] wlstate=free [value=0]
    holding    (efd=5) 9b45cac50 Child cache buffers chains level=1 child#=61221 
        Location from where latch is held: kcl2.h LINE:3140 ID:kclbla: 
        Context saved from call: 0
        state=busy(exclusive) [value=0x200000000000000e, holder orapid=14] wlstate=free [value=0]
  Process Group: DEFAULT, pseudo proc: 0x9b11ca008
  O/S info: user: oracle, term: UNKNOWN, ospid: 29929 
  OSD pid info: Unix process pid: 29929, image: oracle@ebtadbsvr2 (LMS1)

從LMS1進程的信息來看，LMS1 進程所等待的資源（object queue header operation）正被ospid=29987 持有，那麼29987又是什麼呢？

四、進一步分析下ospid=29987 是什麼？

讓咱們接着往下看：

SO: 0x9911283b0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x9911283b0, name=process, file=ksu.h LINE:11459, pg=0
(process) Oracle pid:36, ser:2, calls cur/top: 0x9b17e58e0/0x9b17e58e0
          flags : (0x6) SYSTEM
          flags2: (0x0),  flags3: (0x0) 
          intr error: 0, call error: 0, sess error: 0, txn error 0
          intr queue: empty
ksudlp FALSE at location: 0
  (post info) last post received: 0 0 35
              last post received-location: ksr2.h LINE:603 ID:ksrpublish
              last process to post me: 981110608 118 0
              last post sent: 0 0 36
              last post sent-location: ksr2.h LINE:607 ID:ksrmdone
              last process posted by me: 9911283b0 2 6
  (latch info) wait_event=0 bits=20
    holding    (efd=3) 993cfec60 Child object queue header operation level=5 child#=117 
        Location from where latch is held: kcl2.h LINE:3966 ID:kclbufs: 
        Context saved from call: 0
        state=busy(shared) [value=0x4000000000000001] wlstate=free [value=0]
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         14 (95, 1435326858, 4)
         21 (94, 1435326858, 7)
         waiter count=2
  Process Group: DEFAULT, pseudo proc: 0x9b11ca008
  O/S info: user: oracle, term: UNKNOWN, ospid: 29987 
  OSD pid info: Unix process pid: 29987, image: oracle@ebtadbsvr2 (LCK0)

最後一句很明顯的告訴咱們，29987原來就是LCK0進程。這意味着lms1 進程所等待的資源正被LCK0 進程所持有。而同時還有另一個進程orapid=21 也在等待該進程。經過分析咱們分析發現： orapid=21，是dbw2進程，即數據庫寫進程。

注：這裏解釋一下，orapid是oracle中的進程id，而pid則是os上的進程id。因此orapid=21從v$process中能夠查到是dbw2，而orapid=14是lms1.

五、從數據庫alert log來看，在Jun 26 21:54:10 有提示lck0 進程已經超過85秒沒有響應

LCK0 (ospid: 29987) has not called a wait for 85 secs

根據時間點來計算，大概在Jun 26 21:52:45點左右開始，LCK0進程即沒有響應了。那麼很明顯，咱們只要知道LCK0進程爲何會hung，就知道了這次故障的緣由。那麼咱們來看看LCK0的trace文件，看能不能看到一些線索。

六、LCK0進程的trace信息

*** 2015-06-26 21:50:29.329                                                                                Process diagnostic dump for oracle@ebtadbsvr2 (LCK0), OS id=29987,  
pid: 36, proc_ser: 2, sid: 1729, sess_ser: 3                                                          
-------------------------------------------------------------------------------                                             
loadavg : 6.47 26.96 34.97                                                                                                 
Memory (Avail / Total) = 10598.05M / 64421.55M                           
Swap (Avail / Total) = 20256.00M /  20479.99M            
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD           
0 S oracle   29987     1  0  76   0 - 10541946 semtim Apr05 ?     01:25:21 ora_lck0_orcl2 
Short stack dump:                                        
ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semop()+7<-skgpwwait()+156<-kslgess()+1799<-ksl_get_shared_latch()+620<-kclbufs()+272<-kclanticheck()+412<-kclahrt()+88<-ksbcti()+212<-ksbabs()+1732<-kclabs()+186<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36

*** 2015-06-26 21:54:18.438
Process diagnostic dump for oracle@ebtadbsvr2 (LCK0), OS id=29987,
pid: 36, proc_ser: 2, sid: 1729, sess_ser: 3 
-------------------------------------------------------------------------------
loadavg : 2.04 13.34 27.63
Memory (Avail / Total) = 9519.06M / 64421.55M
Swap (Avail / Total) = 20256.00M /  20479.99M
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 R oracle   29987     1  0  85   0 - 10541965 ?   Apr05 ?        01:26:55 ora_lck0_orcl2
Short stack dump: 
ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-kcbo_get_next_qheader()+255<-kclbufs()+321<-kcldirtycheck()+231<-kclahrt()+93<-ksbcti()+212<-ksbabs()+1732<-kclabs()+186<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36

從上述lck0進程的幾個時間點的信息來看，第一個時間點21:50:29 ：wchan 爲semtim。wchan，表示進程sleeping的等待表現形式。semtim表示在該時間點，lck0 進程一直處於sleep狀態。所謂的sleep狀態，是進程持有的資源是不會釋放的。

而在第2個時間點21:54:18，lck0進程的wchan狀態是 ?，這表示未知，若是是爲空，則表示該進程處理running狀態。在21：50到21：52 時間段內，LCK0進程仍然沒有恢復正常。那麼LCK0到底被什麼阻塞了（或者說它須要的資源被誰佔用了）？同時也能夠看到內存和swap都空閒不少，CPU也並不很忙。

七、繼續檢查trace，在21：50時間點咱們發現，lck0進程是被MMON進程鎖阻塞了

SO: 0x9d10f97c0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
 proc=0x9d10f97c0, name=process, file=ksu.h LINE:11459, pg=0
(process) Oracle pid:31, ser:1, calls cur/top: 0x965657408/0x9b17e3f18
          flags : (0x2) SYSTEM
          flags2: (0x20),  flags3: (0x0) 
          intr error: 0, call error: 0, sess error: 0, txn error 0
          intr queue: empty
ksudlp FALSE at location: 0
  (post info) last post received: 0 0 35
              last post received-location: ksr2.h LINE:603 ID:ksrpublish
              last process to post me: 9911283b0 2 6
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: 9911283b0 2 6
  (latch info) wait_event=0 bits=26
    holding    (efd=7) 993cfec60 Child object queue header operation level=5 child#=117 
        Location from where latch is held: kcbo.h LINE:884 ID:kcbo_link_q: 
        Context saved from call: 0
        state=busy(exclusive) [value=0x200000000000001f, holder orapid=31] wlstate=free [value=0]
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         36 (82, 1435326627, 1)
         21 (81, 1435326627, 1)
         waiter count=2
    holding    (efd=7) 9b5a5d630 Child cache buffers lru chain level=2 child#=233 
        Location from where latch is held: kcb2.h LINE:3601 ID:kcbzgws: 
        Context saved from call: 0
        state=busy [holder orapid=31] wlstate=free [value=0]
    holding    (efd=7) 9c2a99938 Child cache buffers chains level=1 child#=27857 
        Location from where latch is held: kcb2.h LINE:3214 ID:kcbgtcr: fast path (cr pin): 
        Context saved from call: 12583184
        state=busy(exclusive) [value=0x200000000000001f, holder orapid=31] wlstate=free [value=0]
  Process Group: DEFAULT, pseudo proc: 0x9b11ca008
  O/S info: user: oracle, term: UNKNOWN, ospid: 29967 
  OSD pid info: Unix process pid: 29967, image: oracle@ebtadbsvr2 (MMON)

從上面的trace能夠看到以前一直被等待的993cfec60 資源（Child object queue header operation）正被 mmon進程持有。

21:50:29的時候LCK0在等待mmon釋放資源，而此時mmon出現異常，所以在21:51:06 mmon 進程被LMHB強制重啓。而後在重啓後，因爲mmon的異常，致使21:54:18該資源仍沒法被lck0 進程正常持有，最終致使21:54:20LMHB進程強制重啓了整個實例。所以，最終的罪魁禍首是mmon進程。

Fri Jun 26 21:50:26 2015
 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs.
 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc  (incident=204141):
 ORA-29771: process MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds
 Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204141/orcl2_lmhb_29939_i204141.trc
 MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait
 LMHB (ospid: 29939) kills MMON (ospid: 29967).
 Please check LMHB trace file for more detail.
 Fri Jun 26 21:51:06 2015
 Restarting dead background process MMON
 Fri Jun 26 21:51:06 2015
 MMON started with pid=213, OS id=16612

mmon進程Oracle是用於進行AWR信息收集的。既然案情發生的緣由與它有關，那麼接下來的分析天然是查看它的相關trace了。

八、檢查MMON的相關trace 能夠看到，MMON進程負責處理對象的統計信息。

從trace中能夠看到大量的cursor包含了太多的子游標，例如：

User=b1456358 Session=c146d760 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=558d5760 
      LibraryHandle:  Address=2f79eb08 Hash=3dec6f4a LockMode=N PinMode=0 LoadLockMode=0 Status=VALD 
        ObjectName:  Name=        select time_mp, scn, num_mappings, tim_scn_map from smon_scn_time   where scn =    (select max(scn) from smon_scn_time where scn <= :1)

          FullHashValue=c36d5a579fdc3e19733192893dec6f4a Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1038905162 OwnerIdn=0 
        Statistics:  InvalidationCount=0 ExecutionCount=23741 LoadCount=107 ActiveLocks=1 TotalLockCount=6093 TotalPinCount=1 
        Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=106 KeepHandle=106 
BucketInUse=6092 HandleInUse=6092 
        Concurrency:  DependencyMutex=2f79ebb8(0, 0, 0, 0) Mutex=2f79ec30(0, 87578, 0, 0) 
        Flags=RON/PIN/TIM/PN0/DBN/[10012841] 
        WaitersLists:  
          Lock=2f79eb98[2f79eb98,2f79eb98] 
          Pin=2f79eba8[2f79eb78,2f79eb78] 
        Timestamp:  Current=04-05-2015 09:48:42 
        LibraryObject:  Address=dff3bc60 HeapMask=0000-0001-0001 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000] 
          ChildTable:  size='112' 
            Child:  id='0' Table=dff3cb60 Reference=dff3c5b0 Handle=2f79e908 
            Child:  id='1' Table=dff3cb60 Reference=dff3c8e0 Handle=2f4e2d90 
            Child:  id='2' Table=dff3cb60 Reference=df3e7400 Handle=2f8c9e90 
            Child:  id='3' Table=dff3cb60 Reference=df3e76e8 Handle=2f8abce8 
            ......
            ......
            Child:  id='101' Table=dc86f748 Reference=df02d368 Handle=288e6460 
            Child:  id='102' Table=dc86f748 Reference=dd65c3e0 Handle=274d0b40 
            Child:  id='103' Table=dc86f748 Reference=dd65c6c8 Handle=29aa92f8 
            Child:  id='104' Table=dc86f748 Reference=dd65c9b8 Handle=26f3a460 
            Child:  id='105' Table=dc86f748 Reference=dd65ccd0 Handle=25c02dd8 
        NamespaceDump:  
          Parent Cursor:  sql_id=76cckj4yysvua parent=0x8dff3bd48 maxchild=106 plk=y ppn=n 
            Current Cursor Sharing Diagnostics Nodes:  
             ......
             ......
              Child Node: 100  ID=34 reason=Rolling Invalidate Window Exceeded(2) size=0x0 
                already processed:   
              Child Node: 101  ID=34 reason=Rolling Invalidate Window Exceeded(2) size=0x0 
                already processed:

相似上面的信息很是多。很明顯，上述parent cursor包含了大量的子游標，這是爲何在20點-21點（節點2還未重啓前的時段）的awr報告中出現大量cursor: mutex S 的緣由，以下是這個時段的等待事件：

Event	Waits	Time(s)	Avg wait (ms)	% DB time	Wait Class
DB CPU		47,072		93.05
cursor: mutex S	31,751,317	18,253	1	36.08	Concurrency
gc cr multi block request	359,897	1,281	4	2.53	Cluster
gc buffer busy acquire	10,465	686	66	1.36	Cluster
library cache lock	9,285	550	59	1.09	Concurrency

在mmon正常經過內部SQL收集系統信息時，根本不該該出現這種狀況，而此時MMON進程卻出現異常，這個異常看來應該是與cursor子游標過多有關了。

九、最後數據庫實例被強行終止的緣由是lck0進程出現異常致使lmhb進程強行終止instance 在最後instance終止以前的diag dump中，lck0進程的狀態仍然是hang的狀態，同時也阻塞了3個其餘session，以下：

SO: 0x9914dbce8, type: 4, owner: 0x9911283b0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x9911283b0, name=session, file=ksu.h LINE:11467, pg=0
    (session) sid: 1729 ser: 3 trans: (nil), creator: 0x9911283b0
              flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
              flags2: (0x408) -/-
              DID: , short-term DID: 
              txn branch: (nil)
              oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
    ksuxds FALSE at location: 0
    service name: SYS$BACKGROUND
    Current Wait Stack:
      Not in wait; last wait ended 1 min 39 sec ago 
    There are 3 sessions blocked by this session.
    Dumping one waiter:
      inst: 2, sid: 1009, ser: 1
      wait event: 'latch: object queue header operation'
        p1: 'address'=0x993cfec60
        p2: 'number'=0xa2
        p3: 'tries'=0x0
      row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
      min_blocked_time: 81 secs, waiter_cache_ver: 14285
    Wait State:
      fixed_waits=0 flags=0x20 boundary=(nil)/-1

對於數據庫進程，若是狀態不是dead，而是busy，並且持有latch不釋放，那麼這意味着該進程已被掛起，lck0 持有的latch是object queue header operation。oracle mos文檔也有關於該event 的描述以下：Scans of the object queue in the buffer cache can hold the "object queue header operation"。

基於上述的分析，咱們最終判斷，lck0 進程出現問題的緣由與cursor 子游標過多有關。同時，這又與在11.2.0.1版本上的child cursor總數閥值限制太高有關（實際在版本10g中就引入了該Cursor Obsolescence遊標廢棄特性，10g的child cursor 的總數閥值是1024，即子游標超過1024即被過時，可是這個閥值在11g的初期版本中被移除了。這就致使出現一個父遊標下大量child cursor即high version count的發生；由此引起了一系列的版本11.2.0.3以前的cursor sharing 性能問題。這意味着版本11.2.0.1和11.2.0.2的數據庫，將可能出現大量的Cursor: Mutex S 和 library cache lock等待事件這種症狀，進而誘發其餘故障的發生。

所以，強烈建議11.2.0.3如下的版本應儘快將數據庫版本升級到11.2.0.3以上（11.2.0.3中默認就有」_cursor_obsolete_threshold」了，並且默認值爲100），或者經過_cursor_features_enabled 和106001 event來強制控制子游標過多的狀況。

3. 結案陳詞

該案例的分析仍是有些曲折，由於常見致使單節點故障最多見的狀況主要是心跳、節點資源之類，而該案例的誘發緣由相對有些詭異，先是出現了ora-600錯誤，而後又報了kill mmon的信息，這都讓案情分析有些撲朔迷離，固然，最終咱們仍是找出了問題的主要根源。

經過該起案件咱們發現：

數據庫版本的選擇絕對是影響系統穩定性的關鍵要點；
不要覺得性能問題只是影響用戶體驗，有些性能問題是會誘發系統一系列問題的；
問題的分析不要想固然，經過trace逐步遞進，結合原理與經驗，才能更爲準確的肯定問題；
子游標過多千萬不能小視，最好能找出根源並解決它，若是確實很差解決，那麼可經過設置隱含參數_cursor_features_enabled 和106001 event強制失效子游標的方式來防止子游標過多的狀況，操做以下：

SQL> alter system set "_cursor_features_enabled"=300 scope=spfile;
SQL> alter system set event='106001 trace name context forever,level 1024' scope=spfile;

正常重啓數據庫便可。

Q&A

一、在什麼狀況下能夠重用父遊標和子游標呢？

父遊標只要SQL代碼共享就能夠重用，子游標的要求的比較多，特別是當使用了綁定變量或者謂詞字段的數據分佈傾斜的時候，很容易出現多個子遊標的狀況，具體子游標不能共享的緣由，能夠查v，這個視圖裏的一堆字段代表了某個特定父遊標的子游標不能共享的各類緣由

二、請問一下，監測rac心跳可否用直連網線？通常大家怎麼作？

rac的心跳從原理上是可使用直連網線的，但不管是穩定性仍是傳輸速率都不能徹底知足心跳的要求，所以Oracle從9i開始，在白皮書裏明確寫了不容許使用直連線的方式，主要就是出於心跳穩定和傳輸速率考慮。其實若是心跳壓力很是小的話，用直連確定也能夠，只是用rac的大部分是核心，通常不肯意去冒險。

三、能不能簡單說下 rac 的高可用和咱們平時作的雙機熱備有哪些區別呀。

rac的高可用是兩節點的數據庫實例同時可用，而傳統的ha則只是在一個節點上可用實例，另外一個節點的實例並無啓動，所以，簡單來講，rac比ha至少在高可用能力上更強（兩節點同時在用，一旦單節點故障，故障節點的會話能夠無縫切換到另外一可用節點，前端沒有業務中斷的感受，固然，這個須要配置好才行），而ha一旦單節點故障，業務必定得中斷，等待另外一節點實例起來以後才能鏈接。同時，rac也能同時發揮兩臺機器的效能，而ha的一個節點徹底是浪費着（有的用戶爲了不這種浪費，就把兩個庫分別在ha的兩個節點運行，讓兩節點的兩個庫互爲熱備）

四、請問除了子游標之外，還有哪些因素會致使，RAC宕機或者致使其不穩定。

rac不穩定的因素不少，最典型的就是心跳線不穩定或者心跳線出現數據風暴，此外，11g的DRM特性也很容易致使rac單節點宕機，通常咱們都會建議禁用DRM特性。此外，11g還有不少新特性都會誘發節點宕機，好比咱們之前還遇到過因爲asm實例內存過小，而在rac的asm上部署的數據庫太多致使asm實例掛掉，從而致使rac庫宕掉的。緣由不少了

五、可否簡單介紹下在 oracle優化中對遊標的優化思路與RAC經常使用優化方向。

遊標的優化思路，最簡單也是最根本的就是在適當的場景下用綁定變量，在不應用的時候不要亂用綁定。好比，謂詞傾斜的字段就極度不適合綁定。另外，統計信息的準確性，尤爲是傾斜字段的統計信息準確性也直接影響着子游標的產生。還有一些參數也要注意，特別是cursor_sharing，寧願設置成force，也儘可能不要用similar，similar很容易遭遇bug，致使一堆子游標撐爆shared pool。最後就是咱們那個案例中的不一樣版本對子游標數的限制，不能聽任子游標無窮增加。 rac的優化牽扯麪比較多，我先零零星星寫一些吧，太多了。好比sequence，在rac下必定要開啓cache，並且不能過小，咱們對於一些經常使用的sequence甚至建議開啓cache到500以上，決不能使用order。再好比若是發現GC的enq特別多，那麼就要考慮作業務分割，把操做不一樣表的會話鏈接到不一樣的節點實例上，減小兩節點對同一張表的徵用。

六、請說一下，RAC宕機的處理思路，謝謝。

對於宕機，基本上都是從數據庫的alert.log看起，通常告警日誌裏都會有最後宕掉時數據庫作的事情，甚至有更爲詳細的trace文件信息，那麼就須要根據這些信息結合rac各進程的關係來進行分析，這個的確不是一兩句話能說的完的。總之，確定是要看數據庫告警日誌和crs日誌的，asm的日誌有時候也要看。而致使宕機的緣由一樣五花八門，有些是系統緣由，好比asm存儲陣列出問題、光纖鏈路不穩定、swap耗盡等等，有些是數據庫問題，好比心跳檢測超時，爲了恢復出問題的節點，自動宕掉那個節點重啓，好比因爲一些數據庫的bug或者異常致使Oracle的某些關鍵進程被強制殺掉之類。。。可能性很是多了

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。