你們好,我是雲和恩墨的李軼楠,不過網上的朋友更習慣叫我600,因此我也慢慢熟悉了這個稱呼,其實這個稱呼來自於ITPUB論壇上當時我註冊的論壇ID「ORA-600」,由於這個ID跟Oracle的著名錯誤號同樣,很容易給你們留下深入印象,因此被我借用了過來,呵呵。這些年經過論壇上認識了不少朋友,也結識瞭如今與我一塊兒奮戰的恩墨小夥伴們。 閒話很少說,咱們來看看咱們今天要分享的主題吧,這些年咱們積累了大量的客戶羣體,也意味着咱們面對着各類複雜的環境與事件,後續我會把咱們小夥伴們所遭遇到的各類或者經典、或者靈異、或者有趣的case,逐一作成案件分析來分享給你們,但願你們喜歡。今天就以去年咱們所遇到的一次RAC節點宕機的故障做爲開場吧。前端
2015年6月的一個傍晚,大雨滂沱,正坐在家裏發呆中,忽然被一陣鈴聲驚醒。拿起電話,發現是客戶來電,因而趕忙接聽: 」咱們的核心繫統在晚上9點多忽然有一個節點掛了,實例自動重啓。雖然沒有影響業務,但這種平白無故的重啓發生在覈心繫統上,直接威脅咱們的業務運行,領導很重視,因此今天必須把緣由找到,多晚都要給出結果,拜託了!「sql
聽到客戶的描述,內心第一個反應是:還好,只是單節點重啓,問題自己應該不嚴重,至少對業務影響並不大,看來這大雨天是不用淋雨去客戶現場了。。。呵呵,當心思而已。 客戶目前急需的是快速給出問題的緣由,以便根據狀況做出後續的規避措施,因此須要儘快拿到現場的一些數據來作分析,畢竟分析所須要的信息數據距離故障時間越近,精準度就越高,越有利分析。數據庫
通過進一步的溝通,獲得了一些案發時的基本信息:api
固然,很明顯只聽客戶描述是不夠的。因而,我遠程登陸了客戶的服務器,開始作進一步檢查。 在開始分析以前,先跟你們及一下故障分析的常規思路,也便於初學者在遇到問題的時候知道從何處入手。服務器
好吧,回到咱們的案例中:session
一、下面是節點2的alert log部份內容:架構
Fri Jun 26 20:24:52 2015 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_p001_13581.trc (incident=204565): ORA-00600: 內部錯誤代碼, 參數: [kksfbc-wrong-kkscsflgs], [39913467504], [33], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204565/orcl2_p001_13581_i204565.trc Fri Jun 26 20:25:06 2015 Trace dumping is performing id=[cdmp_20150626202506] Fri Jun 26 20:25:06 2015 Sweep [inc][204565]: completed Sweep [inc2][204565]: completed Fri Jun 26 20:25:18 2015 Trace dumping is performing id=[cdmp_20150626202517] 。。。。。 Fri Jun 26 21:21:08 2015 Archived Log entry 164580 added for thread 2 sequence 48360 ID 0xa822d65a dest 1: Fri Jun 26 21:41:09 2015 Thread 2 advanced to log sequence 48362 (LGWR switch) Current log# 2 seq# 48362 mem# 0: +DATA/orcl/onlinelog/redo21.log Current log# 2 seq# 48362 mem# 1: +DATA/orcl/onlinelog/redo22.log Fri Jun 26 21:41:09 2015 Archived Log entry 164584 added for thread 2 sequence 48361 ID 0xa822d65a dest 1: Fri Jun 26 21:50:26 2015 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204141): ORA-29771: process MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204141/orcl2_lmhb_29939_i204141.trc MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait LMHB (ospid: 29939) kills MMON (ospid: 29967). Please check LMHB trace file for more detail. Fri Jun 26 21:51:06 2015 Restarting dead background process MMON Fri Jun 26 21:51:06 2015 MMON started with pid=213, OS id=16612 Fri Jun 26 21:51:09 2015 Sweep [inc][204141]: completed Sweep [inc2][204141]: completed Fri Jun 26 21:54:10 2015 LMS1 (ospid: 29929) waits for latch 'object queue header operation' for 81 secs. LCK0 (ospid: 29987) has not called a wait for 85 secs. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204142): ORA-29770: global enqueue process LMS1 (OSID 29929) is hung for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204142/orcl2_lmhb_29939_i204142.trc Fri Jun 26 21:54:20 2015 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204143): ORA-29770: global enqueue process LCK0 (OSID 29987) is hung for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204143/orcl2_lmhb_29939_i204143.trc ERROR: Some process(s) is not making progress. LMHB (ospid: 29939) is terminating the instance. Please check LMHB trace file for more details. Please also check the CPU load, I/O load and other system properties for anomalous behavior ERROR: Some process(s) is not making progress. LMHB (ospid: 29939): terminating the instance due to error 29770 Fri Jun 26 21:54:21 2015 opiodr aborting process unknown ospid (26414) as a result of ORA-1092 Fri Jun 26 21:54:21 2015 ORA-1092 : opitsk aborting process Fri Jun 26 21:54:21 2015 System state dump is made for local instance System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_29889.trc Instance terminated by LMHB, pid = 29939 Sat Jun 27 12:52:23 2015 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 eth1 172.16.0.0 configured from GPnP Profile for use as a cluster interconnect Interface type 1 eth0 192.168.0.128 configured from GPnP Profile for use as a public interface Picked latch-free SCN scheme 3 Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled
在告警日誌中咱們發現一個很明顯的ORA-600錯誤,同時也發現一些其餘的ORA報錯,見上面標紅部分。因而咱們對這些錯誤分別進行了分析,發現:oracle
1)ORA-600 [kksfbc-wrong-kkscsflgs] (Doc ID 970885.1) 確實是一個bug,在MOS上有說明:app
NB | Bug | Fixed | Description |
---|---|---|---|
9067282 | 11.2.0.1.2, 11.2.0.1.BP01, 11.2.0.3, 12.1.0.1 | ORA-600 [kksfbc-wrong-kkscsflgs] can occur | |
9066130 | 10.2.0.5, 11.1.0.7.2, 11.2.0.2, 12.1.0.1 | OERI [kksfbc-wrong-kkscsflgs] / spin with multiple children | |
8828328 | 11.2.0.1.1, 11.2.0.1.BP04, 11.2.0.2, 12.1.0.1 | OERI [kksfbc-wrong-kkscsflgs] | |
8661168 | 11.2.0.1.1, 11.2.0.1.BP01, 11.2.0.2, 12.1.0.1 | OERI[kksfbc-wrong-kkscsflgs] can occur |
但MOS上並未說明該BUG會致使實例宕機,這個600錯誤看來應該與這次重啓關係不大,好吧,做爲一個問題記下來就是了。ide
2) 在故障時間點,LMHB 進程check發現mmon進程阻塞了LCK0進程,超過70s,所以嘗試kill MMON進程,該進程被kill以後將會自動重啓。 能夠看到在Jun 26 21:51:06 時間點,MMON進程重啓完成。
這裏在插播一件事,正好就在前幾天,咱們的客戶也遇到了MMON進程的相關問題,在這也順便分享給你們:客戶的MMON進程被Kill以後並未自動啓動,致使AWR等性能信息沒法自動收集,所以客戶但願可以在不啓動數據庫的狀況下啓動MMON,再想了各類辦法以後,最終找到了方法:
Restart the database instance
或者
Set the instance to "restricted session" mode and then bring it back to normal mode using following commands as SYSDBA: alter system enable restricted session; alter system disable restricted session;
你們也能夠搜尋一下MOS上的文章,看到相關的信息:文檔 ID 2023652.1
接下來,在Jun 26 21:54:10,LMS1進程報錯沒法得到latch(object queue header operation) 超過85秒。
注: LMHB是11gR2中引入的後臺進程,官方文檔介紹是Global Cache/Enqueue Service Heartbeat Monitor,Monitor the heartbeat of LMON, LMD, and LMSn processes,LMHB monitors LMON, LMD, and LMSn processes to ensure they are running normally without blocking or spinning。 該進程負責監控LMON、LMD、LMSn等RAC關鍵的後臺進程,保證這些background process不被阻塞或spin。 LMHB是Lock Manager Heartbeat的縮寫。
二、爲了更清楚的理清線索,咱們根據節點2的alert log信息,整理出以下的時間線點:
Jun 26 20:24:52 ORA-00600 [kksfbc-wrong-kkscsflgs] Jun 26 21:50:26 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait LMHB (ospid: 29939) kills MMON (ospid: 29967) Jun 26 21:51:06 MMON started with pid=213, OS id=16612 Jun 26 21:54:10 LMS1 (ospid: 29929) waits for latch 'object queue header operation' for 81 secs LCK0 (ospid: 29987) has not called a wait for 85 secs ORA-29770: global enqueue process LMS1 (OSID 29929) is hung for more than 70 seconds Jun 26 21:54:20 ORA-29770: global enqueue process LCK0 (OSID 29987) is hung for more than 70 seconds ERROR: Some process(s) is not making progress. LMHB (ospid: 29939) is terminating the instance LMHB (ospid: 29939): terminating the instance due to error 29770
從最後的信息能夠看出,在21:54:20時間點,LMHB進程強行終止了數據庫實例. 而終止實例的緣由是LMHB進程發現LCK0進行hung住了,並且超過了70s。在從前面的信息也能夠看出,實際上在21:54:10時間點,LCK0進程就已經沒有活動了了,並且在該時間點LMS1進程也一直在等待latch。很明顯,若是LMS1進程沒法正常工做,Oracle爲了保證集羣數據的一致性,爲了不腦裂,必然將問題節點強行驅逐重啓。 那麼LMS1在等什麼呢?LCK0爲何被Hung住了?
三、lms1進程的狀況
讓咱們來看看LMS1到底在幹什麼? 檢查orcl2_lmhb_29939_i204142.trc,而該trace 文件產生的時間點是:Jun 26 21:54:10:
SO: 0x9a1175160, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9a1175160, name=process, file=ksu.h LINE:11459, pg=0 (process) Oracle pid:14, ser:1, calls cur/top: 0x9b17e5330/0x9b17e0e60 flags : (0x6) SYSTEM flags2: (0x100), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 116 last post received-location: kjc.h LINE:1810 ID:KJCS Post snd proxy to flush msg last process to post me: 991126330 1 6 last post sent: 41134582880 162 1 last post sent-location: ksl2.h LINE:2160 ID:kslges last process posted by me: 9811032c8 1 14 (latch info) wait_event=0 bits=a Location from where call was made: kcbo.h LINE:890 ID:kcbo_unlink_q_bg: waiting for 993cfec60 Child object queue header operation level=5 child#=117 Location from where latch is held: kcl2.h LINE:3966 ID:kclbufs: Context saved from call: 0 state=busy(shared) [value=0x4000000000000001] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 14 (95, 1435326858, 4) 21 (94, 1435326858, 7) waiter count=2 gotten 73961423 times wait, failed first 4752 sleeps 1927 gotten 33986 times nowait, failed: 4 possible holder pid = 36 ospid=29987 on wait list for 993cfec60 holding (efd=5) 9b59be480 Child gc element level=3 child#=20 Location from where latch is held: kcl2.h LINE:3535 ID:kclbla: Context saved from call: 0 state=busy(exclusive) [value=0x200000000000000e, holder orapid=14] wlstate=free [value=0] holding (efd=5) 9b45cac50 Child cache buffers chains level=1 child#=61221 Location from where latch is held: kcl2.h LINE:3140 ID:kclbla: Context saved from call: 0 state=busy(exclusive) [value=0x200000000000000e, holder orapid=14] wlstate=free [value=0] Process Group: DEFAULT, pseudo proc: 0x9b11ca008 O/S info: user: oracle, term: UNKNOWN, ospid: 29929 OSD pid info: Unix process pid: 29929, image: oracle@ebtadbsvr2 (LMS1)
從LMS1進程的信息來看,LMS1 進程所等待的資源(object queue header operation)正被ospid=29987 持有,那麼29987又是什麼呢?
四、進一步分析下ospid=29987 是什麼?
讓咱們接着往下看:
SO: 0x9911283b0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9911283b0, name=process, file=ksu.h LINE:11459, pg=0 (process) Oracle pid:36, ser:2, calls cur/top: 0x9b17e58e0/0x9b17e58e0 flags : (0x6) SYSTEM flags2: (0x0), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 35 last post received-location: ksr2.h LINE:603 ID:ksrpublish last process to post me: 981110608 118 0 last post sent: 0 0 36 last post sent-location: ksr2.h LINE:607 ID:ksrmdone last process posted by me: 9911283b0 2 6 (latch info) wait_event=0 bits=20 holding (efd=3) 993cfec60 Child object queue header operation level=5 child#=117 Location from where latch is held: kcl2.h LINE:3966 ID:kclbufs: Context saved from call: 0 state=busy(shared) [value=0x4000000000000001] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 14 (95, 1435326858, 4) 21 (94, 1435326858, 7) waiter count=2 Process Group: DEFAULT, pseudo proc: 0x9b11ca008 O/S info: user: oracle, term: UNKNOWN, ospid: 29987 OSD pid info: Unix process pid: 29987, image: oracle@ebtadbsvr2 (LCK0)
最後一句很明顯的告訴咱們,29987原來就是LCK0進程。這意味着lms1 進程所等待的資源正被LCK0 進程所持有。而同時還有另一個進程orapid=21 也在等待該進程。經過分析咱們分析發現: orapid=21,是dbw2進程,即數據庫寫進程。
注:這裏解釋一下,orapid是oracle中的進程id,而pid則是os上的進程id。因此orapid=21從v$process中能夠查到是dbw2,而orapid=14是lms1.
五、從數據庫alert log來看,在Jun 26 21:54:10 有提示lck0 進程已經超過85秒沒有響應
LCK0 (ospid: 29987) has not called a wait for 85 secs
根據時間點來計算,大概在Jun 26 21:52:45點左右開始,LCK0進程即沒有響應了。 那麼很明顯,咱們只要知道LCK0進程爲何會hung,就知道了這次故障的緣由。 那麼咱們來看看LCK0的trace文件,看能不能看到一些線索。
六、LCK0進程的trace信息
*** 2015-06-26 21:50:29.329 Process diagnostic dump for oracle@ebtadbsvr2 (LCK0), OS id=29987, pid: 36, proc_ser: 2, sid: 1729, sess_ser: 3 ------------------------------------------------------------------------------- loadavg : 6.47 26.96 34.97 Memory (Avail / Total) = 10598.05M / 64421.55M Swap (Avail / Total) = 20256.00M / 20479.99M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 S oracle 29987 1 0 76 0 - 10541946 semtim Apr05 ? 01:25:21 ora_lck0_orcl2 Short stack dump: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semop()+7<-skgpwwait()+156<-kslgess()+1799<-ksl_get_shared_latch()+620<-kclbufs()+272<-kclanticheck()+412<-kclahrt()+88<-ksbcti()+212<-ksbabs()+1732<-kclabs()+186<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36 *** 2015-06-26 21:54:18.438 Process diagnostic dump for oracle@ebtadbsvr2 (LCK0), OS id=29987, pid: 36, proc_ser: 2, sid: 1729, sess_ser: 3 ------------------------------------------------------------------------------- loadavg : 2.04 13.34 27.63 Memory (Avail / Total) = 9519.06M / 64421.55M Swap (Avail / Total) = 20256.00M / 20479.99M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 0 R oracle 29987 1 0 85 0 - 10541965 ? Apr05 ? 01:26:55 ora_lck0_orcl2 Short stack dump: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-kcbo_get_next_qheader()+255<-kclbufs()+321<-kcldirtycheck()+231<-kclahrt()+93<-ksbcti()+212<-ksbabs()+1732<-kclabs()+186<-ksbrdp()+923<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+266<-ssthrdmain()+214<-main()+201<-__libc_start_main()+244<-_start()+36
從上述lck0進程的幾個時間點的信息來看,第一個時間點21:50:29 :wchan 爲semtim。wchan,表示進程sleeping的等待表現形式。semtim表示在該時間點,lck0 進程一直處於sleep狀態。所謂的sleep狀態,是進程持有的資源是不會釋放的。
而在第2個時間點21:54:18,lck0進程的wchan狀態是 ?,這表示未知,若是是爲空,則表示該進程處理running狀態。在21:50到21:52 時間段內,LCK0進程仍然沒有恢復正常。那麼LCK0到底被什麼阻塞了(或者說它須要的資源被誰佔用了)? 同時也能夠看到內存和swap都空閒不少,CPU也並不很忙。
七、繼續檢查trace,在21:50時間點咱們發現,lck0進程是被MMON進程鎖阻塞了
SO: 0x9d10f97c0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9d10f97c0, name=process, file=ksu.h LINE:11459, pg=0 (process) Oracle pid:31, ser:1, calls cur/top: 0x965657408/0x9b17e3f18 flags : (0x2) SYSTEM flags2: (0x20), flags3: (0x0) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 35 last post received-location: ksr2.h LINE:603 ID:ksrpublish last process to post me: 9911283b0 2 6 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:282 ID:ksasnd last process posted by me: 9911283b0 2 6 (latch info) wait_event=0 bits=26 holding (efd=7) 993cfec60 Child object queue header operation level=5 child#=117 Location from where latch is held: kcbo.h LINE:884 ID:kcbo_link_q: Context saved from call: 0 state=busy(exclusive) [value=0x200000000000001f, holder orapid=31] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 36 (82, 1435326627, 1) 21 (81, 1435326627, 1) waiter count=2 holding (efd=7) 9b5a5d630 Child cache buffers lru chain level=2 child#=233 Location from where latch is held: kcb2.h LINE:3601 ID:kcbzgws: Context saved from call: 0 state=busy [holder orapid=31] wlstate=free [value=0] holding (efd=7) 9c2a99938 Child cache buffers chains level=1 child#=27857 Location from where latch is held: kcb2.h LINE:3214 ID:kcbgtcr: fast path (cr pin): Context saved from call: 12583184 state=busy(exclusive) [value=0x200000000000001f, holder orapid=31] wlstate=free [value=0] Process Group: DEFAULT, pseudo proc: 0x9b11ca008 O/S info: user: oracle, term: UNKNOWN, ospid: 29967 OSD pid info: Unix process pid: 29967, image: oracle@ebtadbsvr2 (MMON)
從上面的trace能夠看到以前一直被等待的993cfec60 資源(Child object queue header operation)正被 mmon進程持有。
21:50:29的時候LCK0在等待mmon釋放資源,而此時mmon出現異常,所以在21:51:06 mmon 進程被LMHB強制重啓。而後在重啓後,因爲mmon的異常,致使21:54:18該資源仍沒法被lck0 進程正常持有,最終致使21:54:20LMHB進程強制重啓了整個實例。 所以,最終的罪魁禍首是mmon進程。
Fri Jun 26 21:50:26 2015 LCK0 (ospid: 29987) waits for latch 'object queue header operation' for 77 secs. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_lmhb_29939.trc (incident=204141): ORA-29771: process MMON (OSID 29967) blocks LCK0 (OSID 29987) for more than 70 seconds Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_204141/orcl2_lmhb_29939_i204141.trc MMON (ospid: 29967) is blocking LCK0 (ospid: 29987) in a wait LMHB (ospid: 29939) kills MMON (ospid: 29967). Please check LMHB trace file for more detail. Fri Jun 26 21:51:06 2015 Restarting dead background process MMON Fri Jun 26 21:51:06 2015 MMON started with pid=213, OS id=16612
mmon進程Oracle是用於進行AWR信息收集的。既然案情發生的緣由與它有關,那麼接下來的分析天然是查看它的相關trace了。
八、檢查MMON的相關trace 能夠看到,MMON進程負責處理對象的統計信息。
從trace中能夠看到大量的cursor包含了太多的子游標,例如:
User=b1456358 Session=c146d760 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=558d5760 LibraryHandle: Address=2f79eb08 Hash=3dec6f4a LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName: Name= select time_mp, scn, num_mappings, tim_scn_map from smon_scn_time where scn = (select max(scn) from smon_scn_time where scn <= :1) FullHashValue=c36d5a579fdc3e19733192893dec6f4a Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1038905162 OwnerIdn=0 Statistics: InvalidationCount=0 ExecutionCount=23741 LoadCount=107 ActiveLocks=1 TotalLockCount=6093 TotalPinCount=1 Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=106 KeepHandle=106 BucketInUse=6092 HandleInUse=6092 Concurrency: DependencyMutex=2f79ebb8(0, 0, 0, 0) Mutex=2f79ec30(0, 87578, 0, 0) Flags=RON/PIN/TIM/PN0/DBN/[10012841] WaitersLists: Lock=2f79eb98[2f79eb98,2f79eb98] Pin=2f79eba8[2f79eb78,2f79eb78] Timestamp: Current=04-05-2015 09:48:42 LibraryObject: Address=dff3bc60 HeapMask=0000-0001-0001 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000] ChildTable: size='112' Child: id='0' Table=dff3cb60 Reference=dff3c5b0 Handle=2f79e908 Child: id='1' Table=dff3cb60 Reference=dff3c8e0 Handle=2f4e2d90 Child: id='2' Table=dff3cb60 Reference=df3e7400 Handle=2f8c9e90 Child: id='3' Table=dff3cb60 Reference=df3e76e8 Handle=2f8abce8 ...... ...... Child: id='101' Table=dc86f748 Reference=df02d368 Handle=288e6460 Child: id='102' Table=dc86f748 Reference=dd65c3e0 Handle=274d0b40 Child: id='103' Table=dc86f748 Reference=dd65c6c8 Handle=29aa92f8 Child: id='104' Table=dc86f748 Reference=dd65c9b8 Handle=26f3a460 Child: id='105' Table=dc86f748 Reference=dd65ccd0 Handle=25c02dd8 NamespaceDump: Parent Cursor: sql_id=76cckj4yysvua parent=0x8dff3bd48 maxchild=106 plk=y ppn=n Current Cursor Sharing Diagnostics Nodes: ...... ...... Child Node: 100 ID=34 reason=Rolling Invalidate Window Exceeded(2) size=0x0 already processed: Child Node: 101 ID=34 reason=Rolling Invalidate Window Exceeded(2) size=0x0 already processed:
相似上面的信息很是多。很明顯,上述parent cursor包含了大量的子游標,這是爲何在20點-21點(節點2還未重啓前的時段)的awr報告中出現大量cursor: mutex S 的緣由,以下是這個時段的等待事件:
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
DB CPU | 47,072 | 93.05 | |||
cursor: mutex S | 31,751,317 | 18,253 | 1 | 36.08 | Concurrency |
gc cr multi block request | 359,897 | 1,281 | 4 | 2.53 | Cluster |
gc buffer busy acquire | 10,465 | 686 | 66 | 1.36 | Cluster |
library cache lock | 9,285 | 550 | 59 | 1.09 | Concurrency |
在mmon正常經過內部SQL收集系統信息時,根本不該該出現這種狀況,而此時MMON進程卻出現異常,這個異常看來應該是與cursor子游標過多有關了。
九、最後數據庫實例被強行終止的緣由是lck0進程出現異常致使lmhb進程強行終止instance 在最後instance終止以前的diag dump中,lck0進程的狀態仍然是hang的狀態,同時也阻塞了3個其餘session,以下:
SO: 0x9914dbce8, type: 4, owner: 0x9911283b0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x9911283b0, name=session, file=ksu.h LINE:11467, pg=0 (session) sid: 1729 ser: 3 trans: (nil), creator: 0x9911283b0 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x408) -/- DID: , short-term DID: txn branch: (nil) oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: Not in wait; last wait ended 1 min 39 sec ago There are 3 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 1009, ser: 1 wait event: 'latch: object queue header operation' p1: 'address'=0x993cfec60 p2: 'number'=0xa2 p3: 'tries'=0x0 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 81 secs, waiter_cache_ver: 14285 Wait State: fixed_waits=0 flags=0x20 boundary=(nil)/-1
對於數據庫進程,若是狀態不是dead,而是busy,並且持有latch不釋放,那麼這意味着該進程已被掛起,lck0 持有的latch是object queue header operation。oracle mos文檔也有關於該event 的描述以下:Scans of the object queue in the buffer cache can hold the "object queue header operation"。
基於上述的分析,咱們最終判斷,lck0 進程出現問題的緣由與cursor 子游標過多有關。同時,這又與在11.2.0.1版本上的child cursor總數閥值限制太高有關(實際在版本10g中就引入了該Cursor Obsolescence遊標廢棄特性,10g的child cursor 的總數閥值是1024,即子游標超過1024即被過時,可是這個閥值在11g的初期版本中被移除了。這就致使出現一個父遊標下大量child cursor即high version count的發生;由此引起了一系列的版本11.2.0.3以前的cursor sharing 性能問題。這意味着版本11.2.0.1和11.2.0.2的數據庫,將可能出現大量的Cursor: Mutex S 和 library cache lock等待事件這種症狀,進而誘發其餘故障的發生。
所以, 強烈建議11.2.0.3如下的版本應儘快將數據庫版本升級到11.2.0.3以上(11.2.0.3中默認就有」_cursor_obsolete_threshold」了,並且默認值爲100),或者經過_cursor_features_enabled 和106001 event來強制控制子游標過多的狀況。
該案例的分析仍是有些曲折,由於常見致使單節點故障最多見的狀況主要是心跳、節點資源之類,而該案例的誘發緣由相對有些詭異,先是出現了ora-600錯誤,而後又報了kill mmon的信息,這都讓案情分析有些撲朔迷離,固然,最終咱們仍是找出了問題的主要根源。
經過該起案件咱們發現:
SQL> alter system set "_cursor_features_enabled"=300 scope=spfile; SQL> alter system set event='106001 trace name context forever,level 1024' scope=spfile;
正常重啓數據庫便可。
一、在什麼狀況下能夠重用父遊標和子游標呢?
父遊標只要SQL代碼共享就能夠重用,子游標的要求的比較多,特別是當使用了綁定變量或者謂詞字段的數據分佈傾斜的時候,很容易出現多個子遊標的狀況,具體子游標不能共享的緣由,能夠查v,這個視圖裏的一堆字段代表了某個特定父遊標的子游標不能共享的各類緣由
二、請問一下,監測rac心跳可否用直連網線?通常大家怎麼作?
rac的心跳從原理上是可使用直連網線的,但不管是穩定性仍是傳輸速率都不能徹底知足心跳的要求,所以Oracle從9i開始,在白皮書裏明確寫了不容許使用直連線的方式,主要就是出於心跳穩定和傳輸速率考慮。其實若是心跳壓力很是小的話,用直連確定也能夠,只是用rac的大部分是核心,通常不肯意去冒險。
三、能不能簡單說下 rac 的高可用和咱們平時作的雙機熱備有哪些區別呀。
rac的高可用是兩節點的數據庫實例同時可用,而傳統的ha則只是在一個節點上可用實例,另外一個節點的實例並無啓動,所以,簡單來講,rac比ha至少在高可用能力上更強(兩節點同時在用,一旦單節點故障,故障節點的會話能夠無縫切換到另外一可用節點,前端沒有業務中斷的感受,固然,這個須要配置好才行),而ha一旦單節點故障,業務必定得中斷,等待另外一節點實例起來以後才能鏈接。 同時,rac也能同時發揮兩臺機器的效能,而ha的一個節點徹底是浪費着(有的用戶爲了不這種浪費,就把兩個庫分別在ha的兩個節點運行,讓兩節點的兩個庫互爲熱備)
四、請問除了子游標之外,還有哪些因素會致使,RAC宕機或者致使其不穩定。
rac不穩定的因素不少,最典型的就是心跳線不穩定或者心跳線出現數據風暴,此外,11g的DRM特性也很容易致使rac單節點宕機,通常咱們都會建議禁用DRM特性。此外,11g還有不少新特性都會誘發節點宕機,好比咱們之前還遇到過因爲asm實例內存過小,而在rac的asm上部署的數據庫太多致使asm實例掛掉,從而致使rac庫宕掉的。緣由不少了
五、可否簡單介紹下 在 oracle優化中對遊標的優化思路與RAC經常使用優化方向。
遊標的優化思路,最簡單也是最根本的就是在適當的場景下用綁定變量,在不應用的時候不要亂用綁定。好比,謂詞傾斜的字段就極度不適合綁定。另外,統計信息的準確性,尤爲是傾斜字段的統計信息準確性也直接影響着子游標的產生。還有一些參數也要注意,特別是cursor_sharing,寧願設置成force,也儘可能不要用similar,similar很容易遭遇bug,致使一堆子游標撐爆shared pool。最後就是咱們那個案例中的不一樣版本對子游標數的限制,不能聽任子游標無窮增加。 rac的優化牽扯麪比較多,我先零零星星寫一些吧,太多了。好比sequence,在rac下必定要開啓cache,並且不能過小,咱們對於一些經常使用的sequence甚至建議開啓cache到500以上,決不能使用order。 再好比若是發現GC的enq特別多,那麼就要考慮作業務分割,把操做不一樣表的會話鏈接到不一樣的節點實例上,減小兩節點對同一張表的徵用。
六、請說一下,RAC宕機的處理思路,謝謝。
對於宕機,基本上都是從數據庫的alert.log看起,通常告警日誌裏都會有最後宕掉時數據庫作的事情,甚至有更爲詳細的trace文件信息,那麼就須要根據這些信息結合rac各進程的關係來進行分析,這個的確不是一兩句話能說的完的。總之,確定是要看數據庫告警日誌和crs日誌的,asm的日誌有時候也要看。而致使宕機的緣由一樣五花八門,有些是系統緣由,好比asm存儲陣列出問題、光纖鏈路不穩定、swap耗盡等等,有些是數據庫問題,好比心跳檢測超時,爲了恢復出問題的節點,自動宕掉那個節點重啓,好比因爲一些數據庫的bug或者異常致使Oracle的某些關鍵進程被強制殺掉之類。。。可能性很是多了