Oracle Restart啓動數據庫實例故障一例

Oracle Restart啓動數據庫實例故障一例

Oracle Restart是11gR2中推出的重要高可用(High Availability)特性。在Single Instance狀況下,Clusterware造成一個可用性維護框架,Oracle組件服務都是在這個維護管理框架上進行管理。css

 

Oracle Restart從職責上負責兩方面的功能,一個是Oracle各個服務組件的自動啓動。鑑於組件間複雜的依賴關係,使用Restart自動的進行啓動順序調節是比較好的一種策略。另外一個功能是高可用支持,若是某一個組件意外被終止運行,好比異常中斷,Oracle Restart是能夠按期的檢查「治下」組件的生存狀況,一旦檢查出問題就會進行自動的啓動。node

 

目前單實例Oracle使用Oracle Restart支持的組件內容有:監聽器Listener、Oracle實例和數據庫、ASM實例、ASM磁盤組、數據庫服務Service和ONS(Oracle Notification Service)。linux

 

本篇記錄筆者遇到的一個故障場景,不甚複雜,和行業大牛們大做不敢相比。權當思路記錄,留待須要的朋友不時之需。sql

 

一、問題故障出現數據庫

 

在一臺11gR2的Oracle上,筆者部署了單實例ASM實例和磁盤組結構,而且在上面部署了Single Instance Oracle。因爲是測試使用,筆者在上面進行過一些測試和實驗,今天啓動服務器以後,發現問題。服務器

 

 

grid@SimpleLinux simplelinux]$ uptimeoracle

 13:58:13 up  2:24,  1 user,  load average: 0.03, 0.02, 0.00app

[grid@SimpleLinux simplelinux]$ ps -ef | grep pmon框架

grid      3212     1  0 11:35 ?        00:00:01 asm_pmon_+ASMdom

grid     27724 27685  0 13:58 pts/0    00:00:00 grep pmon

 

 

根據標準的Oracle Restart配置,ASM實例、ASM磁盤組和數據庫實例都是在Restart管理範圍,應該是隨着服務器啓動而自動啓動。可是從實際狀況看,ASM實例已經自動啓動,數據庫實例沒有啓動。

 

同RAC結構同樣,Restart也是藉助服務器啓動過程當中,以ohasd爲首的高可用守護進程進行步步啓動動做。

 

這種狀況下,查看日誌信息是最好的選擇,看看那個環節出現問題。

 

 

[grid@SimpleLinux simplelinux]$ pwd

/u01/app/grid/product/11.2.0/grid/log/simplelinux

[grid@SimpleLinux simplelinux]$ ls -l | grep alert

-rw-rw---- 1 grid oinstall 14494 Oct 17 11:35 alertsimplelinux.log

 

 

對grid和clusterware的日誌,都是保留在$ORACLE_HOME/log下的目錄從中。Alert.log是主日誌,也是檢查的起始點。一般是裏面發現的問題,進行進一步的分析動做。

 

 

[ohasd(2744)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2013-10-17 11:35:34.373

[cssd(3130)]CRS-1601:CSSD Reconfiguration complete. Active nodes are simplelinux .

2013-10-17 11:35:50.094

[/u01/app/grid/product/11.2.0/grid/bin/oraagent.bin(3072)]CRS-5010:Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

2013-10-17 11:35:55.645

[/u01/app/grid/product/11.2.0/grid/bin/oraagent.bin(3072)]CRS-5010:Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

2013-10-17 11:35:55.806

[ohasd(2744)]CRS-2807:Resource 'ora.ora11g.db' failed to start automatically.

 

 

咱們定位到了問題片斷,從上面標紅的內容看。Clusterware在啓動dismon服務以後,試圖啓動數據庫,也就是ora.ora11g.db。在訪問一個參數文件(注意是pfile)過程當中,發現問題。

 

進一步檢查指出的oraagent_grid.log日誌,也沒有過多的信息提示。

 

 

2013-10-17 11:35:50.049: [ora.ora11g.db][3013430160] {0:0:2} [start] sclsnInstAgent::sUpdateOratab file updated with dbName ora11g value /u01/app/oracle/product/11.2.0/db_1:N

2013-10-17 11:35:50.049: [ora.ora11g.db][3013430160] {0:0:2} [start] sclsnInstAgent::sUpdateOratab CSS unlock

2013-10-17 11:35:50.090: [ora.ora11g.db][3013430160] {0:0:2} [start] (:CLSN00014:)Failed to open file /u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora

2013-10-17 11:35:50.091: [   AGENT][3013430160] {0:0:2} UserErrorException: Locale is

2013-10-17 11:35:50.091: [ora.ora11g.db][3013430160] {0:0:2} [start] clsnUtils::error Exception type=2 string=

CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

 

 

從信息上看,是對pfile沒有可以打開。

 

二、一次不成功的嘗試

 

從日誌信息上,看到是不可以打開文本參數控制文件。初步猜想是文件權限緣由,下面進行檢查。

 

 

[grid@SimpleLinux oraagent_grid]$ cd /u01/app/oracle/product/11.2.0/db_1/dbs/

[grid@SimpleLinux dbs]$ ls -l

total 20

-rw-rw---- 1 oracle asmadmin 1544 Sep 12 12:58 hc_ora11g.dat

-rw-r--r-- 1 oracle oinstall 2851 May 15  2009 init.ora

-rw-r----- 1 oracle oinstall  887 Sep 29 09:31 initora11g.ora

-rw-r----- 1 oracle asmadmin   24 Sep 12 12:58 lkORA11G

-rw-r----- 1 oracle oinstall 1536 Sep 12 13:11 orapwora11g

[grid@SimpleLinux dbs]$ id oracle

uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba),502(oper),602(asmdba)

[grid@SimpleLinux dbs]$ id grid

uid=501(grid) gid=500(oinstall) groups=500(oinstall),501(dba),600(asmadmin),601(asmoper),602(asmdba)

 

 

權限內容是oracle用戶讀寫、組用戶讀。從權限上看,grid和oracle讀取和修改的問題不算特別嚴重。可是仍是進行測試嘗試。

 

 

[oracle@SimpleLinux dbs]$ chmod 770 initora11g.ora

[oracle@SimpleLinux dbs]$ ls -l

total 20

-rw-rw---- 1 oracle asmadmin 1544 Sep 12 12:58 hc_ora11g.dat

-rw-r--r-- 1 oracle oinstall 2851 May 15  2009 init.ora

-rwxrwx--- 1 oracle oinstall  887 Sep 29 09:31 initora11g.ora

-rw-r----- 1 oracle asmadmin   24 Sep 12 12:58 lkORA11G

-rw-r----- 1 oracle oinstall 1536 Sep 12 13:11 orapwora11g

 

 

嘗試啓動數據庫。

 

 

[grid@SimpleLinux ~]$ srvctl start database -d ora11g

PRCR-1079 : Failed to start resource ora.ora11g.db

CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

CRS-5017: The resource action "ora.ora11g.db start" encountered the following error:

CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

. For details refer to "(:CLSN00107:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log".

 

CRS-2674: Start of 'ora.ora11g.db' on 'simplelinux' failed

 

 

啓動失敗。那麼,試着使用傳統sqlplus命令行方式啓動是否可行?

 

 

[oracle@SimpleLinux ~]$ sqlplus /nolog

 

SQL*Plus: Release 11.2.0.3.0 Production on Thu Oct 17 14:17:11 2013

 

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

 

SQL> conn / as sysdba

Connected to an idle instance.

SQL> startup

ORACLE instance started.

 

Total System Global Area  263639040 bytes

Fixed Size                  1344312 bytes

Variable Size             134221000 bytes

Database Buffers          125829120 bytes

Redo Buffers                2244608 bytes

Database mounted.

Database opened.

SQL> quit

Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production

With the Partitioning, Automatic Storage Management, OLAP, Data Mining

and Real Application Testing options

[oracle@SimpleLinux ~]$ ps -ef | grep pmon

grid      3212     1  0 11:35 ?        00:00:02 asm_pmon_+ASM

oracle   27979     1  0 14:17 ?        00:00:00 ora_pmon_ora11g

oracle   28106 27921  0 14:17 pts/0    00:00:00 grep pmon

[oracle@SimpleLinux ~]$ srvctl status database -d ora11g

Database is running.

 

 

啓動成功,使用sqlplus命令行能夠啓動,可是Oracle Restart啓動就會失敗。那麼問題在哪兒?

 

三、Spfile vs. Pfile

 

從直觀上看,Oracle Restart啓動的時候是但願訪問到參數文件pfile。從直觀的感受上,好像被替代很長時間的pfile爲何會被說起。利用已經啓動的數據庫實例,看一下當前使用的是什麼參數文件。

 

 

SQL> show parameter spfile

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

spfile                               string

SQL>

 

 

當前啓動是利用pfile啓動的,剛剛咱們對$ORACLE_HOME/dbs的檢索也沒有看到spfile文件。Oracle啓動過程當中,是默認先根據環境變量「拼湊」的路徑查找spfile,以後纔是pfile。系統spfile參數爲空,說明當前使用的是pfile。

 

可是,對應到Oracle Restart裏面的啓動信息,彷佛有些差異。

 

 

[grid@SimpleLinux ~]$ srvctl config database -d ora11g

Database unique name: ora11g

Database name: ora11g

Oracle home: /u01/app/oracle/product/11.2.0/db_1

Oracle user: oracle

Spfile: +DATA/ora11g/spfileora11g.ora

Domain:

Start options: open

Stop options: immediate

Database role: PRIMARY

Management policy: AUTOMATIC

Database instance: ora11g

Disk Groups: DATA,RECO

Services:

 

 

明顯出現不一樣。這個時候,筆者想起以前進行過實驗,在ASM環境下進行spfile和pfile的生成操做。懷疑是這個過程當中,存在Restart和實例信息的不匹配。

 

想出了第二種修復策略。

 

 

SQL> create spfile from pfile;

 

File created.

 

SQL> startup force

ORACLE instance started.

 

Total System Global Area  263639040 bytes

Fixed Size                  1344312 bytes

Variable Size             134221000 bytes

Database Buffers          125829120 bytes

Redo Buffers                2244608 bytes

Database mounted.

Database opened.

SQL> show parameter spfile

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

spfile                               string      /u01/app/oracle/product/11.2.0

                                                 /db_1/dbs/spfileora11g.ora

 

 

設置恢復現有的spfile做爲啓動參數文件。試圖讓Restart和實例信息一致。

 

 

[oracle@SimpleLinux ~]$ srvctl modify database -d ora11g -p /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora

[oracle@SimpleLinux ~]$ srvctl config database -d ora11g

Database unique name: ora11g

Database name: ora11g

Oracle home: /u01/app/oracle/product/11.2.0/db_1

Oracle user: oracle

Spfile: /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora

Domain:

Start options: open

Stop options: immediate

Database role: PRIMARY

Management policy: AUTOMATIC

Database instance: ora11g

Disk Groups: DATA,RECO

Services:

 

 

實驗啓動,故障依然。

 

 

[oracle@SimpleLinux tmp]$ srvctl start database -d ora11g

PRCR-1079 : Failed to start resource ora.ora11g.db

CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

CRS-5017: The resource action "ora.ora11g.db start" encountered the following error:

CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"

. For details refer to "(:CLSN00107:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log".

 

CRS-2674: Start of 'ora.ora11g.db' on 'simplelinux' failed

 

 

第二次修復嘗試以失敗了結,Oracle Restart依然尋找那個pfile。可是筆者得到了方向,就是系統問題在於Restart中對數據庫啓動參數文件的不一致。

 

四、問題解決

 

Oracle Restart是一個很複雜的體系,在沒有經驗和資料的狀況下,筆者也不能證實說是Oracle Bug之類的。

 

一種思路能夠進行嘗試。對於Oracle Restart,各類組件都是在上面可插拔的。根據須要,咱們能夠進行動態的配置註冊過程。從以前的狀況看,數據庫自己是沒有問題的,應該就是配置過程當中的故障。那麼,modify配置是有問題的。可不能夠將database ora11g剔除出Restart體系,以後再添加過來。

 

Srvctl的add和remove命令能夠幫助咱們實現功能。並且在add過程當中,只有-o參數是強制的,輸入ORACLE_HOME目錄。

 

 

[oracle@SimpleLinux dbs]$ srvctl remove database -d ora11g

Remove the database ora11g? (y/[n]) y

[oracle@SimpleLinux dbs]$ srvctl add database -d ora11g -o /u01/app/oracle/product/11.2.0/db_1

[oracle@SimpleLinux dbs]$ srvctl config database -d ora11g

Database unique name: ora11g

Database name:

Oracle home: /u01/app/oracle/product/11.2.0/db_1

Oracle user: oracle

Spfile:

Domain:

Start options: open

Stop options: immediate

Database role: PRIMARY

Management policy: AUTOMATIC

Database instance: ora11g

Disk Groups:

Services:

 

 

Spfile爲空。試着從新啓動。

 

 

[oracle@SimpleLinux dbs]$ srvctl start database -d ora11g

[oracle@SimpleLinux dbs]$ ps -ef | grep pmon

grid      3215     1  0 14:47 ?        00:00:00 asm_pmon_+ASM

oracle    5265     1  0 15:22 ?        00:00:00 ora_pmon_ora11g

oracle    5386  3578  0 15:22 pts/0    00:00:00 grep pmon

[oracle@SimpleLinux dbs]$ srvctl config database -d ora11g

Database unique name: ora11g

Database name:

Oracle home: /u01/app/oracle/product/11.2.0/db_1

Oracle user: oracle

Spfile:

Domain:

Start options: open

Stop options: immediate

Database role: PRIMARY

Management policy: AUTOMATIC

Database instance: ora11g

Disk Groups: DATA,RECO

Services:

 

 

啓動成功!最後嘗試看看reboot系統時,可否自動啓動。

 

 

--從新啓動系統

[root@SimpleLinux simplelinux]# ps -ef | grep pmon

grid      3213     1  0 15:27 ?        00:00:00 asm_pmon_+ASM

oracle    3270     1  0 15:27 ?        00:00:00 ora_pmon_ora11g

root      3336  3042  0 15:27 pts/0    00:00:00 grep pmon

 

 

[grid@SimpleLinux ~]$ lsnrctl status

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 17-OCT-2013 15:32:07

 

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))

STATUS of the LISTENER

------------------------

Alias                     LISTENER

Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production

Start Date                17-OCT-2013 15:27:06

Uptime                    0 days 0 hr. 5 min. 0 sec

Trace Level               off

Security                  ON: Local OS Authentication

SNMP                      OFF

Listener Parameter File   /u01/app/grid/product/11.2.0/grid/network/admin/listener.ora

Listener Log File         /u01/app/grid/diag/tnslsnr/SimpleLinux/listener/alert/log.xml

Listening Endpoints Summary...

  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=SimpleLinux.localdomain)(PORT=1521)))

Services Summary...

Service "+ASM" has 1 instance(s).

  Instance "+ASM", status READY, has 1 handler(s) for this service...

Service "ora11g" has 1 instance(s).

  Instance "ora11g", status READY, has 1 handler(s) for this service...

Service "ora11gXDB" has 1 instance(s).

  Instance "ora11g", status READY, has 1 handler(s) for this service...

The command completed successfully

 

SQL> show parameter spfile

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

spfile                               string      /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora

 

 

問題解決。

 

五、結論和反思

 

從直觀的感受看,這應該是Restart和原有命令協調的一個故障。原有create pfile以後,Restart彷佛不可以支持pfile的啓動了。另外,在修復過程當中,咱們始終看到不能對spfile修改參數生效,也是一個疑惑點。

 

可以確定的是,在添加數據庫ora11g的時候,沒有明確指定啓動spfile的位置,那麼應該是進入了自動檢索目錄spfile-pfile的過程。因此係統獲得修復。

相關文章
相關標籤/搜索