Oracle Restart是11gR2中推出的重要高可用(High Availability)特性。在Single Instance狀況下,Clusterware造成一個可用性維護框架,Oracle組件服務都是在這個維護管理框架上進行管理。css
Oracle Restart從職責上負責兩方面的功能,一個是Oracle各個服務組件的自動啓動。鑑於組件間複雜的依賴關係,使用Restart自動的進行啓動順序調節是比較好的一種策略。另外一個功能是高可用支持,若是某一個組件意外被終止運行,好比異常中斷,Oracle Restart是能夠按期的檢查「治下」組件的生存狀況,一旦檢查出問題就會進行自動的啓動。node
目前單實例Oracle使用Oracle Restart支持的組件內容有:監聽器Listener、Oracle實例和數據庫、ASM實例、ASM磁盤組、數據庫服務Service和ONS(Oracle Notification Service)。linux
本篇記錄筆者遇到的一個故障場景,不甚複雜,和行業大牛們大做不敢相比。權當思路記錄,留待須要的朋友不時之需。sql
一、問題故障出現數據庫
在一臺11gR2的Oracle上,筆者部署了單實例ASM實例和磁盤組結構,而且在上面部署了Single Instance Oracle。因爲是測試使用,筆者在上面進行過一些測試和實驗,今天啓動服務器以後,發現問題。服務器
grid@SimpleLinux simplelinux]$ uptimeoracle
13:58:13 up 2:24, 1 user, load average: 0.03, 0.02, 0.00app
[grid@SimpleLinux simplelinux]$ ps -ef | grep pmon框架
grid 3212 1 0 11:35 ? 00:00:01 asm_pmon_+ASMdom
grid 27724 27685 0 13:58 pts/0 00:00:00 grep pmon
根據標準的Oracle Restart配置,ASM實例、ASM磁盤組和數據庫實例都是在Restart管理範圍,應該是隨着服務器啓動而自動啓動。可是從實際狀況看,ASM實例已經自動啓動,數據庫實例沒有啓動。
同RAC結構同樣,Restart也是藉助服務器啓動過程當中,以ohasd爲首的高可用守護進程進行步步啓動動做。
這種狀況下,查看日誌信息是最好的選擇,看看那個環節出現問題。
[grid@SimpleLinux simplelinux]$ pwd
/u01/app/grid/product/11.2.0/grid/log/simplelinux
[grid@SimpleLinux simplelinux]$ ls -l | grep alert
-rw-rw---- 1 grid oinstall 14494 Oct 17 11:35 alertsimplelinux.log
對grid和clusterware的日誌,都是保留在$ORACLE_HOME/log下的目錄從中。Alert.log是主日誌,也是檢查的起始點。一般是裏面發現的問題,進行進一步的分析動做。
[ohasd(2744)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2013-10-17 11:35:34.373
[cssd(3130)]CRS-1601:CSSD Reconfiguration complete. Active nodes are simplelinux .
2013-10-17 11:35:50.094
[/u01/app/grid/product/11.2.0/grid/bin/oraagent.bin(3072)]CRS-5010:Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-10-17 11:35:55.645
[/u01/app/grid/product/11.2.0/grid/bin/oraagent.bin(3072)]CRS-5010:Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-10-17 11:35:55.806
[ohasd(2744)]CRS-2807:Resource 'ora.ora11g.db' failed to start automatically.
咱們定位到了問題片斷,從上面標紅的內容看。Clusterware在啓動dismon服務以後,試圖啓動數據庫,也就是ora.ora11g.db。在訪問一個參數文件(注意是pfile)過程當中,發現問題。
進一步檢查指出的oraagent_grid.log日誌,也沒有過多的信息提示。
2013-10-17 11:35:50.049: [ora.ora11g.db][3013430160] {0:0:2} [start] sclsnInstAgent::sUpdateOratab file updated with dbName ora11g value /u01/app/oracle/product/11.2.0/db_1:N
2013-10-17 11:35:50.049: [ora.ora11g.db][3013430160] {0:0:2} [start] sclsnInstAgent::sUpdateOratab CSS unlock
2013-10-17 11:35:50.090: [ora.ora11g.db][3013430160] {0:0:2} [start] (:CLSN00014:)Failed to open file /u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora
2013-10-17 11:35:50.091: [ AGENT][3013430160] {0:0:2} UserErrorException: Locale is
2013-10-17 11:35:50.091: [ora.ora11g.db][3013430160] {0:0:2} [start] clsnUtils::error Exception type=2 string=
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
從信息上看,是對pfile沒有可以打開。
二、一次不成功的嘗試
從日誌信息上,看到是不可以打開文本參數控制文件。初步猜想是文件權限緣由,下面進行檢查。
[grid@SimpleLinux oraagent_grid]$ cd /u01/app/oracle/product/11.2.0/db_1/dbs/
[grid@SimpleLinux dbs]$ ls -l
total 20
-rw-rw---- 1 oracle asmadmin 1544 Sep 12 12:58 hc_ora11g.dat
-rw-r--r-- 1 oracle oinstall 2851 May 15 2009 init.ora
-rw-r----- 1 oracle oinstall 887 Sep 29 09:31 initora11g.ora
-rw-r----- 1 oracle asmadmin 24 Sep 12 12:58 lkORA11G
-rw-r----- 1 oracle oinstall 1536 Sep 12 13:11 orapwora11g
[grid@SimpleLinux dbs]$ id oracle
uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba),502(oper),602(asmdba)
[grid@SimpleLinux dbs]$ id grid
uid=501(grid) gid=500(oinstall) groups=500(oinstall),501(dba),600(asmadmin),601(asmoper),602(asmdba)
權限內容是oracle用戶讀寫、組用戶讀。從權限上看,grid和oracle讀取和修改的問題不算特別嚴重。可是仍是進行測試嘗試。
[oracle@SimpleLinux dbs]$ chmod 770 initora11g.ora
[oracle@SimpleLinux dbs]$ ls -l
total 20
-rw-rw---- 1 oracle asmadmin 1544 Sep 12 12:58 hc_ora11g.dat
-rw-r--r-- 1 oracle oinstall 2851 May 15 2009 init.ora
-rwxrwx--- 1 oracle oinstall 887 Sep 29 09:31 initora11g.ora
-rw-r----- 1 oracle asmadmin 24 Sep 12 12:58 lkORA11G
-rw-r----- 1 oracle oinstall 1536 Sep 12 13:11 orapwora11g
嘗試啓動數據庫。
[grid@SimpleLinux ~]$ srvctl start database -d ora11g
PRCR-1079 : Failed to start resource ora.ora11g.db
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
CRS-5017: The resource action "ora.ora11g.db start" encountered the following error:
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.ora11g.db' on 'simplelinux' failed
啓動失敗。那麼,試着使用傳統sqlplus命令行方式啓動是否可行?
[oracle@SimpleLinux ~]$ sqlplus /nolog
SQL*Plus: Release 11.2.0.3.0 Production on Thu Oct 17 14:17:11 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 263639040 bytes
Fixed Size 1344312 bytes
Variable Size 134221000 bytes
Database Buffers 125829120 bytes
Redo Buffers 2244608 bytes
Database mounted.
Database opened.
SQL> quit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options
[oracle@SimpleLinux ~]$ ps -ef | grep pmon
grid 3212 1 0 11:35 ? 00:00:02 asm_pmon_+ASM
oracle 27979 1 0 14:17 ? 00:00:00 ora_pmon_ora11g
oracle 28106 27921 0 14:17 pts/0 00:00:00 grep pmon
[oracle@SimpleLinux ~]$ srvctl status database -d ora11g
Database is running.
啓動成功,使用sqlplus命令行能夠啓動,可是Oracle Restart啓動就會失敗。那麼問題在哪兒?
三、Spfile vs. Pfile
從直觀上看,Oracle Restart啓動的時候是但願訪問到參數文件pfile。從直觀的感受上,好像被替代很長時間的pfile爲何會被說起。利用已經啓動的數據庫實例,看一下當前使用的是什麼參數文件。
SQL> show parameter spfile
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string
SQL>
當前啓動是利用pfile啓動的,剛剛咱們對$ORACLE_HOME/dbs的檢索也沒有看到spfile文件。Oracle啓動過程當中,是默認先根據環境變量「拼湊」的路徑查找spfile,以後纔是pfile。系統spfile參數爲空,說明當前使用的是pfile。
可是,對應到Oracle Restart裏面的啓動信息,彷佛有些差異。
[grid@SimpleLinux ~]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name: ora11g
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile: +DATA/ora11g/spfileora11g.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups: DATA,RECO
Services:
明顯出現不一樣。這個時候,筆者想起以前進行過實驗,在ASM環境下進行spfile和pfile的生成操做。懷疑是這個過程當中,存在Restart和實例信息的不匹配。
想出了第二種修復策略。
SQL> create spfile from pfile;
File created.
SQL> startup force
ORACLE instance started.
Total System Global Area 263639040 bytes
Fixed Size 1344312 bytes
Variable Size 134221000 bytes
Database Buffers 125829120 bytes
Redo Buffers 2244608 bytes
Database mounted.
Database opened.
SQL> show parameter spfile
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string /u01/app/oracle/product/11.2.0
/db_1/dbs/spfileora11g.ora
設置恢復現有的spfile做爲啓動參數文件。試圖讓Restart和實例信息一致。
[oracle@SimpleLinux ~]$ srvctl modify database -d ora11g -p /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora
[oracle@SimpleLinux ~]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name: ora11g
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile: /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups: DATA,RECO
Services:
實驗啓動,故障依然。
[oracle@SimpleLinux tmp]$ srvctl start database -d ora11g
PRCR-1079 : Failed to start resource ora.ora11g.db
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
CRS-5017: The resource action "ora.ora11g.db start" encountered the following error:
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.ora11g.db' on 'simplelinux' failed
第二次修復嘗試以失敗了結,Oracle Restart依然尋找那個pfile。可是筆者得到了方向,就是系統問題在於Restart中對數據庫啓動參數文件的不一致。
四、問題解決
Oracle Restart是一個很複雜的體系,在沒有經驗和資料的狀況下,筆者也不能證實說是Oracle Bug之類的。
一種思路能夠進行嘗試。對於Oracle Restart,各類組件都是在上面可插拔的。根據須要,咱們能夠進行動態的配置註冊過程。從以前的狀況看,數據庫自己是沒有問題的,應該就是配置過程當中的故障。那麼,modify配置是有問題的。可不能夠將database ora11g剔除出Restart體系,以後再添加過來。
Srvctl的add和remove命令能夠幫助咱們實現功能。並且在add過程當中,只有-o參數是強制的,輸入ORACLE_HOME目錄。
[oracle@SimpleLinux dbs]$ srvctl remove database -d ora11g
Remove the database ora11g? (y/[n]) y
[oracle@SimpleLinux dbs]$ srvctl add database -d ora11g -o /u01/app/oracle/product/11.2.0/db_1
[oracle@SimpleLinux dbs]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name:
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile:
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups:
Services:
Spfile爲空。試着從新啓動。
[oracle@SimpleLinux dbs]$ srvctl start database -d ora11g
[oracle@SimpleLinux dbs]$ ps -ef | grep pmon
grid 3215 1 0 14:47 ? 00:00:00 asm_pmon_+ASM
oracle 5265 1 0 15:22 ? 00:00:00 ora_pmon_ora11g
oracle 5386 3578 0 15:22 pts/0 00:00:00 grep pmon
[oracle@SimpleLinux dbs]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name:
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile:
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups: DATA,RECO
Services:
啓動成功!最後嘗試看看reboot系統時,可否自動啓動。
--從新啓動系統
[root@SimpleLinux simplelinux]# ps -ef | grep pmon
grid 3213 1 0 15:27 ? 00:00:00 asm_pmon_+ASM
oracle 3270 1 0 15:27 ? 00:00:00 ora_pmon_ora11g
root 3336 3042 0 15:27 pts/0 00:00:00 grep pmon
[grid@SimpleLinux ~]$ lsnrctl status
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 17-OCT-2013 15:32:07
Copyright (c) 1991, 2011, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date 17-OCT-2013 15:27:06
Uptime 0 days 0 hr. 5 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/grid/product/11.2.0/grid/network/admin/listener.ora
Listener Log File /u01/app/grid/diag/tnslsnr/SimpleLinux/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=SimpleLinux.localdomain)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
Instance "+ASM", status READY, has 1 handler(s) for this service...
Service "ora11g" has 1 instance(s).
Instance "ora11g", status READY, has 1 handler(s) for this service...
Service "ora11gXDB" has 1 instance(s).
Instance "ora11g", status READY, has 1 handler(s) for this service...
The command completed successfully
SQL> show parameter spfile
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora
問題解決。
五、結論和反思
從直觀的感受看,這應該是Restart和原有命令協調的一個故障。原有create pfile以後,Restart彷佛不可以支持pfile的啓動了。另外,在修復過程當中,咱們始終看到不能對spfile修改參數生效,也是一個疑惑點。
可以確定的是,在添加數據庫ora11g的時候,沒有明確指定啓動spfile的位置,那麼應該是進入了自動檢索目錄spfile-pfile的過程。因此係統獲得修復。