轉載自love wife & love life —Roger 的Oracle技術博客css
本文連接地址: HAIP異常,致使RAC節點沒法啓動的解決方案node
一個網友諮詢一個問題,他的11.2.0.2 RAC(for Aix),沒有安裝任何patch或PSU。其中一個節點重啓以後沒法正常啓動,查看ocssd日誌以下:安全
-08-09 14:21:46.094: [ CSSD][5414]clssnmSendingThread: sent 4 join msgs to all nodes-08-09 14:21:46.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:47.042: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958157, LATS 1518247992, lastSeqNo 255958154, uniqueness 1406064021, timestamp 1407565306/1501758072-08-09 14:21:47.051: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958158, LATS 1518248002, lastSeqNo 255958155, uniqueness 1406064021, timestamp 1407565306/1501758190-08-09 14:21:47.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:48.042: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958160, LATS 1518248993, lastSeqNo 255958157, uniqueness 1406064021, timestamp 1407565307/1501759080-08-09 14:21:48.052: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958161, LATS 1518249002, lastSeqNo 255958158, uniqueness 1406064021, timestamp 1407565307/1501759191-08-09 14:21:48.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:49.043: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958163, LATS 1518249993, lastSeqNo 255958160, uniqueness 1406064021, timestamp 1407565308/1501760082-08-09 14:21:49.056: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958164, LATS 1518250007, lastSeqNo 255958161, uniqueness 1406064021, timestamp 1407565308/1501760193-08-09 14:21:49.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:50.044: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958166, LATS 1518250994, lastSeqNo 255958163, uniqueness 1406064021, timestamp 1407565309/1501761090-08-09 14:21:50.057: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958167, LATS 1518251007, lastSeqNo 255958164, uniqueness 1406064021, timestamp 1407565309/1501761195-08-09 14:21:50.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:51.046: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958169, LATS 1518251996, lastSeqNo 255958166, uniqueness 1406064021, timestamp 1407565310/1501762100-08-09 14:21:51.057: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958170, LATS 1518252008, lastSeqNo 255958167, uniqueness 1406064021, timestamp 1407565310/1501762205-08-09 14:21:51.102: [ CSSD][5414]clssnmSendingThread: sending join msg to all nodes-08-09 14:21:51.102: [ CSSD][5414]clssnmSendingThread: sent 5 join msgs to all nodes-08-09 14:21:51.421: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0-08-09 14:21:52.050: [ CSSD][4129]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958172, LATS 1518253000, lastSeqNo 255958169, uniqueness 1406064021, timestamp 1407565311/1501763110-08-09 14:21:52.058: [ CSSD][3358]clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958173, LATS 1518253008, lastSeqNo 255958170, uniqueness 1406064021, timestamp 1407565311/1501763230-08-09 14:21:52.089: [ CSSD][5671]clssnmRcfgMgrThread: Local Join-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: begin on node(2), waittime 193000-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: set curtime (1518253039) for my node-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: scanning 32 nodes-08-09 14:21:52.089: [ CSSD][5671]clssnmLocalJoinEvent: Node rac01, number 1, is in an existing cluster with disk state 3-08-09 14:21:52.090: [ CSSD][5671]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk-08-09 14:21:52.431: [ CSSD][4900]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
從上面的信息,很容易給人感受是心跳的問題。這麼理解也不錯,只是這裏的心跳不是指的咱們說理解的傳統的心跳網絡。我讓他在crs正常的一個節點查詢以下信息,咱們就知道緣由了,以下:網絡
SQL> select name,ip_address from v$cluster_interconnects;NAME IP_ADDRESS--------------- ----------------en0 169.254.116.242
你們能夠看到,這裏心跳IP爲何是169網段呢?很明顯跟咱們的/etc/hosts設置不匹配啊?why ?oracle
這裏咱們要介紹下Oracle 11gR2 引入的HAIP特性,Oracle引入該特性的目的是爲了經過自身的技術來實現心跳網絡的冗餘,而再也不依賴於第三方技術,好比Linux的bond等等。測試
在Oracle 11.2.0.2版本以前,若是使用了OS級別的心跳網卡綁定,那麼Oracle仍然以OS綁定的爲準。從11.2.0.2開始,若是沒有在OS層面進行心跳冗餘的配置,那麼Oracle本身的HAIP就啓用了。因此你雖然設置的192.168.1.100,然而實際上Oracle使用是169.254這個網段。關於這一點,你們能夠去看下alert log,從該日誌都能看出來,這裏很少說。ui
咱們能夠看到,正常節點能看到以下的169網段的ip,問題節點確實看不到這個169的網段IP:.net
Oracle MOS提供了一種解決方案,以下:3d
crsctl start res ora.cluster_interconnect.haip -init日誌
通過測試,使用root進行操做,也是不行的。針對HAIP的沒法啓動,Oracle MOS文檔說一般是以下幾種狀況:
1) 心跳網卡異常
2) 多播工做機制異常
3)防火牆等緣由
4)Oracle bug
對於心跳網卡異常,若是隻有一塊心跳網卡,那麼ping其餘的ip就能夠進行驗證了,這一點很容易排除。
對於多播的問題,能夠經過Oracle提供的mcasttest.pl腳本進行檢測(請參考Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (ID 1212703.1),我這裏的檢查結果以下:
$ ./mcasttest.pl -n rac02,rac01 -i en0########### Setup for node rac02 ##########Checking node access 'rac02'Checking node login 'rac02'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac02'Distributing mcast2 binary to node 'rac02'########### Setup for node rac01 ##########Checking node access 'rac01'Checking node login 'rac01'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac01'Distributing mcast2 binary to node 'rac01'########### testing Multicast on all nodes ##########Test for Multicast address 230.0.1.0Aug 11 21:39:39 | Multicast Failed for en0 using address 230.0.1.0:42000Test for Multicast address 224.0.0.251Aug 11 21:40:09 | Multicast Failed for en0 using address 224.0.0.251:42001$
雖然這裏經過腳本檢查,發現對於230和224網段都是不通的,然而這不見得必定說明是多播的問題致使的。雖然咱們查看ocssd.log,經過搜索mcast關鍵能夠看到相關的信息。
實際上,我在本身的11.2.0.3 Linux RAC環境中測試,即便mcasttest.pl測試不通,也能夠正常啓動CRS的。
因爲網友這裏是AIX,應該我就排除防火牆的問題了。所以最後懷疑Bug 9974223的可能性比較大。實際上,若是你去查詢HAIP的相關信息,你會發現該特性其實存在很多的Oracle bug。
其中 for knowns HAIP issues in 11gR2/12c Grid Infrastructure (1640865.1)就記錄12個HAIP相關的bug。
因爲這裏他的第1個節點沒法操做,爲了安全,是不能有太多的操做的。
對於HAIP,若是沒有使用多心跳網卡的狀況下,我以爲徹底是能夠禁止掉的。可是昨天查MOS文檔,具體說不能disabled。最後測試發現實際上是能夠禁止掉的。以下是個人測試過程:
[root@rac1 bin]# ./crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init[root@rac1 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.crsd' on 'rac1'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.oc4j' on 'rac1'CRS-2673: Attempting to stop 'ora.cvu' on 'rac1'CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac1'CRS-2673: Attempting to stop 'ora.GRID.dg' on 'rac1'CRS-2673: Attempting to stop 'ora.registry.acfs' on 'rac1'CRS-2673: Attempting to stop 'ora.rac1.vip' on 'rac1'CRS-2677: Stop of 'ora.rac1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.rac1.vip' on 'rac2'CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac1'CRS-2677: Stop of 'ora.scan1.vip' on 'rac1' succeeded
CRS-2672: Attempting to start 'ora.scan1.vip' on 'rac2'CRS-2676: Start of 'ora.rac1.vip' on 'rac2' succeeded
CRS-2676: Start of 'ora.scan1.vip' on 'rac2' succeeded
CRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'rac2'CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeeded
CRS-2677: Stop of 'ora.registry.acfs' on 'rac1' succeeded
CRS-2677: Stop of 'ora.oc4j' on 'rac1' succeeded
CRS-2677: Stop of 'ora.cvu' on 'rac1' succeeded
CRS-2677: Stop of 'ora.GRID.dg' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.ons' on 'rac1'CRS-2677: Stop of 'ora.ons' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.net1.network' on 'rac1'CRS-2677: Stop of 'ora.net1.network' on 'rac1' succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rac1' has completed
CRS-2677: Stop of 'ora.crsd' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1'CRS-2673: Attempting to stop 'ora.ctssd' on 'rac1'CRS-2673: Attempting to stop 'ora.evmd' on 'rac1'CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac1'CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'rac1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'rac1' succeeded
CRS-2677: Stop of 'ora.asm' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac1'CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'rac1'CRS-2677: Stop of 'ora.cssd' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'rac1'CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded
CRS-2677: Stop of 'ora.crf' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac1'CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1'CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed
CRS-4133: Oracle High Availability Services has been stopped.[root@rac1 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.[root@rac1 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online[root@rac1 bin]# ./crsctl stat res -t -init--------------------------------------------------------------------------------NAME TARGET STATE SERVER STATE_DETAILS--------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.asm
ONLINE ONLINE rac1 Startedora.cluster_interconnect.haip
ONLINE OFFLINE
ora.crf
ONLINE ONLINE rac1
ora.crsd
ONLINE ONLINE rac1
ora.cssd
ONLINE ONLINE rac1
ora.cssdmonitor
ONLINE ONLINE rac1
ora.ctssd
ONLINE ONLINE rac1 ACTIVE:0ora.diskmon
OFFLINE OFFLINE
ora.drivers.acfs
ONLINE ONLINE rac1
ora.evmd
ONLINE ONLINE rac1
ora.gipcd
ONLINE ONLINE rac1
ora.gpnpd
ONLINE ONLINE rac1
ora.mdnsd
ONLINE ONLINE rac1[root@rac1 bin]#
不過須要注意的是:當修改以後,兩個節點都必需要重啓CRS,不然僅僅重啓一個節點的CRS是不行的,ASM實例是沒法啓動的。
對於HAIP異常,爲何會致使節點的CRS沒法正常啓動呢?關於這一點,咱們來看下該資源的屬性就知道了,以下:
NAME=ora.cluster_interconnect.haip
TYPE=ora.haip.type
ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:oracle:r-x
ACTION_FAILURE_TEMPLATE=ACTION_SCRIPT=ACTIVE_PLACEMENT=0AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%AUTO_START=always
CARDINALITY=1CHECK_INTERVAL=30DEFAULT_TEMPLATE=DEGREE=1DESCRIPTION="Resource type for a Highly Available network IP"ENABLED=0FAILOVER_DELAY=0FAILURE_INTERVAL=0FAILURE_THRESHOLD=0HOSTING_MEMBERS=LOAD=1LOGGING_LEVEL=1NOT_RESTARTING_TEMPLATE=OFFLINE_CHECK_INTERVAL=0PLACEMENT=balanced
PROFILE_CHANGE_TEMPLATE=RESTART_ATTEMPTS=5SCRIPT_TIMEOUT=60SERVER_POOLS=START_DEPENDENCIES=hard(ora.gpnpd,ora.cssd)pullup(ora.cssd)START_TIMEOUT=60STATE_CHANGE_TEMPLATE=STOP_DEPENDENCIES=hard(ora.cssd)STOP_TIMEOUT=0UPTIME_THRESHOLD=1mUSR_ORA_AUTO=USR_ORA_IF=USR_ORA_IF_GROUP=cluster_interconnect
USR_ORA_IF_THRESHOLD=20USR_ORA_NETMASK=USR_ORA_SUBNET=
能夠看到,若是該資源有異常,如今gpnpd,cssd都是有問題的。
備註:實際上還能夠經過在asm 層面指定cluster_interconnects來避免haip的問題。