以前在WSFC日誌分析進階篇中曾經提到過一些關於WSFC底層原理,例如Resource.dll,RHS,RCM,瞭解這些組件對於咱們後期作羣集排錯有莫大的幫助,本文咱們就經過一個實際的資源死鎖的案例,來幫助你們加深下印象服務器
首先,Resource.dll是幹嗎的呢,Resouce.dll是每一個羣集資源所賴以生存的組件,只有有Resouce.dll,羣集資源才能夠正常運做,resource.dll裏面包括了對於羣集資源的檢測Look alive 和is alive的檢測函數定義,以及資源的操做調用定義網絡
RHS是羣集資源主機子系統,以進程的方式運做,負責監視羣集資源的運行狀況是否正常,RHS會根據資源resource.dll裏面定義的Look alive 和is alive方法對羣集資源進行監視ide
當通過is alive方法測試後,確認該資源失敗,則RHS報告失敗給RCM資源控制管理器組件,RCM會把資源按照資源故障故障策略定義對資源進行重啓重試,故障轉移。函數
基本上Resource.dll負責定義應該怎麼看病,看病的操做步驟,RHS遵循Resource.dll的看病方法和操做步驟看病,RHS判斷出病情後,交由RCM護士,RCM護士決定應該按照策略去打針,仍是住院。測試
今天咱們主要關注的是RHS看的這個階段,正常狀況下來講,RHS要不就是看好,要不就是很差,好就報告正常,接着循環檢測,很差RCM就開始按照策略作處理。this
可是除了這兩種狀況,還有另一種常見的狀況,即資源死鎖spa
什麼是資源死鎖呢,就是說RHS給羣集資源發送is alive檢測信號,可是羣集資源遲遲不給響應,那你不響應的話,羣集怎麼知道你到底存不存活呢,羣集不會一直等待你資源的響應的,羣集必定要準確確認每一個資源可不可用,通過一段時間後,羣集就會斷定資源進入Deadlock死鎖,RHS會把你這個不響應的羣集資源放在一個單獨隔離的RHS進程中,而後RHS嘗試從新啓動該資源,建立死鎖WER報告。線程
對於資源死鎖的時間,WSFC 2008R2開始默認狀況下羣集會等待5分鐘,若是5分鐘內,資源仍是沒有響應,則資源死鎖,羣集會把該資源單獨RHS進程中進行重啓嘗試。
3d
在2008R2時×××始Deadlock資源死鎖的時間能夠經過如下命令更改rest
針對單個資源級別
(Get-ClusterResource「資源名稱」).DeadlockTimeout=300000
針對資源類型級別
(Get-ClusterResourceType「Virtual Machine」).DeadlockTimeout=300000
下一層保護是當集羣發出終止RHS進程的請求時,它將等待四次DeadlockTimeout值(默認等於20分鐘),以使RHS進程終止,若是RHS在20分鐘內沒有終止,則集羣將認爲服務器有一些嚴重的健康問題,並會檢查服務器以強制故障轉移和恢復,這段時間羣集資源可能會中止訪問,錯誤檢查代碼將爲Stop 0x0000009E(Parameter1,Parameter2,0x0000000000000005,Parameter4)。注意:若是RHS進程未能終止,則Parameter3將始終爲0x5的值
WSFC 2008R2時代以後
羣集IP,羣集網絡名稱,仲裁資源 單獨在一個RHS監視進程工做
羣集可用磁盤,CSV 單獨在一個RHS監視進程工做
其它羣集資源在專用RHS監視進程工做
經過這也能夠避免之前全部資源都在一個RHS進程託管的弊病,2008R2以前,羣集資源都在同一個RHS進程託管,只要其中一個羣集資源崩潰,那麼整個RHS進程可能會失敗,而且託管的全部資源都將失敗。
隔離開了以後,若是單個羣集資源無響應致使RHS崩潰,羣集服務將認爲特定的資源可疑,而且須要被隔離。羣集服務將自動設置資源公共屬性SeparateMonitor 以將該資源標記爲在其本身的專用RHS進程中運行,以便在資源再次無響應的狀況下; 它不會影響其它羣集資源進程
在2008R2時×××始,一旦發生了咱們上面講的資源死鎖,即應用不響應is alive請求,將會生成WER報告,在控制面板中能夠看到,並收集到該死鎖RHS進程的dump,WSFC2016時代,這項功能更進一步,能夠顯示更多資源死鎖的細部信息,幫助排錯人員Zero Downtime Debugging
OK,上面基本的信息交代完成後,下面咱們來看此次的資源死鎖案例
有時候在一些場景下會遇到不少莫名奇妙的問題,尤爲是你在作變動的時候,一塊兒變動好多內容,突然出問題了,你也不知道是哪一個變動操做致使的問題,這是最頭痛的,這時就須要好好坐下來深刻的看看問題
本次的問題大概是一次更新變動,某地針對一個批次的羣集節點進行補丁更新,其中兩臺節點的羣集更新完成後重啓機器,機器運行緩慢,羣集服務沒法啓動,嘗試在兩個羣集節點上面分別執行強制仲裁,無果,強制仲裁以後,羣集又馬上失敗,見證磁盤和羣集磁盤始終沒法聯機。
最初懷疑是因爲系統更新補丁致使,卸載補丁後發現問題依舊,開始進行羣集層面排錯,首先檢查羣集到存儲之間是否正常,對於鏈路確認無問題,接下來從羣集事件日誌開始看起
查看羣集事件日誌
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
Event ID: 1573
Task Category: Quorum Manager
Level: Error
Keywords:
User: SYSTEM
Computer:
Description:
Node 'ZQ1' failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online and available.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer:
Description:
Cluster resource '羣集磁盤 1' in clustered role 'wlc' failed.
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date:
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer:
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource '羣集磁盤 1' (resource type '', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
大部分羣集事件日誌代表羣集磁盤沒法上線,而且出現deadlock以及RHS進程超時
查看clusterlog看到,羣集磁盤2出現deadlock
00001148.00001158::2017/09/15-20:24:36.365 ERR [RHS] RhsCall::DeadlockMonitor: Call ONLINERESOURCE timed out for resource 'èoˉ′ì 2'.
00001148.00001158::2017/09/15-20:24:36.365 ERR [RHS] Resource èoˉ′ì 2 handling deadlock. Cleaning current operation.
00001148.00001158::2017/09/15-20:24:36.365 ERR [RHS] About to send WER report.
0000084c.0000159c::2017/09/15-20:24:36.365 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'èoˉ′ì 2', gen(0) result 5018.
0000084c.0000159c::2017/09/15-20:24:36.365 INFO [RCM] TransitionToState(èoˉ′ì 2) OnlinePending-->ProcessingFailure.
0000084c.0000159c::2017/09/15-20:24:36.365 ERR [RCM] rcm::RcmResource::HandleFailure: (èoˉ′ì 2)
目前日誌代表羣集問題和羣集磁盤在兩個節點都沒法上線相關,羣集磁盤沒法訪問,並出現RHS deadlock.代表羣集磁盤IO沒有獲得及時迴應
The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009e (0xfffffab030cce600, 0x00000000000004b0, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP
可是目前咱們仍是沒辦法判斷問題究竟是由於什麼致使的,根據bugcheck指示,進一步咱們還須要查看dump文件,來查看致使資源死鎖的緣由,dump能夠是WER裏面的進程dump,最好能夠是一個完整的memory dump,本案例咱們以完整memory.dmp爲例
從dump中獲得發生deadlock的RHS進程是fffffab030cce600,進程中有三個線程,其中兩個線程等待在系統線程fffffab0`30f0a6d0,callstack以下,此係統線程在等待TmXPFlt.sys驅動
00 fffff880`08d92420 fffff800`01ad3142 nt!KiSwapContext+0x7a
01 fffff880`08d92560 fffff800`01ad596f nt!KiCommitThreadWait+0x1d2
02 fffff880`08d925f0 fffff880`056880e4 nt!KeWaitForSingleObject+0x19f
03 fffff880`08d92690 fffff880`05680838 TmXPFlt+0xb0e4
04 fffff880`08d926f0 fffff880`05670be2 TmXPFlt+0x3838
05 fffff880`08d927e0 fffff880`0148c0f7 TmPreFlt!TmpQueryFullName+0xb66
06 fffff880`08d928a0 fffff880`0148ea0a fltmgr!FltpPerformPreCallbacks+0x50b
07 fffff880`08d929b0 fffff880`014aa2a3 fltmgr!FltpPassThroughInternal+0x4a
08 fffff880`08d929e0 fffff800`01dd22bb fltmgr!FltpCreate+0x293
09 fffff880`08d92a90 fffff800`01dcddde nt!IopParseDevice+0x14e2
0a fffff880`08d92bf0 fffff800`01dce8c6 nt!ObpLookupObjectName+0x784
0b fffff880`08d92cf0 fffff800`01dd06bc nt!ObOpenObjectByName+0x306
0c fffff880`08d92dc0 fffff800`01d7316b nt!IopCreateFile+0x2bc
0d fffff880`08d92e60 fffff880`014b1f60 nt!IoCreateFileEx+0xfb
0e fffff880`08d92f00 fffff880`014bdc61 fltmgr!FltpCreateFile+0x194
0f fffff880`08d92ff0 fffff880`014e3506 fltmgr!FltCreateFileEx+0x91
10 fffff880`08d93080 fffff880`014de40e dfsrro!DfsrRopLoadPrefixEntriesFromFile+0x416
11 fffff880`08d93250 fffff880`014b00c6 dfsrro!DfsrRoNewInstanceCallback+0x2e2
12 fffff880`08d932b0 fffff880`014af0cb fltmgr!FltpDoInstanceSetupNotification+0x86
13 fffff880`08d93310 fffff880`014afe81 fltmgr!FltpInitInstance+0x27b
14 fffff880`08d93380 fffff880`014b0d5b fltmgr!FltpCreateInstanceFromName+0x1d1
15 fffff880`08d93450 fffff880`014aed6c fltmgr!FltpEnumerateRegistryInstances+0x15b
16 fffff880`08d934f0 fffff880`014aa3f0 fltmgr!FltpDoFilterNotificationForNewVolume+0xec
17 fffff880`08d93560 fffff800`01dd22bb fltmgr!FltpCreate+0x3e0
18 fffff880`08d93610 fffff800`01dcddde nt!IopParseDevice+0x14e2
19 fffff880`08d93770 fffff800`01dce8c6 nt!ObpLookupObjectName+0x784
1a fffff880`08d93870 fffff800`01dd06bc nt!ObOpenObjectByName+0x306
1b fffff880`08d93940 fffff800`01ddbd34 nt!IopCreateFile+0x2bc
1c fffff880`08d939e0 fffff800`01acd0d3 nt!NtCreateFile+0x78
1d fffff880`08d93a70 00000000`76fac28a nt!KiSystemServiceCopyEnd+0x13
1e 00000000`0219f748 00000000`00000000 0x76fac28a
RHS進程中的另一個線程fffffab0`30ef2b50一樣是等待在TmXPFlt.sys驅動上
00 fffff880`08d92420 fffff800`01ad3142 nt!KiSwapContext+0x7a
01 fffff880`08d92560 fffff800`01ad596f nt!KiCommitThreadWait+0x1d2
02 fffff880`08d925f0 fffff880`056880e4 nt!KeWaitForSingleObject+0x19f
03 fffff880`08d92690 fffff880`05680838 TmXPFlt+0xb0e4
04 fffff880`08d926f0 fffff880`05670be2 TmXPFlt+0x3838
05 fffff880`08d927e0 fffff880`0148c0f7 TmPreFlt!TmpQueryFullName+0xb66
06 fffff880`08d928a0 fffff880`0148ea0a fltmgr!FltpPerformPreCallbacks+0x50b
07 fffff880`08d929b0 fffff880`014aa2a3 fltmgr!FltpPassThroughInternal+0x4a
08 fffff880`08d929e0 fffff800`01dd22bb fltmgr!FltpCreate+0x293
09 fffff880`08d92a90 fffff800`01dcddde nt!IopParseDevice+0x14e2
0a fffff880`08d92bf0 fffff800`01dce8c6 nt!ObpLookupObjectName+0x784
0b fffff880`08d92cf0 fffff800`01dd06bc nt!ObOpenObjectByName+0x306
0c fffff880`08d92dc0 fffff800`01d7316b nt!IopCreateFile+0x2bc
0d fffff880`08d92e60 fffff880`014b1f60 nt!IoCreateFileEx+0xfb
0e fffff880`08d92f00 fffff880`014bdc61 fltmgr!FltpCreateFile+0x194
0f fffff880`08d92ff0 fffff880`014e3506 fltmgr!FltCreateFileEx+0x91
10 fffff880`08d93080 fffff880`014de40e dfsrro!DfsrRopLoadPrefixEntriesFromFile+0x416
11 fffff880`08d93250 fffff880`014b00c6 dfsrro!DfsrRoNewInstanceCallback+0x2e2
12 fffff880`08d932b0 fffff880`014af0cb fltmgr!FltpDoInstanceSetupNotification+0x86
13 fffff880`08d93310 fffff880`014afe81 fltmgr!FltpInitInstance+0x27b
14 fffff880`08d93380 fffff880`014b0d5b fltmgr!FltpCreateInstanceFromName+0x1d1
15 fffff880`08d93450 fffff880`014aed6c fltmgr!FltpEnumerateRegistryInstances+0x15b
16 fffff880`08d934f0 fffff880`014aa3f0 fltmgr!FltpDoFilterNotificationForNewVolume+0xec
17 fffff880`08d93560 fffff800`01dd22bb fltmgr!FltpCreate+0x3e0
18 fffff880`08d93610 fffff800`01dcddde nt!IopParseDevice+0x14e2
19 fffff880`08d93770 fffff800`01dce8c6 nt!ObpLookupObjectName+0x784
1a fffff880`08d93870 fffff800`01dd06bc nt!ObOpenObjectByName+0x306
1b fffff880`08d93940 fffff800`01ddbd34 nt!IopCreateFile+0x2bc
1c fffff880`08d939e0 fffff800`01acd0d3 nt!NtCreateFile+0x78
1d fffff880`08d93a70 00000000`76fac28a nt!KiSystemServiceCopyEnd+0x13
1e 00000000`0219f748 00000000`00000000 0x76fac28a
start end module name
fffff880`0567d000 fffff880`056d3000 TmXPFlt (no symbols)
Loaded symbol p_w_picpath file: TmXPFlt.sys
Image path: \??\C:\Program Files (x86)\Trend Micro\OfficeScan Client\TmXPFlt.sys
Image name: TmXPFlt.sys
Browse all global symbols functions data
Timestamp: Wed Jun 10 18:54:43 2009 (4A2F90F4)
CheckSum: 00040739
ImageSize: 00056000
Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
根據此dump分析,RHS進程deadlock和Trend Micro的driver 相關,綜合上述的分析,此問題很大可能和磁盤沒法上線相關。
嘗試在羣集節點卸載Trend Micro趨勢,再次更新補丁,問題解決。
事實上後期在趨勢的官網發現已經給出了此問題的發生緣由及解決方法
能夠選擇按照趨勢官網給出的修改註冊表方案進行解決
或升級趨勢VSAPI掃描引擎爲9.5以後的版本,9.5以前的VSAPI掃描引擎皆會與羣集有這個資源死鎖問題
以上爲本次資源死鎖案例的分析過程,但願可以爲感興趣的朋友帶來幫助!