最近作了一個case, 客戶在ALWAYSON環境下進行failover操做, 以後全部replica上的alwayson group狀態變成了resolving。 而且在執行failover的replica上生成1個到多個dump 文件。 sql
下面是具體的排查問題。 windows
環境 測試
=== spa
SQL Server 2014 SP1 CU3 blog
Primary replica: p1 同步
Secondary replica: p2 it
Secondary replica: p3 io
P1和P2屬於同一個子網 stream
P3在另一個子網。 配置
Availability mode均爲sync mode.
和客戶討論和得知,在p1和p2之間進行failover一切正常,並不會失敗或生成dump。只有嘗試將p3設置爲primary replica纔會發生錯誤。
執行的語句爲:alter availability group groupName failover
Errorlog記錄了下面的內容
2015-12-14 09:57:47.18 spid52 ***Stack Dump being sent to F:\MSSQL12.DBAAGINS1\MSSQL\LOG\SQLDump0001.txt
2015-12-14 09:57:47.18 spid52 * *******************************************************************************
2015-12-14 09:57:47.18 spid52 *
2015-12-14 09:57:47.18 spid52 * BEGIN STACK DUMP:
2015-12-14 09:57:47.18 spid52 * 12/14/15 09:57:47 spid 52
2015-12-14 09:57:47.18 spid52 *
2015-12-14 09:57:47.18 spid52 * Location: HadrFstrVnnUtils.cpp:479
2015-12-14 09:57:47.18 spid52 * Expression: SUCCEEDED (hr)
2015-12-14 09:57:47.18 spid52 * SPID: 52
2015-12-14 09:57:47.18 spid52 * Process ID: 5412
2015-12-14 09:57:47.18 spid52 *
2015-12-14 09:57:47.18 spid52 * Input Buffer 255 bytes -
2015-12-14 09:57:47.18 spid52 * 16 00 00 00 12 00 00 00 02 00 00 00 00 00 00 00 00 00
2015-12-14 09:57:47.18 spid52 * ÿÿ & ç 01 00 00 00 ff ff 0d 00 00 00 00 01 26 04 00 00 00 e7
2015-12-14 09:57:47.18 spid52 * ÿÿ þÿÿÿÿÿÿÿF ff ff 09 04 00 02 00 fe ff ff ff ff ff ff ff 46 00 00
2015-12-14 09:57:47.18 spid52 * @ P 1 n v a r c 00 40 00 50 00 31 00 20 00 6e 00 76 00 61 00 72 00 63
2015-12-14 09:57:47.18 spid52 * h a r ( 8 0 ) , @ 00 68 00 61 00 72 00 28 00 38 00 30 00 29 00 2c 00 40
2015-12-14 09:57:47.18 spid52 * P 2 b i g i n t 00 50 00 32 00 20 00 62 00 69 00 67 00 69 00 6e 00 74
2015-12-14 09:57:47.18 spid52 * , @ P 3 i n t 00 2c 00 40 00 50 00 33 00 20 00 69 00 6e 00 74 00 00
2015-12-14 09:57:47.18 spid52 * çÿÿ þÿÿÿÿ 00 00 00 00 00 e7 ff ff 09 04 00 02 00 fe ff ff ff ff
2015-12-14 09:57:47.18 spid52 * ÿÿÿx e x e c s ff ff ff 78 00 00 00 65 00 78 00 65 00 63 00 20 00 73
2015-12-14 09:57:47.18 spid52 * p _ a v a i l a b 00 70 00 5f 00 61 00 76 00 61 00 69 00 6c 00 61 00 62
2015-12-14 09:57:47.18 spid52 * i l i t y _ g r o 00 69 00 6c 00 69 00 74 00 79 00 5f 00 67 00 72 00 6f
2015-12-14 09:57:47.18 spid52 * u p _ c o m m a n 00 75 00 70 00 5f 00 63 00 6f 00 6d 00 6d 00 61 00 6e
2015-12-14 09:57:47.18 spid52 * d _ i n t e r n a 00 64 00 5f 00 69 00 6e 00 74 00 65 00 72 00 6e 00 61
2015-12-14 09:57:47.18 spid52 * l @ P 1 , 1 , 00 6c 00 20 00 40 00 50 00 31 00 2c 00 20 00 31 00 2c
2015-12-14 09:57:47.18 spid52 * @ P 2 , @ P 3 00 20 00 40 00 50 00 32 00 2c 00 20 00 40 00 50 00 33
2015-12-14 09:57:47.18 spid52 * ç H 8 00 00 00 00 00 00 00 e7 a0 00 09 04 00 02 00 48 00 38
2015-12-14 09:57:47.18 spid52 * e a 6 b e b 5 - 0 00 65 00 61 00 36 00 62 00 65 00 62 00 35 00 2d 00 30
2015-12-14 09:57:47.18 spid52 * d e 3 - 4 f 7 1 - 00 64 00 65 00 33 00 2d 00 34 00 66 00 37 00 31 00 2d
2015-12-14 09:57:47.18 spid52 * 9 0 b 5 - 3 5 d f 00 39 00 30 00 62 00 35 00 2d 00 33 00 35 00 64 00 66
2015-12-14 09:57:47.18 spid52 * d 1 0 3 6 5 c 2 00 64 00 31 00 30 00 33 00 36 00 35 00 63 00 32 00 00
2015-12-14 09:57:47.18 spid52 * & ø ¨ & © 00 26 08 08 f8 06 a8 0d 00 00 00 00 00 00 26 04 04 a9
2015-12-14 09:57:47.18 spid52 * UM 03 55 4d
2015-12-14 09:57:47.18 spid52 *
因此首先分析了dump文件。生成dump的callstack 內容以下:
Callstack
===
sqlmin!HadrFstrVnnUtils::GetRsFxEndpointPath+0x7e
sqlmin!HadrFstrVnnUtils::SetClusterResourceProperties+0x153
sqlmin!HadrFstrVnnUtils::RefreshWsfcConfig+0x299
sqlmin!CHadrArProxy::RefreshFilestreamInWsfc+0xff
sqlmin!CHadrArController::RefreshFilestreamInWsfc+0x4f
sqlmin!CFstrSubscriber::Publish+0x138
sqlmin!CHadrPublisher::Publish+0x333
sqlmin!CHadrArProxy::PublishRoleChangeEvent+0x19d
sqlmin!CHadrArProxy::Signal+0x469
sqlmin!CHadrArController::Online+0x1b5
sqlmin!CHadrArManager::OnlineAg+0x12d
sqlmin!SpAvailabilityGroupCommand+0x2f5
通過測試和排查, 終於發現了緣由:
p1和p2均配置了Filestream和Windows Share,但p3沒有這些配置.
解釋:
Alwayson以及SQL Cluster中有一個概念叫作WSFC Storage(存儲在註冊表內),用於存儲一些共享信息。在Alwayson中,若是primary的一些配置發生變化,這些變化也會反映到wsfc storage裏,並在同步到其餘的secondary replica中。
若是primary replica啓動了Filestream和windows share name,那麼這些信息會存儲在WSFC store(註冊表)。這些信息會被同步到全部的replica。
當secondary replica接收到failover命令時,他會去讀取本地的WSFC Store。若是WSFC Store顯示Filestream和windows share沒有啓動,那麼執行正常的failover操做。若是已經啓動,那就會去嘗試獲得相應的windows share。若是這當前的replica沒有啓動Filestream,或沒有啓動windows Share,那麼就會出現異常,致使failover失敗並生成dump文件。
重現方式以下:
建立兩個replica的,
P1爲primary replica
P2爲secondary replica
同步模式。
Failover的方式均爲手動(manual)。
其中P1的配置以下
啓用了Filestream,而且設置了Windows Share name.
若是p2的配置和p1不一樣,那麼failover就會失敗。
解決方法也很簡單:
保持replica的配置一致.
若是不須要使用這些功能,那麼將這些工做在全部的replica上禁用便可。
或者在全部的replica上都開啓這些功能。