今天下午的故障開始於 14:09 左右,最開始出現的故障是訪問博客後臺502。服務器
發生故障時博客後臺第1條錯誤日誌是 SqlClient 鏈接 SQL Server 數據庫失敗(咱們用的是阿里雲 RDS SQL Server 實例)併發
2020-12-03T14:09:48 ERR [Path:/healthz]/[Action:]/[Version:]
Health check "blogdb" completed after 0.3522ms with status Unhealthy and description 'null'
Microsoft.Data.SqlClient.SqlException (0x80131904): Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. This failure occurred while attempting to connect to the Principle server. The duration spent while attempting to connect to this server was - [Pre-Login] initialization=20025; handshake=3;
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)ide
發生故障時博客站點第1個錯誤日誌是 SqlClient 解析數據庫服務器名稱失敗this
2020-12-03 14:12:46.729 [Error] An exception occurred while iterating over the results of a query for context type '"BlogServer.Infrastructure.Data.EfUnitOfWork"'."
""Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
at System.Net.Dns.GetHostEntryOrAddressesCore(String hostName, Boolean justAddresses)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeSpan timeout, Boolean isInfiniteTimeout, String cachedFQDN, SQLDNSInfo& pendingDNSInfo)阿里雲
以後就是博客後臺一直 502,博客站點訪問速度慢,頻繁出現500錯誤。日誌
以後咱們使勁渾身解數,也沒法讓博客站點徹底恢復正常,恢復到必定程度後發現,訪問有時飛快有時很是緩慢,這與請求落在哪一個 pod 有關,後來咱們向 k8s 集羣添加了更多服務器,scale 更多 pod ,而後強制一個一個停用運行時間最先的一批 pod ,這纔有所緩解,但真正恢復是在過了訪問高峯以後。