你們好,很是抱歉,在昨天下午(12月3日)的訪問高峯,園子迎來更高的併發,在這樣的高併發下,突發的數據庫鏈接故障形成博客站點全線崩潰,由此給您帶來很大的麻煩,請您諒解。html
最近,咱們一邊在忙於AWS合做項目,一邊在加快產品的改進速度,一邊在統一全園UI,一邊在忙於解決高併發下出現的各類問題。園子正處於發展的關鍵時期,咱們正全力應對挑戰,迎接園子的新階段。感謝你們的支持,也請你們諒解這段時間給你們帶來的麻煩。數據庫
今天下午的故障開始於 14:09 左右,最開始出現的故障是訪問博客後臺502。服務器
發生故障時博客後臺第1條錯誤日誌是 SqlClient 鏈接 SQL Server 數據庫失敗(咱們用的是阿里雲 RDS SQL Server 實例)併發
2020-12-03T14:09:48 ERR [Path:/healthz]/[Action:]/[Version:]
Health check "blogdb" completed after 0.3522ms with status Unhealthy and description 'null'
Microsoft.Data.SqlClient.SqlException (0x80131904): Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. This failure occurred while attempting to connect to the Principle server. The duration spent while attempting to connect to this server was - [Pre-Login] initialization=20025; handshake=3;
---> System.ComponentModel.Win32Exception (258): Unknown error 258
at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)ide
3分鐘後,博客站點也開始出現故障,表現爲訪問有時出現500錯誤。高併發
發生故障時博客站點第1個錯誤日誌是 SqlClient 解析數據庫服務器名稱失敗this
2020-12-03 14:12:46.729 [Error] An exception occurred while iterating over the results of a query for context type '"BlogServer.Infrastructure.Data.EfUnitOfWork"'."
""Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 35 - An internal exception was caught)
---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
at System.Net.Dns.GetHostEntryOrAddressesCore(String hostName, Boolean justAddresses)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Connect(String serverName, Int32 port, TimeSpan timeout, Boolean isInfiniteTimeout, String cachedFQDN, SQLDNSInfo& pendingDNSInfo)阿里雲
以後就是博客後臺一直 502,博客站點訪問速度慢,頻繁出現500錯誤。日誌
在以後的故障處理過程當中,咱們進行了數據庫服務器的主備切換,切換後博客後臺恢復了正常。但高併發壓力下的博客站點怎麼也沒法恢復正常,數據庫主備切換後,數據庫鏈接數飆升server
以後咱們使勁渾身解數,也沒法讓博客站點徹底恢復正常,恢復到必定程度後發現,訪問有時飛快有時很是緩慢,這與請求落在哪一個 pod 有關,後來咱們向 k8s 集羣添加了更多服務器,scale 更多 pod ,而後強制一個一個停用運行時間最先的一批 pod ,這纔有所緩解,但真正恢復是在過了訪問高峯以後。
先發布這篇博文向你們彙報一下故障的大體狀況,對於故障的緣由,咱們須要進一步排查與分析,再次請你們諒解此次故障給您帶來的麻煩。