張延俊:攜程技術保障中心資深DBA,對數據庫架構和疑難問題分析排查有濃厚的興趣。
壽向晨:攜程技術保障中心高級DBA,主要負責攜程Redis及DB的運維工做,在自動化運維,流程化及監控排障等方面有較多的實踐經驗,喜歡深刻分析問題,提升團隊運維效率。node
生產環境有一個Redis會偶爾發生鏈接失敗的報錯,報錯的時間點、客戶端IP並無特別明顯的規律,過一下子,報錯會自動恢復。
如下是客戶端報錯信息:redis
CRedis.Client.RExceptions.ExcuteCommandException: Unable to Connect redis server: ---> CRedis.Third.Redis.RedisException: Unable to Connect redis server: 在 CRedis.Third.Redis.RedisNativeClient.CreateConnectionError() 在 CRedis.Third.Redis.RedisNativeClient.SendExpectData(Byte[][] cmdWithBinaryArgs) 在 CRedis.Client.Entities.RedisServer.<>c__DisplayClassd`1.b__c(RedisClient client) 在 CRedis.Client.Logic.ClientPool.ExcuteAction[T](Func`2 action, String methodName) 在 CRedis.Client.Logic.ClientPool.Excute[T](Func`2 action, String methodName)
從報錯的信息來看,應該是鏈接不上Redis所致。Redis的版本是2.8.19。雖然版本有點老,但基本運行穩定。
線上環境只有這個集羣有偶爾報錯。這個集羣的一個比較明顯的特徵是客戶端服務器比較多,有上百臺。數據庫
從報錯的信息來看,客戶端鏈接不到服務端。常見的緣由有如下幾點:服務器
從服務端的部署的監控來看,出問題的點上,鏈接數有一個忽然飆升,從3500個鏈接忽然飆升至4100個鏈接。以下圖顯示:網絡
同時間,服務器端顯示Redis服務端有丟包現象:345539 – 344683 = 856個包。架構
Sat Apr 7 10:41:40 CST 2018 1699 outgoing packets dropped 92 dropped because of missing route 344683 SYNs to LISTEN sockets dropped 344683 times the listen queue of a socket overflowed
Sat Apr 7 10:41:41 CST 2018 1699 outgoing packets dropped 92 dropped because of missing route 345539 SYNs to LISTEN sockets dropped 345539 times the listen queue of a socket overflowed
客戶端報錯的緣由基本肯定,是由於建連速度太快,致使服務端backlog隊列溢出,鏈接被server端reset。併發
在高併發的短鏈接服務中,這是一種很常見的tcp報錯類型。一個正常的tcp建連過程以下:less
1.client發送一個(SYN)給server運維
2.server返回一個(SYN,ACK)給clientsocket
3.client返回一個(ACK)
三次握手結束,對client來講建連成功,client能夠繼續發送數據包給server,可是這個時候server端未必ready,以下圖所示 :
在BSD版本內核實現的tcp協議中,server端建連過程須要兩個隊列,一個是SYN queue,一個是accept queue。前者叫半開鏈接(或者半鏈接)隊列,在接收到client發送的SYN時加入隊列。(一種常見的網絡攻擊方式就是不斷髮送SYN可是不發送ACK從而致使server端的半開隊列撐爆,server端拒絕服務。)後者叫全鏈接隊列,server返回(SYN,ACK),在接收到client發送ACK後(此時client會認爲建連已經完成,會開始發送PSH包),若是accept queue沒有滿,那麼server從SYN queue把鏈接信息移到accept queue;若是此時accept queue溢出的話,server的行爲要看配置。若是tcp_abort_on_overflow爲0(默認),那麼直接drop掉client發送的PSH包,此時client會進入重發過程,一段時間後server端從新發送SYN,ACK,從新從建連的第二步開始;若是tcp_abort_on_overflow爲1,那麼server端發現accept queue滿以後直接發送reset。
經過wireshark搜索發如今一秒內有超過2000次對Redis Server端發起建連請求。咱們嘗試修改tcp backlog大小,從511調整到2048, 問題並無獲得解決。因此此類微調,並不能完全的解決問題。
咱們用wireshark來識別網絡擁塞的準確時間點和緣由。咱們已經有了準確的報錯時間點,先用editcap把超大的tcp包裁剪一下,裁成30秒間隔,並經過wireshark I/O 100ms間隔分析網絡阻塞的準確時間點:
根據圖標能夠明顯看到tcp的packets來往存在block。
對該block先後的網絡包進行明細分析,網絡包來往狀況以下:
Time | Source | Dest | Description |
---|---|---|---|
12:01:54.6536050 | Redis-Server | Clients | TCP:Flags=…AP… |
12:01:54.6538580 | Redis-Server | Clients | TCP:Flags=…AP… |
12:01:54.6539770 | Redis-Server | Clients | TCP:Flags=…AP… |
12:01:54.6720580 | Redis-Server | Clients | TCP:Flags=…A..S.. |
12:01:54.6727200 | Redis-Server | Clients | TCP:Flags=…A…… |
12:01:54.6808480 | Redis-Server | Clients | TCP:Flags=…AP….. |
12:01:54.6910840 | Redis-Server | Clients | TCP:Flags=…A…S., |
12:01:54.6911950 | Redis-Server | Clients | TCP:Flags=…A…… |
… | … | … | … |
12:01:56.1181350 | Redis-Server | Clients | TCP:Flags=…AP…. |
12:01:54.6808480, Redis Server端向客戶端發送了一個Push包,也就是對於查詢請求的一個結果返回。後面的包都是在作鏈接處理,包括Ack包,Ack確認包,以及重置的RST包,緊接着下面一個Push包是在12:01:56.1181350發出的。中間的間隔是1.4372870秒。也就是說,在這1.4372870秒期間,Redis的服務器端,除了作一個查詢,其餘的操做都是在作建連,或拒絕鏈接。
客戶端報錯的先後邏輯已經清楚了,redis-server卡了1.43秒,client的connection pool被打滿,瘋狂新建鏈接,server的accept queue滿,直接拒絕服務,client報錯。開始懷疑client發送了特殊命令,這時須要確認一下client的最後幾個命令是什麼,找到redis-server卡死前的第一個包,裝一個wireshark的redis插件,看到最後幾個命令是簡單的get,而且key-value都很小,不至於須要耗費1.43秒才能完成。服務端也沒有slow log,此時排障再次陷入僵局。
爲了瞭解這1.43秒以內,Redis Server在作什麼事情,咱們用pstack來抓取信息。Pstack本質上是gdb attach. 高頻率的抓取會影響redis的吞吐。死循環0.5秒一次無腦抓,在redis-server卡死的時候抓到堆棧以下(過濾了沒用的棧信息):
Thu May 31 11:29:18 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x000000000048cec4 in ?? () #1 0x00000000004914a4 in je_arena_ralloc () #2 0x00000000004836a1 in je_realloc () #3 0x0000000000422cc5 in zrealloc () #4 0x00000000004213d7 in sdsRemoveFreeSpace () #5 0x000000000041ef3c in clientsCronResizeQueryBuffer () #6 0x00000000004205de in clientsCron () #7 0x0000000000420784 in serverCron () #8 0x0000000000418542 in aeProcessEvents () #9 0x000000000041873b in aeMain () #10 0x0000000000420fce in main () Thu May 31 11:29:19 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x0000003729ee5407 in madvise () from /lib64/libc.so.6 #1 0x0000000000493a4e in je_pages_purge () #2 0x000000000048cf70 in ?? () #3 0x00000000004914a4 in je_arena_ralloc () #4 0x00000000004836a1 in je_realloc () #5 0x0000000000422cc5 in zrealloc () #6 0x00000000004213d7 in sdsRemoveFreeSpace () #7 0x000000000041ef3c in clientsCronResizeQueryBuffer () #8 0x00000000004205de in clientsCron () #9 0x0000000000420784 in serverCron () #10 0x0000000000418542 in aeProcessEvents () #11 0x000000000041873b in aeMain () #12 0x0000000000420fce in main () Thu May 31 11:29:19 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x000000000048108c in je_malloc_usable_size () #1 0x0000000000422be6 in zmalloc () #2 0x00000000004220bc in sdsnewlen () #3 0x000000000042c409 in createStringObject () #4 0x000000000042918e in processMultibulkBuffer () #5 0x0000000000429662 in processInputBuffer () #6 0x0000000000429762 in readQueryFromClient () #7 0x000000000041847c in aeProcessEvents () #8 0x000000000041873b in aeMain () #9 0x0000000000420fce in main () Thu May 31 11:29:20 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x000000372a60e7cd in write () from /lib64/libpthread.so.0 #1 0x0000000000428833 in sendReplyToClient () #2 0x0000000000418435 in aeProcessEvents () #3 0x000000000041873b in aeMain () #4 0x0000000000420fce in main ()
重複屢次抓取後,從堆棧中發現可疑堆棧clientsCronResizeQueryBuffer位置,屬於serverCron()函數下,這個redis-server內部的定時調度,並不在用戶線程下,這個解釋了爲何卡死的時候沒有出現慢查詢。
查看redis源碼,確認到底redis-server在作什麼:
clientsCron(server.h): #define CLIENTS_CRON_MIN_ITERATIONS 5 void clientsCron(void) { /* Make sure to process at least numclients/server.hz of clients * per call. Since this function is called server.hz times per second * we are sure that in the worst case we process all the clients in 1 * second. */ int numclients = listLength(server.clients); int iterations = numclients/server.hz; mstime_t now = mstime(); /* Process at least a few clients while we are at it, even if we need * to process less than CLIENTS_CRON_MIN_ITERATIONS to meet our contract * of processing each client once per second. */ if (iterations < CLIENTS_CRON_MIN_ITERATIONS) iterations = (numclients < CLIENTS_CRON_MIN_ITERATIONS) ? numclients : CLIENTS_CRON_MIN_ITERATIONS; while(listLength(server.clients) && iterations--) { client *c; listNode *head; /* Rotate the list, take the current head, process. * This way if the client must be removed from the list it's the * first element and we don't incur into O(N) computation. */ listRotate(server.clients); head = listFirst(server.clients); c = listNodeValue(head); /* The following functions do different service checks on the client. * The protocol is that they return non-zero if the client was * terminated. */ if (clientsCronHandleTimeout(c,now)) continue; if (clientsCronResizeQueryBuffer(c)) continue; } }
clientsCron首先判斷當前client的數量,用於控制一次清理鏈接的數量,生產服務器單實例的鏈接數量在5000不到,也就是一次清理的鏈接數是50個。
clientsCronResizeQueryBuffer(server.h): /* The client query buffer is an sds.c string that can end with a lot of * free space not used, this function reclaims space if needed. * * The function always returns 0 as it never terminates the client. */ int clientsCronResizeQueryBuffer(client *c) { size_t querybuf_size = sdsAllocSize(c->querybuf); time_t idletime = server.unixtime - c->lastinteraction; /* 只在如下兩種狀況下會Resize query buffer: * 1) Query buffer > BIG_ARG(在server.h 中定義#define PROTO_MBULK_BIG_ARG (1024*32)) 且這個Buffer的小於一段時間的客戶端使用的峯值. * 2) 客戶端空閒超過2s且Buffer size大於1k. */ if (((querybuf_size > PROTO_MBULK_BIG_ARG) && (querybuf_size/(c->querybuf_peak+1)) > 2) || (querybuf_size > 1024 && idletime > 2)) { /* Only resize the query buffer if it is actually wasting space. */ if (sdsavail(c->querybuf) > 1024) { c->querybuf = sdsRemoveFreeSpace(c->querybuf); } } /* Reset the peak again to capture the peak memory usage in the next * cycle. */ c->querybuf_peak = 0; return 0; }
若是redisClient對象的query buffer知足條件,那麼就直接resize掉。知足條件的鏈接分紅兩種,一種是真的很大的,比該客戶端一段時間內使用的峯值還大;還有一種是很閒(idle>2)的,這兩種都要知足一個條件,就是buffer free的部分超過1k。那麼redis-server卡住的緣由就是正好有那麼50個很大的或者空閒的而且free size超過了1k大小鏈接的同時循環作了resize,因爲redis都屬於單線程工做的程序,因此block了client。那麼解決這個問題辦法就很明朗了,讓resize 的頻率變低或者resize的執行速度變快。
既然問題出在query buffer上,咱們先看一下這個東西被修改的位置:
readQueryFromClient(networking.c): redisClient *createClient(int fd) { redisClient *c = zmalloc(sizeof(redisClient)); /* passing -1 as fd it is possible to create a non connected client. * This is useful since all the Redis commands needs to be executed * in the context of a client. When commands are executed in other * contexts (for instance a Lua script) we need a non connected client. */ if (fd != -1) { anetNonBlock(NULL,fd); anetEnableTcpNoDelay(NULL,fd); if (server.tcpkeepalive) anetKeepAlive(NULL,fd,server.tcpkeepalive); if (aeCreateFileEvent(server.el,fd,AE_READABLE, readQueryFromClient, c) == AE_ERR) { close(fd); zfree(c); return NULL; } } selectDb(c,0); c->id = server.next_client_id++; c->fd = fd; c->name = NULL; c->bufpos = 0; c->querybuf = sdsempty(); 初始化是0 readQueryFromClient(networking.c): void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) { redisClient *c = (redisClient*) privdata; int nread, readlen; size_t qblen; REDIS_NOTUSED(el); REDIS_NOTUSED(mask); server.current_client = c; readlen = REDIS_IOBUF_LEN; /* If this is a multi bulk request, and we are processing a bulk reply * that is large enough, try to maximize the probability that the query * buffer contains exactly the SDS string representing the object, even * at the risk of requiring more read(2) calls. This way the function * processMultiBulkBuffer() can avoid copying buffers to create the * Redis Object representing the argument. */ if (c->reqtype == REDIS_REQ_MULTIBULK && c->multibulklen && c->bulklen != -1 && c->bulklen >= REDIS_MBULK_BIG_ARG) { int remaining = (unsigned)(c->bulklen+2)-sdslen(c->querybuf); if (remaining < readlen) readlen = remaining; } qblen = sdslen(c->querybuf); if (c->querybuf_peak < qblen) c->querybuf_peak = qblen; c->querybuf = sdsMakeRoomFor(c->querybuf, readlen); 在這裏會被擴大
因而可知c->querybuf在鏈接第一次讀取命令後的大小就會被分配至少1024*32,因此回過頭再去看resize的清理邏輯就明顯存在問題,每一個被使用到的query buffer的大小至少就是1024*32,可是清理的時候判斷條件是>1024,也就是說,全部的idle>2的被使用過的鏈接都會被resize掉,下次接收到請求的時候再從新分配到1024*32,這個實際上是沒有必要的,在訪問比較頻繁的羣集,內存會被頻繁得回收重分配,因此咱們嘗試將清理的判斷條件改造爲以下,就能夠避免大部分沒有必要的resize操做:
if (((querybuf_size > REDIS_MBULK_BIG_ARG) &&
(querybuf_size/(c->querybuf_peak+1)) > 2) ||
(querybuf_size > 1024*32 && idletime > 2))
{
/* Only resize the query buffer if it is actually wasting space. */
if (sdsavail(c->querybuf) > 1024*32) {
c->querybuf = sdsRemoveFreeSpace(c->querybuf);
}
}
這個改造的反作用是內存的開銷,按照一個實例5k鏈接計算,5000*1024*32=160M,這點內存消耗對於上百G內存的服務器徹底能夠接受。
在使用修改過源碼的Redis server後,問題仍然重現了,客戶端仍是會報同類型的錯誤,且報錯的時候,服務器內存依然會出現抖動。抓取內存堆棧信息以下:
Thu Jun 14 21:56:54 CST 2018 #3 0x0000003729ee893d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)): #0 0x0000003729ee5400 in madvise () from /lib64/libc.so.6 #1 0x0000000000493a1e in je_pages_purge () #2 0x000000000048cf40 in arena_purge () #3 0x00000000004a7dad in je_tcache_bin_flush_large () #4 0x00000000004a85e9 in je_tcache_event_hard () #5 0x000000000042c0b5 in decrRefCount () #6 0x000000000042744d in resetClient () #7 0x000000000042963b in processInputBuffer () #8 0x0000000000429762 in readQueryFromClient () #9 0x000000000041847c in aeProcessEvents () #10 0x000000000041873b in aeMain () #11 0x0000000000420fce in main () Thu Jun 14 21:56:54 CST 2018 Thread 1 (Thread 0x7f2dc108d720 (LWP 27851)): #0 0x0000003729ee5400 in madvise () from /lib64/libc.so.6 #1 0x0000000000493a1e in je_pages_purge () #2 0x000000000048cf40 in arena_purge () #3 0x00000000004a7dad in je_tcache_bin_flush_large () #4 0x00000000004a85e9 in je_tcache_event_hard () #5 0x000000000042c0b5 in decrRefCount () #6 0x000000000042744d in resetClient () #7 0x000000000042963b in processInputBuffer () #8 0x0000000000429762 in readQueryFromClient () #9 0x000000000041847c in aeProcessEvents () #10 0x000000000041873b in aeMain () #11 0x0000000000420fce in main ()
顯然,Querybuffer被頻繁resize的問題已經獲得了優化,可是仍是會出現客戶端報錯。這就又陷入了僵局。難道還有其餘因素致使query buffer resize變慢?咱們再次抓取pstack。但這時,jemalloc引發了咱們的注意。此時回想Redis的內存分配機制,Redis爲避免libc內存不被釋放致使大量內存碎片的問題,默認使用的是jemalloc用做內存分配管理,此次報錯的堆棧信息中都是je_pages_purge () redis在調用jemalloc回收髒頁。咱們看下jemalloc作了些什麼:
arena_purge(arena.c) static void arena_purge(arena_t *arena, bool all) { arena_chunk_t *chunk; size_t npurgatory; if (config_debug) { size_t ndirty = 0; arena_chunk_dirty_iter(&arena->chunks_dirty, NULL, chunks_dirty_iter_cb, (void *)&ndirty); assert(ndirty == arena->ndirty); } assert(arena->ndirty > arena->npurgatory || all); assert((arena->nactive >> opt_lg_dirty_mult) < (arena->ndirty - arena->npurgatory) || all); if (config_stats) arena->stats.npurge++; npurgatory = arena_compute_npurgatory(arena, all); arena->npurgatory += npurgatory; while (npurgatory > 0) { size_t npurgeable, npurged, nunpurged; /* Get next chunk with dirty pages. */ chunk = arena_chunk_dirty_first(&arena->chunks_dirty); if (chunk == NULL) { arena->npurgatory -= npurgatory; return; } npurgeable = chunk->ndirty; assert(npurgeable != 0); if (npurgeable > npurgatory && chunk->nruns_adjac == 0) { arena->npurgatory += npurgeable - npurgatory; npurgatory = npurgeable; } arena->npurgatory -= npurgeable; npurgatory -= npurgeable; npurged = arena_chunk_purge(arena, chunk, all); nunpurged = npurgeable - npurged; arena->npurgatory += nunpurged; npurgatory += nunpurged; } }
Jemalloc每次回收都會判斷全部實際應該清理的chunck並對清理作count,這個操做對於高響應要求的系統是很奢侈的,因此咱們考慮經過升級jemalloc的版原本優化purge的性能。Redis 4.0版本發佈後,性能有很大的改進,並能夠經過命令回收內存,咱們線上也正準備進行升級,跟隨4.0發佈的jemalloc版本爲4.1,jemalloc的版本使用的在jemalloc的4.0以後版本的arena_purge()作了不少優化,去掉了計數器的調用,簡化了不少判斷邏輯,增長了arena_stash_dirty()方法合併了以前的計算和判斷邏輯,增長了purge_runs_sentinel,用保持髒塊在每一個arena LRU中的方式替代以前的保持髒塊在arena樹的dirty-run-containing chunck中的方式,大幅度減小了髒塊purge的體積,而且在內存回收過程當中再也不移動內存塊。代碼以下:
arena_purge(arena.c) static void arena_purge(arena_t *arena, bool all) { chunk_hooks_t chunk_hooks = chunk_hooks_get(arena); size_t npurge, npurgeable, npurged; arena_runs_dirty_link_t purge_runs_sentinel; extent_node_t purge_chunks_sentinel; arena->purging = true; /* * Calls to arena_dirty_count() are disabled even for debug builds * because overhead grows nonlinearly as memory usage increases. */ if (false && config_debug) { size_t ndirty = arena_dirty_count(arena); assert(ndirty == arena->ndirty); } assert((arena->nactive >> arena->lg_dirty_mult) < arena->ndirty || all); if (config_stats) arena->stats.npurge++; npurge = arena_compute_npurge(arena, all); qr_new(&purge_runs_sentinel, rd_link); extent_node_dirty_linkage_init(&purge_chunks_sentinel); npurgeable = arena_stash_dirty(arena, &chunk_hooks, all, npurge, &purge_runs_sentinel, &purge_chunks_sentinel); assert(npurgeable >= npurge); npurged = arena_purge_stashed(arena, &chunk_hooks, &purge_runs_sentinel, &purge_chunks_sentinel); assert(npurged == npurgeable); arena_unstash_purged(arena, &chunk_hooks, &purge_runs_sentinel, &purge_chunks_sentinel); arena->purging = false; }
實際上咱們有多個選項。可使用Google的tcmalloc來代替jemalloc,能夠升級jemalloc的版本等等。咱們根據上面的分析,嘗試經過升級jemalloc版本,實際操做爲升級Redis版原本解決。咱們將Redis的版本升級到4.0.9以後觀察,線上客戶端鏈接超時這個棘手的問題獲得瞭解決。
Redis在生產環境中因其支持高併發,響應快,易操做被普遍使用,對於運維人員而言,其響應時間的要求帶來了各類各樣的問題,Redis的鏈接超時問題是其中比較典型的一種,從發現問題,客戶端鏈接超時,到經過抓取客戶端與服務端的網絡包,內存堆棧定位問題,也被其中一些假象所迷惑,最終經過升級jemalloc(Redis)的版本解決問題,此次最值得總結和借鑑的是整個分析的思路。