Redis-bgsave致使的接口響應延遲波動（深刻分析Linux的fork()機制）

時間 2019-11-17

標籤 redis bgsave 致使接口響應延遲波動深刻分析 linux fork 機制欄目 Redis 简体版

原文原文鏈接

近期線上有個接口響應延遲P99波動較大，後對其進行了優化。響應延遲折線圖以下：java

在12月11號11點左右優化完成後，P99趨於平穩，平均在70ms左右。linux

下面來講一下優化過程。git

1 思考接口的執行過程

這個接口一共會通過三個服務，最終返回給客戶端。執行流程以下：github

按照箭頭所示流程，先訪問服務1，服務1的結果返回給接口層，在請求服務2，服務2請求服務3，而後將結果返回給接口層。redis

2 分析

而後分別觀察了服務一、服務二、服務3，主要觀察的指標以下：數據庫

服務對外響應延遲
CPU負載
網絡抖動

觀察後，服務2和服務3的這幾個指標都沒啥問題。服務器

服務2的對外響應延遲波動狀況與接口的波動頗爲類似，再針對服務2分析。服務2是個IO密集型的服務，平均QPS在3K左右。微信

主要的幾個IO操做包括：網絡

單點Redis的讀取
集羣Redis的讀取
數據庫的讀取
兩個http接口的拉取
一次其餘服務的調用

集羣Redis的響應很快，平均在5ms左右（加上來回的網絡消耗），數據庫在10ms左右，http接口只有偶爾的慢請求，其餘服務的調用也沒問題。app

最後發現單點的Redis響應時間過長

如圖所示，服務2接受到的每次請求會訪問三次這個單點redis，這三次加起來有接近100ms，而後針對這個單點redis進行分析。

發現這臺redis的CPU有以下波動趨勢

基本上每一分鐘會波動一次。

立刻反應過來是開啓了bgsave引發的（基本1分鐘bgsave一次），由於以前有過相似的經驗，就直接關掉bgsave再觀察

至此，業務平穩下來。

3 解決方案

線上的bgsave不能一直關閉，萬一出現故障，會形成大量數據丟失。

具體方案以下：

先開啓這臺機器的bgsave
申請一臺從服務器，並從這臺機器上同步數據
同步完成後，主節點關閉bgsave，從節點開啓bgsave

這樣一來，主節點的讀寫再也不受bgsave影響，同時也能用從節點保證數據不丟失。

4 bgsave引發CPU波動緣由探索

首先要說一下bgsave的執行機制。執行bgsave時（不管以哪一種方式執行），會先fork出一個子進程來，由子進程把數據庫的快照寫入硬盤，父進程會繼續處理客戶端的請求。

因此在平時沒有bgsave的時候，進程狀態以下：

bgsave時，進程狀態以下：

最上面CPU佔用100%的就是fork出來的子進程，在執行bgsave，同時他徹底獨佔了一個CPU（上面的紅框）。

因此得出結論，這個CPU的波動是正常的，每個波峯都是子進程bgsave所致。

5 bgsave引發的接口相應延遲探索

關於fork，在redis官網有這麼一段描述：

RDB disadvantages

RDB is NOT good if you need to minimize the chance of data loss in case Redis stops working (for example after a power outage). You can configure different save points where an RDB is produced (for instance after at least five minutes and 100 writes against the data set, but you can have multiple save points). However you'll usually create an RDB snapshot every five minutes or more, so in case of Redis stopping working without a correct shutdown for any reason you should be prepared to lose the latest minutes of data.
RDB needs to fork() often in order to persist on disk using a child process. Fork() can be time consuming if the dataset is big, and may result in Redis to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance not great. AOF also needs to fork() but you can tune how often you want to rewrite your logs without any trade-off on durability.

這裏說了RDB的劣勢，第二點說明了fork會形成的問題。

大意是：RDB爲了將數據持久化到硬盤，須要常常fork一個子進程出來。數據集若是過大的話，fork()的執行可能會很是耗時，若是數據集很是大的話，可能會致使Redis服務器產生幾毫秒甚至幾秒鐘的拒絕服務，而且CPU的性能會急劇降低。

這個停頓的時間長短取決於redis所在的系統，對於真實硬件、VMWare虛擬機或者KVM虛擬機來講，Redis進程每佔用1個GB的內存，fork子進程的時間就增長10-20ms，對於Xen虛擬機來講，Redis進程每佔用1個GB的內存，fork子進程的時間須要增長200-300ms。

但對於一個訪問量大的Redis來講，10-20ms已是很長時間了（咱們的redis佔用了10個G左右內存，估計停頓時間在100ms左右）。

至此，形成接口響應延遲的緣由就明確了：

因爲redis是單進程運行的，在fork子進程時，若是耗時過多，形成服務器的停頓，致使redis沒法繼續處理請求，進一步就會致使向redis發請求的客戶端全都hang住，接口響應變慢。

6 深刻分析fork機制

知道緣由後，來看一下redis執行bgsave的源碼（fork部分）：

註釋中分析了若是fork卡住，會形成的影響。

// 執行bgsave
int rdbSaveBackground(char * filename, rdbSaveInfo * rsi) {
    pid_t childpid;
    long long start;
    if (server.aof_child_pid != -1 || server.rdb_child_pid != -1) return C_ERR;
    server.dirty_before_bgsave = server.dirty;
    server.lastbgsave_try = time(NULL);
    openChildInfoPipe();

    // 記錄執行fork的起始時間，用於計算fork的耗時
    start = ustime();
    // 在這裏執行fork ！！
    // 因而可知，若是fork卡住，下面執行父進程的else條件就會卡住，子進程的執行也須要fork完成後纔會開始
    if ((childpid = fork()) == 0) {
        // fork()返回了等於0的值，說明執行成功，
        int retval;
        // 下面是子進程的執行過程
        /* Child */
        closeListeningSockets(0);
        redisSetProcTitle("redis-rdb-bgsave");
        // 子進程執行硬盤的寫操做
        retval = rdbSave(filename, rsi);
        if (retval == C_OK) {
            size_t private_dirty = zmalloc_get_private_dirty( - 1);
            if (private_dirty) {
                serverLog(LL_NOTICE, "RDB: %zu MB of memory used by copy-on-write", private_dirty / (1024 * 1024));
            }
            server.child_info_data.cow_size = private_dirty;
            sendChildInfo(CHILD_INFO_TYPE_RDB);
        }
        // 子進程執行完畢退出，返回執行結果給父進程，0 - 成功，1 - 失敗
        exitFromChild((retval == C_OK) ? 0 : 1);
    } else {
        // 下面是父進程的執行過程
        /* Parent */
        // 計算fork的執行時間
        server.stat_fork_time = ustime() - start;
        server.stat_fork_rate = (double) zmalloc_used_memory() * 1000000 / server.stat_fork_time / (1024 * 1024 * 1024);
        /* GB per second. */
        latencyAddSampleIfNeeded("fork", server.stat_fork_time / 1000);
        if (childpid == -1) { // fork出錯，打印錯誤日誌
            closeChildInfoPipe();
            server.lastbgsave_status = C_ERR;
            serverLog(LL_WARNING, "Can't save in background: fork: %s", strerror(errno));
            return C_ERR;
        }
        serverLog(LL_NOTICE, "Background saving started by pid %d", childpid);
        server.rdb_save_time_start = time(NULL);
        server.rdb_child_pid = childpid;
        server.rdb_child_type = RDB_CHILD_TYPE_DISK;
        updateDictResizePolicy();
        return C_OK;
    }
    return C_OK;
    /* unreached */
}
複製代碼

fork()方法返回值的描述：

Return Value

On success, the PID of the child process is returned in the parent, and 0 is returned in the child. On failure, -1 is returned in the parent, no child process is created, and errno is set appropriately.

意思是，若是fork成功，此進程的PID會返回給父進程，而且會給fork出的子進程返回一個0。若是fork失敗，給父進程返回-1，沒有子進程建立，並設置一個系統錯誤碼。

因而可知，fork的執行流程以下：

再來看看Linux中關於fork()的注意事項。

Notes

Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child. Since version 2.3.3, rather than invoking the kernel's fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).

第一段描述了fork()的一些問題。大意以下：

在Linux系統下，fork()經過copy-on-write策略實現，所以，他會帶來的問題是：複製父進程和爲子進程建立惟一的進程結構所須要的時間和內存。