分佈式圖數據庫 Nebula Graph 中的集羣快照實踐

時間 2020-02-10

原文原文鏈接

1 概述

1.1 需求背景

圖數據庫 Nebula Graph 在生產環境中將擁有龐大的數據量和高頻率的業務處理，在實際的運行中將不可避免的發生人爲的、硬件或業務處理錯誤的問題，某些嚴重錯誤將致使集羣沒法正常運行或集羣中的數據失效。當集羣處於沒法啓動或數據失效的狀態時，從新搭建集羣並從新倒入數據都將是一個繁瑣並耗時的工程。針對此問題，Nebula Graph 提供了集羣 snapshot 的建立功能。git

Snapshot 功能須要預先提供集羣在某個時間點 snapshot 的建立功能，以備發生災難性問題時用歷史 snapshot 便捷地將集羣恢復到一個可用狀態。github

1.2 術語

本文主要會用到如下術語：shell

**StorageEngine：**Nebula Graph 的最小物理存儲單元，目前支持 RocksDB 和 HBase，在本文中只針對 RocksDB。
Partition：Nebula Graph 的最小邏輯存儲單元，一個 StorageEngine 可包含多個 Partition。Partition 分爲 leader 和 follower 的角色，Raftex 保證了 leader 和 follower 之間的數據一致性。
GraphSpace：每一個 GraphSpace 是一個獨立的業務 Graph 單元，每一個 GraphSpace 有其獨立的 tag 和 edge 集合。一個 Nebula Graph 集羣中可包含多個 GraphShpace。
checkpoint：針對 StorageEngine 的一個時間點上的快照，checkpoint 能夠做爲全量備份的一個 backup 使用。checkpoint files是 sst files 的一個硬鏈接。
snapshot：本文中的 snapshot 是指 Nebula Graph 集羣的某個時間點的快照，即集羣中全部 StorageEngine 的 checkpoint 的集合。經過 snapshot 能夠將集羣恢復到某個 snapshot 建立時的狀態。
wal：Write-Ahead Logging ，用 raftex 保證 leader 和 follower 的一致性。

2 系統構架

2.1 系統總體架構

2.2 存儲系統結構關係

2.3 存儲系統物理文件結構

[bright2star@hp-server storage]$ tree
.
└── nebula
    └── 1
        ├── checkpoints
        │   ├── SNAPSHOT_2019_12_04_10_54_42
        │   │   ├── data
        │   │   │   ├── 000006.sst
        │   │   │   ├── 000008.sst
        │   │   │   ├── CURRENT
        │   │   │   ├── MANIFEST-000007
        │   │   │   └── OPTIONS-000005
        │   │   └── wal
        │   │       ├── 1
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 2
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 3
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 4
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 5
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 6
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 7
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 8
        │   │       │   └── 0000000000000000233.wal
        │   │       └── 9
        │   │           └── 0000000000000000233.wal
        │   └── SNAPSHOT_2019_12_04_10_54_44
        │       ├── data
        │       │   ├── 000006.sst
        │       │   ├── 000008.sst
        │       │   ├── 000009.sst
        │       │   ├── CURRENT
        │       │   ├── MANIFEST-000007
        │       │   └── OPTIONS-000005
        │       └── wal
        │           ├── 1
        │           │   └── 0000000000000000236.wal
        │           ├── 2
        │           │   └── 0000000000000000236.wal
        │           ├── 3
        │           │   └── 0000000000000000236.wal
        │           ├── 4
        │           │   └── 0000000000000000236.wal
        │           ├── 5
        │           │   └── 0000000000000000236.wal
        │           ├── 6
        │           │   └── 0000000000000000236.wal
        │           ├── 7
        │           │   └── 0000000000000000236.wal
        │           ├── 8
        │           │   └── 0000000000000000236.wal
        │           └── 9
        │               └── 0000000000000000236.wal
        ├── data

3 處理邏輯分析

3.1 邏輯分析

Create snapshot 由 client api 或 console 觸發， graph server 對 create snapshot 的 AST 進行解析，而後經過 meta client 將建立請求發送到 meta server 。 meta server 接到請求後，首先會獲取全部的 active host ，並建立 adminClient 所需的 request 。經過 adminClient 將建立請求發送到每一個 StorageEngine ，StorageEngine 收到 create 請求後，會遍歷指定 space 的所有 StorageEngine，並建立 checkpoint ，隨後對 StorageEngine 中的所有 partition 的 wal 作 hardlink。在建立 checkpoint 和 wal hardlink 時，由於已經提早向全部 leader partition 發送了 write blocking 請求，因此此時數據庫是隻讀狀態的。數據庫

由於 snapshot 的名稱是由系統的 timestamp 自動生成，因此沒必要擔憂 snapshot 的重名問題。若是建立了沒必要要的 snapshot，能夠經過 drop snapshot 命令刪除已建立的 snapshot。api

3.2 Create Snapshot

3.3 Create Checkpoint

4 關鍵代碼實現

4.1 Create Snapshot

folly::Future<Status> AdminClient::createSnapshot(GraphSpaceID spaceId, const std::string& name) {
    // 獲取全部storage engine的host
    auto allHosts = ActiveHostsMan::getActiveHosts(kv_);
    storage::cpp2::CreateCPRequest req;
    
    // 指定spaceId，目前是對全部space作checkpoint，list spaces 工做已在調用函數中執行。
    req.set_space_id(spaceId);
    
    // 指定 snapshot name，已有meta server根據時間戳產生。
    // 例如：SNAPSHOT_2019_12_04_10_54_44
    req.set_name(name);
    folly::Promise<Status> pro;
    auto f = pro.getFuture();
    
    // 經過getResponse接口發送請求到全部的storage engine.
    getResponse(allHosts, 0, std::move(req), [] (auto client, auto request) {
        return client->future_createCheckpoint(request);
    }, 0, std::move(pro), 1 /*The snapshot operation only needs to be retried twice*/);
    return f;
}

4.2 Create Checkpoint

ResultCode NebulaStore::createCheckpoint(GraphSpaceID spaceId, const std::string& name) {
    auto spaceRet = space(spaceId);
    if (!ok(spaceRet)) {
        return error(spaceRet);
    }
    auto space = nebula::value(spaceRet);
    
    // 遍歷屬於本space中的全部StorageEngine
    for (auto& engine : space->engines_) {
        
        // 首先對StorageEngine作checkpoint
        auto code = engine->createCheckpoint(name);
        if (code != ResultCode::SUCCEEDED) {
            return code;
        }
        
        // 而後對本StorageEngine中的全部partition的last wal作hardlink
        auto parts = engine->allParts();
        for (auto& part : parts) {
            auto ret = this->part(spaceId, part);
            if (!ok(ret)) {
                LOG(ERROR) << "Part not found. space : " << spaceId << " Part : " << part;
                return error(ret);
            }
            auto walPath = folly::stringPrintf("%s/checkpoints/%s/wal/%d",
                                                      engine->getDataRoot(), name.c_str(), part);
            auto p = nebula::value(ret);
            if (!p->linkCurrentWAL(walPath.data())) {
                return ResultCode::ERR_CHECKPOINT_ERROR;
            }
        }
    }
    return ResultCode::SUCCEEDED;
}

5 用戶使用幫助

5.1 CREATE SNAPSHOT

CREATE SNAPSHOT 即對整個集羣建立當前時間點的快照，snapshot 名稱由 meta server 的 timestamp 組成。bash

在建立過程當中可能會建立失敗，當前版本不支持建立失敗的垃圾回收的自動功能，後續將計劃在 metaServer 中開發 cluster checker 的功能，將經過異步線程檢查集羣狀態，並自動回收 snapshot 建立失敗的垃圾文件。微信

當前版本若是 snapshot 建立失敗，必須經過 DROP SNAPSHOT 命令清除無效的 snapshot。架構

當前版本不支持對指定的 space 作 snapshot，當執行 CREATE SNAPSHOT 後，將對集羣中的全部 space 建立快照。<br />CREATE SNAPSHOT 語法：異步

CREATE SNAPSHOT

如下爲筆者建立 3 個 snapshot 的例子：函數

(user@127.0.0.1) [default_space]> create snapshot;
Execution succeeded (Time spent: 28211/28838 us)

(user@127.0.0.1) [default_space]> create snapshot;
Execution succeeded (Time spent: 22892/23923 us)

(user@127.0.0.1) [default_space]> create snapshot;
Execution succeeded (Time spent: 18575/19168 us)

咱們用 5.3 說起的 SHOW SNAPSHOTS 命令看下如今有的快照

(user@127.0.0.1) [default_space]> show snapshots;
===========================================================
| Name                         | Status | Hosts           |
===========================================================
| SNAPSHOT_2019_12_04_10_54_36 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_42 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_44 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
Got 3 rows (Time spent: 907/1495 us)

從上 SNAPSHOT_2019_12_04_10_54_36 可見 snapshot 名同 timestamp 有關。

5.2 DROP SNAPSHOT

DROP SNAPSHOT 即刪除指定名稱的 snapshot，能夠經過 SHOW SNAPSHOTS 命令獲取 snapshot 的名稱，DROP SNAPSHOT 既能夠刪除有效的 snapshot，也能夠刪除建立失敗的 snapshot。

語法：

DROP SNAPSHOT name

筆者刪除了 5.1 成功建立的 snapshot SNAPSHOT_2019_12_04_10_54_36 ，並用SHOW SNAPSHOTS 命令查看現有的 snapshot。

(user@127.0.0.1) [default_space]> drop snapshot SNAPSHOT_2019_12_04_10_54_36;
Execution succeeded (Time spent: 6188/7348 us)

(user@127.0.0.1) [default_space]> show snapshots;
===========================================================
| Name                         | Status | Hosts           |
===========================================================
| SNAPSHOT_2019_12_04_10_54_42 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_44 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
Got 2 rows (Time spent: 1097/1721 us)

5.3 SHOW SNAPSHOTS

SHOW SNAPSHOTS 可查看集羣中全部的 snapshot，能夠經過 SHOW SNAPSHOTS 命令查看其狀態（VALID 或 INVALID）、名稱、和建立 snapshot 時全部 storage Server 的 ip 地址。<br />語法：

SHOW SNAPSHOTS

如下爲一個小示例：

(user@127.0.0.1) [default_space]> show snapshots;
===========================================================
| Name                         | Status | Hosts           |
===========================================================
| SNAPSHOT_2019_12_04_10_54_36 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_42 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_44 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
Got 3 rows (Time spent: 907/1495 us)

6 注意事項

當系統結構發生變化後，最好馬上 create snapshot，例如 add host、drop host、create space、drop space、balance 等。
當前版本暫未提供用戶指定 snapshot 路徑的功能，snapshot 將默認建立在 data_path/nebula 目錄下。
當前版本暫未提供 snapshot 的恢復功能，須要用戶根據實際的生產環境編寫 shell 腳本實現。實現邏輯也比較簡單，拷貝各 engineServer 的 snapshot 到指定的文件夾下，並將此文件夾設置爲 data_path，啓動集羣便可。