目錄node

Ceph架構介紹
NFS介紹
分佈式文件系統比較
CephFS介紹
MDS介紹
- 5.1 單活MDS介紹
- 5.2 單活MDS高可用
CephFS遇到的部分問題
- 6.1 客戶端緩存問題
- 6.2 務端緩存不釋放
- 6.3 客戶端夯住或者慢查詢
- 6.4 客戶端失去鏈接
- 6.5 主從切換問題
CephFS問題解決方案
- 7.1 服務端緩存警告問題
- 7.2 客戶端夯住問題
  - 7.2.1 MDS鎖的問題
- 7.3 MDS主從切換問題
  - 7.3.1 爲何mds切換耗時比較高？
  - 7.3.2 MDS切換循環？
- 7.4 客戶端失去鏈接
總結及優化方案推薦
多活MDS
- 9.1 簡介
- 9.2 多活MDS優點
- 9.3 多活MDS特色
- 9.4 CephFS Subtree Partitioning
  - 9.4.1 介紹
- 9.5 Subtree Pinning(static subtree partitioning)
- 9.6 動態負載均衡
  - 9.6.1 介紹
  - 9.6.2 可配置的負載均衡
  - 9.6.3 負載均衡策略
  - 9.6.4 經過lua靈活控制負載均衡
  - 9.6.5 內部結構圖
多活負載均衡-實戰演練
- 10.1 集羣架構
- 10.2 擴容活躍MDS
- 10.3 多活MDS壓測
- 10.4 多活MDS-動態負載均衡
- 10.5 多活MDS-靜態分區(多租戶隔離)
- 10.6 多活MDS-主備模式
多活負載均衡-總結
- 11.1 測試報告
- 11.2 結論
MDS狀態說明
- 12.1 MDS主從切換流程圖
- 12.2 MDS狀態
- 12.3 State Diagram
深刻研究
- 13.1 MDS啓動階段分析
- 13.2 MDS核心組件
- 13.3 MDSDaemon類圖
- 13.4 MDSDaemon源碼分析
- 13.5 MDSRank類圖
- 13.6 MDSRank源碼分析

1. Ceph架構介紹

Ceph是一種爲優秀的性能、可靠性和可擴展性而設計的統一的、分佈式文件系統。git

特色以下：github

高性能
- a. 摒棄了傳統的集中式存儲元數據尋址的方案，採用CRUSH算法，數據分佈均衡，並行度高。
- b.考慮了容災域的隔離，可以實現各種負載的副本放置規則，例如跨機房、機架感知等。
- c. 可以支持上千個存儲節點的規模，支持TB到PB級的數據。
高可用性
- a. 副本數能夠靈活控制。
- b. 支持故障域分隔，數據強一致性。
- c. 多種故障場景自動進行修復自愈。
- d. 沒有單點故障，自動管理。
高可擴展性
- a. 去中心化。
- b. 擴展靈活。
- c. 隨着節點增長而線性增加。
特性豐富
- a. 支持三種存儲接口：塊存儲、文件存儲、對象存儲。
- b. 支持自定義接口，支持多種語言驅動。

使用場景：算法

塊存儲 (適合單客戶端使用)
- 典型設備：磁盤陣列，硬盤。
- 使用場景：
  - a. docker容器、虛擬機遠程掛載磁盤存儲分配。
  - b. 日誌存儲。
  - ...
文件存儲 (適合多客戶端有目錄結構)
- 典型設備：FTP、NFS服務器。
- 使用場景：
  - a. 日誌存儲。
  - b. 多個用戶有目錄結構的文件存儲共享。
  - ...
對象存儲 (適合更新變更較少的數據，沒有目錄結構，不能直接打開/修改文件)
- 典型設備：s3, swift。
- 使用場景：
  - a. 圖片存儲。
  - b. 視頻存儲。
  - c. 文件。
  - d. 軟件安裝包。
  - e. 歸檔數據。
  - ...

系統架構：docker

Ceph 生態系統架構能夠劃分爲四部分：編程

Clients：客戶端（數據用戶）
mds：Metadata server cluster，元數據服務器（緩存和同步分佈式元數據）
osd：Object storage cluster，對象存儲集羣（將數據和元數據做爲對象存儲，執行其餘關鍵職能）
mon：Cluster monitors，集羣監視器（執行監視功能）

image.png

2. NFS介紹

1. NAS(Network Attached Storage)swift

網絡存儲基於標準網絡協議NFSv3/NFSv4實現數據傳輸。
爲網絡中的Windows / Linux / Mac OS 等各類不一樣操做系統的計算機提供文件共享和數據備份。
目前市場上的NAS存儲是專門的設備，成本較高，且容量不易動態擴展，數據高可用須要底層RAID來保障。
CephFS屬於NAS的解決方案的一種，主要優點在成本，容量擴展和高性能方案。

2. NFS(Network File System)緩存

NFS即網絡文件系統，經過使用NFS，用戶和程序能夠像訪問本地文件同樣訪問遠端系統上的文件。
NFS客戶端和NFS服務器之間正是經過NFS協議進行通訊的。
目前NFS協議版本有NFSv三、NFSv4和NFSv4.1，NFSv3是無狀態的，NFSv4是有狀態，NFSv3和NFSv4是基於Filelayout驅動的，而NFSv4.1是基於Blocklayout驅動。本文主要使用NFSv4協議。

3. 分佈式文件系統比較

名稱	功能	適合場景	優缺點
MFS	1. 單點MDS 2. 支持FUSE 3. 數據分片分佈 4. 多副本 5. 故障手動恢復	大量小文件讀寫	1. 運維實施簡單 2. 但存在單點故障
Ceph	1. 多個MDS,可擴展 2. 支持FUSE 3. 數據分片(crush)分佈 4. 多副本/糾刪碼 5. 故障自動恢復	統一小文件存儲	1. 運維實施簡單 2. 故障自愈，自我恢復 3. MDS鎖的問題 4. J版本不少坑, L版本能夠上生產環境
ClusterFS	1. 不存在元數據節點 2. 支持FUSE 3. 數據分片分佈 4. 鏡像 5. 故障自動恢復	適合大文件	1. 運維實施簡單 2. 不存儲元數據管理 3. 增長了客戶端計算負載
Lustre	1. 雙MDS互備，不可用擴展 2. 支持FUSE 3. 數據分片分佈 4. 冗餘(無) 5. 故障自動恢復	大文件讀寫	1. 運維實施複雜 2. 太龐大 3. 比較成熟

4. CephFS介紹

image.png

說明：服務器

CephFS 是個與 POSIX 標準兼容的文件系統。
文件目錄和其餘元數據存儲在RADOS中。
MDS緩存元信息和文件目錄信息。
核心組件：MDS、Clients、RADOS。
- Client <–> MDS
  元數據操做和capalities。
- Client <–> OSD
  數據IO。
- MDS <–> OSD
  元數據IO。
掛載方式：
- ceph-fuse ... 。
- mount -t ceph ... 。
可擴展性
- client讀寫osd 。
共享文件系統
- 多個clients能夠同時讀寫。
高可用
- MDS主備模式，Active/Standby MDSs 。
文件/目錄Layouts
- 支持配置文件/目錄的Layouts使用不一樣的Ppool 。
POSIX ACLs
- CephFS kernel client默認支持。
- CephFS FUSE client可配置支持。
NFS-Ganesha
- 一個基於 NFSv3\v4\v4.1 的NFS服務器
- 運行在大多數 Linux 發行版的用戶態空間下，同時也支持 9p.2000L 協議。
- Ganesha經過利用libcephfs庫支持CephFS FSAL(File System Abstraction Layer，文件系統抽象層)，能夠將CephFS從新Export出去。
Client Quotas
- CephFS FUSE client支持配置任何目錄的Quotas。
負載均衡
- 動態負載均衡。
- 靜態負載均衡。
- hash負載均衡。

5. MDS介紹

5.1 單活MDS介紹

image.png

說明：

MDS全稱Ceph Metadata Server，是CephFS服務依賴的元數據服務。
元數據的內存緩存，爲了加快元數據的訪問。
保存了文件系統的元數據(對象裏保存了子目錄和子文件的名稱和inode編號)
保存cephfs日誌journal，日誌是用來恢復mds裏的元數據緩存
重啓mds的時候會經過replay的方式從osd上加載以前緩存的元數據
對外提供服務只有一個active mds。
全部用戶的請求都只落在一個active mds上。

5.2 單活MDS高可用

image.png

說明：

對外提供服務只有一個active mds, 多個standby mds。
active mds掛掉，standby mds會立馬接替，保證集羣高可用性。
standby mds
- 冷備就是備份的mds，只起到一個進程備份的做用，並不備份lru元數據。主備進程保持心跳關係，一旦主的mds掛了，備份mds replay()元數據到緩存，固然這須要消耗一點時間。
- 熱備除了進程備份，元數據緩存還時時刻刻的與主mds保持同步，當 active mds掛掉後，熱備的mds直接變成主mds，而且沒有replay()的操做，元數據緩存大小和主mds保持一致。

6. CephFS遇到的部分問題

6.1 客戶端緩存問題

消息： Client name failing to respond to cache pressure

說明： 客戶端有各自的元數據緩存，客戶端緩存中的條目（好比索引節點）也會存在於 MDS 緩存中，
因此當 MDS 須要削減其緩存時（保持在 mds_cache_size 如下），它也會發消息給客戶端讓它們削減本身的緩存。若是某個客戶端的響應時間超過了 mds_recall_state_timeout （默認爲 60s ），這條消息就會出現。

6.2 服務端緩存不釋放

若是有客戶端沒響應或者有缺陷，就會妨礙 MDS 將緩存保持在 mds_cache_size 如下， MDS 就有可能耗盡內存然後崩潰。

6.3 客戶端夯住或者慢查詢

客戶端搜索遍歷查找文件（不可控)。
session的 inode太大致使mds負載太高。
日誌級別開的太大，從而致使mds負載高。
mds鎖問題，致使客戶端夯住。
mds性能有限，目前是單活。

6.4 客戶端失去鏈接

客戶端因爲網絡問題或者其餘問題，致使客戶端不可用。

6.5 主從切換問題

主從切換耗時長。
主從切換循環選舉。

7. CephFS問題解決方案

7.1 服務端緩存警告問題

v12 luminous版本已修復：
github.com/ceph/ceph/c…
mds: fix false "failing to respond to cache pressure" warning

MDS has cache pressure, sends recall state messages to clients
Client does not trim as many caps as MDS expected. So MDS
does not reset session->recalled_at
MDS no longer has cache pressure, it stop sending recall state
messages to clients.
Client does not release its caps. So session->recalled_at in
MDS keeps unchanged

7.2 客戶端夯住問題

7.2.1 MDS鎖的問題

7.2.1.1 場景模擬

A用戶以只讀的方式打開文件，不關閉文件句柄。而後意外掉線或者掉電，B用戶讀寫這個文件就會夯住。

讀寫代碼

//read.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <pthread.h>
int main()
{
    int i = 0;
    for(i = 0; ;i++)
    {
        char *filename = "test.log";
        int fd = open(filename, O_RDONLY);
        printf("fd=[%d]", fd);
        fflush(stdout);
        sleep(5);
    }
}
 
//write.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <pthread.h>
int main()
{
    int i = 0;
    for(i = 0; ;i++)
    {
        char *filename = "test.log";
        int fd = open(filename, O_CREAT | O_WRONLY | O_APPEND, S_IRUSR | S_IWUSR);
        write(fd, "aaaa\n", 6);
        printf("fd=[%d] buffer=[%s]", fd, "aaaa");
        close(fd);
        fflush(stdout);
        sleep(5);
    }
}

A用戶執行read, B用戶執行write。

a. A用戶，kill -9 ceph-fuse pid 時間點是19:55:39。
b. 觀察A,B用戶的狀況以下。

image.png
c. 觀察mds的日誌

2018-12-13 19:56:11.222816 7fffee6d0700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.670943 secs
2018-12-13 19:56:11.222826 7fffee6d0700  0 log_channel(cluster) log [WRN] : slow request 30.670943 seconds old, received at 2018-12-13 19:55:40.551820: client_request(client.22614489:538 lookup #0x1/test.log 2018-12-13 19:55:40.551681 caller_uid=0, caller_gid=0{0,}) currently failed to rdlock, waiting
2018-12-13 19:56:13.782378 7ffff0ed5700  1 mds.ceph-xxx-osd02.ys Updating MDS map to version 229049 from mon.0
2018-12-13 19:56:33.782572 7ffff0ed5700  1 mds.ceph-xxx-osd02.ys Updating MDS map to version 229050 from mon.0
2018-12-13 20:00:26.226405 7fffee6d0700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph-xxx-osd01.ys (22532339), after 303.489228 seconds

總結：

能夠發現kill以後A用戶是不可用的狀態。
與此同時B用戶也是不可用的狀態，過了300s才恢復。
與此同時mds日誌顯示，有慢查詢夯住的client.22614489正好是B用戶。
mds日誌中發現，夯住都是在等待讀鎖。(currently failed to rdlock, waiting)
mds日誌中發現，夯住後過了300s 驅逐異常客戶端A用戶。
有兩種狀況能夠自動剔除客戶：
- 在活動的MDS守護程序上，若是客戶端還沒有經過mds_session_autoclose秒（默認爲300秒）與MDS進行通訊(客戶端每隔20s 向mds發送心跳連接handle_client_session)，則會自動將其逐出。
- 在MDS啓動期間（包括故障轉移），MDS經過稱爲從新鏈接的狀態。在此狀態下，它等待全部客戶端鏈接到新的MDS守護程序。若是任何客戶端在時間窗口（mds_reconnect_timeout，默認值爲45秒）內未能這樣作，那麼它們將被逐出。
調節mds session autoclose(默認300s)能夠儘快釋放異常會話，讓其餘客戶端儘快可用。

7.3 MDS主從切換問題

7.3.1 爲何mds切換耗時比較高？

分析日誌(發現執行rejoin_start，rejoin_joint_start動做耗時比較高)。

2018-04-27 19:24:15.984156 7f53015d7700  1 mds.0.2738 rejoin_start
2018-04-27 19:25:15.987531 7f53015d7700  1 mds.0.2738 rejoin_joint_start
2018-04-27 19:27:40.105134 7f52fd4ce700  1 mds.0.2738 rejoin_done
2018-04-27 19:27:42.206654 7f53015d7700  1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:27:42.206658 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:rejoin --> up:active

跟蹤代碼分析(在執行process_imported_caps超時了，這個函數主要是打開inodes 加載到cache中)。

image.png

7.3.2 MDS切換循環？

MDS守護進程至少在mds_beacon_grace中未能向監視器發送消息，而它們應該在每一個mds_beacon_interval發送消息。此時Ceph監視器將自動將MDS切換爲備用MDS。若是MDS的Session Inode過多致使MDS繁忙，只從切換未能及時發送消息，就可能會出現循環切換的機率。通常建設增大mds_beacon_grace。

mds beacon grace
描述: 多久沒收到標識消息就認爲 MDS 落後了（並可能替換它）。
類型: Float
默認值: 15

7.4 客戶端失去鏈接

client: fix fuse client hang because its pipe to mds is not ok
There is a risk client will hang if fuse client session had been killed by mds and
the mds daemon restart or hot-standby switch happens right away but the client
did not receive any message from monitor due to network or other whatever reason
untill the mds become active again.Thus cause client didn't do closed_mds_session
lead the seession still is STATE_OPEN but client can't send any message to
mds because its pipe is not ok.

So we should create pipe to mds guarantee any meta request can be sent to
server.When mds recevie the message will send a CLOSE_SESSION to client
becasue its session for this client is STATE_CLOSED.After the previous
steps the old session of client can be closed and new session and pipe
can be established and the mountpoint will be ok.

8. 總結及優化方案推薦

A用戶讀數據意外掉線，B用戶的操做都會抗住等待A用戶恢復，若是恢復不了，直到必定時間會自動剔除A用戶。(鎖的粒度很大，坑很大)
調節mds session autoclose(默認300s)，儘快剔除有問題的客戶端。
- On an active MDS daemon, if a client has not communicated with the MDS for over session_autoclose (a file system variable) seconds (300 seconds by default), then it will be evicted automatically
有兩種狀況能夠自動驅逐客戶：
- 在活動的MDS守護程序上，若是客戶端還沒有經過mds_session_autoclose秒（默認爲300秒）與MDS進行通訊(客戶端每隔20s 向mds發送心跳連接handle_client_session)，則會自動將其逐出。
- 在MDS啓動期間（包括故障轉移），MDS經過稱爲從新鏈接的狀態。在此狀態下，它等待全部客戶端鏈接到新的MDS守護程序。若是任何客戶端在時間窗口（mds_reconnect_timeout，默認值爲45秒）內未能這樣作，那麼它們將被逐出。
若是mds負載太高或者內存過大，限制MDS內存，減小資源消耗。mds limiting cache by memory github.com/ceph/ceph/p…
若是mds負載太高或者內存過大，官方提供的mds 主動刪除cache，補丁在review過程當中個，目標版本是ceph-14.0.0 github.com/ceph/ceph/p…
mds在主處理流程中使用了單線程，這致使了其單個MDS的性能受到了限制，最大單個MDS可達8k ops/s，CPU利用率達到的 140%左右。
ceph-fuse客戶端Qos限速，避免IO一瞬間涌進來致使mds抖動(從客戶端限制IOPS,避免資源爭搶，對系統資源帶來衝擊)
剔除用戶能夠釋放inode數量，可是不能減小內存，若是此時切換主從能夠加快切換速度。
多活MDS 在12 Luminous 官方宣稱能夠上生產環境。
當某個文件系統客戶端不響應或者有其它異常行爲時，此時會對客戶端進行驅逐，爲了防止異常客戶端致使數據不一致。

9. 多活MDS

9.1 簡介

也叫： multi-mds 、 active-active MDS
每一個 CephFS 文件系統默認狀況下都只配置一個活躍 MDS 守護進程。在大型系統中，爲了擴展元數據性能你能夠配置多個活躍的 MDS 守護進程，它們會共同承擔元數據負載。

CephFS 在Luminous版本中多元數據服務器（Multi-MDS）的功能和目錄分片（dirfragment）的功能宣稱已經能夠在生產環境中使用。

image.png

9.2 多活MDS優點

當元數據默認的單個 MDS 成爲瓶頸時，配置多個活躍的 MDS 守護進程，提高集羣性能。
多個活躍的 MDS 有利於性能提高。
多個活躍的MDS 能夠實現MDS負載均衡。
多個活躍的MDS 能夠實現多租戶資源隔離。

9.3 多活MDS特色

它可以將文件系統樹分割成子樹。
每一個子樹能夠交給特定的MDS進行權威管理。
從而達到了隨着元數據服務器數量的增長，集羣性能線性地擴展。
每一個子樹都是基於元數據在給定目錄樹中的熱動態建立的。
一旦建立了子樹，它的元數據就被遷移到一個未加載的MDS。
後續客戶端對先前受權的MDS的請求被轉發。

image.png

9.4 CephFS Subtree Partitioning

9.4.1 介紹

image.png

說明：
爲了實現文件系統數據和元數據的負載均衡，業界通常有幾種分區方法：

靜態子樹分區
- 即經過手工分區方式，將數據直接分配到某個服務節點上，出現負載
  不均衡時，再由管理員手動從新進行分配。
- 這種方式適應於數據位置固定的場景，不適合動態擴展、或者有可能出現異常的場景。
Hash計算分區
- 即經過Hash計算來分配數據存儲的位置。
- 這種方式適合數據分佈均衡、且須要應用各類異常的狀況，但不太適合須要數據分佈固定、環境變化頻率很高的場景。
動態子樹分區
- 經過實時監控集羣節點的負載，動態調整子樹分佈於不一樣的節點。
- 這種方式適合各類異常場景，能根據負載的狀況，動態的調整數據分佈，不過若是大量數據的遷移確定會致使業務抖動，影響性能。

9.5 Subtree Pinning(static subtree partitioning)

image.png

說明：

經過pin能夠把mds和目錄進行綁定。
經過pin能夠作到不一樣用戶的目錄訪問不一樣的mds。
能夠實現多租戶MDS負載均衡。
能夠實現多租戶MDS負載資源隔離。

9.6 動態負載均衡

9.6.1 介紹

多個活動的MDSs能夠遷移目錄以平衡元數據負載。什麼時候、何地以及遷移多少的策略都被硬編碼到元數據平衡模塊中。

Mantle是一個內置在MDS中的可編程元數據均衡器。其思想是保護平衡負載(遷移、複製、碎片化)的機制，但使用Lua定製化平衡策略。

大多數實現都在MDBalancer中。度量經過Lua棧傳遞給均衡器策略，負載列表返回給MDBalancer。這些負載是「發送到每一個MDS的數量」，並直接插入MDBalancer「my_targets」向量。

暴露給Lua策略的指標與已經存儲在mds_load_t中的指標相同:auth.meta_load()、all.meta_load()、req_rate、queue_length、cpu_load_avg。

它位於當前的均衡器實現旁邊，而且它是經過「ceph.conf」中的字符串啓用的。若是Lua策略失敗(不管出於何種緣由)，咱們將回到原來的元數據負載均衡器。
均衡器存儲在RADOS元數據池中，MDSMap中的字符串告訴MDSs使用哪一個均衡器。

This PR does not not have the following features from the Supercomputing paper:

Balancing API: all we require is that balancer written in Lua returns a targets table, where each index is the amount of load to send to each MDS
"How much" hook: this let's the user define meta_load()
Instantaneous CPU utilization as metric
Supercomputing '15 Paper: sc15.supercomputing.org/schedule/ev…

9.6.2 可配置的負載均衡

image.png

參考：

9.6.3 負載均衡策略

image.png

9.6.4 經過lua靈活控制負載均衡

image.png

參考：

9.6.5 內部結構圖

image.png

參考：

www.soe.ucsc.edu/sites/defau…

10. 多活負載均衡-實戰演練

10.1 集羣架構

mon: ceph-xxx-osd02.ys,ceph-xxx-osd03.ys,ceph-xxx-osd01.ys
mgr: ceph-xxx-osd03.ys(active), standbys: ceph-xxx-osd02.ys
mds: test1_fs-1/1/1 up {0=ceph-xxx-osd02.ys=up:active}, 2 up:standby
osd: 36 osds: 36 up, 36 in
rgw: 1 daemon active

10.2 擴容活躍MDS

10.2.1 設置max_mds爲2

$ ceph fs set test1_fs max_mds 2

10.2.2 查看fs狀態信息

$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd02.ys | Reqs:    0 /s | 3760  |   14  |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s |   11  |   13  |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  194M | 88.7T |
|   cephfs_data   |   data   |    0  | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd03.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.2.3 總結

每個 CephFS 文件系統都有本身的 max_mds 配置，它控制着會建立多少 rank 。
有空閒守護進程可接管新 rank 時，文件系統 rank 的實際數量纔會增長。
經過設置max_mds增長active mds。
- 新建立的 rank (1) 會從 creating 狀態過渡到 active 狀態。
- 建立後有兩個active mds，一個standby mds。

10.3 多活MDS壓測

10.3.1 用戶掛載目錄

$ ceph-fuse /mnt/
$ df
ceph-fuse      95330861056     40960 95330820096   1% /mnt

10.3.2 filebench壓測

image.png

10.3.3 查看fs mds負載

$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd03.ys | Reqs: 5624 /s |  139k |  133k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  238M | 88.7T |
|   cephfs_data   |   data   | 2240M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd01.ys |
| ceph-xxx-osd02.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.3.4 總結

fuse模式 mds性能 5624 ops/s。
雖然有兩個active mds, 可是目前請求都會落在rank0上面。
默認多個active mds負載並無均衡。

10.4 多活MDS-動態負載均衡

10.4.1 Put the balancer into RADOS

rados put --pool=cephfs_metadata_a greedyspill.lua ../src/mds/balancers/greedyspill.lua

10.4.2 Activate Mantle

ceph fs set test1_fs max_mds 2
ceph fs set test1_fs balancer greedyspill.lua

10.4.3 掛載壓測

$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+------------------------+---------------+-------+-------+
| 0 | active | ceph-xxx-osd03.ys | Reqs: 2132 /s | 4522 | 1783 |
| 1 | active | ceph-xxx-osd02.ys | Reqs: 173 /s | 306 | 251 |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 223M | 88.7T |
| cephfs_data | data | 27.1M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
| Standby MDS |
+------------------------+
| ceph-xxx-osd01.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.4.4 總結

經過lua能夠靈活控制負載均衡策略。
測試結果發現，負載均衡效果並很差。
負載均衡目前來看坑比較深，目前不推薦使用。

10.5 多活MDS-靜態分區(多租戶隔離)

10.5.1 根據目錄綁定不一樣的mds

#mds00綁定到/mnt/test0
#mds01綁定到/mnt/test1
#setfattr -n ceph.dir.pin -v <rank> <path>
 
setfattr -n ceph.dir.pin -v 0 /mnt/test0
setfattr -n ceph.dir.pin -v 1 /mnt/test1

10.5.2 兩個客戶端壓測

image.png

10.5.3 觀察fs 狀態信息(2個壓測端)

#檢查mds請求負責狀況
$ ceph fs status
test1_fs - 3 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd03.ys | Reqs: 3035 /s |  202k |  196k |
|  1   | active | ceph-xxx-osd02.ys | Reqs: 3039 /s | 70.8k | 66.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  374M | 88.7T |
|   cephfs_data   |   data   | 4401M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd01.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.5.4 結論

經過ceph.dir.pin把目錄綁定到不一樣的mds上，從而實現多租戶隔離。
兩個客戶端各自寫入本身所在目錄持續壓測20分鐘。
兩個客戶端壓測結果分別是：3035 ops/s，3039 ops/s。
兩個客戶端cpu消耗很是接近。
兩個active mds 目前都有請求負載，實現了多個客戶端的負載均衡。

10.6 多活MDS-主備模式

10.6.1 查看mds狀態

$ ceph fs status
test1_fs - 4 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd02.ys | Reqs:    0 /s | 75.7k | 72.6k |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s | 67.8k | 64.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  311M | 88.7T |
|   cephfs_data   |   data   | 3322M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd03.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.6.2 停掉mds2

$ systemctl stop ceph-mds.target

10.6.3 查看mds狀態信息

$ ceph fs status
test1_fs - 2 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | replay | ceph-xxx-osd03.ys |               |    0  |    0  |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s | 67.8k | 64.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  311M | 88.7T |
|   cephfs_data   |   data   | 3322M | 88.7T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
+-------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.6.4 壓測觀察

#進行壓測rank0, 發現請求能正常落在mds3上
$ ceph fs status
test1_fs - 4 clients
========
+------+--------+------------------------+---------------+-------+-------+
| Rank | State  |          MDS           |    Activity   |  dns  |  inos |
+------+--------+------------------------+---------------+-------+-------+
|  0   | active | ceph-xxx-osd03.ys | Reqs: 2372 /s | 72.7k | 15.0k |
|  1   | active | ceph-xxx-osd01.ys | Reqs:    0 /s | 67.8k | 64.0k |
+------+--------+------------------------+---------------+-------+-------+
+-----------------+----------+-------+-------+
|       Pool      |   type   |  used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata |  367M | 88.7T |
|   cephfs_data   |   data   | 2364M | 88.7T |
+-----------------+----------+-------+-------+
+------------------------+
|      Standby MDS       |
+------------------------+
| ceph-xxx-osd02.ys |
+------------------------+
MDS version: didi_dss version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

10.6.5 總結

多active mds，若是主mds掛掉，備mds會接替主的位置。
新的主會繼承靜態分區關係。

11. 多活負載均衡-總結

11.1 測試報告

工具	集羣模式	客戶端數量(壓測端)	性能
filebench	1MDS	2個客戶端	5624 ops/s
filebench	2MDS	2個客戶端	客戶端1：3035 ops/s 客戶端2：3039 ops/s

11.2 結論

單活mds
- 性能是 5624 ops/s左右。
- 經過主備模式能夠實現高可用。
多活mds 默認
- 用戶的請求都只會在 rank 0 上的mds。
多活mds 動態負載均衡 (目前12.2版本不推薦使用)
- 測試效果多個mds負載不均衡。
- 能夠經過lua靈活調節負載均衡策略。
- 資源來回遷移等各類問題，目前感受坑仍是很大。
多活mds 靜態分區（推薦使用，外界社區也有用到生產環境)
- 能夠實現不一樣目錄綁定到不一樣的mds上。
- 從而實現多租戶mds資源隔離。
- 隨着mds增長能夠線性增長集羣性能。
- 兩個客戶端壓測結果分別是：3035 ops/s，3039 ops/s。
多活mds 主備模式
- 其中一個active mds掛掉 stanbdy會立馬接替。
- 接替過來的新主active mds 也會繼承靜態分區的關係。

12. MDS狀態說明

12.1 MDS主從切換流程圖

image.png

說明：

用戶手動發起主從切換fail。
active mds手動信號，發起respawn重啓。
standby mds收到信號，通過分佈式算法推選爲新主active mds。
新主active mds 從up:boot狀態，變成up:replay狀態。日誌恢復階段，他將日誌內容讀入內存後，在內存中進行回放操做。
新主active mds 從up:replay狀態，變成up:reconnect狀態。恢復的mds須要與以前的客戶端從新創建鏈接，而且須要查詢以前客戶端發佈的文件句柄，從新在mds的緩存中建立一致性功能和鎖的狀態。
新主active mds從up:reconnect狀態，變成up:rejoin狀態。把客戶端的inode加載到mds cache。(耗時最多的地方)
新主active mds從up:rejoin狀態，變成up:active狀態。mds狀態變成正常可用的狀態。
recovery_done 遷移完畢。
active_start 正常可用狀態啓動，mdcache加載相應的信息。

12.2 MDS狀態

狀態	說明
up:active	This is the normal operating state of the MDS. It indicates that the MDS and its rank in the file system is available. 這個狀態是正常運行的狀態。這個代表該mds在rank中是可用的狀態。
up:standby	The MDS is available to takeover for a failed rank (see also :ref:`mds-standby`). The monitor will automatically assign an MDS in this state to a failed rank once available. 這個狀態是災備狀態，用來接替主掛掉的狀況。
up:standby_replay	The MDS is following the journal of another up:active MDS. Should the active MDS fail, having a standby MDS in replay mode is desirable as the MDS is replaying the live journal and will more quickly takeover. A downside to having standby replay MDSs is that they are not available to takeover for any other MDS that fails, only the MDS they follow. 災備守護進程就會持續讀取某個處於 up 狀態的 rank 的元數據日誌。這樣它就有元數據的熱緩存，在負責這個 rank 的守護進程失效時，可加速故障切換。一個正常運行的 rank 只能有一個災備重放守護進程（ standby replay daemon ），若是兩個守護進程都設置成了災備重放狀態，那麼其中任意一個會取勝，另外一個會變爲普通的、非重放災備狀態。一旦某個守護進程進入災備重放狀態，它就只能爲它那個 rank 提供災備。若是有另一個 rank 失效了，即便沒有災備可用，這個災備重放守護進程也不會去頂替那個失效的。
up:boot	This state is broadcast to the Ceph monitors during startup. This state is never visible as the Monitor immediately assign the MDS to an available rank or commands the MDS to operate as a standby. The state is documented here for completeness. 此狀態在啓動期間被廣播到CEPH監視器。這種狀態是不可見的，由於監視器當即將MDS分配給可用的秩或命令MDS做爲備用操做。這裏記錄了完整性的狀態。
up:creating	The MDS is creating a new rank (perhaps rank 0) by constructing some per-rank metadata (like the journal) and entering the MDS cluster.
up:starting	The MDS is restarting a stopped rank. It opens associated per-rank metadata and enters the MDS cluster.
up:stopping	When a rank is stopped, the monitors command an active MDS to enter the `up:stopping` state. In this state, the MDS accepts no new client connections, migrates all subtrees to other ranks in the file system, flush its metadata journal, and, if the last rank (0), evict all clients and shutdown (see also :ref:`cephfs-administration`).
up:replay	The MDS taking over a failed rank. This state represents that the MDS is recovering its journal and other metadata. 日誌恢復階段，他將日誌內容讀入內存後，在內存中進行回放操做。
up:resolve	The MDS enters this state from up:replay if the Ceph file system has multiple ranks (including this one), i.e. it's not a single active MDS cluster. The MDS is resolving any uncommitted inter-MDS operations. All ranks in the file system must be in this state or later for progress to be made, i.e. no rank can be failed/damaged or up:replay. 用於解決跨多個mds出現權威元數據分歧的場景，對於服務端包括子樹分佈、Anchor表更新等功能，客戶端包括rename、unlink等操做。
up:reconnect	An MDS enters this state from up:replay or up:resolve. This state is to solicit reconnections from clients. Any client which had a session with this rank must reconnect during this time, configurable via mds_reconnect_timeout. 恢復的mds須要與以前的客戶端從新創建鏈接，而且須要查詢以前客戶端發佈的文件句柄，從新在mds的緩存中建立一致性功能和鎖的狀態。mds不會同步記錄文件打開的信息，緣由是須要避免在訪問mds時產生多餘的延遲，而且大多數文件是以只讀方式打開。
up:rejoin	The MDS enters this state from up:reconnect. In this state, the MDS is rejoining the MDS cluster cache. In particular, all inter-MDS locks on metadata are reestablished. If there are no known client requests to be replayed, the MDS directly becomes up:active from this state. 把客戶端的inode加載到mds cache
up:clientreplay	The MDS may enter this state from up:rejoin. The MDS is replaying any client requests which were replied to but not yet durable (not journaled). Clients resend these requests during up:reconnect and the requests are replayed once again. The MDS enters up:active after completing replay.
down:failed	No MDS actually holds this state. Instead, it is applied to the rank in the file system
down:damaged	No MDS actually holds this state. Instead, it is applied to the rank in the file system
down:stopped	No MDS actually holds this state. Instead, it is applied to the rank in the file system

12.3 State Diagram

This state diagram shows the possible state transitions for the MDS/rank. The legend is as follows:

Color

綠色: MDS是活躍的。
橙色: MDS處於過渡臨時狀態，試圖變得活躍。
紅色: MDS指示一個狀態，該狀態致使被標記爲失敗。
紫色: MDS和rank爲中止。
紅色: MDS指示一個狀態，該狀態致使被標記爲損壞。

Shape

圈：MDS保持這種狀態。
六邊形：沒有MDS保持這個狀態。

Lines

A double-lined shape indicates the rank is "in".

image.png

13. 深刻研究

CephFS源碼分析

MDS 多活配置

CephFS 介紹及使用經驗分享

1. Ceph架構介紹

2. NFS介紹

3. 分佈式文件系統比較

4. CephFS介紹

5. MDS介紹

5.1 單活MDS介紹

5.2 單活MDS高可用

6. CephFS遇到的部分問題

6.1 客戶端緩存問題

6.2 服務端緩存不釋放

6.3 客戶端夯住或者慢查詢

6.4 客戶端失去鏈接

6.5 主從切換問題

7. CephFS問題解決方案

7.1 服務端緩存警告問題

7.2 客戶端夯住問題

7.2.1 MDS鎖的問題

7.2.1.1 場景模擬

7.3 MDS主從切換問題

7.3.1 爲何mds切換耗時比較高？

7.3.2 MDS切換循環？

7.4 客戶端失去鏈接

8. 總結及優化方案推薦

9. 多活MDS

9.1 簡介

9.2 多活MDS優點

9.3 多活MDS特色

9.4 CephFS Subtree Partitioning

9.4.1 介紹

9.5 Subtree Pinning(static subtree partitioning)

9.6 動態負載均衡

9.6.1 介紹

9.6.2 可配置的負載均衡

9.6.3 負載均衡策略

9.6.4 經過lua靈活控制負載均衡

9.6.5 內部結構圖

10. 多活負載均衡-實戰演練

10.1 集羣架構

10.2 擴容活躍MDS

10.2.1 設置max_mds爲2

10.2.2 查看fs狀態信息

10.2.3 總結

10.3 多活MDS壓測

10.3.1 用戶掛載目錄

10.3.2 filebench壓測

10.3.3 查看fs mds負載

10.3.4 總結

10.4 多活MDS-動態負載均衡

10.4.1 Put the balancer into RADOS

10.4.2 Activate Mantle

10.4.3 掛載壓測

10.4.4 總結

10.5 多活MDS-靜態分區(多租戶隔離)

10.5.1 根據目錄綁定不一樣的mds

10.5.2 兩個客戶端壓測

10.5.3 觀察fs 狀態信息(2個壓測端)

10.5.4 結論

10.6 多活MDS-主備模式

10.6.1 查看mds狀態

10.6.2 停掉mds2

10.6.3 查看mds狀態信息

10.6.4 壓測觀察

10.6.5 總結

11. 多活負載均衡-總結

11.1 測試報告

11.2 結論

12. MDS狀態說明

12.1 MDS主從切換流程圖

12.2 MDS狀態

12.3 State Diagram

Color

Shape

Lines

13. 深刻研究