Ceph的架構（一）

時間 2019-11-12

標籤 ceph 架構欄目系統架構简体版

原文原文鏈接

The Ceph storage cluster
ios

Ceph的集羣由兩種類型的守護進程組（daemon）成（一般是一個host一個daemon）：git

Ceph Monitor & Ceph OSD Daemon
算法

一個ceph的monitor維護了整個cluster map，多個monitor組成的集羣能夠避免一個Monitor崩潰的單點失效問題。存儲集羣的用戶能夠從monitor處拷貝一份cluster map做爲緩存。
緩存

一個Ceph OSD Daemon檢查它本身以及其餘OSD的狀態而且報告給monitor。
服務器

存儲集羣的Client和每一個Ceph OSD Daemon使用crush算法來高效地計算關於數據位置的信息，而不是經過一個巨大的中心化的搜索表。
session

1. 存儲數據
架構

Ceph從Ceph client處獲取數據——不管這些數據是從Ceph塊設備（Ceph Block Device），Ceph對象存儲（Ceph Object Storage），Ceph文件系統（Ceph Filesystem）仍是一個你使用librados建立的自定義存儲方式——而且將它們做爲對象存儲起來。每一個對象對應着文件系統中的一個文件，這些文件存儲在Ceph存儲設備（OSD）中。Ceph OSD守護進程負責處理在磁盤上的讀寫操做。
dom

Ceph OSD Daemon將全部的數據做爲對象存儲在一個平坦的空間內（即沒有目錄那種層次結構）。一個對象包括一個標識符，二進制數據，以及由name/value對組成的元數據組成。數據的語義徹底由Ceph client決定。好比Ceph Filesystem會使用元數據存儲文件屬性，好比文件的擁有者，建立時間，上一次修改時間等等。
ide

注意，一個對象的ID在整個集羣中都是獨一無二的，而不是僅僅侷限在本地文件系統中。
性能

2. 可擴展性與高可用性

在傳統的架構當中，客戶端是經過訪問一箇中心化的組建來訪問整個複雜的子系統，這種方式很容易形成單點失效問題，已經性能和可擴展性的瓶頸問題。Ceph則採用了去中心化的思想，讓客戶端直接與OSD Daemon交換信息。Ceph會在多個節點上保持同一個數據的備份，來保證可用性。Monitor也使用了多個節點構成一個集羣，來保證可用性。

爲了實現去中心化，Ceph使用了CRUSH算法。

CRUSH介紹

Ceph Client和Ceph OSD Daemon都使用crush算法來高效地計算對象位置的信息，而不是依賴於一箇中心化的查詢表。對於crush的詳細介紹，能夠看 CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data這篇論文。

Cluster Map

Ceph clients和Ceph OSD Daemons都擁有關於集羣的拓撲（topology）信息，Ceph依賴於這些信息。這些拓撲信息包含了五個map，分別爲：

The Monitor Map: Contains the cluster fsid, the position, name address and port of each monitor. It also indicates the current epoch, when the map was created, and the last time it changed. To view a monitor map, execute ceph mon dump.
The OSD Map: Contains the cluster fsid, when the map was created and last modified, a list of pools, replica sizes, PG numbers, a list of OSDs and their status (e.g., up, in). To view an OSD map, execute ceph osd dump.
The PG Map: Contains the PG version, its time stamp, the last OSD map epoch, the full ratios, and details on each placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG (e.g., active + clean), and data usage statistics for each pool.
The CRUSH Map: Contains a list of storage devices, the failure domain hierarchy (e.g., device, host, rack, row, room, etc.), and rules for traversing the hierarchy when storing data. To view a CRUSH map, execute ceph osd getcrushmap -o {filename}; then, decompile it by executing crushtool -d {comp-crushmap-filename} -o{decomp-crushmap-filename}. You can view the decompiled map in a text editor or with cat.
The MDS Map: Contains the current MDS map epoch, when the map was created, and the last time it changed. It also contains the pool for storing metadata, a list of metadata servers, and which metadata servers are up and in. To view an MDS map, execute ceph mdsdump.

這個五個map被合稱爲cluster map。

每一個map都維護了一個它自身操做狀態變化的歷史記錄。Ceph Monitor維護了一個clsuter map的主要備份。

3. 高可用性Monitor

在Ceph客戶端讀寫數據以前，客戶端必須聯繫monitor並從其獲取最近的cluster map的一份拷貝。爲了不單點失效問題（當monitor失效的時候，客戶端沒法進行讀寫），Ceph支持monitor構成的集羣。當一個或者多個monitor宕機的時候，Ceph並不會總體失效。

4. 高可用性認證

爲了辨別用戶和保護系統不受中間人（man-in-the-middle）攻擊，Ceph提供cephx認證系統來認證用戶和daemon。注意cephx系統並不解決在傳輸或者其餘過程當中的數據加密過程。

Cephx使用共享密鑰來進行認證，這意味着客戶端和monitor集羣都擁有客戶的密鑰的一份拷貝。這個認證協議容許雙方可以向對方證實本身的身份，而不用泄漏這個密鑰。

因爲Ceph是能夠擴展的，Ceph從設計上就避免去經過一箇中心化的接口去訪問Ceph對象存儲設施，這意味這Ceph客戶端可以直接訪問OSD。Cephx協議的運轉方式相似於Kerberos。

A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each monitor can authenticate users and distribute keys, so there is no single point of failure or bottleneck when usingcephx. The monitor returns an authentication data structure similar to a Kerberos ticket that contains a session key for use in obtaining Ceph services. This session key is itself encrypted with the user’s permanent secret key, so that only the user can request services from the Ceph monitor(s). The client then uses the session key to request its desired services from the monitor, and the monitor provides the client with a ticket that will authenticate the client to the OSDs that actually handle data. Ceph monitors and OSDs share a secret, so the client can use the ticket provided by the monitor with any OSD or metadata server in the cluster. Like Kerberos, cephx tickets expire, so an attacker cannot use an expired ticket or session key obtained surreptitiously. This form of authentication will prevent attackers with access to the communications medium from either creating bogus messages under another user’s identity or altering another user’s legitimate messages, as long as the user’s secret key is not divulged before it expires.

爲了使用cephx，一個管理員必須首先設置用戶。在下列圖當中，client.admin用戶（應該是位於客戶端）發起一個ceph auth get-or-create-key指令來生成一個用戶名和密鑰。Ceph的auth子系統生成一個用戶名和密鑰後，會在monitor上存儲一份拷貝，並傳輸回給client.admin用戶。這意味着客戶端和monitor共享了一個密鑰。

To authenticate with the monitor, the client passes in the user name to the monitor, and the monitor generates a session key and encrypts it with the secret key associated to the user name. Then, the monitor transmits the encrypted ticket back to the client. The client then decrypts the payload with the shared secret key to retrieve the session key. The session key identifies the user for the current session. The client then requests a ticket on behalf of the user signed by the session key. The monitor generates a ticket, encrypts it with the user’s secret key and transmits it back to the client. The client decrypts the ticket and uses it to sign requests to OSDs and metadata servers throughout the cluster.

Cephx協議對客戶端和Ceph server之間的正在進行的通訊都會進行認證。在初始認證以後，客戶端和服務器之間的每個消息，都會被ticket簽名。

Smart daemon enable hyperscale 關於大規模擴展性的問題

Ceph使用了去中心化的思想：

Ceph OSD和Ceph client都是cluster-aware，即他們一開始就知道整個集羣的存在，也知道其餘的節點的信息。和Ceph client同樣，每一個Ceph OSD Daemon都能感知到集羣裏其餘OSD Daemon的信息。這使得OSD Daemon可以直接與其餘的OSD Daemon和monitor交換信息。除此以外，這種機制也可以使client直接訪問OSD Daemon。

Ceph client，monitor和OSD他們之間可以直接互相訪問的能力意味着OSD Daemon可以充分利用每一個節點的CPU和RAM，將中心化的任務分攤到各個節點中去。這種的設計的好處以下：

1. 讓OSD直接服務client

這個設計移除了訪問的單點失效問題，而且提升了性能和可擴展性。

2. OSD membership and status

Ceph OSD Daemon加入一個cluster以後，會主動報告它的狀態。OSD Daemon在掛掉的時候不能主動通知monitor它宕機了，因而Ceph monitor會週期性地ping一下OSD Daemon，檢查它是否在運行。Ceph同時也讓OSD Daemon來檢查它附近的OSD Daemon是否宕機，來更新cluster map而且報告給monitor。這意味着ceph monitor能夠作的更加輕量級化。

3. 數據擦除 data srcubbing

data scrubbing可以維持數據的一致性和清潔。

4. 複製（replication）

和Ceph客戶端同樣，Ceph OSD Daemon使用crush算法，但不一樣之處在於OSD Daemon使用crush來計算數據的拷貝應該被放在哪裏。在一個典型的寫場景中，一個client使用crush算法來計算存儲對象的位置，將這個對象映射到一個pool和placement group，而後查詢CRUSH map來肯定這個placement group的primary OSD。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。