Percona 開始嘗試基於Ceph作上層感知的分佈式 MySQL 集羣，使用 Ceph 提供的快照，備份和 HA 功能來解決分佈式數據庫的底層存儲問題

時間 2019-11-18

標籤 percona 開始嘗試基於 ceph 上層感知分佈式 mysql 集羣使用提供快照備份功能解決數據庫底層存儲問題欄目系統架構简体版

原文原文鏈接

本文由 Ceph 中國社區 -QiYu 翻譯node

歡迎加入CCTGreact

Over the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.git

通過過去一年，Ceph的世界吸引了我。部分是由於我對分佈式系統的品味，但也是由於我認爲Ceph特描繪了對尤爲是MySQL和普通數據庫的一個大機會。從本地存儲到分佈式存儲的轉變和從裸磁盤主機配置到LVM管理磁盤配置的轉變類似。算法

Most of the work I’ve done with Ceph was in collaboration with folks from Red Hat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the Red Hat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.sql

我用Ceph作完的大部分工做是和Red Hat的夥伴 (主要是 Brent Compton and Kyle Bader)合做。這項工做致使一些討論呈如今Percona4月份的在線會議和六月末舊金山的Red hat峯會上。我能夠寫不少數據庫使用Ceph的經驗，我但願這個帖是Ceph一個長系列中的第1個。在案例、設置配置和性能基準測試，我想我應該快速回顧一下Ceph背後的架構和原則。數據庫

Introduction to Ceph
Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. Red Hat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.後端

Ceph介紹

做爲主機服務公司的DreamHost的獨立子公司Inktank一些年前創造Ceph。 Red Hat 在2014年收購了Inktank而且如今做爲一個存儲解決方案推出。OpenStack使用Ceph做爲它支配性的存儲後端。而這個博文聚焦在一個更通用角度，而不侷限在虛擬環境中。服務器

A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point. There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.網絡

一個簡化描述Ceph的方式是說它是一個對象存儲，像S3或Swift。這是一個隊的聲明，但只提到一個特定的點。Ceoh最少有2種類型的節點，監視器（MON）和對象存儲後臺服務（OSD）。監視器負責維護一個集羣的圖或若是你喜歡，集羣元數據。沒有訪問到監視器節點提供的信息，集羣是沒有用的。在監視器層面冗餘和選舉法定人數是很重要的。

Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the 「crushmap」) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.

任何有價值的Ceph配置至少須要3個監視器，監視器是輕量級進程的能夠和OSD節點（其餘節點須要在最小的配置）部署在一塊兒。OSD節點存儲數據到磁盤，一個單點的物理服務器能夠有不少OSD節點，然而有超過1個監視器節點毫無心義。

With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.

經過監視器的CRUSHMAP提供的信息，任何客戶端基於一個僞隨機哈希算法訪問數據。這不須要一個傳遞的代理。由於這些代理帶來性能瓶頸，這成爲一個大的擴展因素。

Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (「size」 in Ceph terminology).

Ceph存儲熟讀在一個叫作池的邏輯容器中。池定義後帶來了一些PG（放置組）。 PG是池訪問數據的碎片。例如，在一個4個節點的Ceph集羣上，若是一個池被定義有256個PG，接着每一個OSD都有這個池的64個PG。你能夠以一個間接在節點間平滑數據分佈的層看這些PG。在池的層，你定義的副本數（Ceph術語 'size' ）

The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes. As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.

副本數的推薦值是普通機械硬盤3，SSD/Flash盤 2。我常常用1做短暫的測試VM鏡像。一個大於1的副本配置把每一個PG和一個或更多的其餘OSD節點的PG關聯起來。當數據被修改，它被同步複製到其餘關聯的PG以防保存的數據在一個OSD節點壞掉後仍然可用。

So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.

到目前爲止，我已經論述了一個對象存儲的基礎。可是自動適當更新對象的能力使ceph不一樣和比其餘的對象存儲更好（在我看來）。根本的對象訪問協議RADOS，在一個對象任意位置更新一個任意數量的字節，很像它是一個普通的文件。這種更新能力容許對象存儲的不少新奇的應用-例如塊設備，rbd設備，甚至網絡文件系統cephfs的支持。

When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.

當使用在Ceph上使用MySQL，rbd磁盤塊設備特性很是地吸引人。一個Ceph的rbd磁盤基本上是一系列（默認4M對象）的串聯，它被Linux內核rbd模塊做爲塊設備。因它能夠被掛載在任何能夠訪問存儲網絡並依賴網絡的性能的主機上，它功能上至關像一個iSCSI設備。

The benefits of using Ceph

Agility

In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.

使用Ceph的優點

敏捷

在一個爲虛擬化和容器奮鬥的世界裏，Ceph提供在不主機間容易的移動數據庫資源

IO scalability

On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.

IO擴展性

在一個單點主機，你能夠只能達到這個主機的IO能力。使用Ceph，你基本上把全部主機的所有IO能力並行化。若是一個主機有1000 iops，4個節點集羣能夠達到4000 iops。

High availability

Ceph replicates data at the storage level, and provides resiliency to storage node crash. A kind of DRBD on steroids…

高可用性

Ceph 在存儲層面複製數據，並提供存儲節點壞掉的彈性。

Backups

Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.

備份

Ceph rbd塊設備支持快照，它很快且沒有性能影響。快照是一個優化MySQL備份性能的理想方式。

Thin provisioning

You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.

輕發放

你能夠克隆和掛載一個快照做爲塊設備。這是一個有用的特性爲來發放一個複製的新的數據庫服務器，並且有異步複製或Galera複製。

The caveats of using Ceph

Of course, nothing is free. Ceph use comes with some caveats.

使用Ceph的限制

固然，沒有東西是免費的。 Ceph一同帶來一些限制。

Ceph reaction to a missing OSD

If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.

The 「no out」 OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the 「no out」 attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.

Ceph 丟失OSD的反應

若是一個OSD宕機，Ceph集羣開始以比設定的值小的複製數據副本。雖然對高可用性好，複製過程明顯的影響性能。這意味着你不能在接近慢的存儲上運行Ceph，你必須有足夠的硬盤空間處理一個節點的丟失。 OSD的「no nout」屬性減輕了這些，而且阻止Ceph自動地處理一個失敗（但你講本身負責）。當使用「no nout」屬性時，你在降級運行模式，你必須監控和檢測並採起措施。這和一個RAID集的一個磁盤失敗相似。你能夠設置mon_osd_auto_mark_auto_out_in選擇這個行爲做爲默認。

Scrubbing

Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.

數據清理

天天和每週（深度），Ceph執行清理操做，雖然它被限流，仍然湖影響性能。你能夠修改控制清理動做的間隔和小時。曾經一天活一週多是好的。可是你應該設置osd_scrub_begin_hour 和 osd_scrub_end_hour 來限制在繁忙時間清理。而且清理限流自身不要放過多的負載到節點。osd_scrub_load_threshold變量設置閾值。

Tuning

Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.

調優

Ceph有不少參數所以調優Ceph是複雜和使人困惑的。因爲分佈式系統推動硬件，恰當地的調優Ceph可能須要像在CPU核和線程核綁定分佈中斷負載、處理Numa域（尤爲是你使用一個NVMe設備）的事情。

Conclusion

Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.

結論

但願這篇博文能提供一個好的Ceph介紹。我已經論述Ceph的架構、好處和限制。在將來的博文中，我將呈如今Ceph上使用MySQL的案例。這些案例包括使用Ceph快照調優XtraDB 集羣的 SST操做、發放異步slave和構建HA配置。我也但願提供一個怎樣構建和配置一個高效的Ceph集羣的指導。

Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.

I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!

最後，一個給那些認爲構建Ceph集羣的成本和複雜度難以企及的人提示。下面的圖片展現我家的機器（我重度地使用）。這個集羣由4個基於ARM的節點（Odroid-XU4）構成，每一個帶1個USB-3接口的2TB的硬盤、1個16GB的EMMC閃存盤和1個1Gb的以太網口。
我不會宣稱性能記錄打破（雖然足夠好），但從很難從成本方式戰勝（近600$）。