https://www.jianshu.com/p/36c2d5682d87javascript
繼上次分享的《Ceph介紹及原理架構分享》,此次主要來分享Ceph中的PG各類狀態詳解,PG是最複雜和難於理解的概念之一,PG的複雜以下:php
正常的PG狀態是 100%的active + clean, 這表示全部的PG是可訪問的,全部副本都對所有PG均可用。
若是Ceph也報告PG的其餘的警告或者錯誤狀態。PG狀態表:java
狀態 | 描述 |
---|---|
Activating | Peering已經完成,PG正在等待全部PG實例同步並固化Peering的結果(Info、Log等) |
Active | 活躍態。PG能夠正常處理來自客戶端的讀寫請求 |
Backfilling | 正在後臺填充態。 backfill是recovery的一種特殊場景,指peering完成後,若是基於當前權威日誌沒法對Up Set當中的某些PG實例實施增量同步(例如承載這些PG實例的OSD離線過久,或者是新的OSD加入集羣致使的PG實例總體遷移) 則經過徹底拷貝當前Primary全部對象的方式進行全量同步 |
Backfill-toofull | 某個須要被Backfill的PG實例,其所在的OSD可用空間不足,Backfill流程當前被掛起 |
Backfill-wait | 等待Backfill 資源預留 |
Clean | 乾淨態。PG當前不存在待修復的對象, Acting Set和Up Set內容一致,而且大小等於存儲池的副本數 |
Creating | PG正在被建立 |
Deep | PG正在或者即將進行對象一致性掃描清洗 |
Degraded | 降級狀態。Peering完成後,PG檢測到任意一個PG實例存在不一致(須要被同步/修復)的對象,或者當前ActingSet 小於存儲池副本數 |
Down | Peering過程當中,PG檢測到某個不能被跳過的Interval中(例如該Interval期間,PG完成了Peering,而且成功切換至Active狀態,從而有可能正常處理了來自客戶端的讀寫請求),當前剩餘在線的OSD不足以完成數據修復 |
Incomplete | Peering過程當中, 因爲 a. 無非選出權威日誌 b. 經過choose_acting選出的Acting Set後續不足以完成數據修復,致使Peering無非正常完成 |
Inconsistent | 不一致態。集羣清理和深度清理後檢測到PG中的對象在副本存在不一致,例如對象的文件大小不一致或Recovery結束後一個對象的副本丟失 |
Peered | Peering已經完成,可是PG當前ActingSet規模小於存儲池規定的最小副本數(min_size) |
Peering | 正在同步態。PG正在執行同步處理 |
Recovering | 正在恢復態。集羣正在執行遷移或同步對象和他們的副本 |
Recovering-wait | 等待Recovery資源預留 |
Remapped | 從新映射態。PG活動集任何的一個改變,數據發生從老活動集到新活動集的遷移。在遷移期間仍是用老的活動集中的主OSD處理客戶端請求,一旦遷移完成新活動集中的主OSD開始處理 |
Repair | PG在執行Scrub過程當中,若是發現存在不一致的對象,而且可以修復,則自動進行修復狀態 |
Scrubbing | PG正在或者即將進行對象一致性掃描 |
Unactive | 非活躍態。PG不能處理讀寫請求 |
Unclean | 非乾淨態。PG不能從上一個失敗中恢復 |
Stale | 未刷新態。PG狀態沒有被任何OSD更新,這說明全部存儲這個PG的OSD可能掛掉, 或者Mon沒有檢測到Primary統計信息(網絡抖動) |
Undersized | PG當前Acting Set小於存儲池副本數 |
a. 中止osd.1python
$ systemctl stop ceph-osd@1
b. 查看PG狀態算法
$ bin/ceph pg stat 20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12/36 objects degraded (33.333%)
c. 查看集羣監控狀態ruby
$ bin/ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s) OSD_DOWN 1 osds down osd.1 (root=default,host=ceph-xx-cc00) is down PG_DEGRADED Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded pg 1.0 is active+undersized+degraded, acting [0,2] pg 1.1 is active+undersized+degraded, acting [2,0]
d. 客戶端IO操做bash
#寫入對象 $ bin/rados -p test_pool put myobject ceph.conf #讀取對象到文件 $ bin/rados -p test_pool get myobject.old #查看文件 $ ll ceph.conf* -rw-r--r-- 1 root root 6211 Jun 25 14:01 ceph.conf -rw-r--r-- 1 root root 6211 Jul 3 19:57 ceph.conf.old
故障總結:
爲了模擬故障,(size = 3, min_size = 2) 咱們手動中止了 osd.1,而後查看PG狀態,可見,它此刻的狀態是active+undersized+degraded,當一個 PG 所在的 OSD 掛掉以後,這個 PG 就會進入undersized+degraded 狀態,然後面的[0,2]的意義就是還有兩個副本存活在 osd.0 和 osd.2 上, 而且這個時候客戶端能夠正常讀寫IO。服務器
a. 停掉兩個副本osd.1,osd.0網絡
$ systemctl stop ceph-osd@1 $ systemctl stop ceph-osd@0
b. 查看集羣健康狀態架構
$ bin/ceph health detail
HEALTH_WARN 1 osds down; Reduced data availability: 4 pgs inactive; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s) OSD_DOWN 1 osds down osd.0 (root=default,host=ceph-xx-cc00) is down PG_AVAILABILITY Reduced data availability: 4 pgs inactive pg 1.6 is stuck inactive for 516.741081, current state undersized+degraded+peered, last acting [2] pg 1.10 is stuck inactive for 516.737888, current state undersized+degraded+peered, last acting [2] pg 1.11 is stuck inactive for 516.737408, current state undersized+degraded+peered, last acting [2] pg 1.12 is stuck inactive for 516.736955, current state undersized+degraded+peered, last acting [2] PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded pg 1.0 is undersized+degraded+peered, acting [2] pg 1.1 is undersized+degraded+peered, acting [2]
c. 客戶端IO操做(夯住)
#讀取對象到文件,夯住IO $ bin/rados -p test_pool get myobject ceph.conf.old
故障總結:
d. 調整min_size=1能夠解決IO夯住問題
#設置min_size = 1 $ bin/ceph osd pool set test_pool min_size 1 set pool 1 min_size to 1
e. 查看集羣監控狀態
$ bin/ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized; application not enabled on 1 pool(s) OSD_DOWN 1 osds down osd.0 (root=default,host=ceph-xx-cc00) is down PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized pg 1.0 is stuck undersized for 65.958983, current state active+undersized+degraded, last acting [2] pg 1.1 is stuck undersized for 65.960092, current state active+undersized+degraded, last acting [2] pg 1.2 is stuck undersized for 65.960974, current state active+undersized+degraded, last acting [2]
f. 客戶端IO操做
#讀取對象到文件中 $ ll -lh ceph.conf* -rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf -rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.old -rw-r--r-- 1 root root 6.1K Jul 3 20:11 ceph.conf.old.1
故障總結:
a. 中止osd.x
$ systemctl stop ceph-osd@x
b. 間隔5分鐘,啓動osd.x
$ systemctl start ceph-osd@x
c. 查看PG狀態
$ ceph pg stat 1416 pgs: 6 active+clean+remapped, 1288 active+clean, 3 stale+active+clean, 119 active+undersized+degraded; 74940 MB data, 250 GB used, 185 TB / 185 TB avail; 1292/48152 objects degraded (2.683%) $ ceph pg dump | grep remapped dumped all 13.cd 0 0 0 0 0 0 2 2 active+clean+remapped 2018-07-03 20:26:14.478665 9453'2 20716:11343 [10,23] 10 [10,23,14] 10 9453'2 2018-07-03 20:26:14.478597 9453'2 2018-07-01 13:11:43.262605 3.1a 44 0 0 0 0 373293056 1500 1500 active+clean+remapped 2018-07-03 20:25:47.885366 20272'79063 20716:109173 [9,23] 9 [9,23,12] 9 20272'79063 2018-07-03 03:14:23.960537 20272'79063 2018-07-03 03:14:23.960537 5.f 0 0 0 0 0 0 0 0 active+clean+remapped 2018-07-03 20:25:47.888430 0'0 20716:15530 [23,8] 23 [23,8,22] 23 0'0 2018-07-03 06:44:05.232179 0'0 2018-06-30 22:27:16.778466 3.4a 45 0 0 0 0 390070272 1500 1500 active+clean+remapped 2018-07-03 20:25:47.886669 20272'78385 20716:108086 [7,23] 7 [7,23,17] 7 20272'78385 2018-07-03 13:49:08.190133 7998'78363 2018-06-28 10:30:38.201993 13.102 0 0 0 0 0 0 5 5 active+clean+remapped 2018-07-03 20:25:47.884983 9453'5 20716:11334 [1,23] 1 [1,23,14] 1 9453'5 2018-07-02 21:10:42.028288 9453'5 2018-07-02 21:10:42.028288 13.11d 1 0 0 0 0 4194304 1539 1539 active+clean+remapped 2018-07-03 20:25:47.886535 20343'22439 20716:86294 [4,23] 4 [4,23,15] 4 20343'22439 2018-07-03 17:21:18.567771 20343'22439 2018-07-03 17:21:18.567771#2分鐘以後查詢$ ceph pg stat 1416 pgs: 2 active+undersized+degraded+remapped+backfilling, 10 active+undersized+degraded+remapped+backfill_wait, 1401 active+clean, 3 stale+active+clean; 74940 MB data, 247 GB used, 179 TB / 179 TB avail; 260/48152 objects degraded (0.540%); 49665 kB/s, 9 objects/s recovering$ ceph pg dump | grep remapped dumped all 13.1e8 2 0 2 0 0 8388608 1527 1527 active+undersized+degraded+remapped+backfill_wait 2018-07-03 20:30:13.999637 9493'38727 20754:165663 [18,33,10] 18 [18,10] 18 9493'38727 2018-07-03 19:53:43.462188 0'0 2018-06-28 20:09:36.303126
d. 客戶端IO操做
#rados讀寫正常 rados -p test_pool put myobject /tmp/test.log
a. 中止osd.x
$ systemctl stop ceph-osd@x
b. 間隔1分鐘啓動osd.x
osd$ systemctl start ceph-osd@x
c. 查看集羣監控狀態
$ ceph health detail
HEALTH_WARN Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded PG_DEGRADED Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded pg 1.19 is active+recovery_wait+degraded, acting [29,9,17]
a. 中止osd.x
$ systemctl stop ceph-osd@x
b. 間隔10分鐘啓動osd.x
$ osd systemctl start ceph-osd@x
c. 查看集羣健康狀態
$ ceph health detail
HEALTH_WARN Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded PG_DEGRADED Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded pg 3.7f is active+undersized+degraded+remapped+backfilling, acting [21,29]
a. 分別中止PG中的三個副本osd, 首先中止osd.23
$ systemctl stop ceph-osd@23
b. 而後中止osd.24
$ systemctl stop ceph-osd@24
c. 查看中止兩個副本PG 1.45的狀態(undersized+degraded+peered)
$ ceph health detail
HEALTH_WARN 2 osds down; Reduced data availability: 9 pgs inactive; Degraded data redundancy: 3041/47574 objects degraded (6.392%), 149 pgs unclean, 149 pgs degraded, 149 pgs undersized OSD_DOWN 2 osds down osd.23 (root=default,host=ceph-xx-osd02) is down osd.24 (root=default,host=ceph-xx-osd03) is down PG_AVAILABILITY Reduced data availability: 9 pgs inactive pg 1.45 is stuck inactive for 281.355588, current state undersized+degraded+peered, last acting [10]
d. 在中止PG 1.45中第三個副本osd.10
$ systemctl stop ceph-osd@10
e. 查看中止三個副本PG 1.45的狀態(stale+undersized+degraded+peered)
$ ceph health detail
HEALTH_WARN 3 osds down; Reduced data availability: 26 pgs inactive, 2 pgs stale; Degraded data redundancy: 4770/47574 objects degraded (10.026%), 222 pgs unclean, 222 pgs degraded, 222 pgs undersized OSD_DOWN 3 osds down osd.10 (root=default,host=ceph-xx-osd01) is down osd.23 (root=default,host=ceph-xx-osd02) is down osd.24 (root=default,host=ceph-xx-osd03) is down PG_AVAILABILITY Reduced data availability: 26 pgs inactive, 2 pgs stale pg 1.9 is stuck inactive for 171.200290, current state undersized+degraded+peered, last acting [13] pg 1.45 is stuck stale for 171.206909, current state stale+undersized+degraded+peered, last acting [10] pg 1.89 is stuck inactive for 435.573694, current state undersized+degraded+peered, last acting [32] pg 1.119 is stuck inactive for 435.574626, current state undersized+degraded+peered, last acting [28]
f. 客戶端IO操做
#讀寫掛載磁盤IO 夯住 ll /mnt/
故障總結:
先中止同一個PG內兩個副本,狀態是undersized+degraded+peered。
而後中止同一個PG內三個副本,狀態是stale+undersized+degraded+peered。
a. 刪除PG 3.0中副本osd.34頭文件
$ rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3
b. 手動執行PG 3.0進行數據清洗
$ ceph pg scrub 3.0 instructing pg 3.0 on osd.34 to scrub
c. 檢查集羣監控狀態
$ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 3.0 is active+clean+inconsistent, acting [34,23,1]
d. 修復PG 3.0
$ ceph pg repair 3.0 instructing pg 3.0 on osd.34 to repair #查看集羣監控狀態 $ ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent, 1 pg repair OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent, 1 pg repair pg 3.0 is active+clean+scrubbing+deep+inconsistent+repair, acting [34,23,1] #集羣監控狀態已恢復正常 $ ceph health detail HEALTH_OK
故障總結:
當PG內部三個副本有數據不一致的狀況,想要修復不一致的數據文件,只須要執行ceph pg repair修復指令,ceph就會從其餘的副本中將丟失的文件拷貝過來就行修復數據。
a. 查看PG 3.7f內副本數
$ ceph pg dump | grep ^3.7f
dumped all
3.7f 43 0 0 0 0 494927872 1569 1569 active+clean 2018-07-05 02:52:51.512598 21315'80115 21356:111666 [5,21,29] 5 [5,21,29] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219
b. 中止PG 3.7f 副本osd.21
$ systemctl stop ceph-osd@21
c. 查看PG 3.7f狀態
$ ceph pg dump | grep ^3.7f
dumped all
3.7f 66 0 89 0 0 591396864 1615 1615 active+undersized+degraded 2018-07-05 15:29:15.741318 21361'80161 21365:128307 [5,29] 5 [5,29] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219
d. 客戶端寫入數據,必定要確保數據寫入到PG 3.7f的副本中[5,29]
$ fio -filename=/mnt/xxxsssss -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=4M -size=2G -numjobs=30 -runtime=200 -group_reporting -name=read-libaio read-libaio: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1 ... fio-2.2.8 Starting 30 threads read-libaio: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 5 (f=5): [_(5),R(1),_(5),R(1),_(3),R(1),_(2),R(1),_(1),R(1),_(9)] [96.5% done] [1052MB/0KB/0KB /s] [263/0/0 iops] [eta 00m:02s] s] read-libaio: (groupid=0, jobs=30): err= 0: pid=32966: Thu Jul 5 15:35:16 2018 read : io=61440MB, bw=1112.2MB/s, iops=278, runt= 55203msec slat (msec): min=18, max=418, avg=103.77, stdev=46.19 clat (usec): min=0, max=33, avg= 2.51, stdev= 1.45 lat (msec): min=18, max=418, avg=103.77, stdev=46.19 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 2], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2], | 70.00th=[ 3], 80.00th=[ 3], 90.00th=[ 4], 95.00th=[ 5], | 99.00th=[ 7], 99.50th=[ 8], 99.90th=[ 10], 99.95th=[ 14], | 99.99th=[ 32] bw (KB /s): min=15058, max=185448, per=3.48%, avg=39647.57, stdev=12643.04 lat (usec) : 2=19.59%, 4=64.52%, 10=15.78%, 20=0.08%, 50=0.03% cpu : usr=0.01%, sys=0.37%, ctx=491792, majf=0, minf=15492 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=15360/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: io=61440MB, aggrb=1112.2MB/s, minb=1112.2MB/s, maxb=1112.2MB/s, mint=55203msec, maxt=55203msec
e. 中止PG 3.7f中副本osd.29,而且查看PG 3.7f狀態(undersized+degraded+peered)
#中止該PG副本osd.29 systemctl stop ceph-osd@29 #查看該PG 3.7f狀態爲undersized+degraded+peered ceph pg dump | grep ^3.7f dumped all 3.7f 70 0 140 0 0 608174080 1623 1623 undersized+degraded+peered 2018-07-05 15:35:51.629636 21365'80169 21367:132165 [5] 5 [5] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219
f. 中止PG 3.7f中副本osd.5,而且查看PG 3.7f狀態(undersized+degraded+peered)
#中止該PG副本osd.5 $ systemctl stop ceph-osd@5 #查看該PG狀態undersized+degraded+peered $ ceph pg dump | grep ^3.7f dumped all 3.7f 70 0 140 0 0 608174080 1623 1623 stale+undersized+degraded+peered 2018-07-05 15:35:51.629636 21365'80169 21367:132165 [5] 5 [5] 5 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219
g. 拉起PG 3.7f中副本osd.21(此時的osd.21數據比較陳舊), 查看PG狀態(down)
#拉起該PG的osd.21 $ systemctl start ceph-osd@21 #查看該PG的狀態down $ ceph pg dump | grep ^3.7f dumped all 3.7f 66 0 0 0 0 591396864 1548 1548 down 2018-07-05 15:36:38.365500 21361'80161 21370:111729 [21] 21 [21] 21 21315'80115 2018-07-05 02:52:51.512568 6206'80083 2018-06-29 22:51:05.831219
h. 客戶端IO操做
#此時客戶端IO都會夯住 ll /mnt/
故障總結:
首先有一個PG 3.7f有三個副本[5,21,29], 當停掉一個osd.21以後, 寫入數據到osd.5, osd.29。 這個時候停掉osd.29, osd.5 ,最後拉起osd.21。 這個時候osd.21的數據比較舊,就會出現PG爲down的狀況,這個時候客戶端IO會夯住,只能拉起掛掉的osd才能修復問題。
a. 刪除沒法拉起的OSD
b. 建立對應編號的OSD
c. PG的Down狀態就會消失
d. 對於unfound 的PG ,能夠選擇delete或者revert ceph pg {pg-id} mark_unfound_lost revert|delete
a. 首先kill B b. 新寫入數據到 A、C c. kill A和C d. 拉起B
Peering過程當中, 因爲 a. 無非選出權威日誌 b. 經過choose_acting選出的Acting Set後續不足以完成數據修復,致使Peering無非正常完成。
常見於ceph集羣在peering狀態下,來回重啓服務器,或者掉電。
簡單方式,數據可能又丟的狀況
a. stop the osd that is primary for the incomplete PG; b. run: ceph-objectstore-tool --data-path ... --journal-path ... --pgid $PGID --op mark-complete c. start the osd.
保證數據完整性
#1. 查看pg 1.1主副本里面的對象數,假設主本對象數多,則到主本所在osd節點執行 $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --journal-path /var/lib/ceph/osd/ceph-0/journal --pgid 1.1 --op export --file /home/pg1.1 #2. 而後將/home/pg1.1 scp到副本所在節點(有多個副本,每一個副本都要這麼操做),而後到副本所在節點執行 $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --journal-path /var/lib/ceph/osd/ceph-1/journal --pgid 1.1 --op import --file /home/pg1.1 #3. 而後再makr complete $ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --journal-path /var/lib/ceph/osd/ceph-1/journal --pgid 1.1 --op mark-complete #4. 最後啓動osd $ start osd