(1)查看集羣狀態,發現2個osd 狀態爲downnode
[root@node140 /]# ceph -s cluster: id: 58a12719-a5ed-4f95-b312-6efd6e34e558 health: HEALTH_ERR noout flag(s) set 2 osds down 1 scrub errors Possible data damage: 1 pg inconsistent Degraded data redundancy: 1633/10191 objects degraded (16.024%), 84 pgs degraded, 122 pgs undersized services: mon: 2 daemons, quorum node140,node142 (age 3d) mgr: admin(active, since 3d), standbys: node140 osd: 18 osds: 16 up (since 3d), 18 in (since 5d) flags noout data: pools: 2 pools, 384 pgs objects: 3.40k objects, 9.8 GiB usage: 43 GiB used, 8.7 TiB / 8.7 TiB avail pgs: 1633/10191 objects degraded (16.024%) 261 active+clean 84 active+undersized+degraded 38 active+undersized 1 active+clean+inconsistent
(2)查看osd狀態ide
[root@node140 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 9.80804 root default -2 3.26935 host node140 0 hdd 0.54489 osd.0 up 1.00000 1.00000 1 hdd 0.54489 osd.1 up 1.00000 1.00000 2 hdd 0.54489 osd.2 up 1.00000 1.00000 3 hdd 0.54489 osd.3 up 1.00000 1.00000 4 hdd 0.54489 osd.4 up 1.00000 1.00000 5 hdd 0.54489 osd.5 up 1.00000 1.00000 -3 3.26935 host node141 12 hdd 0.54489 osd.12 up 1.00000 1.00000 13 hdd 0.54489 osd.13 up 1.00000 1.00000 14 hdd 0.54489 osd.14 up 1.00000 1.00000 15 hdd 0.54489 osd.15 up 1.00000 1.00000 16 hdd 0.54489 osd.16 up 1.00000 1.00000 17 hdd 0.54489 osd.17 up 1.00000 1.00000 -4 3.26935 host node142 6 hdd 0.54489 osd.6 up 1.00000 1.00000 7 hdd 0.54489 osd.7 down 1.00000 1.00000 8 hdd 0.54489 osd.8 down 1.00000 1.00000 9 hdd 0.54489 osd.9 up 1.00000 1.00000 10 hdd 0.54489 osd.10 up 1.00000 1.00000 11 hdd 0.54489 osd.11 up 1.00000 1.00000
(3)osd 7 osd 8狀態查看,已經failed了,重啓也沒法啓動ui
[root@node140 /]# systemctl status ceph-osd@8.service ● ceph-osd@8.service - Ceph object storage daemon osd.8 Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Fri 2019-08-30 17:36:50 CST; 1min 20s ago Process: 433642 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=1/FAILURE) Aug 30 17:36:50 node140 systemd[1]: Failed to start Ceph object storage daemon osd.8. Aug 30 17:36:50 node140 systemd[1]: Unit ceph-osd@8.service entered failed state. Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service failed. Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service holdoff time over, scheduling restart. Aug 30 17:36:50 node140 systemd[1]: Stopped Ceph object storage daemon osd.8. Aug 30 17:36:50 node140 systemd[1]: start request repeated too quickly for ceph-osd@8.service Aug 30 17:36:50 node140 systemd[1]: Failed to start Ceph object storage daemon osd.8. Aug 30 17:36:50 node140 systemd[1]: Unit ceph-osd@8.service entered failed state. Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service failed.
(4)osd硬盤故障,狀態變化
osd硬盤故障,狀態變爲down。在通過mod osd down out interval 設定的時間間隔後,ceph將其標記爲out,並開始進行數據遷移恢復。 爲了下降影響能夠先關閉,待硬盤更換完成後再開啓
[root@node140 /]# cat /etc/ceph/ceph.conf
[global]
mon osd down out interval = 9003d
(5)中止數據均衡
[root@node140 /]# for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd set $i;donerest
(6)定位i故障盤
[root@node140 /]# ceph osd tree | grep -i down
7 hdd 0.54489 osd.7 down 0 1.00000
8 hdd 0.54489 osd.8 down 0 1.00000 code
(7)卸載故障的節點
[root@node142 ~]# umount /var/lib/ceph/osd/ceph-7
[root@node142 ~]# umount /var/lib/ceph/osd/ceph-8rem
(8)從crush map 中移除osd
[root@node142 ~]# ceph osd crush remove osd.7
removed item id 7 name 'osd.7' from crush map
[root@node142 ~]# ceph osd crush remove osd.8
removed item id 8 name 'osd.8' from crush mapit
(9)刪除故障osd的密鑰
[root@node142 ~]# ceph auth del osd.7
updated
[root@node142 ~]# ceph auth del osd.8
updatedclass
(10)刪除故障osd集羣
[root@node142 ~]# ceph osd rm 7 removed osd.7 [root@node142 ~]# ceph osd rm 8 removed osd.8 [root@node142 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 8.71826 root default -2 3.26935 host node140 0 hdd 0.54489 osd.0 up 1.00000 1.00000 1 hdd 0.54489 osd.1 up 1.00000 1.00000 2 hdd 0.54489 osd.2 up 1.00000 1.00000 3 hdd 0.54489 osd.3 up 1.00000 1.00000 4 hdd 0.54489 osd.4 up 1.00000 1.00000 5 hdd 0.54489 osd.5 up 1.00000 1.00000 -3 3.26935 host node141 12 hdd 0.54489 osd.12 up 1.00000 1.00000 13 hdd 0.54489 osd.13 up 1.00000 1.00000 14 hdd 0.54489 osd.14 up 1.00000 1.00000 15 hdd 0.54489 osd.15 up 1.00000 1.00000 16 hdd 0.54489 osd.16 up 1.00000 1.00000 17 hdd 0.54489 osd.17 up 1.00000 1.00000 -4 2.17957 host node142 6 hdd 0.54489 osd.6 up 1.00000 1.00000 9 hdd 0.54489 osd.9 up 1.00000 1.00000 10 hdd 0.54489 osd.10 up 1.00000 1.00000 11 hdd 0.54489 osd.11 up 1.00000 1.00000 [root@node142 ~]#
(11)更換故障硬盤查看盤符,而後重建
[root@node142 ~]# ceph-volume lvm create --data /dev/sdd
[root@node142 ~]# ceph-volume lvm create --data /dev/sdc
(12)[root@node142 ~]# ceph-volume lvm list(13)待新osd添加crush map後,從新開啓集羣禁用標誌for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd unset $i;done