zabbix監控報警一臺ceph節點journal盤寫入壽命已經達到96%以上,根據intel官方說法寫入壽命達到設置值將會沒法正常寫入。PercentageUsed : 97bash
[root@ceph-11 ~]# isdct show -sensor PowerOnHours : 0x021B5 EraseFailCount : 0 EndToEndErrorDetectionCount : 0 ReliabilityDegraded : False AvailableSpare : 100 AvailableSpareBelowThreshold : False DeviceStatus : Healthy SpecifiedPCBMaxOperatingTemp : 85 SpecifiedPCBMinOperatingTemp : 0 UnsafeShutdowns : 0x08 CrcErrorCount : 0 AverageNandEraseCycles : 2917 MediaErrors : 0x00 PowerCycles : 0x0C ProgramFailCount : 0 MaxNandEraseCycles : 2922 HighestLifetimeTemperature : 57 PercentageUsed : 97 ThermalThrottleStatus : 0 ErrorInfoLogEntries : 0x00 MinNandEraseCycles : 2913 LowestLifetimeTemperature : 23 ReadOnlyMode : False ThermalThrottleCount : 0 TemperatureThresholdExceeded : False Temperature - Celsius : 50
有12個osd用這塊盤作的日誌服務器
[root@ceph-11 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5.5T 0 disk └─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87 sdb 8:16 0 5.5T 0 disk └─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88 sdc 8:32 0 5.5T 0 disk └─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89 sdd 8:48 0 5.5T 0 disk └─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90 sde 8:64 0 5.5T 0 disk └─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91 sdf 8:80 0 5.5T 0 disk └─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92 sdg 8:96 0 5.5T 0 disk └─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93 sdh 8:112 0 5.5T 0 disk └─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94 sdi 8:128 0 5.5T 0 disk └─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95 sdj 8:144 0 5.5T 0 disk └─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96 sdk 8:160 0 5.5T 0 disk └─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97 sdl 8:176 0 5.5T 0 disk └─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98 sdm 8:192 0 419.2G 0 disk └─sdm1 8:193 0 419.2G 0 part / nvme0n1 259:0 0 372.6G 0 disk ├─nvme0n1p1 259:1 0 30G 0 part ├─nvme0n1p2 259:2 0 30G 0 part ├─nvme0n1p3 259:3 0 30G 0 part ├─nvme0n1p4 259:4 0 30G 0 part ├─nvme0n1p5 259:5 0 30G 0 part ├─nvme0n1p6 259:6 0 30G 0 part ├─nvme0n1p7 259:7 0 30G 0 part ├─nvme0n1p8 259:8 0 30G 0 part ├─nvme0n1p9 259:9 0 30G 0 part ├─nvme0n1p10 259:10 0 30G 0 part ├─nvme0n1p11 259:11 0 30G 0 part └─nvme0n1p12 259:12 0 30G 0 part [root@ceph-11 ~]#
1,下降osd優先級
在大部分故障場景, 咱們須要關機操做, 爲了讓用戶無感知, 咱們須要提早下降待操做的節點的優先級。首先看下ceph版本號,ceph版本爲10.x. 咱們啓用了primary-affinity支持, 用戶的io請求會先轉給primary pg處理. 而後寫入其餘replica(副本).。先找出host ceph-11對應的osd,而後把這些osd的primary-affinity設爲0, 意思就是上面的pg除非其餘副本掛了, 不然不該該成爲主pg.ide
-12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0.89999 88 5.45599 osd.88 up 0.79999 0.29999 89 5.45599 osd.89 up 1.00000 0.89999 90 5.45599 osd.90 up 1.00000 0.89999 91 5.45599 osd.91 up 1.00000 0.89999 92 5.45599 osd.92 up 1.00000 0.79999 93 5.45599 osd.93 up 1.00000 0.89999 94 5.45599 osd.94 up 1.00000 0.89999 95 5.45599 osd.95 up 1.00000 0.89999 96 5.45599 osd.96 up 1.00000 0.89999 97 5.45599 osd.97 up 1.00000 0.89999 98 5.45599 osd.98 up 0.89999 0.89999
將osd87到98優先級設置爲0for osd in {87..98}; do ceph osd primary-affinity "$osd" 0; done
ui
使用ceph osd tree能夠看到對應的節點設置日誌
-12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0 88 5.45599 osd.88 up 0.79999 0 89 5.45599 osd.89 up 1.00000 0 90 5.45599 osd.90 up 1.00000 0 91 5.45599 osd.91 up 1.00000 0 92 5.45599 osd.92 up 1.00000 0 93 5.45599 osd.93 up 1.00000 0 94 5.45599 osd.94 up 1.00000 0 95 5.45599 osd.95 up 1.00000 0 96 5.45599 osd.96 up 1.00000 0 97 5.45599 osd.97 up 1.00000 0 98 5.45599 osd.98 up 0.89999 0
2,禁止踢出節點ceph osd set noout
code
默認狀況下, osd長時間無響應則會被自動踢出集羣, 從而觸發數據遷移. 關機更換ssd操做時間較長, 爲了不數據無心義地來回遷移, 咱們須要臨時禁止集羣自動踢osd,使用ceph -s檢查是否配置完成。能夠看到集羣狀態變爲WARN, 額外提示說noout flag被設置了, 並且flags這樣多了一項ip
[root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN noout flag(s) set monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73511: 111 osds: 108 up, 108 in flags noout,sortbitwise,require_jewel_osds pgmap v85913863: 5064 pgs, 24 pools, 89164 GB data, 12450 kobjects 261 TB used, 141 TB / 403 TB avail 5060 active+clean 4 active+clean+scrubbing+deep client io 27608 kB/s rd, 59577 kB/s wr, 399 op/s rd, 668 op/s wr
3,檢查pg是否完成切換ci
[root@ceph-11 ~]# ceph pg ls | grep "\[9[1-8]," 13.24 5066 0 0 0 0 41480507922 3071 3071 active+clean 2019-07-02 19:33:37.537802 73497'120563162 73511:110960694 [94,25,64] 94 [94,25,64] 94 73497'120562718 2019-07-02 19:33:37.537761 73294'120561198 2019-07-01 18:11:54.686413 13.10f 4874 0 0 0 0 39967832064 3083 3083 active+clean 2019-07-01 23:56:13.911259 73511'59603193 73511:52739094 [91,44,38] 91 [91,44,38] 91 73302'59589396 2019-07-01 23:56:13.911226 69213'59545762019-06-26 22:58:12.864475 13.17d 5001 0 0 0 0 40919228578 3088 3088 active+clean 2019-07-02 13:51:04.162137 73511'34680543 73511:26095334 [96,45,72] 96 [96,45,72] 96 73497'34678725 2019-07-02 13:51:04.162089 70393'34676042019-07-01 08:47:58.771910 13.20d 4872 0 0 0 0 40007166482 3036 3036 active+clean 2019-07-03 07:40:28.677097 73511'27811217 73511:22372286 [93,85,73] 93 [93,85,73] 93 73497'27809831 2019-07-03 07:40:28.677059 73302'27796622019-07-01 23:15:14.731237 13.214 5006 0 0 0 0 40940654592 3079 3079 active+clean 2019-07-02 21:10:51.094829 73511'34400529 73511:27161705 [94,61,53] 94 [94,61,53] 94 73497'34398612 2019-07-02 21:10:51.094784 73294'34393962019-07-01 18:54:06.249357 13.2fd 4950 0 0 0 0 40522633728 3086 3086 active+clean 2019-07-02 06:36:14.763435 73511'149011011 73511:136693896 [91,58,36] 91 [91,58,36] 91 73497'148963815 2019-07-02 06:36:14.763383 73497'148963815 2019-07-02 06:36:14.763383 13.3ae 4989 0 0 0 0 40879544320 3055 3055 active+clean 2019-07-02 00:30:44.817062 73511'67827999 73511:60578765 [91,54,25] 91 [91,54,25] 91 73302'67806651 2019-07-02 00:30:44.817017 69213'67776352
主pg不願走啊,既然這樣那就無論它了,咱們前面已經設置禁止踢出節點,且咱們用的是三副本,直接關閉這臺機器ceph會啓用副本,也不會出現數據遷移。
一個存儲3份的集羣, 能夠容忍任意兩個主機故障.,因此你須要確保已經關機的節點數量不要超出限制. 以避免引起更大的故障.v8
4,中止服務、關閉服務器、更換ssd
新換上去的ssd使用率爲0,PercentageUsed : 0it
[root@ceph-11 ~]# isdct show -sensor PowerOnHours : 0x063F3 EraseFailCount : 0 EndToEndErrorDetectionCount : 0 ReliabilityDegraded : False AvailableSpare : 100 AvailableSpareBelowThreshold : False DeviceStatus : Healthy SpecifiedPCBMaxOperatingTemp : 85 SpecifiedPCBMinOperatingTemp : 0 UnsafeShutdowns : 0x00 CrcErrorCount : 0 AverageNandEraseCycles : 7 MediaErrors : 0x00 PowerCycles : 0x012 ProgramFailCount : 0 MaxNandEraseCycles : 10 HighestLifetimeTemperature : 48 PercentageUsed : 0 ThermalThrottleStatus : 0 ErrorInfoLogEntries : 0x00 MinNandEraseCycles : 6 LowestLifetimeTemperature : 16 ReadOnlyMode : False ThermalThrottleCount : 0 TemperatureThresholdExceeded : False Temperature - Celsius : 48
5,插入新的磁盤爲nvme0n1
[root@ceph-11 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5.5T 0 disk └─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87 sdb 8:16 0 5.5T 0 disk └─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88 sdc 8:32 0 5.5T 0 disk └─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89 sdd 8:48 0 5.5T 0 disk └─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90 sde 8:64 0 5.5T 0 disk └─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91 sdf 8:80 0 5.5T 0 disk └─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92 sdg 8:96 0 5.5T 0 disk └─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93 sdh 8:112 0 5.5T 0 disk └─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94 sdi 8:128 0 5.5T 0 disk └─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95 sdj 8:144 0 5.5T 0 disk └─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96 sdk 8:160 0 5.5T 0 disk └─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97 sdl 8:176 0 5.5T 0 disk └─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98 sdm 8:192 0 419.2G 0 disk └─sdm1 8:193 0 419.2G 0 part / nvme0n1 259:0 0 372.6G 0 disk
6,重建journal
因爲journal故障, 開機後沒法正常啓動osd. 須要從新建立journal,編輯腳原本生成最終執行的腳本。
#!/bin/bash desc="create ceph journal part for specified osd." type_journal_uuid=45b0969e-9b03-4f30-b4c6-b4b80ceff106 sgdisk=sgdisk journal_size=30G //分區設置大小 journal_dev=/dev/nvme0n1 //ssd磁盤名稱 sleep=5 osd_uuids=$(grep "" /var/lib/ceph/osd/ceph-*/journal_uuid 2>/dev/null) die(){ echo >&2 "$@"; exit 1; } tip(){ printf >&2 "%b" "$@"; } [ "$osd_uuids" ] || die "no osd uuid found." echo "osd journal uuid:" echo "$osd_uuids" echo "now sleep $sleep" sleep $sleep journal_script="/dev/shm/ceph-journal.sh" echo "ls -l /dev/nvme0n1p*" > "$journal_script" echo "sleep 5" >> "$journal_script" # 須要預先檢測分區的位置. 而後才能成功設置名稱和uuid之類的數據. IFS=": " while read osd_path uuid; do let d++ [ "$osd_path" ] || continue osd_id=${osd_path#/var/lib/ceph/osd/ceph-} osd_id=${osd_id%/journal_uuid} journal_link=${osd_path%_uuid} [ ${osd_id:-1} -ge 0 ] || { echo "invalid osd id: $osd_id."; exit 11; } tip "create journal for osd $osd_id ... " $sgdisk --mbrtogpt --new=$d:0:+"$journal_size" \ --change-name=$d:'ceph journal' \ --typecode=$d:"$type_journal_uuid" \ --partition-guid=$d:"$uuid" \ "$journal_dev" || exit 1 tip "part done.\n" ln -sfT /dev/disk/by-partuuid/"$uuid" "$journal_link" || exit 3 echo "ceph-osd --mkjournal --osd-journal /dev/nvme0n1p"$d "-i "$osd_id >> "$journal_script" sleep 1 done << EOF $osd_uuids EOF
上述腳本僅用於生成最終的執行腳本. 其默認路徑是/dev/shm/ceph-journal.sh
請務必人工確認內容操做無誤, 方能夠root權限手動執行之[root@ceph-11~]# bash /dev/shm/ceph-journal.sh
腳本內容:
[root@ceph-11 ~]# cat /dev/shm/ceph-journal.sh #!/bin/bash ls -l /dev/nvme0n1p* sleep 5 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p1 -i 87 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p2 -i 88 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p3 -i 89 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p4 -i 90 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p5 -i 91 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p6 -i 92 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p7 -i 93 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p8 -i 94 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p9 -i 95 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p10 -i 96 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p11 -i 97 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p12 -i 98 [root@ceph-11 ~]#
7,journal跟換完畢,檢查恢復服務
osd服務已恢復
[root@ceph-11 ~]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10008 0 root sas6t3 -10007 0 root sas6t2 -10006 130.94598 root sas6t1 -12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0 88 5.45599 osd.88 up 0.79999 0 89 5.45599 osd.89 up 1.00000 0 90 5.45599 osd.90 up 1.00000 0 91 5.45599 osd.91 up 1.00000 0 92 5.45599 osd.92 up 1.00000 0 93 5.45599 osd.93 up 1.00000 0 94 5.45599 osd.94 up 1.00000 0 95 5.45599 osd.95 up 1.00000 0 96 5.45599 osd.96 up 1.00000 0 97 5.45599 osd.97 up 1.00000 0 98 5.45599 osd.98 up 0.89999 0
恢復osd flag,須要把干預期間的其餘操做所有恢復ceph osd unset noout
恢復osd優先級
[root@ceph-11 ~]# for osd in {87..98}; do ceph osd primary-affinity "$osd" 0.8; done set osd.87 primary-affinity to 0.8 (8524282) set osd.88 primary-affinity to 0.8 (8524282) set osd.89 primary-affinity to 0.8 (8524282) set osd.90 primary-affinity to 0.8 (8524282) set osd.91 primary-affinity to 0.8 (8524282) set osd.92 primary-affinity to 0.8 (8524282) set osd.93 primary-affinity to 0.8 (8524282) set osd.94 primary-affinity to 0.8 (8524282) set osd.95 primary-affinity to 0.8 (8524282) set osd.96 primary-affinity to 0.8 (8524282) set osd.97 primary-affinity to 0.8 (8524282) set osd.98 primary-affinity to 0.8 (8524282) [root@ceph-11 ~]#
等待集羣恢復
等待集羣自動recovery恢復到 HEALHTH_OK 狀態.
期間若是出現 HEALTH_ERROR 狀態, 能夠及時跟進, 搜索Google.
[root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN 12 pgs degraded 2 pgs recovering 10 pgs recovery_wait 12 pgs stuck unclean recovery 116/38259009 objects degraded (0.000%) monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918476: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 116/38259009 objects degraded (0.000%) 5049 active+clean 10 active+recovery_wait+degraded 3 active+clean+scrubbing+deep 2 active+recovering+degraded recovery io 22105 kB/s, 4 objects/s client io 55017 kB/s rd, 77280 kB/s wr, 944 op/s rd, 590 op/s wr [root@ceph-11 ~]# [root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN 1 pgs degraded 1 pgs recovering 1 pgs stuck unclean recovery 2/38259009 objects degraded (0.000%) monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918493: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 2/38259009 objects degraded (0.000%) 5060 active+clean 3 active+clean+scrubbing+deep 1 active+recovering+degraded client io 81789 kB/s rd, 245 MB/s wr, 1441 op/s rd, 651 op/s wr [root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918494: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 5061 active+clean 3 active+clean+scrubbing+deep recovery io 7388 kB/s, 0 objects/s client io 67551 kB/s rd, 209 MB/s wr, 1153 op/s rd, 901 op/s wr [root@ceph-11 ~]#
集羣狀態已經恢復正常。