版本號node
[root@controller1 ~]# ceph -v ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
狀態
在admin節點執行ceph -s
app
能夠看到集羣的狀態, 以下示例ide
cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e85: 1/1/1 up {0=ceph-2=up:active} osdmap e62553: 111 osds: 109 up, 109 in flags sortbitwise,require_jewel_osds pgmap v72844263: 5064 pgs, 24 pools, 93130 GB data, 13301 kobjects 273 TB used, 133 TB / 407 TB avail 5058 active+clean 6 active+clean+scrubbing+deep client io 57046 kB/s rd, 35442 kB/s wr, 1703 op/s rd, 1486 op/s wr
若是咱們須要持續觀察, 有兩種辦法
一種是:ceph -w
優化
這是官方的作法, 效果和ceph -s同樣, 不過下面的client io那行會持續更新
有時咱們指望看下上面其餘信息的變更狀況, 因此我寫了個腳本ui
watch -n 1 "ceph -s| awk -v ll=$COLUMNS '/^ *mds[0-9]/{ \$0=substr(\$0, 1, ll); } /^ +[0-9]+ pg/{next} /monmap/{ next } /^ +recovery [0-9]+/{next} { print}'; ceph osd pool stats | awk '/^pool/{ p=\$2 } /^ +(recovery|client)/{ if(p){print \"\n\"p; p=\"\"}; print }'"
參考輸出debug
Every 1.0s: ceph -s| awk -v ll=105 '/^ *mds[0-9]/{$0=substr($0, 1, ll);} /^ ... Mon Jan 21 18:09:44 2019 cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e85: 1/1/1 up {0=ceph-2=up:active} osdmap e62561: 111 osds: 109 up, 109 in flags sortbitwise,require_jewel_osds pgmap v73183831: 5064 pgs, 24 pools, 93179 GB data, 13310 kobjects 273 TB used, 133 TB / 407 TB avail 5058 active+clean 6 active+clean+scrubbing+deep client io 263 MB/s rd, 58568 kB/s wr, 755 op/s rd, 1165 op/s wr cinder-sas client io 248 MB/s rd, 33529 kB/s wr, 363 op/s rd, 597 op/s wr vms client io 1895 B/s rd, 2343 kB/s wr, 121 op/s rd, 172 op/s wr cinder-ssd client io 15620 kB/s rd, 22695 kB/s wr, 270 op/s rd, 395 op/s wr
用量code
# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 407T 146T 260T 64.04 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS cinder-sas 13 76271G 89.25 9186G 10019308 images 14 649G 6.60 9186G 339334 vms 15 7026G 43.34 9186G 1807073 cinder-ssd 16 4857G 74.73 1642G 645823 rbd 17 0 0 16909G 1
osd
能夠快速看到osd的拓撲關係, 能夠用於查看osd的狀態等信息ip
# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10008 0 root sas6t3 -10007 0 root sas6t2 -10006 130.94598 root sas6t1 -12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0.89999 88 5.45599 osd.88 up 0.79999 0.29999 89 5.45599 osd.89 up 1.00000 0.89999 90 5.45599 osd.90 up 1.00000 0.89999 91 5.45599 osd.91 up 1.00000 0.89999 92 5.45599 osd.92 up 1.00000 0.79999 93 5.45599 osd.93 up 1.00000 0.89999 94 5.45599 osd.94 up 1.00000 0.89999 95 5.45599 osd.95 up 1.00000 0.89999 96 5.45599 osd.96 up 1.00000 0.89999 97 5.45599 osd.97 up 1.00000 0.89999 98 5.45599 osd.98 up 0.89999 0.89999 -13 65.47299 host ceph-12 99 5.45599 osd.99 up 1.00000 0.79999 100 5.45599 osd.100 up 1.00000 0.79999 101 5.45599 osd.101 up 1.00000 0.79999 102 5.45599 osd.102 up 1.00000 0.79999 103 5.45599 osd.103 up 1.00000 0.79999 104 5.45599 osd.104 up 0.79999 0.79999 105 5.45599 osd.105 up 1.00000 0.79999 106 5.45599 osd.106 up 1.00000 0.79999 107 5.45599 osd.107 up 1.00000 0.79999 108 5.45599 osd.108 up 1.00000 0.79999 109 5.45599 osd.109 up 1.00000 0.79999 110 5.45599 osd.110 up 1.00000 0.79999
我寫了個腳本, 能夠高亮ceph osd df | awk -v c1=84 -v c2=90 '{z=NF-2; if($z<=100&&$z>c1){c=34;if($z>c2)c=31;$z="\033["c";1m"$z"\033[0m"}; print}'
ci
reweight
人工權重
當osd負載不均衡時, 就須要人工干預權重. 默認值都是1, 咱們通常都是下降權重osd reweight <int[0-]> <float[0.0-1.0]> reweight osd to 0.0 < <weight> < 1.0
rem
primary affinity
這個控制osd裏的pg成爲primary的比例. 0表示除非其餘的pg掛了, 不然不會成爲pg. 1表示, 除非其餘的都是1, 那麼這個必定會成爲primary. 至於其餘的值, 則是根據osd拓撲結構計算決定具體的pg數量. 畢竟不一樣的pool可能位於不一樣的osd上
osd primary-affinity <osdname (id|osd. adjust osd primary-affinity from 0.0 <= id)> <float[0.0-1.0]> <weight> <= 1.0
pool
命令皆以 ceph osd pool 開頭
看看有哪些poolceph osd pool ls
在結尾加detail能夠看pool詳情
# ceph osd pool ls detail pool 13 'cinder-sas' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 63138 flags hashpspool stripe_width 0 removed_snaps [1~5,7~2,a~2,e~10,23~4,2c~24,51~2,54~2,57~2,5a~a] pool 14 'images' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 512 pgp_num 512 last_change 63012 flags hashpspool stripe_width 0
調整pool的屬性
# ceph osd pool set pool名稱 屬性 值 osd pool set <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority <val> {--yes-i-really-mean-it} : set pool parameter <var> to <val>
pg
命令皆以 ceph pg 開頭
查看狀態
# ceph pg stat v79188443: 5064 pgs: 1 active+clean+scrubbing, 2 active+clean+scrubbing+deep, 5061 active+clean; 88809 GB data, 260 TB used, 146 TB / 407 TB avail; 384 MB/s rd, 134 MB/s wr, 2380 op/s
ceph pg ls, 後面能夠跟狀態, 也能夠跟其餘參數
# ceph pg ls | grep scrub pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 13.1e 4832 0 0 0 0 39550330880 3034 3034 active+clean+scrubbing+deep 2019-04-08 15:24:46.496295 63232'167226529 63232:72970092 [95,80,44] 95 [95,80,44] 95 63130'167208564 2019-04-07 05:16:01.452400 63130'167117875 2019-04-05 18:53:54.796948 13.13b 4955 0 0 0 0 40587477010 3065 3065 active+clean+scrubbing+deep 2019-04-08 15:19:43.641336 63232'93849435 63232:89107385 [87,39,78] 87 [87,39,78] 87 63130'93838372 2019-04-07 08:07:43.825933 62998'93796094 2019-04-01 22:23:14.399257 13.1ac 4842 0 0 0 0 39605106850 3081 3081 active+clean+scrubbing+deep 2019-04-08 15:26:40.119698 63232'29801889 63232:23652708 [110,31,76] 110 [110,31,76] 110 63130'29797321 2019-04-07 10:50:26.243588 62988'29759937 2019-04-01 08:19:34.927978 13.31f 4915 0 0 0 0 40128633874 3013 3013 active+clean+scrubbing 2019-04-08 15:27:19.489919 63232'45174880 63232:38010846 [99,25,42] 99 [99,25,42] 99 63130'45170307 2019-04-07 06:29:44.946734 63130'45160962 2019-04-05 21:30:38.849569 13.538 4841 0 0 0 0 39564094976 3003 3003 active+clean+scrubbing 2019-04-08 15:27:15.731348 63232'69555013 63232:58836987 [109,85,24] 109 [109,85,24] 109 63130'69542700 2019-04-07 08:09:00.311084 63130'69542700 2019-04-07 08:09:00.311084 13.71f 4851 0 0 0 0 39552301568 3014 3014 active+clean+scrubbing 2019-04-08 15:27:16.896665 63232'57281834 63232:49191849 [100,75,66] 100 [100,75,66] 100 63130'57247440 2019-04-07 05:43:44.886559 63008'57112775 2019-04-03 05:15:51.434950 13.774 4867 0 0 0 0 39723743842 3092 3092 active+clean+scrubbing 2019-04-08 15:27:19.501188 63232'32139217 63232:28360980 [101,63,21] 101 [101,63,21] 101 63130'32110484 2019-04-07 06:24:22.174377 63130'32110484 2019-04-07 06:24:22.174377 13.7fe 4833 0 0 0 0 39485484032 3015 3015 active+clean+scrubbing+deep 2019-04-08 15:27:15.699899 63232'38297730 63232:32962414 [108,82,56] 108 [108,82,56] 108 63130'38286258 2019-04-07 07:59:53.586416 63008'38267073 2019-04-03 14:44:02.779800
固然也可使用ls-by開頭的命令
pg ls {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]} pg ls-by-primary <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]} pg ls-by-osd <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]} pg ls-by-pool <poolstr> {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
修復
# ceph pg repair 13.e1 instructing pg 13.e1 on osd.110 to repair
平常故障處理
pg inconsistent
出現inconsistent狀態, 即表示符合此問題. 後面的 1scrub error表示這是scrub相關的問題
# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set pg 13.e1 is active+clean+inconsistent, acting [110,55,21] 1 scrub errors noout flag(s) set
使用以下操做:
# ceph pg repair 13.e1 instructing pg 13.e1 on osd.110 to repair
檢查
此時能夠看到13.e1進入了deep scrub
# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 pgs repair; 1 scrub errors; noout flag(s) set pg 13.e1 is active+clean+scrubbing+deep+inconsistent+repair, acting [110,55,21] 1 scrub errors noout flag(s) set
等待一段時間後, 能夠看到報錯消失, pg13.e1也迴歸了active+clean狀態
# ceph health detail HEALTH_WARN noout flag(s) set noout flag(s) set
問題緣由
ceph會按期對pg作校驗. 出現inconsistent並不表明必定是出現了數據不一致, 這個只是由於數據和校驗碼不一致而已,當肯定執行repair後, ceph會進行一次deep scrub, 從而判斷數據是否存在不一致的狀況, 若是deep scrub經過, 那麼說明沒有數據問題, 只須要修正校驗便可。
request blocked for XXs
定位請求被阻塞的osdceph health detail | grep blocked
而後下降上述osd的primary affinity, 能夠分流一部分pg出去. 壓力會變小. 以前的值能夠經過ceph osd tree查看ceph osd primary-affinity OSD_ID 比以前低的數值
主要仍是因爲集羣不均衡, 致使部分osd壓力過大. 沒法及時處理請求. 1,若是頻繁出現, 建議調查緣由:2,若是是由於客戶端IO需求增大, 那麼嘗試優化客戶端, 下降沒必要要的讀寫.3,若是是由於部分osd一直沒法處理請求, 建議臨時下降此osd的primary affinity. 並保持關注, 由於這多是磁盤故障的前兆.4,若是某個journal ssd上的osd均出現此問題, 建議排查journal ssd是否存在寫入瓶頸, 或者是否故障.