###環境介紹ui
###故障現象 在模擬單節點故障發生的過程當中,屢次手工添加和刪除同一個osd(只刪除數據和keyring,不動crushmap內容),最後發現新加的osd進程雖然已經啓動,而且啓動日誌也無報錯,可是始終沒法進入up狀態。debug
2016-04-01 11:19:16.868837 7fee3654b900 0 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 104255 ..... 2016-04-01 11:19:19.295992 7fee3654b900 0 osd.139 12789 crush map has features 2200130813952, adjusting msgr requires for clients 2016-04-01 11:19:19.296008 7fee3654b900 0 osd.139 12789 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons 2016-04-01 11:19:19.296016 7fee3654b900 0 osd.139 12789 crush map has features 2200130813952, adjusting msgr requires for osds 2016-04-01 11:19:19.296052 7fee3654b900 0 osd.139 12789 load_pgs 2016-04-01 11:19:19.296094 7fee3654b900 0 osd.139 12789 load_pgs opened 0 pgs 2016-04-01 11:19:19.296878 7fee3654b900 -1 osd.139 12789 log_to_monitors {default=true} 2016-04-01 11:19:19.305091 7fee246f1700 0 osd.139 12789 ignoring osdmap until we have initialized 2016-04-01 11:19:19.305239 7fee246f1700 0 osd.139 12789 ignoring osdmap until we have initialized 2016-04-01 11:19:19.305425 7fee3654b900 0 osd.139 12789 done with init, starting boot process
開啓debug osd=20之後發現始終進行以下操做日誌
2016-04-01 11:46:23.300790 7f9219d15700 20 osd.139 12813 update_osd_stat osd_stat(538 MB used, 3723 GB avail, 3724 GB total, peers []/[] op hist []) 2016-04-01 11:46:23.300821 7f9219d15700 5 osd.139 12813 heartbeat: osd_stat(538 MB used, 3723 GB avail, 3724 GB total, peers []/[] op hist []) 2016-04-01 11:46:25.200613 7f9231e86700 5 osd.139 12813 tick 2016-04-01 11:46:25.200644 7f9231e86700 10 osd.139 12813 do_waiters -- start 2016-04-01 11:46:25.200648 7f9231e86700 10 osd.139 12813 do_waiters -- finish 2016-04-01 11:46:25.600974 7f9219d15700 20 osd.139 12813 update_osd_stat osd_stat(538 MB used, 3723 GB avail, 3724 GB total, peers []/[] op hist []) 2016-04-01 11:46:25.601002 7f9219d15700 5 osd.139 12813 heartbeat: osd_stat(538 MB used, 3723 GB avail, 3724 GB total, peers []/[] op hist []) 2016-04-01 11:46:26.200759 7f9231e86700 5 osd.139 12813 tick 2016-04-01 11:46:26.200784 7f9231e86700 10 osd.139 12813 do_waiters -- start 2016-04-01 11:46:26.200788 7f9231e86700 10 osd.139 12813 do_waiters -- finish 2016-04-01 11:46:27.200867 7f9231e86700 5 osd.139 12813 tick 2016-04-01 11:46:27.200892 7f9231e86700 10 osd.139 12813 do_waiters -- start 2016-04-01 11:46:27.200895 7f9231e86700 10 osd.139 12813 do_waiters -- finish 2016-04-01 11:46:28.201002 7f9231e86700 5 osd.139 12813 tick 2016-04-01 11:46:28.201022 7f9231e86700 10 osd.139 12813 do_waiters -- start 2016-04-01 11:46:28.201030 7f9231e86700 10 osd.139 12813 do_waiters -- finish 2016-04-01 11:46:29.101147 7f9219d15700 20 osd.139 12813 update_osd_stat osd_stat(538 MB used, 3723 GB avail, 3724 GB total, peers []/[] op hist []) 2016-04-01 11:46:29.101180 7f9219d15700 5 osd.139 12813 heartbeat: osd_stat(538 MB used, 3723 GB avail, 3724 GB total, peers []/[] op hist []) 2016-04-01 11:46:29.201115 7f9231e86700 5 osd.139 12813 tick 2016-04-01 11:46:29.201128 7f9231e86700 10 osd.139 12813 do_waiters -- start 2016-04-01 11:46:29.201132 7f9231e86700 10 osd.139 12813 do_waiters -- finish 2016-04-01 11:46:30.201237 7f9231e86700 5 osd.139 12813 tick 2016-04-01 11:46:30.201267 7f9231e86700 10 osd.139 12813 do_waiters -- start 2016-04-01 11:46:30.201271 7f9231e86700 10 osd.139 12813 do_waiters -- finish
###解決方法 1.在crush中刪除對應的osd信息code
ceph osd crush remove osd.139 #注意可能會致使數據遷移
2.啓動osd服務,將osd添加回crushmap內。進程
ceph osd crush add 139 1.0 host=xxx
###總結 在頻繁添加和刪除osd的時候,可能觸發了某些bug,致使osdmap沒法實時更新,須要手工經過操做crushmap來刷新。rem