本案例轉載自李大玉老師分享mysql
探活腳本連續8次探測,判斷主庫異常,觸發切換(判斷備機是否有延遲,kill原主,VIP飄到備機,設置新主可寫)
切換後,業務仍是異常,SQL查詢沒返回,DB鏈接數耗完,爲了恢復業務,重啓新主後業務恢復sql
兩個問題數據庫
一、根據系統運行時間、系統日誌以及服務器帶外日誌可排除服務器和數據庫OOM
二、從每分鐘的線程快照中發現故障時大量線程處於sending data和statistics狀態,但前一分鐘快照中未看到任何阻塞與鎖等待
三、被阻塞的sql都是基於主鍵或者業務索引訪問,理論上沒問題,提取sql在從庫執行一遍,很快,且當時沒有產生慢日誌,固排除sql執行效率慢致使阻塞
四、statistics是爲sql生成執行計劃的,會觸發表的統計操做,而統計操做須要對錶中page進行採樣,會觸發io,分析當時磁盤iops、吞吐量、cpu負載等,和前一天基本吻合,排除系統負載致使性能降低
-----------------
思路中斷~~~centos
覆盤,開發反映業務切到新db,業務各個接口耗時變大,詢問是否新庫服務器性能不如老庫
會後上機器發現實例上存在20個處於user sleep狀態的sql,大概模型都是where id = '+(select 'rbzd' where 6910=6910 and sleep(300)+)',比較可疑,由於開發不會在程序中調用sleep函數服務器
重點分析此sql
此sql不佔用系統資源,可是寫法可疑,相似sql注入
經查,sleep操做和innodb_thread_concurrency參數互斥,這樣每秒只能處理24個sql
換言之當有24個線程進入引擎並處於sleep狀態的話,其餘線程是不能進入innodb引擎層,這裏的24是和線上MySQL參數innodb_thread_concurrency被設置爲24有關session
CREATE TABLE `test` ( `id` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1; insert into test select 1;
提早設置innodb_thread_concurrency設置爲24模擬線上狀況函數
[root@VM_42_63_centos ~]# for i in `seq 1 24`; do nohup mysql -S /tmp/mysql.sock3306 -e "select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+''" & done
執行上一步腳本後,再開一個session訪問test表會發現線程hang住,沒返回,狀態爲sending data,以下:性能
(root@localhost) [test]> show processlist; +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+ | 2 | root | localhost:38204 | NULL | Sleep | 0 | | NULL | | 3 | root | localhost | test | Query | 0 | starting | show processlist | | 4 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 5 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 6 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 7 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 8 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 9 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 10 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 11 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 12 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 13 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 14 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 15 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 16 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 17 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 18 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 19 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 20 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 21 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 22 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 23 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 24 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 25 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 26 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 27 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' | | 28 | root | localhost | NULL | Query | 11 | Sending data | select id from test.test | +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+ 27 rows in set (0.00 sec)
一、sql注入,user sleep狀態的sql累計到24條以後,其餘sql就不能進入innodb進行操做,包括高可用探測程序,因爲線程快照中過濾了sleep,致使沒抓到注入的sql,加大了後續排查難度優化
二、主庫的存活探測程序檢查失敗(探測方式爲update一個innodb表),連續8次失敗後,ha認爲實例異常,則kill主,觸發切換流程線程
三、切換到主庫後,注入還在繼續,因此一樣的故障在新主上重演
四、重啓主庫後,僅有20個注入進入innodb且一直爲user sleep,因爲沒達到24個觸發閾值,因此業務表現正常,只是性能不及老主庫,緣由就是已經有20個線程在innodb層一直沒退出,kill掉這些線程,業務恢復正常