被sleep開了個小玩笑

本案例轉載自李大玉老師分享mysql

Ⅰ、問題背景

探活腳本連續8次探測,判斷主庫異常,觸發切換(判斷備機是否有延遲,kill原主,VIP飄到備機,設置新主可寫)
切換後,業務仍是異常,SQL查詢沒返回,DB鏈接數耗完,爲了恢復業務,重啓新主後業務恢復sql

兩個問題數據庫

  • 主庫爲何會切換
  • 切換後新主庫爲何仍是不可用

Ⅱ、問題排查

一、根據系統運行時間、系統日誌以及服務器帶外日誌可排除服務器和數據庫OOM
二、從每分鐘的線程快照中發現故障時大量線程處於sending data和statistics狀態,但前一分鐘快照中未看到任何阻塞與鎖等待
三、被阻塞的sql都是基於主鍵或者業務索引訪問,理論上沒問題,提取sql在從庫執行一遍,很快,且當時沒有產生慢日誌,固排除sql執行效率慢致使阻塞
四、statistics是爲sql生成執行計劃的,會觸發表的統計操做,而統計操做須要對錶中page進行採樣,會觸發io,分析當時磁盤iops、吞吐量、cpu負載等,和前一天基本吻合,排除系統負載致使性能降低
-----------------
思路中斷~~~centos

覆盤,開發反映業務切到新db,業務各個接口耗時變大,詢問是否新庫服務器性能不如老庫
會後上機器發現實例上存在20個處於user sleep狀態的sql,大概模型都是where id = '+(select 'rbzd' where 6910=6910 and sleep(300)+)',比較可疑,由於開發不會在程序中調用sleep函數服務器

重點分析此sql
此sql不佔用系統資源,可是寫法可疑,相似sql注入
經查,sleep操做和innodb_thread_concurrency參數互斥,這樣每秒只能處理24個sql
換言之當有24個線程進入引擎並處於sleep狀態的話,其餘線程是不能進入innodb引擎層,這裏的24是和線上MySQL參數innodb_thread_concurrency被設置爲24有關session

Ⅲ、模擬故障

3.1 準備數據

CREATE TABLE `test` (
`id` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

insert into test select 1;

3.2 開24個線程訪問此表

提早設置innodb_thread_concurrency設置爲24模擬線上狀況函數

[root@VM_42_63_centos ~]# for i in `seq 1 24`; do nohup mysql -S /tmp/mysql.sock3306 -e "select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+''" & done

3.3 觀察一波

執行上一步腳本後,再開一個session訪問test表會發現線程hang住,沒返回,狀態爲sending data,以下:性能

(root@localhost) [test]> show processlist;
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
| Id | User | Host            | db   | Command | Time | State        | Info                                                                                   |
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
|  2 | root | localhost:38204 | NULL | Sleep   |    0 |              | NULL                                                                                   |
|  3 | root | localhost       | test | Query   |    0 | starting     | show processlist                                                                       |
|  4 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
|  5 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
|  6 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
|  7 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
|  8 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
|  9 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 10 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 11 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 12 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 13 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 14 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 15 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 16 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 17 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 18 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 19 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 20 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 21 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 22 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 23 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 24 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 25 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 26 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 27 | root | localhost       | NULL | Query   |   23 | User sleep   | select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+'' |
| 28 | root | localhost       | NULL | Query   |   11 | Sending data | select id from test.test                                                               |
+----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
27 rows in set (0.00 sec)

Ⅳ、故障回顧

一、sql注入,user sleep狀態的sql累計到24條以後,其餘sql就不能進入innodb進行操做,包括高可用探測程序,因爲線程快照中過濾了sleep,致使沒抓到注入的sql,加大了後續排查難度優化

二、主庫的存活探測程序檢查失敗(探測方式爲update一個innodb表),連續8次失敗後,ha認爲實例異常,則kill主,觸發切換流程線程

三、切換到主庫後,注入還在繼續,因此一樣的故障在新主上重演

四、重啓主庫後,僅有20個注入進入innodb且一直爲user sleep,因爲沒達到24個觸發閾值,因此業務表現正常,只是性能不及老主庫,緣由就是已經有20個線程在innodb層一直沒退出,kill掉這些線程,業務恢復正常

Ⅴ、問題解決

  • 因爲線程快照過濾了sleep關鍵字,致使排查過程較長,後續要對user sleep狀態線程保留在線程快照中
  • 開發優化程序,防止sql注入
相關文章
相關標籤/搜索