主庫MySQL5.7.22,從庫爲Mariadb10.2.17 此架構是MySQL多源複製架構 多個個主庫MySQL實例同步複製到一個從庫mariadb實例。html
其中一個複製通道erp_rep 出現數據不復制的場景,db-saas主實例MySQL中db_core_assets 庫中的表不在同步到從庫實例。可是db-read1從庫的io複製線程和從庫的複製線程卻一致是yes狀態。致使基於這2個線程狀態的複製監控檢查失效,一直沒觸發報警。結果仍是研發那邊反饋的問題mysql
出現問題的緣由:下午給db-saas主庫 db_core_assets 表b_refit_bill_sn(400多萬記錄)pt-osc 添加索引,誘發的從庫db-read1複製通道rep_erp發生 Waiting for table metadata lock 索。致使這個複製通道的數據複製阻塞,不一樣步數據了。
主庫db_core_assets.b_refit_bill_sn大表執行添加索引時候,在添加完成後,從庫db-read1 rep_erp 複製通道出現Waiting for table metadata lock 索。redis
登陸db-read1從庫show all slaves status\Gsql
第一次檢查了下db-read1庫的全部的複製通道,發現狀態都是yes,可是第二次db-read1從庫show all slaves status\G 發現複製通道rep_erp Slave_SQL_Running_State: Waiting for table metadata locksession
連續幾回show slave 'rep_erp' status\G
發現 Exec_Master_Log_Pos: 151201185 一直保持不變。說明這個複製通道一直備阻塞着,不一樣步數據了。
Exec_Master_Log_Pos: 151201185
Relay_Log_Space: 1146902246架構
執行下面的命令發現 出現問題的緣由:ide
[root@db-read1 ~]# mysql -e "show full processlist"|grep 'Waiting for table metadata lock' 30 system user db_core_assets Slave_SQL 15205 Waiting for table metadata lock CREATE DEFINER=`pt_tools`@`172.16.0.237` TRIGGER `pt_osc_db_core_assets_b_refit_bill_sn_del` AFTER DELETE ON `db_core_assets`.`b_refit_bill_sn` FOR EACH ROW DELETE IGNORE FROM `db_core_assets`.`_b_refit_bill_sn_new` WHERE `db_core_assets`.`_b_refit_bill_sn_new`.`id` <=> OLD.`id` 0.000
本覺得嘗試重啓下rep_erp複製通道就能夠解決問題,通過操做,是沒用的ui
stop slave 'rep_erp';
start slave 'rep_erp';線程
爲啥出現這樣的問題呢?帶着問題,查看了下當前的正在執行的事務的狀態日誌
root@db-read1 20:26: [db_core_assets]> select * from information_schema.innodb_trx\G *************************** 1. row *************************** trx_id: 422133762488088 trx_state: RUNNING trx_started: 2021-04-10 19:53:10 trx_requested_lock_id: NULL trx_wait_started: NULL trx_weight: 0 trx_mysql_thread_id: 288863654 trx_query: select count(*) from ( SELECT d.device_id, d.in_storage_rolex FROM db_core_assets.b_refit_bill b LEFT JOIN db_core_assets.b_refit_bill_sku s ON b.id = s.refit_bill_id LEFT JOIN db_core_assets.b_refit_bill_sn n ON n.refit_sku_id = s.id LEFT JOIN db_core_assets.d_depreciation d ON d.device_id = n.device_id AND b.created_date >= '2021-01-01 00:00:00' AND b.created_date <= '2021-04-01 00:00:00' AND s.refit_operate_type = 1 AND s.sku_type = 1 ) a trx_operation_state: NULL trx_tables_in_use: 4 trx_tables_locked: 0 trx_lock_structs: 0 trx_lock_memory_bytes: 1136 trx_rows_locked: 0 trx_rows_modified: 0 trx_concurrency_tickets: 0 trx_isolation_level: REPEATABLE READ trx_unique_checks: 1 trx_foreign_key_checks: 1 trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_is_read_only: 1 trx_autocommit_non_locking: 1 1 row in set (0.00 sec)
緣由分析:
由於在db-saas主實例 表 db_core_assets.b_refit_bill_sn 在 pt-osc添加索引動做以前,從庫db-read1上一直存在關於db_core_assets.b_refit_bill_sn 表的查詢事務一直在運行還沒完成提交。
正好這時主庫對錶db_core_assets.b_refit_bill_sn online DDL操做,致使從庫的 Waiting for table metadata lock 索,形成複製阻塞。
出現Waiting for table metadata lock的等待場景:
MySQL在進行alter table等DDL操做時,有時會出現Waiting for table metadata lock的等待場景。
並且,一旦alter table TableA的操做停滯在Waiting for table metadata lock的狀態,後續對TableA的任何操做(包括讀)都沒法進行,由於他們也會在Opening tables的階段進入到Waiting for table metadata lock的鎖等待隊列。
若是是業務核心系統表出現了這樣的鎖等待隊列,就會形成災難性的後果
經過show processlist能夠看到TableA上有正在進行的操做(包括讀),此時alter table語句沒法獲取到metadata 獨佔鎖,會進行等待。
這是最基本的一種情形,這個和mysql 5.6中的online ddl並不衝突。通常alter table的操做過程當中(見下圖),在after create步驟會獲取metadata 獨佔鎖,當進行到altering table的過程時(一般是最花時間的步驟),對該表的讀寫均可以正常進行,這就是online ddl的表現,並不會像以前在整個alter table過程當中阻塞寫入。
(固然,也並非全部類型的alter操做都能online的,具體能夠參見官方手冊:http://dev.mysql.com/doc/refman/5.6/en/innodb-create-index-overview.html)
處理方法: kill 掉 DDL所在的session.
因而針對上述的問題KILL 掉DDL所在的session的會話id
[root@db-read1 ~]# mysql -e "show full processlist"|grep 'Waiting for table metadata lock' 30 system user db_core_assets Slave_SQL 15205 Waiting for table metadata lock CREATE DEFINER=`pt_tools`@`172.16.0.237` TRIGGER `pt_osc_db_core_assets_b_refit_bill_sn_del` AFTER DELETE ON `db_core_assets`.`b_refit_bill_sn` FOR EACH ROW DELETE IGNORE FROM `db_core_assets`.`_b_refit_bill_sn_new` WHERE `db_core_assets`.`_b_refit_bill_sn_new`.`id` <=> OLD.`id` 0.000 KILL 30;
執行下面的命令:此時複製報錯
root@db-read1 20:15: [db_core_assets]> show slave 'rep_erp' status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.16.0.127 Master_User: backup Master_Port: 3306 Connect_Retry: 60 Master_Log_File: saas-mysql-bin.011335 Read_Master_Log_Pos: 97512499 Relay_Log_File: db-read1-relay-bin-rep_erp.003561 Relay_Log_Pos: 105416465 Relay_Master_Log_File: saas-mysql-bin.011327 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: mysql.%,db_redis.%,db_scheduler.% Last_Errno: 1927 Last_Error: Error 'Connection was killed' on query. Default database: 'db_core_assets'. Query: 'CREATE DEFINER=`pt_tools`@`172.16.0.237` TRIGGER `pt_osc_db_core_assets_b_refit_bill_sn_del` AFTER DELETE ON `db_core_assets`.`b_refit_bill_sn` FOR EACH ROW DELETE IGNORE FROM `db_core_assets`.`_b_refit_bill_sn_new` WHERE `db_core_assets`.`_b_refit_bill_sn_new`.`id` <=> OLD.`id`' Skip_Counter: 0 Exec_Master_Log_Pos: 151201185 Relay_Log_Space: 1119370857 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1927 Last_SQL_Error: Error 'Connection was killed' on query. Default database: 'db_core_assets'. Query: 'CREATE DEFINER=`pt_tools`@`172.16.0.237` TRIGGER `pt_osc_db_core_assets_b_refit_bill_sn_del` AFTER DELETE ON `db_core_assets`.`b_refit_bill_sn` FOR EACH ROW DELETE IGNORE FROM `db_core_assets`.`_b_refit_bill_sn_new` WHERE `db_core_assets`.`_b_refit_bill_sn_new`.`id` <=> OLD.`id`' Replicate_Ignore_Server_Ids: Master_Server_Id: 172160127 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: 1 row in set (0.00 sec)
重啓了下rep_erp複製通道 stop slave 'rep_erp';start slave 'rep_erp';
複製狀態恢復成雙yes.可是 Slave_SQL_Running_State: Waiting for table metadata lock 仍是一直在阻塞。
root@db-read1 20:29: [(none)]> show slave 'rep_erp' status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.16.0.127 Master_User: backup Master_Port: 3306 Connect_Retry: 60 Master_Log_File: saas-mysql-bin.011335 Read_Master_Log_Pos: 167521372 Relay_Log_File: db-read1-relay-bin-rep_erp.003561 Relay_Log_Pos: 105416465 Relay_Master_Log_File: saas-mysql-bin.011327 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: mysql.%,db_redis.%,db_scheduler.% Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 151201185 Relay_Log_Space: 1146902246 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 172160127 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Waiting for table metadata lock 1 row in set (0.00 sec)
經過show processlist看不到TableA上有任何操做,但實際上存在有未提交的事務,能夠在 information_schema.innodb_trx中查看到。在事務沒有完成以前,TableA上的鎖不會釋放,alter table一樣獲取不到metadata的獨佔鎖。
處理方法:經過 select * from information_schema.innodb_trx\G, 找到未提交事物的sid, 而後 kill 掉,讓其回滾。
root@db-read1 20:26: [db_core_assets]> select * from information_schema.innodb_trx\G *************************** 1. row *************************** trx_id: 422133762488088 trx_state: RUNNING trx_started: 2021-04-10 19:53:10 trx_requested_lock_id: NULL trx_wait_started: NULL trx_weight: 0 trx_mysql_thread_id: 288863654 trx_query: select count(*) from ( SELECT d.device_id, d.in_storage_rolex FROM db_core_assets.b_refit_bill b LEFT JOIN db_core_assets.b_refit_bill_sku s ON b.id = s.refit_bill_id LEFT JOIN db_core_assets.b_refit_bill_sn n ON n.refit_sku_id = s.id LEFT JOIN db_core_assets.d_depreciation d ON d.device_id = n.device_id AND b.created_date >= '2021-01-01 00:00:00' AND b.created_date <= '2021-04-01 00:00:00' AND s.refit_operate_type = 1 AND s.sku_type = 1 ) a trx_operation_state: NULL trx_tables_in_use: 4 trx_tables_locked: 0 trx_lock_structs: 0 trx_lock_memory_bytes: 1136 trx_rows_locked: 0 trx_rows_modified: 0 trx_concurrency_tickets: 0 trx_isolation_level: REPEATABLE READ trx_unique_checks: 1 trx_foreign_key_checks: 1 trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_is_read_only: 1 trx_autocommit_non_locking: 1 1 row in set (0.00 sec)
KILL 掉 trx_mysql_thread_id: 288863654
root@db-read1 20:30: [db_core_assets]> KILL 288863654 ; Query OK, 0 rows affected (0.00 sec) root@db-read1 20:29: [(none)]> show slave 'rep_erp' status\G
此時複製狀態正常了。數據能夠正常同步了。
經過show processlist看不到TableA上有任何操做,在information_schema.innodb_trx中也沒有任何進行中的事務。這極可能是由於在一個顯式的事務中,對TableA進行了一個失敗的操做(好比查詢了一個不存在的字段),這時事務沒有開始,可是失敗語句獲取到的鎖依然有效,沒有釋放。從performance_schema.events_statements_current表中能夠查到失敗的語句。
官方手冊上對此的說明以下:
If the server acquires metadata locks for a statement that is syntactically valid but fails during execution, it does not release the locks early. Lock release is still deferred to the end of the transaction because the failed statement is written to the binary log and the locks protect log consistency.
也就是說除了語法錯誤,其餘錯誤語句獲取到的鎖在這個事務提交或回滾以前,仍然不會釋放掉。because the failed statement is written to the binary log and the locks protect log consistency 可是解釋這一行爲的緣由很難理解,由於錯誤的語句根本不會被記錄到二進制日誌。
處理方法:經過performance_schema.events_statements_current找到其sid, kill 掉該session. 也能夠 kill 掉DDL所在的session.
alter table的語句是很危險的(其實他的危險實際上是未提交事物或者長事務致使的),在操做以前最好確認對要操做的表沒有任何進行中的操做、沒有未提交事務、也沒有顯式事務中的報錯語句。(包括主從庫複製環境,尤爲是主庫大表添加字段或者是索引時,必定要是業務低峯期,必定要在slave庫上也確認下是否存在大的事務)若是有alter table的維護任務,在無人監管的時候運行,最好經過lock_wait_timeout設置好超時時間,避免長時間的metedata鎖等待