sql_slave_skip_counter,你真的用對了嗎?

寫在前面:html

    最近一段時間都在作傳統主從複製相關的測試,思考了不少上線主從複製架構後,可能會發生的問題,而後針對性設置了這些故障,再而後思考如何在保證業務可用,或者對業務衝擊儘量的小的前提下,進行故障的恢復,也算是小有所得,如今呢,筆者就主從複製故障時,須要跳過故障點時所用到的 sql_slave_skip_counte 變量進行記錄描述mysql


sql_slave_skip_counter 介紹:sql

摘自MySQL官方的解釋(強烈建議閱讀英文原文。中文版,是筆者本身的理解,只能說仁者見仁)session

SET GLOBAL sql_slave_skip_counter Syntax:
        SET GLOBAL sql_slave_skip_counter = N
This statement skips the next N events from the master. This is useful for recovering from replication stops caused by a statement.
        跳過N個events。注意:以event爲單位,而不是以事務爲單位,只有在由單條語句組成的事務時,二者纔等價。
        如:一個事務由多個EVENT組成,BEGIN;INSERT;UPDATE;DELETE;COMMOIT; 這種狀況下,二者毫不相等
This statement is valid only when the slave threads are not running. Otherwise, it produces an error.

When using this statement, it is important to understand that the binary log is actually organized as a sequence of groups known as event groups. Each event group consists of a sequence of events.

For transactional tables, an event group corresponds to a transaction.
        對於事務表,一個event group對應一個事務
or nontransactional tables, an event group corresponds to a single SQL statement.
        對於非事務表,一個event group對應一條SQL
When you use SET GLOBAL sql_slave_skip_counter to skip events and the result is in the middle of a group, the slave continues to skip events until it reaches the end of the group. Execution then starts with the next event group
        當你跳過event的時候,若是N的值,處於event group之中,那麼slave會繼續跳過event,直至跳過這個event group,從下一個event group開始
架構


對於事務表使用sql_slave_skip_counter的狀況:app


一、跳過1032複製錯誤(update/delete error)
ide

跳過由單條SQL組成的事務:工具


在Slave主機上人爲的刪除兩條數據:測試

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '9';
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '11';
this


而Master在變動上述兩條記錄的時候會報錯,致使複製中斷:

INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('孫權', '吳國', '妹妹');
UPDATE `edusoho_e`.`t1` SET xname='遊戲' WHERE id=7;
UPDATE `edusoho_e`.`t1` SET age=40 WHERE id=11;    #報錯
DELETE FROM `edusoho_e`.`t1` WHERE age=40;            #報錯
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('曹丕', '魏國', '甄姬');
DELETE FROM `edusoho_e`.`t1` WHERE id=1;
UPDATE `edusoho_e`.`t1` SET hobby='Games' WHERE id=3; 


在Slave查看主從複製狀態時,就會發現報錯信息:

mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 2176
Exec_Master_Log_Pos: 874
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 1127
Slave_IO_Running: Yes
Slave_SQL_Running: No


在Master主機上查看position作了什麼操做:

mysql> show binlog events in 'mysql-bin.000002' from 874;
+------------------+------+-------------+-----------+-------------+---------------------------------+
| Log_name         | Pos  | Event_type  | Server_id | End_log_pos | Info                            |
+------------------+------+-------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 |  874 | Query       |         2 |         956 | BEGIN                           |
| mysql-bin.000002 |  956 | Table_map   |         2 |        1017 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1017 | Update_rows |         2 |        1127 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1127 | Xid         |         2 |        1158 | COMMIT /* xid=437 */            |
| mysql-bin.000002 | 1158 | Query       |         2 |        1240 | BEGIN                           |
| mysql-bin.000002 | 1240 | Table_map   |         2 |        1301 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1301 | Delete_rows |         2 |        1407 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1407 | Xid         |         2 |        1438 | COMMIT /* xid=446 */            |
| mysql-bin.000002 | 1438 | Query       |         2 |        1520 | BEGIN                           |
| mysql-bin.000002 | 1520 | Table_map   |         2 |        1581 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1581 | Write_rows  |         2 |        1644 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1644 | Xid         |         2 |        1675 | COMMIT /* xid=455 */            |
| mysql-bin.000002 | 1675 | Query       |         2 |        1757 | BEGIN                           |
| mysql-bin.000002 | 1757 | Table_map   |         2 |        1818 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1818 | Delete_rows |         2 |        1880 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1880 | Xid         |         2 |        1911 | COMMIT /* xid=464 */            |
| mysql-bin.000002 | 1911 | Query       |         2 |        1993 | BEGIN                           |
| mysql-bin.000002 | 1993 | Table_map   |         2 |        2054 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 2054 | Update_rows |         2 |        2145 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 2145 | Xid         |         2 |        2176 | COMMIT /* xid=473 */            |
+------------------+------+-------------+-----------+-------------+---------------------------------+


在Slave跳過第一個Update_rows event複製報錯:

mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
Slave_IO_Running: Yes
Slave_SQL_Running: No
Exec_Master_Log_Pos: 1158
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 1407


成功跳過第一個events group


在Slave繼續跳過第二個Delete_rows event複製報錯:

mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_SQL_Errno: 0
Last_SQL_Error:


成功跳過第二個events group


注意:

雖然主從複製出現的故障成功跳過了,但只是暫時恢復了正常的主從複製狀態,須要儘快的對Slave缺失的數據進行補齊,否則Master對Slave不存在的數據作的變動,仍然會重複致使主從複製故障,筆者以爲若是你的數據量差別不是太大的話,能夠考慮使用pt-table-checksumpt-table-sync工具進行恢復,若是你的數據量很大且數據差別不少,仍是建議重作Slave較好,由於使用工具會鎖表,會對線上業務形成必定的影響,具體狀況,請自行考量。


跳過由多條SQL(event)組成的事務:

在Slave主機上人爲的刪除一條數據:

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '7';


在Master主機上產生一個由多條SQL組成的事務:

BEGIN;
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '7';
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('懶死', '不知道', '吃了睡睡了吃');
COMMIT;


由於Slave主機上已經刪除id=7的數據,在Slave查看主從複製狀態時,就會發現報錯信息:

mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 7219
Exec_Master_Log_Pos: 6840
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 7049


在Master主機上查看position作了什麼操做:

mysql> show binlog events in 'mysql-bin.000002' from 6840;
+------------------+------+-------------+-----------+-------------+---------------------------------+
| Log_name         | Pos  | Event_type  | Server_id | End_log_pos | Info                            |
+------------------+------+-------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 6840 | Query       |         2 |        6922 | BEGIN                           |
| mysql-bin.000002 | 6922 | Table_map   |         2 |        6983 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 6983 | Delete_rows |         2 |        7049 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7049 | Table_map   |         2 |        7110 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 7110 | Write_rows  |         2 |        7188 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7188 | Xid         |         2 |        7219 | COMMIT /* xid=825 */            |
+------------------+------+-------------+-----------+-------------+---------------------------------+


能夠看到,這個事務是由兩個SQL(event)組成的


若是使用 sql_slave_skip_counter=N 跳過由多條SQL組成的事務會怎樣呢?

mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 7219
Exec_Master_Log_Pos: 7219
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:


發現問題沒有,在使用sql_slave_skip_counter跳過由多條SQL(event)組成的事務時,從在Master上執行的 show binlog events 能夠看到,若是隻是跳過出報錯SQL語句,那麼 Exec_Master_Log_Pos 值應該爲7110,可是如今爲7219,說明將整個event group跳過了,可是7110的SQL數據是咱們須要的,因此,和單條SQL組成的事務同樣,主從複製狀態雖然恢復,可是數據仍處於不一致狀態,要抓緊時間補齊數據或重作Slave


二、由多條SQL(event)組成的事務時,僅跳過一個event,而不是一個event group:

在Slave主機上人爲的刪除一條數據:

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '17';


在Master主機上產生一個由多條SQL組成的事務:

BEGIN;
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '17';
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('我是誰', '不知道', '吃了睡睡了吃');
COMMIT;


由於Slave主機上已經刪除id=17的數據,在Slave查看主從複製狀態時,就會發現報錯信息:

Exec_Master_Log_Pos: 120

Slave_IO_Running: Yes
Slave_SQL_Running: No

Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 341


在Master主機上查看position作了什麼操做:

mysqlbinlog -v --base64-output=decode --start-position=120 mysql-bin.000004
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 120
#190507 13:52:05 server id 2  end_log_pos 202 CRC32 0x0ca0c280     Query    thread_id=3    exec_time=0    error_code=0
SET TIMESTAMP=1557208325/*!*/;
SET @@session.pseudo_thread_id=3/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 202
#190507 13:52:05 server id 2  end_log_pos 263 CRC32 0x20d2e89d     Table_map: `edusoho_e`.`t1` mapped to number 216
# at 263
#190507 13:52:05 server id 2  end_log_pos 341 CRC32 0xbec6fd45     Delete_rows: table id 216 flags: STMT_END_F
### DELETE FROM `edusoho_e`.`t1`
### WHERE
###   @1=17
###   @2='懶死'
###   @3='不知道'
###   @4=1
###   @5='吃了睡睡了吃'
###   @6=18

# at 341
#190507 13:52:05 server id 2  end_log_pos 402 CRC32 0xa37bc5c9     Table_map: `edusoho_e`.`t1` mapped to number 216
# at 402
#190507 13:52:05 server id 2  end_log_pos 483 CRC32 0x0d774707     Write_rows: table id 216 flags: STMT_END_F
### INSERT INTO `edusoho_e`.`t1`
### SET
###   @1=21
###   @2='我是誰'
###   @3='不知道'
###   @4=1
###   @5='吃了睡睡了吃'
###   @6=18
# at 483
#190507 13:52:05 server id 2  end_log_pos 514 CRC32 0x8c333b30     Xid = 411
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

能夠看到,綠色的部分就是咱們須要跳過的,而第二個event是須要咱們保留的

這個時候,就須要用到slave_exec_mode這個變量了,至於slave_exec_mode詳細介紹,仍是請參考MySQL官網資料


mysql> set global slave_exec_mode='IDEMPOTENT';
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

Exec_Master_Log_Pos: 514


去Slave上edusoho_e.t1表上查看,數據id=21的數據已通過去了,此時,數據處於一致性狀態


三、跳過主鍵衝突1062錯誤(Duplicate entry):

在Slave主鍵上先插入一條id值:

INSERT INTO `edusoho_e`.`t1` (`id`,`xname`, `address`, `hobby`, `age`) VALUES (19,'小玩子', '明朝', '皇后', '25');


由於Slave已經佔用了Master要自動產生的主鍵值id=19,因此Slave主機會報錯:

INSERT INTO `edusoho_e`.`t1` (`id`,`xname`, `address`, `hobby`, `age`) VALUES (19,'朱棣', '明朝', '皇帝', '36'); 


查看Slave主從複製狀態發現已經發生了主從複製報錯:

mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1062
Last_Error: Could not execute Write_rows event on table edusoho_e.t1; Duplicate entry '19' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000002, end_log_pos 7425
Exec_Master_Log_Pos: 7219


查看Master binlog:

mysql> show binlog events in 'mysql-bin.000002' from 7219;
+------------------+------+------------+-----------+-------------+---------------------------------+
| Log_name         | Pos  | Event_type | Server_id | End_log_pos | Info                            |
+------------------+------+------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 7219 | Query      |         2 |        7301 | BEGIN                           |
| mysql-bin.000002 | 7301 | Table_map  |         2 |        7362 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 7362 | Write_rows |         2 |        7425 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7425 | Xid        |         2 |        7456 | COMMIT /* xid=893 */            |
+------------------+------+------------+-----------+-------------+---------------------------------+


思考:

由於Slave這條數據已經存在,若是在Slave主機上把這條數據刪除了,Slave會不會直接同步過來?(答案是:不會。須要重啓Slave thread):

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '19';
mysql> stop slave;
mysql> start slave user='repliter' password='123456';

驗證的時候,發現數據已經同步過去了


題外:

以上是筆者對於單條SQL組成的事務、多條SQL組成的事務,及在這些單/多條SQL組成的事務下,人爲設置的1032和1062複製錯誤和解決方法,還有sql_slave_skip_counter和slave_exec_mode各自的用法和跳過的範圍,固然了,筆者呢,作的只是線上應用前的部署測試,並無通過任何的實戰檢測。一方面,僅爲廣大同行作個參考;另外一方面,記錄筆者本身的心得和針對問題解決的思路作個總結,當問題真正發生的時候,有個方向能夠進行參考,而不至於手忙腳亂,不知所措,因此,對其中有誤之處和理解不到位的地方,望請下方留言指正,不勝感激!

還有,筆者作的,只是針對事務表,作的sql_slave_skip_counter和slave_exec_mode測試,對於非事務表,sql_slave_skip_counter和slave_exec_mode用途會稍有不一樣,請自行百度吧

相關文章
相關標籤/搜索