MySQL 主從複製場景單表數據出錯致使複製終止如何快速修復

時間 2021-01-30

標籤 mysql linux sql centos tomcat 服務器網絡 app ide 工具欄目 MySQL 简体版

原文原文鏈接

場景描述:

若是從庫上表 t 數據與主庫不一致，致使複製錯誤，整個庫的數據量很大，重作從庫很慢，如何單獨恢復這張表的數據？
一般認爲是不能修復單表數據的，由於涉及到各表狀態不一致的問題。
下面就列舉備份單表恢復到從庫會面臨的問題以及解決辦法mysql

1、本次演示環境描述:

Dell物理服務器r620 兩臺
網絡環境都是內網
master:192.168.1.220
slave:192.168.1.217
OS系統環境：centos7.8 X86_64位最小化安裝，關閉iptables，關閉selinux
測試軟件版本:mysql5.7.27二進制包
提早配置好基於Gtid的MySQL主從複製
建立模擬測試數據，模擬故障場景
修復MySQL主從複製
pt-table-checksum 校驗修復後的MySQL主從複製數據是否一致linux

2、配置主從複製

MySQL的安裝過程此處再也不描述，自行百度sql

配置主從複製
給一個master機器配置一個新的slave的話，記得在mysqldump備份數據時加參數--set-gtid-purged=ON
知識補充：
1.常規備份是要加--set-gtid-purged=OFF解決備份時的警告
[root@localhost ~]# mysqldump -uroot -p'dXdjVF#(y3lt' --set-gtid-purged=OFF --single-transaction -A -B |gzip > 2020-09-17.sql.gz
2.構建主從時作的備份，不須要加--set-gtid-purged=OFF 這個參數,而是要加--set-gtid-purged=ON
[root@localhost ~]# mysqldump -uroot -p'dXdjVF#(y3lt' --set-gtid-purged=ON --single-transaction -A -B --master-data=2 |gzip > 2020-09-17.sql.gz
提示:
在構建主從複製是，千萬不要OFF。在平常備份時，能夠OFF。
--set-gtid-purged=AUTO,ON,OFF
1.--set-gtid-purged=OFF可使用在平常備份參數中。
2.--set-gtid-purged=ON在構建主從複製環境時須要的參數配置centos

基於Gtid配置主從複製具體步驟以下:tomcat

master庫：服務器

grant replication slave on *.* to rep@'192.168.1.217' identified by 'JuwoSdk21TbUser'; flush privileges;
 mysqldump -uroot -p'dXdjVF#(y3lt' --set-gtid-purged=ON --single-transaction -A -B --master-data=2 |gzip > 2020-09-20.sql.gz

slave庫操做:網絡

[root@mysql02 ~]# mysql < 2020-09-17.sql 
 mysql>  change master to master_host='192.168.1.220',master_user='rep',master_password='JuwoSdk21TbUser',MASTER_AUTO_POSITION = 1;start slave;show slave status\G
ERROR 29 (HY000): File '/data1/mysql/3306/relaylog/relay-bin.index' not found (Errcode: 2 - No such file or directory)
ERROR 29 (HY000): File '/data1/mysql/3306/relaylog/relay-bin.index' not found (Errcode: 2 - No such file or directory)
Empty set (0.00 sec)

緣由是slave機器配置my.cnf中配置了relay-log的存放路徑，可是slave服務器實際不存在這個路徑，致使的報錯，把這目錄新建出來，受權mysql權限，而後從新change masterapp

mkdir -p /data1/mysql/3306/relaylog/
cd /data1/mysql/3306/
chown -R mysql.mysql relaylog
mysql> change master to master_host='192.168.1.220',master_user='rep',master_password='JuwoSdk21TbUser',MASTER_AUTO_POSITION = 1;start slave;show slave status\G

主從複製配置完成。ide

3、準備測試數據並模擬故障

在master庫上建立模擬演示表，已經定時器和存儲過程，定時寫入數據到測試表，方便下面主從複製故障恢復演示
建立測試表：工具

CREATE TABLE `test_event` (
`id` int(8) NOT NULL AUTO_INCREMENT, 
`username` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`password` varchar(20) COLLATE utf8_unicode_ci NOT NULL, 
`create_time` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`) #主鍵ID
) ENGINE=innodb AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

建立定時器，從當前時間1分鐘後每一秒寫入一條數據：

delimiter $$
create event event_2 
on schedule every 1 second STARTS   CURRENT_TIMESTAMP + INTERVAL 1 MINUTE
COMMENT 'xiaowu create'
do 
    BEGIN
           insert into test_event(username,password,create_time) values("李四","tomcat",now());
    END $$
delimiter ;

相似上面的方式再新建一個測試表txt，定時寫入數據

slave庫模擬故障：

insert into test_event(username,password,create_time) values("李四","tomcat",now());
 insert into test_event(username,password,create_time) values("李四","tomcat",now());
 delete from txt where id=200;

而後在master庫上再刪除id=200的記錄
master端操做： delete from txt where id=200;

此時slave庫查看複製狀態已經中止複製:

[root@mysql02 ~]#  mysql -e "show slave status\G"|grep -A 1 'Last_SQL_Errno'
               Last_SQL_Errno: 1062
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '8a9fb9a3-f579-11ea-830d-90b11c12779c:42083' at master log mysql-bin.000001, end_log_pos 18053730. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

4、故障恢復

場景 1

若是複製報錯後，沒有使用跳過錯誤、複製過濾等方法修復主從複製。主庫數據一直在更新，從庫數據停滯在報錯狀態（假設GTID 爲8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42083）。
修復步驟：
在主庫上備份表test_event （假設備份快照 GTID 爲 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262)；
恢復到從庫；
啓動複製。
這裏的問題是複製起始位點是8a9fb9a3-f579-11ea-830d-90b11c12779c:42084，從庫上表test_event 的數據狀態是領先其餘表的。
8a9fb9a3-f579-11ea-830d-90b11c12779c:42084-42262 這些事務中只要有修改表test_event數據的事務，就會致使複製報錯，好比主鍵衝突、記錄不存在（而8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42084這個以前複製報錯的事務一定是修改表 t 的事務）
解決辦法：啓動複製時跳過8a9fb9a3-f579-11ea-830d-90b11c12779c:42084-42262 這些事務中修改表 t 的事務。

正確的修復步驟：

在主庫上備份表test_event（備份快照 GTID 爲 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262），恢復到從庫；
設置複製過濾，過濾表 t：
CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ('dbtest01.test_event');
啓動複製，回放到8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262時中止複製（此時從庫上全部表的數據都在同一狀態，是一致的）;
START SLAVE UNTIL SQL_AFTER_GTIDS = '8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262';
刪除複製過濾，正常啓動複製。
注意事項：這裏要用 mysqldump --single-transaction --master-data=2，記錄備份快照對應的 GTID

具體的詳細步驟以下:

A.要在master庫上dump出致使複製中止的表test_event：

mysqldump -uroot -p'dXdjVF#(y3lt'  --single-transaction dbtest01 test_event --master-data=2 |gzip >$(date +%F).test_event.sql.gz
[root@localhost ~]# mysqldump -uroot -p'dXdjVF#(y3lt'  --single-transaction dbtest01 test_event --master-data=2 |gzip >$(date +%F).test_event.sql.gz
mysqldump: [Warning] Using a password on the command line interface can be insecure.
Warning: A partial dump from a server that has GTIDs will by default include the GTIDs of all transactions, even those that changed suppressed parts of the database. If you don't want to restore GTIDs, pass --set-gtid-purged=OFF. To make a complete dump, pass --all-databases --triggers --routines --events.

B.獲取出單獨備份表的快照gtid值：

8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262

[root@mysql02 ~]# gzip -d 2020-09-17.test_event.sql.gz
[root@mysql02 ~]# grep -A6 'GLOBAL.GTID_PURGED' 2020-09-17.test_event.sql 
SET @@GLOBAL.GTID_PURGED='8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262';
--
-- Position to start replication or point-in-time recovery from
--
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=18130552;

C.恢復此表到slave庫上，因爲GTID_EXECUTED不是空值，致使導入表test_event到slave庫失敗,具體報錯以下:
slave庫操做:

[root@mysql02 ~]#  mysql dbtest01 < 2020-09-17.test_event.sql 
ERROR 1840 (HY000) at line 24: @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.

 mysql> select  @@GLOBAL.GTID_EXECUTED;
+----------------------------------------------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                                                                 |
+----------------------------------------------------------------------------------------+
| 5ec577a4-f401-11ea-bf6d-14187756553d:1-2,
8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42082 |
+----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> show master status\G
*************************** 1. row ***************************
             File: mysql-bin.000001
         Position: 368620
     Binlog_Do_DB: 
 Binlog_Ignore_DB: 
Executed_Gtid_Set: 5ec577a4-f401-11ea-bf6d-14187756553d:1-2,
8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42082
1 row in set (0.00 sec)

解決辦法就是登錄slave庫：
mysql> reset master;
這個操做能夠將當前庫的GTID_EXECUTED值置空

[root@mysql02 ~]#  mysql dbtest01 < 2020-09-17.test_event.sql 

mysql> show master status\G
*************************** 1. row ***************************
             File: mysql-bin.000001
         Position: 154
     Binlog_Do_DB: 
 Binlog_Ignore_DB: 
Executed_Gtid_Set: 8a9fb9a3-f579-11ea-830d-90b11c12779c:1-42262
1 row in set (0.00 sec)

D.在線開啓複製過濾:

mysql> CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ('db_name.test_event');
Query OK, 0 rows affected (0.00 sec)

[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'db_name.test_event'
  Replicate_Wild_Ignore_Table: db_name.test_event

E.啓動複製，回放到8a9fb9a3-f579-11ea-830d-90b11c12779c:42262時中止複製（此時從庫上全部表的數據都在同一狀態，是一致的）

mysql> START SLAVE UNTIL SQL_AFTER_GTIDS ='8a9fb9a3-f579-11ea-830d-90b11c12779c:42262';
Query OK, 0 rows affected, 1 warning (0.03 sec)

mysql>

雖然此時SQL線程是no，可是複製再也不報錯:

[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'Last_SQL_Error|Slave_IO|Slave_SQL'
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
               Last_SQL_Error:

F.在線關閉複製過濾:

mysql> CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = ();
Query OK, 0 rows affected (0.00 sec)

mysql> 
[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'db_name.test_event|IO_Running|SQL_Running'
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
      Slave_SQL_Running_State:

G.開啓slave複製SQL線程：

mysql> start slave sql_thread;
Query OK, 0 rows affected (0.04 sec)

主從複製恢復：

[root@mysql02 ~]# mysql -e "show slave status\G"|egrep 'IO_Running|SQL_Running'
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

注意事項：這裏要用 mysqldump --single-transaction --master-data=2，記錄備份快照對應的 GTID

5、校驗主從數據一致性

採用校驗工具pt-table-checksum來驗證。具體如何安裝使用參考下面的博文地址：
http://www.javashuo.com/article/p-hcnfdszd-cm.html

[root@localhost bin]# time /usr/local/percona-toolkit/bin/pt-table-checksum h=192.168.1.220,u=ptsum,p='ptchecksums',P=3306 --ignore-databases sys,mysql  --truncate-replicate-table  --replicate=percona.ptchecksums --no-check-binlog-format --nocheck-replication-filters --recursion-method="processlist"   2>&1 | tee 2020-09-18-pt-checksum.log

Checking if all tables can be checksummed ...
Starting checksum ...
            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
09-18T07:49:09      0      0     9739          0       4       0   0.747 dbtest01.hlz_ad
09-18T07:49:10      0      0    64143          0       4       0   0.968 dbtest01.hlz_ad_step
09-18T07:49:16      0      0   741424          0      10       0   6.014 dbtest01.hlz_bubble
09-18T07:49:18      0      0   499991          0       5       0   1.610 dbtest01.test01
09-18T07:49:25      0      0  3532986          0      13       0   7.802 dbtest01.test02
09-18T07:49:26      0      0   126863          0       1       0   0.976 dbtest01.test_event
09-18T07:49:27      0      1    30294          0       1       0   0.582 test01.txt

real    1m22.725s
user    0m0.387s
sys 0m0.078s

發現主庫的 test01.txt 這個表和slave中的test01.txt存在不一致。

緣由是：剛纔模擬演示，先在slave庫上執行了刪除動做
delete from txt where id=200;致使slave庫表txt中比master庫txt表少一條記錄

修復數據:

[root@localhost bin]# /usr/local/percona-toolkit/bin/pt-table-sync h=192.168.1.220,u=ptsum,p=ptchecksums,P=3306 --databases=test01 --tables=test01.txt  --replicate=percona.ptchecksums  --charset=utf8  --transaction --execute

再次校驗，數據一致

[root@localhost bin]# time /usr/local/percona-toolkit/bin/pt-table-checksum h=192.168.1.220,u=ptsum,p='ptchecksums',P=3306 --ignore-databases sys,mysql  --truncate-replicate-table  --replicate=percona.ptchecksums --no-check-binlog-format --nocheck-replication-filters --recursion-method="processlist"   2>&1 | tee 2020-09-18-pt-checksum.log
Checking if all tables can be checksummed ...
Starting checksum ...
            TS ERRORS  DIFFS     ROWS  DIFF_ROWS  CHUNKS SKIPPED    TIME TABLE
09-18T09:48:10      0      0     9739          0       4       0   0.784 dbtest01.hlz_ad
09-18T09:48:11      0      0    64143          0       4       0   0.995 dbtest01.hlz_ad_step
09-18T09:48:16      0      0   741424          0       9       0   4.224 dbtest01.hlz_bubble
09-18T09:48:17      0      0   499991          0       5       0   1.470 dbtest01.test01
09-18T09:48:24      0      0  3532986          0      13       0   6.403 dbtest01.test02
09-18T09:48:24      0      0   133999          0       1       0   0.894 dbtest01.test_event
09-18T09:48:25      0      0    37431          0       1       0   0.511 test01.txt

real    0m15.676s
user    0m0.359s
sys 0m0.055s