常見的MySQL Replication Error

時間 2019-11-21

標籤常見的 mysql replication error 欄目 MySQL 简体版

原文原文鏈接

如今很多公司都在用MySQL（master）-->MySQL（slave）的框架，固然也有一主多從的架構，這也是MySQL主從的一個延伸架構;固然也有的公司MySQL主主的架構，MySQL主主架構要是處理得不適當，會面臨各類各樣的問題，固然啦，每種數據庫構架都有本身的優缺點，合適本身公司業務需求的且方便本身維護的架構均可以認爲是理想的構架，當出現同步斷開了，咱們是否是一味的使用--slave-skip-errors=[error_code]來跳過錯誤代碼呢？其實不是的，這樣作可能會形成數據不一致的可能，下面我只針對MySQL Replication常見的錯誤進行說明及處理。html

1、在master上更新一條記錄時出現的故障（master與slave處理同步的狀況下，binlog爲row格式）mysql

在slave庫上，模擬slave少了一條數據，因此把id=6的記錄在slave上先delete掉：sql

root@mysql-slave> select * from test;
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  6 | aa   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
4 rows in set (0.00 sec)

root@mysql-slave> delete from test where id=6;
Query OK, 1 row affected (0.00 sec)

而後在master上更新id爲6的記錄：數據庫

root@mysql-master> show variables like 'binlog_format';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW   |
+---------------+-------+
1 row in set (0.00 sec)

root@mysql-master> select * from test;
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  6 | aa   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
4 rows in set (0.00 sec)

root@mysql-master> update test set name='AA' where id=6;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
root@mysql-master>

回slave庫看下同步狀態是否正常：緩存

Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 3704
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3529
              Relay_Log_Space: 4183
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 3704
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
1 row in set (0.00 sec)

root@mysql-slave>

能夠看到，同步已經斷開，根據slave的報錯信息去查看master的binlog到底作了什麼，從上面看如今master作的操做寫的binlog是mysql-bin.000004，end_log_pos=3704session

[root@localhost ~]# /usr/local/services/mysql/bin/mysqlbinlog  -v --base64-output=DECODE-ROWS /data/mysql/data/mysql-bin.000004 | grep -A '10' 3704
#150610 22:33:08 server id 1  end_log_pos 3704  Update_rows: table id 34 flags: STMT_END_F
### UPDATE xuanzhi.test
### WHERE
###   @1=6
###   @2='aa'
###   @3=10002011
### SET
###   @1=6
###   @2='AA'
###   @3=10002011
# at 3704
#150610 22:33:08 server id 1  end_log_pos 3731  Xid = 89
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
[root@localhost ~]#

能夠看到作了更新的操做UPDATE xuanzhi.test where id=6的操做，咱們在slave庫上查看id爲6的記錄:架構

root@mysql-slave> select * from xuanzhi.test where id=6;
Empty set (0.00 sec)

root@mysql-slave>

能夠看到slave庫上並沒存在這樣的記錄。咱們回到master查看下id=6的記錄：框架

root@mysql-master>   select * from xuanzhi.test where id=6;
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  6 | AA   | 10002011 |
+----+------+----------+
1 row in set (0.00 sec)

root@mysql-master>

下面咱們要解決同步問題呢？操做以下：把丟失的數據補到slave上：post

root@mysql-slave> stop slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> insert into test (id,name,code) values (6,'AA',10002011);
Query OK, 1 row affected (0.00 sec)

root@mysql-slave> start slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> show slave status\G                   
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.132
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 3731
               Relay_Log_File: localhost-relay-bin.000004
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

正常同步了。若是有N多數據缺失，得用pt-table-checksum校驗數據一致性，不少同窗會好奇爲何slave庫上會少數據呢？我總結了如下幾種狀況，固然還有別的：ui

一、當人爲設置set session sql_log_bin=0時，當前session操做是不記錄到Binlog的。

二、就是slave沒設置爲read only，在slave庫上有刪除操做

三、slave讀取master的binlog日誌後，須要落地3個文件：relay log、relay log info、master info，這三個文件若是不及時落地，則主機crash後會致使數據的不一致

2、估計比較常見的一種錯誤，就是錯誤代碼爲1062的錯誤，主鍵衝突

在slave上添加一條記錄，模擬slave上還存在舊的數據記錄，此時master是沒有的這條記錄的（這裏的id自增主鍵）

root@mysql-slave> insert into test value (5,'zz',10002010);
Query OK, 1 row affected (0.00 sec)

root@mysql-slave> select * from test;                      
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  5 | zz   | 10002010 |
|  6 | AA   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
5 rows in set (0.00 sec)

root@mysql-slave>

在master上操做，添加一條id爲5的記錄：

root@mysql-master>  select * from test;                      
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  6 | AA   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
5 rows in set (0.00 sec)
 
root@mysql-master>  insert into test value (5,'ZZ',10002010);
Query OK, 1 row affected (0.00 sec)

回到slave庫查看同步狀態：

 Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table xuanzhi.test; Duplicate entry '5' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000004, end_log_pos 3893
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3731
              Relay_Log_Space: 2322
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Could not execute Write_rows event on table xuanzhi.test; Duplicate entry '5' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000004, end_log_pos 3893
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
1 row in set (0.00 sec)

root@mysql-slave>

能夠看到提示1062主鍵衝突錯誤，在表xuanzhi.test上，那麼，此時咱們應該考慮以誰的數據爲準？咱們固然要以master庫的數據爲準啦，因此咱們須要把slave上主鍵爲5的記錄給刪除掉，刪除前要先desc查看錶結構肯定自增主鍵在什麼列：

root@mysql-slave> stop slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> desc xuanzhi.test;
+-------+----------+------+-----+---------+----------------+
| Field | Type     | Null | Key | Default | Extra          |
+-------+----------+------+-----+---------+----------------+
| id    | int(11)  | NO   | PRI | NULL    | auto_increment |
| name  | char(10) | YES  |     | NULL    |                |
| code  | int(20)  | YES  |     | NULL    |                |
+-------+----------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

root@mysql-slave> delete from xuanzhi.test where id=5;
Query OK, 1 row affected (0.00 sec)

root@mysql-slave> start slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> show slave status\G                 
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.132
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 3920
               Relay_Log_File: localhost-relay-bin.000005
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

嘻嘻，有人會想，若是有N多主鍵衝突，這樣手動清除衝突的記錄有點不科學，是的，因此我寫了寫腳本去清除，能夠參考我寫的主從複製1062錯誤的解決方法。有人會好其，這樣刪除後，啓動同步線程，記錄還會不會同步過來呢，答案是會的

root@mysql-slave> select * from test;                      
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  5 | ZZ   | 10002010 |
|  6 | AA   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
5 rows in set (0.00 sec)

root@mysql-slave>

能夠看到主鍵值爲5的記錄被同步過來了。當備庫在一次非計劃的關閉後重啓時，會去讀master.info文件以找到上次中止複製的位置。不幸的是，該文件可能並無同步寫到磁盤，由於該信息是在緩存中，可能並無刷新到磁盤文件master.info。文件中存儲的信息多是錯誤的，備庫可能會嘗試從新執行一些二進制日誌事件，這可能致使主鍵衝突，就是咱們經常看見的1062錯誤。除非能肯定備庫在哪裏中止（很難），不然惟一的辦法就是忽略那些錯誤。

3、master上刪除一條記錄時出現的故障。

在master上刪除一條記錄，但這條記錄在slave庫上並不存在的時候，同步會不會斷開，下面咱們瞧瞧看：

在slave上delete一條數據，模擬slave比master少了數據：

root@mysql-slave> select * from test;                      
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  5 | ZZ   | 10002010 |
|  6 | AA   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
5 rows in set (0.00 sec)

root@mysql-slave> delete from test where id=5;
Query OK, 1 row affected (0.00 sec)

root@mysql-slave>

在master上進行dalete行記錄的操做，此時的slave是不存在這條記錄了的：

root@mysql-master> select * from test;                      
+----+------+----------+
| id | name | code     |
+----+------+----------+
|  5 | ZZ   | 10002010 |
|  6 | AA   | 10002011 |
|  7 | bb   | 10002012 |
|  8 | cc   | 10002013 |
|  9 | dd   | 10002014 |
+----+------+----------+
5 rows in set (0.00 sec)

root@mysql-master>  delete from test where id = 5;
Query OK, 1 row affected (0.00 sec)

回到slave庫查看下狀態，同步已經斷開：

Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 4082
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3920
              Relay_Log_Space: 937
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows event on table xuanzhi.test; Can't find record in 'test', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 4082
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
1 row in set (0.00 sec)

root@mysql-slave>

咱們分析錯誤信息能夠看到關鍵字是Delete_rows，這樣處理就很簡單了，由於master庫是刪除數據操做，因此slave庫上沒這條數據也不要緊，因此在slave庫上跳過此錯誤便可（當出現這種狀況，就應該引發注意了，應該去檢查是否還有更多的數據丟失了）

root@mysql-slave> stop slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> set global sql_slave_skip_counter=1;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> start slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.132
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 4109
               Relay_Log_File: localhost-relay-bin.000006
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

能夠看到同步正常了。你在master上delete一條master都沒有記錄，同步是不會斷開的。

4、slave的中繼日誌relay-log損壞

如今模擬slave庫down機，relay-log損壞了，同步沒法正常：

斷電後啓動slave庫後，執行slave start後查看狀態會報日誌讀不了或者損壞（有時直接斷電slave並不必定損壞或者掉數據，若是配置參數合理的話）：

root@mysql- show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.132
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000006
          Read_Master_Log_Pos: 401269011
               Relay_Log_File: localhost-relay-bin.000010
                Relay_Log_Pos: 439914363
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1594
                   Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 590788350
              Relay_Log_Space: 2398604371

show slave status幾個重要參數說明：

Slave_IO_Running: 接收master的binlog信息

Master_Log_File：正在雲讀取master上binlog日誌名

Read_master_Log_Pos: 正在讀取master上當前的binlog日誌POS點

slave_SQL_Running: 執行寫操做。

Relay_master_Log_File：正在同步master上binlog日誌名

Exec_master_log_Pos: 正在同步當前binlog日誌的POS點

出現relay log損壞的話，以 Relay_master_Log_File 和 Exec_master_Log_Pos參數值爲基準，從上面看到Relay_master_Log_File：mysql-bin.000004 、Exec_master_Log_Pos=590788350，這時咱們須要作的就是change master操做：

root@mysql-slave> stop slave sql_thread;
Query OK, 0 rows affected (0.01 sec)

root@mysql-slave> change master to master_host='192.168.10.132',master_port=3306,master_user='root',master_password='123456',master_log_file='mysql-bin.000004',master_log_pos=590788350;     
Query OK, 0 rows affected (0.04 sec)

root@mysql-slave> start slave sql_thread;
Query OK, 0 rows affected (0.00 sec)

root@mysql-slave> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Connecting to master
                  Master_Host: 192.168.10.132
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 590788573
               Relay_Log_File: localhost-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 590788573
              Relay_Log_Space: 107

這樣會致使丟棄全部在磁盤上的中繼日誌。

若是出現如下的報錯，也是按以上的方法解決：

             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1593
                   Last_Error: Error initializing relay log position: I/ O error reading the header from the binary log
                   Skip_Counter: 0
          Exec_Master_Log_Pos: 59078

經過這種方法去修改中繼日誌，是否是發現有些麻煩呢？其實MySQL5.5已經考慮到slave宕機中繼日誌損壞這一問題了，即在slave的配置文件my.cnf裏要增長一個參數relay_log_recovery=1就能夠了。

總結：

1、遇到同步斷開時，不能一味的使用--slave-skip-errors=[error_code]來跳過錯誤代碼，這樣很容易致使數據不一致的發生

2、binlog爲STATEMENT格式時，在mater進行更新或者刪除一條slave庫沒有的數據，同步是不會斷開的。

3、按期檢查數據的完整性，能夠用pt-table-checksum校驗主從數據的一致性，數據的完整性，對一個公司來講，無疑是最重要的。

4、slave庫上建議把一些重要的選項開啓，例如設置爲read only、relay_log_recovery、sync_master_info、sync_relay_log_info、sync_relay_log這些重要選項開啓。

參考資料：

http://blog.itpub.net/25704976/viewspace-1318714

做者：陸炫志

出處：xuanzhi的博客 http://www.cnblogs.com/xuanzhi201111

您的支持是對博主最大的鼓勵，感謝您的認真閱讀。本文版權歸做者全部，歡迎轉載，但請保留該聲明。