mysql5.5 物理刪除binlog文件致使的故障

時間 2019-11-10

標籤 mysql5.5 mysql 物理刪除 binlog 文件致使故障欄目 MySQL 简体版

原文原文鏈接

故障現象：mysql

中午12點多，一套主從集羣的主庫由於沒有配置大頁內存，發佈時致使OOM，MYSQL實例重啓了，而後MHA發生了切換。切換過程正常。切換後須要把原master配置成新master的slave，在manager.log文件裏面找到change master to ....命令，執行後發現複製狀態一直停留在connectiong 。名稱定：OOM的是M1，掛掉後頂替的是S1.sql

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting to reconnect after a failed master event read
                  Master_Host: 10.3.171.40
                  Master_User: rep_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: centos-bin.000002
          Read_Master_Log_Pos: 107
               Relay_Log_File: relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: centos-bin.000002 Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 107
              Relay_Log_Space: 107
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2017140

檢查錯誤日誌文件，日誌以下，提示在S1上找不到master上的binlog文件數據庫

160408 12:25:40 [Note] Slave I/O thread: connected to master 'rep_user@10.3.171.40:3306',replication started in log 'centos-bin.000002' at position 107
160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
160408 12:25:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107
160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
160408 12:26:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107
160408 12:26:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)

到S1上去檢查，show master status;show master logs能夠看到業務數據在寫入，POS位置也一直在改變，這裏奇怪的是00001文件的大小是0centos

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| centos-bin.000001 |         0 |
| centos-bin.000002 | 568661746 |
+-------------------+-----------+
2 rows in set (0.00 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| centos-bin.000001 |         0 |
| centos-bin.000002 | 568941034 |
+-------------------+-----------+
2 rows in set (0.00 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| centos-bin.000001 |         0 |
| centos-bin.000002 | 569017617 |
+-------------------+-----------+
2 rows in set (0.00 sec)

到data目錄查看，卻沒有找到這2個文件。複製提示也是找不到文件post

到這裏奇特的現象是：業務正常寫數據庫，show master status也能夠看到有pos位置變化，可是磁盤上沒有文件，複製沒法創建測試

[root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]# find / -name centos-bin.000002
[root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]#

#故障重現spa

1）正常啓動實例，開啓binlog，配置複製環境日誌

2）rm 把主庫的binlog.index.binlog.0000X刪除code

3）繼續寫入數據，pos位置變化server

4）從庫報錯，找不到binlog文件

#爲何會出現這樣的狀況

回想起來這個故障，應該和故障重現的過程是同樣的，這套集羣3,4個月前搭起來的，在複製正常後，standby的binlog相關文件被刪除了，其實刪除的整個目錄，這個目錄專門用來存binlog,relaylog的。刪除後搭建複製的時候作change master to，把relay log重建了，可是binlog沒有。今天發生了MHA切換，standby變成了master,接受數據寫入。MHA裏面的filename,pos是連到standby作show master status獲得的，可是這些文件已經被刪除。因此複製出錯。

#繼續作實驗

1）生成binlog.0001後，把binlog.index,binlog.00001都rm後，數據寫入，pos逐步變大，當超過1G大小作文件切換，會發生什麼？

答：當1寫滿後作切換，binlog.index沒有，拿不到最大的文件ID，那就又從1開始。結論：一直寫00001文件

2）留下index文件，把00001刪除，繼續寫入，超過1G大小會怎麼樣？

答：會生成00002文件，這個文件是落地磁盤的正常的binlog文件。

#今天出現的故障，如何把events拿出來？

測試下來，若是是statement的，能夠經過show master events in xxxx，獲得binlog的命令。若是是row格式的，拿不到具體的SQL命令。