故障現象:mysql
中午12點多,一套主從集羣的主庫由於沒有配置大頁內存,發佈時致使OOM,MYSQL實例重啓了,而後MHA發生了切換。切換過程正常。切換後須要把原master配置成新master的slave,在manager.log文件裏面找到change master to ....命令,執行後發現複製狀態一直停留在connectiong 。名稱定:OOM的是M1,掛掉後頂替的是S1.sql
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting to reconnect after a failed master event read Master_Host: 10.3.171.40 Master_User: rep_user Master_Port: 3306 Connect_Retry: 60 Master_Log_File: centos-bin.000002 Read_Master_Log_Pos: 107 Relay_Log_File: relay-bin.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: centos-bin.000002 Slave_IO_Running: Connecting Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 107 Relay_Log_Space: 107 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 2017140
檢查錯誤日誌文件,日誌以下,提示在S1上找不到master上的binlog文件數據庫
160408 12:25:40 [Note] Slave I/O thread: connected to master 'rep_user@10.3.171.40:3306',replication started in log 'centos-bin.000002' at position 107 160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29) 160408 12:25:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107 160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29) 160408 12:26:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107 160408 12:26:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
到S1上去檢查,show master status;show master logs能夠看到業務數據在寫入,POS位置也一直在改變,這裏奇怪的是00001文件的大小是0centos
mysql> show master logs; +-------------------+-----------+ | Log_name | File_size | +-------------------+-----------+ | centos-bin.000001 | 0 | | centos-bin.000002 | 568661746 | +-------------------+-----------+ 2 rows in set (0.00 sec) mysql> show master logs; +-------------------+-----------+ | Log_name | File_size | +-------------------+-----------+ | centos-bin.000001 | 0 | | centos-bin.000002 | 568941034 | +-------------------+-----------+ 2 rows in set (0.00 sec) mysql> show master logs; +-------------------+-----------+ | Log_name | File_size | +-------------------+-----------+ | centos-bin.000001 | 0 | | centos-bin.000002 | 569017617 | +-------------------+-----------+ 2 rows in set (0.00 sec)
到data目錄查看,卻沒有找到這2個文件。複製提示也是找不到文件post
到這裏奇特的現象是:業務正常寫數據庫,show master status也能夠看到有pos位置變化,可是磁盤上沒有文件,複製沒法創建測試
[root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]# find / -name centos-bin.000002 [root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]#
#故障重現spa
1)正常啓動實例,開啓binlog,配置複製環境日誌
2)rm 把主庫的binlog.index.binlog.0000X刪除code
3)繼續寫入數據,pos位置變化server
4)從庫報錯,找不到binlog文件
#爲何會出現這樣的狀況
回想起來這個故障,應該和故障重現的過程是同樣的,這套集羣3,4個月前搭起來的,在複製正常後,standby的binlog相關文件被刪除了,其實刪除的整個目錄,這個目錄專門用來存binlog,relaylog的。刪除後搭建複製的時候作change master to,把relay log重建了,可是binlog沒有。今天發生了MHA切換,standby變成了master,接受數據寫入。MHA裏面的filename,pos是連到standby作show master status獲得的,可是這些文件已經被刪除。因此複製出錯。
#繼續作實驗
1)生成binlog.0001後,把binlog.index,binlog.00001都rm後,數據寫入,pos逐步變大,當超過1G大小作文件切換,會發生什麼?
答:當1寫滿後作切換,binlog.index沒有,拿不到最大的文件ID,那就又從1開始。結論:一直寫00001文件
2)留下index文件,把00001刪除,繼續寫入,超過1G大小會怎麼樣?
答:會生成00002文件,這個文件是落地磁盤的正常的binlog文件。
#今天出現的故障,如何把events拿出來?
測試下來,若是是statement的,能夠經過show master events in xxxx,獲得binlog的命令。若是是row格式的,拿不到具體的SQL命令。