上週末和開發人員對線上庫中的部分表的在線DDL和update,這過程當中出現了一些意料以外的問題,現將過程、分析和解決方案在這裏總結一下mysql
1、 需求背景:
sql
要在以下表中添加字段(modified_at)而且更改默認值數據庫
table_name {
baby_comp
baby_comp_status
baby_usr
baby_ad_user
baby_camp
baby_ord
baby_acc_eva
}安全
每張表執行以下操做
ALTER TABLE `$table_name` ADD COLUMN `modified_at` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '建立時間/最後修改時間'");
更新的語句
UPDATE `baby_camp`
SET `modified_at` = FROM_UNIXTIME(updated_time + 60)
WHERE `modified_at` <= '1970-01-01 08:00:00';架構
master:192.168.100.18 >主庫寫數據複製源
slave1:192.168.100.17 >搜索用
slave2:192.168.100.19 >查詢用
slave3:192.168.100.10 >查詢用
slave4:192.168.100.15 >備份用ide
問題 1. max binlog cache 不足引發的複製崩潰 涉及從庫(192.168.100.17-搜索用 和 192.168.100.15-備份用)ui
161009 21:42:49 [ERROR] Slave SQL: Could not execute Write_rows event on table baby.baby_delta; Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase this mysqld variable and try again, Error_code: 1197; Writing one row to the row-based binary log failed, Error_code: 1534; handler error HA_ERR_RBR_LOGGING_FAILED; the event's master log mysql-bin.007759, end_log_pos 3856759100, Error_code: 1197this
161009 21:42:49 [Warning] Slave: Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase this mysqld variable and try again Error_code: 1197
161009 21:42:49 [Warning] Slave: Writing one row to the row-based binary log failed Error_code: 1534
161009 21:42:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.007759' position 633959791
161009 21:43:48 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
161009 21:43:48 [Note] Slave I/O thread killed while reading event
161009 21:43:48 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.007760', position 301659
161009 21:43:53 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.007759' at position 633959791, relay log './serverdb01-relay-bin.009618' position: 633959937
161009 21:43:53 [Note] Slave I/O thread: connected to master 'backup@192.168.100.18:3306',replication started in log 'mysql-bin.007760' at position 301659url
++++
解釋:
報錯主要是:從庫上對於表baby.baby_delta的操做不能寫到binlog中,多語句的事物請求更多的max_binlog_cache_szie,增長max_binlog_cache_szie大小重試
++++
spa
問題 2. max allowed packet 不足引發的複製崩潰 涉及從庫(192.168.100.15-備份用)
161009 21:42:49 [ERROR] Error reading packet from server: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master (server_errno=1236) 131118
161009 21:42:49 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master', Error_code: 1236
++++
解釋:
報錯主要是:從庫讀取主庫的binlog的packet的大小超出了設定的max_allowed_packet大小,在主庫上增長此參數的值。
++++
首先單獨操做了表:baby_ord, 此表數據量大大概4百多萬的數據條目.
其中在此表上有多個觸發器涉及到INSERT\UPDATE\DELETE操做,會觸發將相應的數據行插入到baby_delta表中,執行完除了主從延時並無出現其餘的情況
因而過於樂觀的認爲餘下的表沒有太大的數據量,除了主從延遲,不會形成其餘的問題,索性就所有放在了集中一次發佈中修改.
在DBMigrate後監控SQL在主庫的執行,主庫正常執行完成,從庫17和15出現複製崩潰.
查看變動完最後一批表後的binlog大小,其中mysql-bin.007759這一文件達到了將近9G,配置文件中限定產生的binlog文件的最大大小是1G
由於後面一批的表字段添加變動執行是一個事務,同一個事務產生的binlog不會被分配到兩個binlog文件中.致使出現上述問題 1和2
過後發現babysitter_campagin 表物理文件有6G大小,其中有幾個列的數據類型是text.
但爲何binlog文件會變得這麼大呢?超出了限定大小?
由於主庫配置的binlog的格式是mixed,由系統根據SQL的類型判斷是記錄row格式仍是stmt格式,但默認是記錄stmt格式的,那何時會記錄
row格式呢?
1.當SQL語句是update或者delete
row格式的缺點就是將每條數據的變化都詳細的記錄下來,結果就是binlog文件很大,會佔用更大的binlog cache.
mysql> show master logs;
+------------------+------------+
| Log_name | File_size |
+------------------+------------+
| mysql-bin.007758 | 2514487585 |
| mysql-bin.007759 | 9107651572 |
+------------------+------------+
mysql> desc baby_camp;
+-----------------------------------+-----------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------------------+-----------------------+------+-----+-------------------+-----------------------------+
| content | mediumtext | YES | | NULL | |
| tweet_url | text | YES | | NULL | |
| note | text | YES | | NULL | |
| requirement | text | YES | | NULL | |
省略了部份內容。。。。
+-----------------------------------+-----------------------+------+-----+-------------------+-----------------------------+
mysql> select count(*) from baby_camp;
+----------+
| count(*) |
+----------+
| 1131460 |
+----------+
1 row in set (0.50 sec)
mysql> select count(*) from baby_delta;
+----------+
| count(*) |
+----------+
| 10136301 |
+----------+
1 row in set (1.12 sec)
1.能不增長、不修改表列或者默認值儘可能不要作,要求彷佛不合情理啊,該作的還得作:(
2.多個表要變動字段等操做分批處理,減小binlog的產生,雖然麻煩一些,安全穩定重要
3.沒有辦法的辦法就是暴力改動數據庫的參數,缺點是有些參數須要重啓數據庫實例
複製的源主庫只有一個,其中17和15出現複製崩潰(注:都開啓了binlog,且格式是row,都會出現問題1,可是隻有15問題1和2都出現了),而查詢專用的19和10兩從庫(注:兩庫都沒有開啓binlog)沒有出現問題2,不解?