技術分享 | MySQL：產生大量小 relay log 的故障一例

時間 2020-01-16

標籤技術分享 mysql 產生量小 relay log 故障一例欄目 MySQL 简体版

原文原文鏈接

做者：高鵬
文章末尾有他著做的《深刻理解 MySQL 主從原理 32 講》，深刻透徹理解 MySQL 主從，GTID 相關技術知識。
源碼版本：5.7.22
水平有限，若有誤請諒解

本文建議橫屏觀看，效果更佳html

1、案例來源和現象

這個案例是朋友 @peaceful 遇到的線上問題，最終線索也是他本身找到的。現象以下：mysql

一、出現了大量很小的 relay log 以下，堆積量大約 2600 個：

...
-rw-r----- 1 mysql dba    12827 Oct 11 12:28 mysql-relay-bin.036615
-rw-r----- 1 mysql dba     4908 Oct 11 12:28 mysql-relay-bin.036616
-rw-r----- 1 mysql dba     1188 Oct 11 12:28 mysql-relay-bin.036617
-rw-r----- 1 mysql dba     5823 Oct 11 12:29 mysql-relay-bin.036618
-rw-r----- 1 mysql dba      507 Oct 11 12:29 mysql-relay-bin.036619
-rw-r----- 1 mysql dba     1188 Oct 11 12:29 mysql-relay-bin.036620
-rw-r----- 1 mysql dba     3203 Oct 11 12:29 mysql-relay-bin.036621
-rw-r----- 1 mysql dba    37916 Oct 11 12:30 mysql-relay-bin.036622
-rw-r----- 1 mysql dba      507 Oct 11 12:30 mysql-relay-bin.036623
-rw-r----- 1 mysql dba     1188 Oct 11 12:31 mysql-relay-bin.036624
-rw-r----- 1 mysql dba     4909 Oct 11 12:31 mysql-relay-bin.036625
-rw-r----- 1 mysql dba     1188 Oct 11 12:31 mysql-relay-bin.036626
-rw-r----- 1 mysql dba      507 Oct 11 12:31 mysql-relay-bin.036627
-rw-r----- 1 mysql dba      507 Oct 11 12:32 mysql-relay-bin.036628
-rw-r----- 1 mysql dba     1188 Oct 11 12:32 mysql-relay-bin.036629
-rw-r----- 1 mysql dba      454 Oct 11 12:32 mysql-relay-bin.036630
-rw-r----- 1 mysql dba     6223 Oct 11 12:32 mysql-relay-bin.index

二、主庫錯誤日誌有以下錯誤

2019-10-11T12:31:26.517309+08:00 61303425 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303421).
2019-10-11T12:31:26.517489+08:00 61303425 [Note] Start binlog_dump to master_thread_id(61303425) slave_server(19304313), pos(, 4)
2019-10-11T12:31:44.203747+08:00 61303449 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303425).
2019-10-11T12:31:44.203896+08:00 61303449 [Note] Start binlog_dump to master_thread_id(61303449) slave_server(19304313), pos(, 4)

2、slave_net_timeout 參數分析

實際上第一眼看這個案例我也以爲很奇怪，由於不多有人會去設置 slave_net_timeout 參數，一樣咱們也沒有設置過，所以關注較少。可是 @peaceful 本身找到了可能出現問題的設置就是當前從庫 slave_net_timeout 參數設置爲 10。我就順着這個線索往下分析，咱們先來看看 slave_net_timeout 參數的功能。sql

當前看來從庫的 slave_net_timeout 有以下兩個功能：less

一、設置 IO 線程在空閒狀況下（沒有 Event 接收的狀況下）的鏈接超時時間。

這個參數 5.7.7 事後是 60 秒，之前是 3600 秒，修改後須要重啓主從纔會生效。ide

二、若是 change master 沒有指定 MASTER_HEARTBEAT_PERIOD 的狀況下會設置爲 slave_net_timeout/2

通常咱們配置主從都沒有去指定這個心跳週期，所以就是 slave_net_timeout/2，它控制的是若是在主庫沒有 Event 產生的狀況下，多久發送一個心跳 Event 給從庫的 IO 線程，用於保持鏈接。可是一旦咱們配置了主從（change master）這個值就定下來了，不會隨着 slave_net_timeout 參數的更改而更改，咱們能夠在 slave_master_info 表中找到相應的設置以下：ui

mysql> select Heartbeat from slave_master_info \G
*************************** 1. row ***************************
Heartbeat: 30
1 row in set (0.01 sec)

若是咱們要更改這個值只能從新 change master 才行。spa

3、緣由總結

若是知足下面三個條件，將會出現案例中的故障：線程

主從中的 MASTER_HEARTBEAT_PERIOD 的值大於從庫 slave_net_timeout
主庫當前壓力很小持續 slave_net_timeout 設置時間沒有產生新的 Event
以前主從有必定的延遲

那麼這種狀況下在主庫心跳 Event 發送給從庫的IO線程以前，IO 線程已經斷開了。斷開後 IO 線程會進行重連，每次重連將會生成新的 relay log，可是這些 relay log 因爲延遲問題不能清理就出現了案例中的狀況。日誌

下面是官方文檔中關於這部分說明：code

If you are logging master connection information to tables, MASTER_HEARTBEAT_PERIOD can be seen
as the value of the Heartbeat column of the mysql.slave_master_info table.
Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the
value of slave_net_timeout divided by 2.
Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval
results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to
reset it to the default value.

4、案例模擬

有了理論基礎就很好了模擬了，延遲這一點我模擬的時候關閉了從庫的 SQL 線程來模擬 relay log 積壓的狀況，由於這個案例和 SQL 線程沒有太多的關係。

提早配置好主從，查看當前的心跳週期和 slave_net_timeout 參數以下：

mysql> show variables like '%slave_net_timeout%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| slave_net_timeout | 60    |
+-------------------+-------+
1 row in set (0.01 sec)

mysql> select Heartbeat from slave_master_info \G
*************************** 1. row ***************************
Heartbeat: 30
1 row in set (0.00 sec)

一、中止從庫的 SQL 線程

stop slave sql_thread;

二、設置 slave_net_timeout 爲 10

mysql> set global slave_net_timeout=10;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                         |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

能夠看到這裏實際上已經有一個警告了。

三、重啓 IO 線程

這樣纔會讓 slave_net_timeout 參數生效

mysql> stop slave ;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave io_thread;
Query OK, 0 rows affected (0.01 sec)

四、觀察現象

大概每 10 秒會生成一個 relay log 文件以下：

-rw-r----- 1 mysql mysql        500 2019-09-27 23:48:32.655001361 +0800 relay.000142
-rw-r----- 1 mysql mysql        500 2019-09-27 23:48:42.943001355 +0800 relay.000143
-rw-r----- 1 mysql mysql        500 2019-09-27 23:48:53.293001363 +0800 relay.000144
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:03.502000598 +0800 relay.000145
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:13.799001357 +0800 relay.000146
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:24.055001354 +0800 relay.000147
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:34.280001827 +0800 relay.000148
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:44.496001365 +0800 relay.000149
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:54.789001353 +0800 relay.000150
-rw-r----- 1 mysql mysql        500 2019-09-27 23:50:05.485001371 +0800 relay.000151
-rw-r----- 1 mysql mysql        500 2019-09-27 23:50:15.910001430 +0800 relay.000152

大概每 10 秒主庫的日誌會輸出以下日誌：

2019-10-08T02:27:24.996827+08:00 217 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(216).
2019-10-08T02:27:24.998297+08:00 217 [Note] Start binlog_dump to master_thread_id(217) slave_server(953340), pos(, 4)
2019-10-08T02:27:35.265961+08:00 218 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(217).
2019-10-08T02:27:35.266653+08:00 218 [Note] Start binlog_dump to master_thread_id(218) slave_server(953340), pos(, 4)
2019-10-08T02:27:45.588074+08:00 219 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(218).
2019-10-08T02:27:45.589814+08:00 219 [Note] Start binlog_dump to master_thread_id(219) slave_server(953340), pos(, 4)
2019-10-08T02:27:55.848558+08:00 220 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(219).
2019-10-08T02:27:55.849442+08:00 220 [Note] Start binlog_dump to master_thread_id(220) slave_server(953340), pos(, 4)

這個日誌就和案例中的如出一轍了。

解決問題

知道緣由後解決也就很簡單了咱們只需設置 slave_net_timeout 參數爲 MASTER_HEARTBEAT_PERIOD 的 2 倍就能夠了，設置後重啓主從便可。

5、實現方式

這裏咱們將經過簡單的源碼調用分析來看看到底 slave_net_timeout 參數和 MASTER_HEARTBEAT_PERIOD 對主從的影響。

一、從庫使用參數 slave_net_timeout

從庫IO線程啓動時候會經過參數 slave_net_timeout 設置超時：

->connect_to_master
  -> mysql_options

case MYSQL_OPT_CONNECT_TIMEOUT: //MYSQL_OPT_CONNECT_TIMEOUT
    mysql->options.connect_timeout= *(uint*) arg;
    break;

而在創建和主庫的鏈接時候會使用這個值

connect_to_master
 ->mysql_real_connect
   -> get_vio_connect_timeout

timeout_sec= mysql->options.connect_timeout;

所以咱們也看到了 slave_net_timeout 參數只有在 IO 線程重啓的時候纔會生效

二、從庫設置 MASTER_HEARTBEAT_PERIOD 值

在每次使用從庫 change master 時候會設置這個值以下，默認爲 slave_net_timeout/2：

->change_master
  ->change_receive_options


    mi->heartbeat_period= min<float>(SLAVE_MAX_HEARTBEAT_PERIOD,
                                     (slave_net_timeout/2.0f));

所以咱們看到只有 change master 纔會從新設置這個值，重啓主從是不會從新設置的。

三、使用 MASTER_HEARTBEAT_PERIOD 值

每次 IO 線程啓動時候會將這個值傳遞給主庫的 DUMP 線程，方式應該是經過構建語句 ‘SET @master_heartbeat_period’ 來完成的。以下：

->handle_slave_io
 ->get_master_version_and_clock


if (mi->heartbeat_period != 0.0)
  {
    char llbuf[22];
    const char query_format[]= "SET @master_heartbeat_period= %s";
    char query[sizeof(query_format) - 2 + sizeof(llbuf)];

主庫啓動 DUMP 線程的時候會經過搜索的方式找到這個值以下

->Binlog_sender::init
   ->Binlog_sender::init_heartbeat_period

user_var_entry *entry=
    (user_var_entry*) my_hash_search(&m_thd->user_vars, (uchar*) name.str,
                                     name.length);
  m_heartbeat_period= entry ? entry->val_int(&null_value) : 0;

四、DUMP 線程使用 MASTER_HEARTBEAT_PERIOD 發送心跳 Event

這裏主要是經過一個超時等待來完成，以下：

->Binlog_sender::wait_new_events
  ->Binlog_sender::wait_with_heartbeat
 

set_timespec_nsec(&ts, m_heartbeat_period); //心跳超時
    ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//等待
    if (ret != ETIMEDOUT && ret != ETIME) //若是是正常收到則收到信號，說明有新的Event到來，不然若是是超時則發送心跳Event
      break; //正常返回0 是超時返回ETIMEDOUT 繼續循環

      if (send_heartbeat_event(log_pos)) //發送心跳Event
        return 1;

五、重連會殺掉可能的存在的 DUMP 線程

根據 UUID 進行比對以下：

->kill_zombie_dump_threads


Find_zombie_dump_thread find_zombie_dump_thread(slave_uuid);
THD *tmp= Global_THD_manager::get_instance()->
                                find_thd(&find_zombie_dump_thread);
  if (tmp)
  {
    /*
      Here we do not call kill_one_thread() as
      it will be slow because it will iterate through the list
      again. We just to do kill the thread ourselves.
    */
    if (log_warnings > 1)
    {
      if (slave_uuid.length())
      {
        sql_print_information("While initializing dump thread for slave with "
                              "UUID <%s>, found a zombie dump thread with the "
                              "same UUID. Master is killing the zombie dump "
                              "thread(%u).", slave_uuid.c_ptr(),
                              tmp->thread_id());
      }//這裏就是本案例中的日誌了
.....

這裏咱們看到了案例中的日誌。