案例 - 一個IP切換引起的數據不一致

時間 2021-01-22

標籤 mysql sql 數據庫 centos bash 網絡 app ide 函數工具欄目 MySQL 简体版

原文原文鏈接

業務說，爲何10號機房缺乏這條數據，其餘機房卻有？
mysql

mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10;
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| sid        | tm_timestamp | tm_lasttime | gid                 | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version           | __deleted |
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| xxxxxxxxxx |   1495773704 |  1495773704 | xxxxxxxxxxx | 處對象     |            0 |          5 |  3611732366 | vx:wtc2033      |      0 |     18 |        8 |           0 |                 0 |             1 |            0 |         0 |          |        0 | 6126694332813803019 |         0 |
+------------+--------------+-------------+---------------------+-------

大概判定，10號機房的數據同步是有問題的，先看這條記錄，是從哪一個機房插入的，而後再看10號機房與該機房之間的同步是否有問題，使用8827登陸，獲取這條數據的版本號__version，由函數轉換獲得這條數據，來自14號機房插入的，日期:2017-05-26 05:03:03 機房號:14 端口號:11sql

這至關於MySQL裏的binlog，會記錄每條SQL，來自於哪一個server-id,目的是爲了防止循環複製，myshard不只在binlog記錄server-id，每條記錄都帶有版本號，包含了從哪一個機房，哪一個端口寫入的，何時寫入的數據庫

到這裏，知道14號機房寫入的數據，沒法同步到10號機房，能夠去14號看一下同步命令centos

[root@centos local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset"
shard_local             Read_offset             48494420885     
shard_local             Read_speed              33373           
shard_local             Read_bytes_behind       0                    
sync_r12m0              Read_offset             48494420885     
sync_r12m0              Read_speed              33373           
sync_r12m0              Read_bytes_behind       0               
sync_r13m0              Read_offset             48494420885     
sync_r13m0              Read_speed              33373           
sync_r13m0              Read_bytes_behind       0               
sync_r1m0               Read_offset             48494420885     
sync_r1m0               Read_speed              33373           
sync_r1m0               Read_bytes_behind       0               
sync_r3m0               Read_offset             48494420885     
sync_r3m0               Read_speed              33373           
sync_r3m0               Read_bytes_behind       0               
shard_remote            Read_offset             52080697507     
shard_remote            Read_speed              27290           
shard_remote            Read_bytes_behind       0

發現沒有r10m0這個機房來拉取數據，那證實同步有問題了，去10號機房看同步的日誌，看到不斷去重連14號機房這個點bash

[root@localhost db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more                                     
May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2

看到有不少日誌，不斷重試去鏈接14號機房，其中最先的重連發生在網絡

db_sync_xxxxxxxx_r10m0_d.log.13.gz

這個文件，而這個文件在5月14日記錄的app

-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz 
-rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz 
-rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz 
-rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz 
-rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz 
-rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz  
-rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz  
-rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz  
-rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz  
-rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz  
-rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz  
-rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz  
-rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz  
-rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz

通常重連只有2種可能，一個是14號機房沒有開放白名單，不容許10號機房訪問，但以前搭建成功，確定白名單是開放了，極可能防火牆出問題，因而在14號機房，進行ide

iptables -n -L|grep 10號機房的IP

發現電信IP是開放了規則，可是聯通的IP是沒有開放防火牆規則，這是雙線機房，而我在5月12日部署的環境，說明部署環境2天后，由於網絡質量，電信通道沒法鏈接，改成了聯統統道了，而聯通IP沒有受權，這就致使10號機房沒法順利鏈接14號機房了，可是當時業務沒有使用這個數據庫，昨天5月25日，業務開始部署進程在14號機房，發現數據沒同步，才找DBA的。我因而立刻加入防火牆規則，而後重啓同步進程，從新拉取數據，但10號機房仍是在報錯不斷重連函數

然而在14號機房能夠看到另一個錯誤工具

May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132

一直在報一個位置點2587613132不存在，沒法拉取...那是由於過久沒有鏈接上，而14號機房的binlog只保留7天致使的，14號+7天=21號出問題，因而解決方案是把在14號機房，尋找存在但尚未被刪除的位置點，讓10號機房去拉數據，而後詢問業務在14號機房有寫入操做的表有哪些，而後把14號機房的表數據導出來，而後倒入到10號機房

myshard的好處是能夠經過導數來去修補缺失的數據，而mysql只能用percona的修復工具，這也是給本身一個教訓，在機房網絡條件差的狀況下，開通ip必須所有ip都開了，另外業務須要補充數據，過後開會總結了幾個規則