案例 - 一個IP切換引起的數據不一致

業務說,爲何10號機房缺乏這條數據,其餘機房卻有?
mysql

mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10;
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| sid        | tm_timestamp | tm_lasttime | gid                 | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version           | __deleted |
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| xxxxxxxxxx |   1495773704 |  1495773704 | xxxxxxxxxxx | 處對象     |            0 |          5 |  3611732366 | vx:wtc2033      |      0 |     18 |        8 |           0 |                 0 |             1 |            0 |         0 |          |        0 | 6126694332813803019 |         0 |
+------------+--------------+-------------+---------------------+-------

大概判定,10號機房的數據同步是有問題的,先看這條記錄,是從哪一個機房插入的,而後再看10號機房與該機房之間的同步是否有問題,使用8827登陸,獲取這條數據的版本號__version,由函數轉換獲得這條數據,來自14號機房插入的, 日期:2017-05-26 05:03:03 機房號:14 端口號:11sql


這至關於MySQL裏的binlog,會記錄每條SQL,來自於哪一個server-id,目的是爲了防止循環複製,myshard不只在binlog記錄server-id,每條記錄都帶有版本號,包含了從哪一個機房,哪一個端口寫入的,何時寫入的數據庫


到這裏,知道14號機房寫入的數據,沒法同步到10號機房,能夠去14號看一下同步命令centos

[root@centos local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset"
shard_local             Read_offset             48494420885     
shard_local             Read_speed              33373           
shard_local             Read_bytes_behind       0                    
sync_r12m0              Read_offset             48494420885     
sync_r12m0              Read_speed              33373           
sync_r12m0              Read_bytes_behind       0               
sync_r13m0              Read_offset             48494420885     
sync_r13m0              Read_speed              33373           
sync_r13m0              Read_bytes_behind       0               
sync_r1m0               Read_offset             48494420885     
sync_r1m0               Read_speed              33373           
sync_r1m0               Read_bytes_behind       0               
sync_r3m0               Read_offset             48494420885     
sync_r3m0               Read_speed              33373           
sync_r3m0               Read_bytes_behind       0               
shard_remote            Read_offset             52080697507     
shard_remote            Read_speed              27290           
shard_remote            Read_bytes_behind       0

發現沒有r10m0這個機房來拉取數據,那證實同步有問題了,去10號機房看同步的日誌,看到不斷去重連14號機房這個點bash

[root@localhost db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more                                     
May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2

看到有不少日誌,不斷重試去鏈接14號機房,其中最先的重連發生在網絡

db_sync_xxxxxxxx_r10m0_d.log.13.gz

這個文件,而這個文件在5月14日記錄的app

-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz 
-rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz 
-rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz 
-rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz 
-rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz 
-rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz  
-rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz  
-rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz  
-rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz  
-rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz  
-rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz  
-rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz  
-rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz  
-rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz

通常重連只有2種可能,一個是14號機房沒有開放白名單,不容許10號機房訪問,但以前搭建成功,確定白名單是開放了,極可能防火牆出問題,因而在14號機房,進行ide

iptables -n -L|grep 10號機房的IP

發現電信IP是開放了規則,可是聯通的IP是沒有開放防火牆規則,這是雙線機房,而我在5月12日部署的環境,說明部署環境2天后,由於網絡質量,電信通道沒法鏈接,改成了聯統統道了,而聯通IP沒有受權,這就致使10號機房沒法順利鏈接14號機房了,可是當時業務沒有使用這個數據庫,昨天5月25日,業務開始部署進程在14號機房,發現數據沒同步,才找DBA的。我因而立刻加入防火牆規則,而後重啓同步進程,從新拉取數據,但10號機房仍是在報錯不斷重連函數


然而在14號機房能夠看到另一個錯誤工具

May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132

一直在報一個位置點2587613132不存在,沒法拉取...那是由於過久沒有鏈接上,而14號機房的binlog只保留7天致使的,14號+7天=21號出問題,因而解決方案是把在14號機房,尋找存在但尚未被刪除的位置點,讓10號機房去拉數據,而後詢問業務在14號機房有寫入操做的表有哪些,而後把14號機房的表數據導出來,而後倒入到10號機房


myshard的好處是能夠經過導數來去修補缺失的數據,而mysql只能用percona的修復工具,這也是給本身一個教訓,在機房網絡條件差的狀況下,開通ip必須所有ip都開了,另外業務須要補充數據,過後開會總結了幾個規則


  1. 對myshard監控的監控必定要作足夠,爲了不數據落後可以及時發現

  2. 業務人員在申請數據庫申請權限時,多線機房要提供所有IP(電信IP,聯通IP,內網IP,管理網IP)

  3. myshard要作一致性hash,對於同一個用戶,在哪一個機房寫入數據,在哪一個機房進行修改數據

  4. 在同步落後的狀況下,不要作節點之間的切業務

相關文章
相關標籤/搜索