業務說,爲何10號機房缺乏這條數據,其餘機房卻有?
mysql
mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10; +------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+ | sid | tm_timestamp | tm_lasttime | gid | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version | __deleted | +------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+ | xxxxxxxxxx | 1495773704 | 1495773704 | xxxxxxxxxxx | 處對象 | 0 | 5 | 3611732366 | vx:wtc2033 | 0 | 18 | 8 | 0 | 0 | 1 | 0 | 0 | | 0 | 6126694332813803019 | 0 | +------------+--------------+-------------+---------------------+-------
大概判定,10號機房的數據同步是有問題的,先看這條記錄,是從哪一個機房插入的,而後再看10號機房與該機房之間的同步是否有問題,使用8827登陸,獲取這條數據的版本號__version,由函數轉換獲得這條數據,來自14號機房插入的, 日期:2017-05-26 05:03:03 機房號:14 端口號:11sql
這至關於MySQL裏的binlog,會記錄每條SQL,來自於哪一個server-id,目的是爲了防止循環複製,myshard不只在binlog記錄server-id,每條記錄都帶有版本號,包含了從哪一個機房,哪一個端口寫入的,何時寫入的數據庫
到這裏,知道14號機房寫入的數據,沒法同步到10號機房,能夠去14號看一下同步命令centos
[root@centos local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset" shard_local Read_offset 48494420885 shard_local Read_speed 33373 shard_local Read_bytes_behind 0 sync_r12m0 Read_offset 48494420885 sync_r12m0 Read_speed 33373 sync_r12m0 Read_bytes_behind 0 sync_r13m0 Read_offset 48494420885 sync_r13m0 Read_speed 33373 sync_r13m0 Read_bytes_behind 0 sync_r1m0 Read_offset 48494420885 sync_r1m0 Read_speed 33373 sync_r1m0 Read_bytes_behind 0 sync_r3m0 Read_offset 48494420885 sync_r3m0 Read_speed 33373 sync_r3m0 Read_bytes_behind 0 shard_remote Read_offset 52080697507 shard_remote Read_speed 27290 shard_remote Read_bytes_behind 0
發現沒有r10m0這個機房來拉取數據,那證實同步有問題了,去10號機房看同步的日誌,看到不斷去重連14號機房這個點bash
[root@localhost db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
看到有不少日誌,不斷重試去鏈接14號機房,其中最先的重連發生在網絡
db_sync_xxxxxxxx_r10m0_d.log.13.gz
這個文件,而這個文件在5月14日記錄的app
-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz -rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz -rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz -rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz -rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz -rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz -rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz -rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz -rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz -rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz -rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz -rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz -rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz -rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz
通常重連只有2種可能,一個是14號機房沒有開放白名單,不容許10號機房訪問,但以前搭建成功,確定白名單是開放了,極可能防火牆出問題,因而在14號機房,進行ide
iptables -n -L|grep 10號機房的IP
發現電信IP是開放了規則,可是聯通的IP是沒有開放防火牆規則,這是雙線機房,而我在5月12日部署的環境,說明部署環境2天后,由於網絡質量,電信通道沒法鏈接,改成了聯統統道了,而聯通IP沒有受權,這就致使10號機房沒法順利鏈接14號機房了,可是當時業務沒有使用這個數據庫,昨天5月25日,業務開始部署進程在14號機房,發現數據沒同步,才找DBA的。我因而立刻加入防火牆規則,而後重啓同步進程,從新拉取數據,但10號機房仍是在報錯不斷重連函數
然而在14號機房能夠看到另一個錯誤工具
May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
一直在報一個位置點2587613132不存在,沒法拉取...那是由於過久沒有鏈接上,而14號機房的binlog只保留7天致使的,14號+7天=21號出問題,因而解決方案是把在14號機房,尋找存在但尚未被刪除的位置點,讓10號機房去拉數據,而後詢問業務在14號機房有寫入操做的表有哪些,而後把14號機房的表數據導出來,而後倒入到10號機房
myshard的好處是能夠經過導數來去修補缺失的數據,而mysql只能用percona的修復工具,這也是給本身一個教訓,在機房網絡條件差的狀況下,開通ip必須所有ip都開了,另外業務須要補充數據,過後開會總結了幾個規則
對myshard監控的監控必定要作足夠,爲了不數據落後可以及時發現
業務人員在申請數據庫申請權限時,多線機房要提供所有IP(電信IP,聯通IP,內網IP,管理網IP)
myshard要作一致性hash,對於同一個用戶,在哪一個機房寫入數據,在哪一個機房進行修改數據
在同步落後的狀況下,不要作節點之間的切業務