一篇文章帶你玩轉TiDB災難恢復

時間 2020-05-12

標籤一篇文章 tidb 災難恢復简体版

原文原文鏈接

一篇文章帶你玩轉TiDB災難恢復

1、背景前端

高可用是 TiDB 的另外一大特色，TiDB/TiKV/PD 這三個組件都能容忍部分實例失效，不影響整個集羣的可用性。下面分別說明這三個組件的可用性、單個實例失效後的後果以及如何恢復。mysql

TiDB
TiDB 是無狀態的，推薦至少部署兩個實例，前端經過負載均衡組件對外提供服務。當單個實例失效時，會影響正在這個實例上進行的 Session，從應用的角度看，會出現單次請求失敗的狀況，從新鏈接後便可繼續得到服務。單個實例失效後，能夠重啓這個實例或者部署一個新的實例。git

PD
PD 是一個集羣，經過 Raft 協議保持數據的一致性，單個實例失效時，若是這個實例不是 Raft 的 leader，那麼服務徹底不受影響；若是這個實例是 Raft 的 leader，會從新選出新的 Raft leader，自動恢復服務。PD 在選舉的過程當中沒法對外提供服務，這個時間大約是3秒鐘。推薦至少部署三個 PD 實例，單個實例失效後，重啓這個實例或者添加新的實例。sql

TiKV
TiKV 是一個集羣，經過 Raft 協議保持數據的一致性（副本數量可配置，默認保存三副本），並經過 PD 作負載均衡調度。單個節點失效時，會影響這個節點上存儲的全部 Region。對於 Region 中的 Leader 節點，會中斷服務，等待從新選舉；對於 Region 中的 Follower 節點，不會影響服務。當某個 TiKV 節點失效，而且在一段時間內（默認 30 分鐘）沒法恢復，PD 會將其上的數據遷移到其餘的 TiKV 節點上。架構

2、架構app

wtidb28.add.shbt.qihoo.net  192.168.1.1  TiDB/PD/pump/prometheus/grafana/CCS
wtidb27.add.shbt.qihoo.net  192.168.1.2  TiDB
wtidb26.add.shbt.qihoo.net  192.168.1.3  TiDB
wtidb22.add.shbt.qihoo.net  192.168.1.4  TiKV
wtidb21.add.shbt.qihoo.net  192.168.1.5  TiKV
wtidb20.add.shbt.qihoo.net  192.168.1.6  TiKV
wtidb19.add.shbt.qihoo.net  192.168.1.7  TiKV
wtidb18.add.shbt.qihoo.net  192.168.1.8  TiKV
wtidb17.add.shbt.qihoo.net  192.168.1.9  TiFlash
wtidb16.add.shbt.qihoo.net  192.168.1.10  TiFlash

集羣採用3TiDB節點，5TiKV，2TiFlash架構來測試災難恢復，TiFlash採用的方式是先部署集羣，後部署TiFlash的方式，版本3.1.0GA負載均衡

3、宕機兩臺測試curl

集羣默認3副本，5臺機器宕機任意兩臺，理論上存在三種狀況，一種是3副本中，有兩個副本正巧在宕機的這兩臺上，一種是3副本中，只有一個region在宕機的兩臺機器上，還一種就是宕機的兩臺機器裏不存在某些內容的任何副本，本次咱們測試讓wtidb21和wtidb22兩個TiKV節點宕機。ide

咱們先看一下宕機前測試表的情況測試

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.91 sec)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.98 sec)

兩臺同時宕機後：

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;               
ERROR 9002 (HY000): TiKV server timeout 
mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
ERROR 9005 (HY000): Region is unavailable

看一下宕機的兩臺store_id

/data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -i -u http://192.168.1.1:2379
» store

知道是1和4

檢查大於等於一半副本數在故障節點上的region

[tidb@wtidb28 bin]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379  -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4) then . else empty end) | length>=$total-length)}'
{"id":18,"peer_stores":[4,6,1]}
{"id":405,"peer_stores":[7,4,1]}
{"id":120,"peer_stores":[4,1,6]}
{"id":337,"peer_stores":[4,5,1]}
{"id":128,"peer_stores":[4,1,6]}
{"id":112,"peer_stores":[1,4,6]}
{"id":22,"peer_stores":[4,6,1]}
{"id":222,"peer_stores":[7,4,1]}
{"id":571,"peer_stores":[4,6,1]}

在剩餘正常的kv節點上執行停kv的腳本：

ps -ef|grep tikv
sh /data1/tidb/deploy/scripts/stop_tikv.sh
ps -ef|grep tikv

變動其屬主，將其拷貝至tidb目錄下
chown -R tidb. /home/helei/tikv-ctl

這個分狀況, 若是是 region 數量太多，那麼按照 region 來修復的話，速度會比較慢，而且繁瑣，這個使用使用 all 操做比較便捷，可是有可能誤殺有兩個 peer 的副本，也就是說可能你壞的這臺機器，有個region只有一個在這臺機器上，但他也會只保留一個region副本在集羣裏
下面的操做要在全部存活的節點先執行stop kv操做（要求kv是關閉狀態），而後執行

[tidb@wtidb20 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 1,4 --all-regions                                                                    
removing stores [1, 4] from configrations...
success

重啓pd節點

ansible-playbook stop.yml --tags=pd
這裏若是pd都關了的話，你是登不上庫的
[helei@db-admin01 ~]$ /usr/local/mysql56/bin/mysql -u mydba -h 10.217.36.146 -P 43000 -p4c41b9
6c7687f66d 
…
…
…
ansible-playbook start.yml --tags=pd

重啓存活的kv節點
sh /data1/tidb/deploy/scripts/start_tikv.sh

檢查沒有處於leader狀態的region
[tidb@wtidb28 bin]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
這裏我沒有搜到任何的非leader region，只有副本數是3，且同時掛3臺機器以上，且正巧有些region所有的region都在這3臺機器上，前面步驟是unsafe all-region，pd這個檢查沒有處於leader狀態的region步驟纔會顯示出來，纔會須要對應到表查詢丟了那些數據，才須要去建立空region啥的，我這個狀況，只要還保留一個副本，無論unsafe執行的是all-regions，仍是指定的具體的region號，都是不須要後面的步驟

正常啓動集羣后，能夠經過pd-ctl來觀看以前的region數，理論上在使用unsafe --all-regions後，僅剩的1個region成爲leader，剩餘的kv節點經過raft協議將其再次複製出2個follower拷貝到其餘store上
例如本案例裏的
{"id":18,"peer_stores":[4,6,1]}
經過pd-ctl能夠看到他如今在猶豫1,4kv節點損壞，在執行unsafe-recover remove-fail-stores --all-regions後，將1,4的移除，僅剩的6成爲leader，利用raft協議在5,7節點複製出新的follower，達成3副本順利啓動集羣

» region 18
{
  "id": 18,
  "start_key": "7480000000000000FF0700000000000000F8",
  "end_key": "7480000000000000FF0900000000000000F8",
  "epoch": {
    "conf_ver": 60,
    "version": 4
  },
  "peers": [
    {
      "id": 717,
      "store_id": 6
    },
    {
      "id": 59803,
      "store_id": 7
    },
    {
      "id": 62001,
      "store_id": 5
    }
  ],
  "leader": {
    "id": 717,
    "store_id": 6
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 1,
  "approximate_keys": 0
}

若是隻同時掛了2臺機器，那麼到這裏就結束了，若是隻掛1臺那麼不用處理的

先看一下數據如今是沒問題的，以前的步驟恢復的很順利

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.86 sec)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.98 sec)

這裏有個插曲
當我把1,4宕掉的節點恢復，這期間集羣一直沒有新的數據寫入，本來是6做爲leader，新生成的5，7做爲follower做爲副本，而恢復後，將5,7剔除，從新將1,4做爲follower了，region 18仍是1,4,6的store_id。

4、宕機3臺測試

若是同時掛了3臺及以上，那麼上面的非leader步驟檢查是會有內容的
咱們此次讓以下三臺宕機：

wtidb22.add.shbt.qihoo.net  192.168.1.4  TiKV
wtidb21.add.shbt.qihoo.net  192.168.1.5  TiKV
wtidb20.add.shbt.qihoo.net  192.168.1.6  TiKV

首先，中止全部正常的tikv，本案例是wtidb19,wtidb18

看一下宕機的兩臺store_id

/data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -i -u http://192.168.1.1:2379
» store

知道是一、四、5

檢查大於等於一半副本數在故障節點上的region

[tidb@wtidb28 bin]$  /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379  -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4,5) then . else empty end) | length>=$total-length)}'
{"id":156,"peer_stores":[1,4,6]}
{"id":14,"peer_stores":[6,1,4]}
{"id":89,"peer_stores":[5,4,1]}
{"id":144,"peer_stores":[1,4,6]}
{"id":148,"peer_stores":[6,1,4]}
{"id":152,"peer_stores":[7,1,4]}
{"id":260,"peer_stores":[6,1,4]}
{"id":480,"peer_stores":[7,1,4]}
{"id":132,"peer_stores":[5,4,6]}
{"id":22,"peer_stores":[6,1,4]}
{"id":27,"peer_stores":[4,1,6]}
{"id":37,"peer_stores":[1,4,6]}
{"id":42,"peer_stores":[5,4,6]}
{"id":77,"peer_stores":[5,4,6]}
{"id":116,"peer_stores":[5,4,6]}
{"id":222,"peer_stores":[6,1,4]}
{"id":69,"peer_stores":[5,4,6]}
{"id":73,"peer_stores":[7,4,1]}
{"id":81,"peer_stores":[5,4,1]}
{"id":128,"peer_stores":[6,1,4]}
{"id":2,"peer_stores":[5,6,4]}
{"id":10,"peer_stores":[7,4,1]}
{"id":18,"peer_stores":[6,1,4]}
{"id":571,"peer_stores":[6,5,4]}
{"id":618,"peer_stores":[7,1,4]}
{"id":218,"peer_stores":[6,5,1]}
{"id":47,"peer_stores":[1,4,6]}
{"id":52,"peer_stores":[6,1,4]}
{"id":57,"peer_stores":[4,7,1]}
{"id":120,"peer_stores":[6,1,4]}
{"id":179,"peer_stores":[5,1,4]}
{"id":460,"peer_stores":[5,7,1]}
{"id":93,"peer_stores":[6,1,4]}
{"id":112,"peer_stores":[6,1,4]}
{"id":337,"peer_stores":[5,6,4]}
{"id":400,"peer_stores":[5,7,1]}

如今還剩兩臺存活

wtidb19.add.shbt.qihoo.net  192.168.1.7  TiKV
wtidb18.add.shbt.qihoo.net  192.168.1.8  TiKV

下面的操做要在全部存活的節(本案例是wtidb19和wtidb18)點先執行stop kv操做（要求kv是關閉狀態），而後執行

[tidb@wtidb19 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 1,4,5 --all-regions                                                                 
removing stores [1, 4, 5] from configrations...
success

重啓pd節點

ansible-playbook stop.yml --tags=pd
ansible-playbook start.yml --tags=pd

重啓存活的kv節點
sh /data1/tidb/deploy/scripts/start_tikv.sh

檢查沒有處於leader狀態的region，這裏看到，1,4,5由於全部的region都在損壞的3臺機器上，這些region丟棄後數據是恢復不了的

[tidb@wtidb28 tidb-ansible-3.1.0]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
{"id":179,"peer_stores":[5,1,4]}
{"id":81,"peer_stores":[5,4,1]}
{"id":89,"peer_stores":[5,4,1]}

根據region ID，確認region屬於哪張表

[tidb@wtidb28 tidb-ansible-3.1.0]$ curl http://192.168.1.1:10080/regions/179
{
 "region_id": 179,
 "start_key": "dIAAAAAAAAA7X2mAAAAAAAAAAwOAAAAAATQXJwE4LjQuMAAAAPwBYWxsAAAAAAD6A4AAAAAAAqs0",
 "end_key": "dIAAAAAAAAA7X3KAAAAAAAODBA==",
 "frames": [
  {
   "db_name": "hl",
   "table_name": "rpt_qdas_show_shoujizhushou_channelver_mix_daily(p201910)",
   "table_id": 59,
   "is_record": false,
   "index_name": "key2",
   "index_id": 3,
   "index_values": [
    "20191015",
    "8.4.0",
    "all",
    "174900"
   ]
  },
  {
   "db_name": "hl",
   "table_name": "rpt_qdas_show_shoujizhushou_channelver_mix_daily(p201910)",
   "table_id": 59,
   "is_record": true,
   "record_id": 230148
  }
 ]
}

這時候去看集羣狀態的話，

» store
{
  "count": 5,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "192.168.1.4:20160",
        "version": "3.1.0",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 3,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "192.168.1.5:20160",
        "version": "3.1.0",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 3,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "192.168.1.6:20160",
        "version": "3.1.0",
        "state_name": "Down"

監控也是沒數據

庫裏查詢也依舊被阻塞

mysql> use hl
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;

建立空region解決unavaliable狀態，這個命令要求pd，kv處於關閉狀態
這裏必須一個一個-r的寫，要不報錯：

[tidb@wtidb19 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p 192.168.1.1:2379 -r 89,179,81
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', src/libcore/result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Aborted

./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 89
./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 81
./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 179

啓動pd和tikv後，再次運行
[tidb@wtidb28 tidb-ansible-3.1.0]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
沒有任何結果則符合預期

這裏再次查詢能夠看到丟了數據，由於咱們有幾個region（81,89,179）都丟失了

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1262523 |
+----------+
1 row in set (0.92 sec)

這裏能夠看到索引數據再也不region（81,89,179）中，因此還跟以前同樣

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (1.01 sec)

至此，測試完成

5、總結

看完這篇文章，相信你不會再虛TiDB的多點掉電問題的數據恢復了，正常狀況下，極少數出現集羣同時宕機多臺機器的，若是隻宕機了一臺，那麼並不影響集羣的運行，他會自動處理，當某個 TiKV 節點失效，而且在一段時間內（默認 30 分鐘）沒法恢復，PD 會將其上的數據遷移到其餘的 TiKV 節點上。但若是同時宕機兩臺，甚至3臺及以上，那麼看過這篇文章的你相信你必定不會再手忙腳亂不知所措了！~

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。