MySQL高可用複製管理工具 —— Orchestrator使用

時間 2019-11-06

標籤 mysql 可用複製管理工具 orchestrator 使用欄目 MySQL 简体版

原文原文鏈接

背景

在上一篇「MySQL高可用複製管理工具 —— Orchestrator介紹」中大體介紹了Orchestrator的功能、配置和部署，固然最詳細的說明能夠查閱官方文檔。本文開始對Orchestrator的各方面進行測試和說明。
html

測試說明

環境介紹

服務器環境：node

三臺服務器
1：MySQL實例（3306是orch的後端數據庫，3307是MySQL主從架構「開啓GTID」）
Master ：192.168.163.131:3307
Slave  ：192.168.163.132:3307
Slave  ：192.168.163.133:3307

2：hosts（etc/hosts）：
192.168.163.131 test1
192.168.163.132 test2
192.168.163.133 test3

這裏須要注意的是，orch檢測主庫宕機依賴從庫的IO線程（自己連不上主庫後，還會經過從庫再去檢測主庫是否異常），因此默認change搭建的主從感知主庫宕機的等待時間過長，須要須要稍微改下：mysql

change master to master_host='192.168.163.131',master_port=3307,master_user='rep',master_password='rep',master_auto_position=1,MASTER_HEARTBEAT_PERIOD=2,MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400;

set global slave_net_timeout=8;

slave_net_timeout（全局變量）：MySQL5.7.7以後，默認改爲60秒。該參數定義了從庫從主庫獲取數據等待的秒數，超過這個時間從庫會主動退出讀取，中斷鏈接，並嘗試重連。git

master_heartbeat_period：複製心跳的週期。默認是slave_net_timeout的一半。Master在沒有數據的時候，每master_heartbeat_period秒發送一個心跳包，這樣 Slave 就能知道 Master 是否是還正常。github

slave_net_timeout是設置在多久沒收到數據後認爲網絡超時，以後 Slave 的 IO 線程會從新鏈接 Master 。結合這兩個設置就能夠避免因爲網絡問題致使的複製延誤。master_heartbeat_period 單位是秒，能夠是個帶上小數，如 10.5，最高精度爲 1 毫秒。web

重試策略爲：
備庫過了slave-net-timeout秒尚未收到主庫來的數據，它就會開始第一次重試。而後每過 master-connect-retry 秒，備庫會再次嘗試重連主庫。直到重試了 master-retry-count 次，它纔會放棄重試。若是重試的過程當中，連上了主庫，那麼它認爲當前主庫是好的，又會開始 slave-net-timeout 秒的等待。
slave-net-timeout 的默認值是 60 秒， master-connect-retry 默認爲 60 秒， master-retry-count 默認爲 86400 次。也就是說，若是主庫一分鐘都沒有任何數據變動發送過來，備庫纔會嘗試重連主庫。

這樣，主庫宕機以後，約8~10秒感知主庫異常，Orchestrator開始切換。另外還須要注意的是，orch默認是用主機名來進行管理的，須要在mysql的配置文件裏添加：report_host和report_port參數。算法

數據庫環境：sql

Orchestrator後端數據庫：
在啓動Orchestrator程序的時候，會自動在數據庫裏建立orchestrator數據庫，保存orchestrator的一些數據信息。

Orchestrator管理的數據庫：
在配置文件裏配置的一些query參數，須要在每一個被管理的目標庫裏有meta庫來保留一些元信息（相似cmdb功能），好比用pt-heartbeat來驗證主從延遲；用cluster表來保存別名、數據中心等。

以下面是測試環境的cluster表信息：
數據庫

> CREATE TABLE `cluster` (
  `anchor` tinyint(4) NOT NULL,
  `cluster_name` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '',
  `cluster_domain` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '',
  `data_center` varchar(128) NOT NULL,
  PRIMARY KEY (`anchor`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

>select * from cluster;
+--------+--------------+----------------+-------------+
| anchor | cluster_name | cluster_domain | data_center |
+--------+--------------+----------------+-------------+
|      1 | test         | CaoCao         | BJ          |
+--------+--------------+----------------+-------------+

測試說明

開啓Orchestrator進程：json

./orchestrator --config=/etc/orchestrator.conf.json http

在瀏覽器裏輸入三臺主機的任意主機的IP加端口（http://192.168.163.131:3000）進入到Web管理界面，在Clusters導航的Discover裏輸入任意一臺被管理MySQL實例的信息。添加完成以後，Web界面效果：

在web上能夠進行相關的管理，關於Web上的相關按鈕的說明，下面會作相關說明：

1. 部分可修改的參數(點擊Web上須要被修改實例的任意圖標）：

說明

Instance Alias ：實例別名
Last seen       :  最後檢測時間
Self coordinates ：自身的binlog位點信息
Num replicas ：有幾個從庫
Server ID    ： MySQL server_id
Server UUID ：    MySQL UUID
Version ：    版本
Read only ： 是否只讀
Has binary logs ：是否開啓binlog
Binlog format    ：binlog 模式
Logs slave updates ：是否開啓log_slave_updates
GTID supported ：是否支持GTID
GTID based replication ：是不是基於GTID的複製
GTID mode    ：複製是否開啓了GTID
Executed GTID set ：複製中執行過的GTID列表
Uptime ：啓動時間
Allow TLS ：是否開啓TLS
Cluster ：集羣別名
Audit ：審計實例
Agent ：Agent實例

說明：上面圖中，後面有按鈕的都是能夠在Web上進行修改的功能，如：是否只讀，是否開啓GTID的複製等。其中Begin Downtime 會將實例標記爲已停用，此時若是發生Failover，該實例不會參與。

2. 任意改變主從的拓撲結構：能夠直接在圖上拖動變動複製，會自動恢復拓撲關係：

3. 主庫掛了以後自動Failover，如：

圖中顯示，當主掛掉以後，拓撲結構裏自動剔除該主節點，選擇一個最合適的從庫提高成主庫，並修復複製拓撲。在Failover過程中，能夠查看/tmp/recovery.log文件（配置文件裏定死），裏面包含了在Failover過程當中Hooks執行的外部腳本，相似MHA的master_ip_failover_script參數。能夠經過外部腳本進行相應的如：VIP切換、Proxy修改、DNS修改、中間件修改、LVS修改等等，具體的執行腳本能夠根據本身的實際狀況編寫。

4. Orchestrator 高可用。由於在一開始就已經部署了3臺，經過配置文件裏的Raft參數進行通訊。只要有2個節點的Orchestrator正常，就不會影響使用，若是出現2個節點的Orchestrator異常，則Failover會失敗。2個節點異常的圖以下：

圖中的各個節點所有顯示灰色，此時Raft算法失效，致使Orch的Failover功能失敗。相對比MHA的Manager的單點，Orchestrator經過Raft算法解決了自己的高可用性以及解決網絡隔離問題，特別是跨數據中心網絡異常。這裏說明下Raft，經過共識算法：

Orchestrator節點可以選擇具備仲裁的領導者（leader）。若有3個orch節點，其中一個能夠成爲leader（3節點仲裁大小爲2，5節點仲裁大小爲3）。只容許leader進行修改，每一個MySQL拓撲服務器將由三個不一樣的orchestrator節點獨立訪問，在正常狀況下，三個節點將看到或多或少相同的拓撲圖，但他們每一個都會獨立分析寫入其本身的專用後端數據庫服務器：

① 全部更改都必須經過leader。

② 在啓用raft模式上禁止使用orchestrator客戶端。

③ 在啓用raft模式上使用orchestrator-client，orchestrator-client能夠安裝在沒有orchestrator上的服務器。

④ 單個orchestrator節點的故障不會影響orchestrator的可用性。在3節點設置上，最多一個服務器可能會失敗。在5節點設置上，2個節點可能會失敗。

⑤ Orchestrator節點異常關閉，而後再啓動。它將從新加入Raft組，並接收遺漏的任何事件,只要有足夠的Raft記錄。

⑥ 要加入比日誌保留容許的更長/更遠的orchestrator節點或者數據庫徹底爲空的節點，須要從另外一個活動節點克隆後端DB。

關於Raft更多的信息見：https://github.com/github/orchestrator/blob/master/docs/raft.md

Orchestrator的高可用有2種方式，第一種就是上面說的經過Raft（推薦），另外一種是經過後端數據庫的同步。詳細信息見文檔。文檔裏詳細比較了兩種高可用性部署方法。兩種方法的圖以下：

到這裏，Orchestrator的基本功能已經實現，包括主動Failover、修改拓撲結構以及Web上的可視化操做。

5. Web上各個按鈕的功能說明

①：Home下的status：查看orch的狀態：包括運行時間、版本、後端數據庫以及各個Raft節點的狀態。

②：Cluster下的dashboard：查看orch下的全部被管理的MySQL實例。

③：Cluster下的Failure analysis：查看故障分析以及包括記錄的故障類型列表。

④：Cluster下的Discover：用來發現被管理的MySQL實例。

⑤：Audit下的Failure detection：故障檢測信息，包含歷史信息。

⑥：Audit下的Recovery：故障恢復信息以及故障確認。

⑦：Audit下的Agent：是一個在MySQL主機上運行並與orchestrator通訊的服務，可以向orch提供操做系統，文件系統和LVM信息，以及調用某些命令和腳本。

⑧：導航欄裏的圖標，對應左邊導航欄的圖標：

第1行：集羣別名的查看修改。

第2行：pools。

第3行：Compact display，緊湊展現。

第4行：Pool indicator，池指示器。

第5行：Colorize DC，每一個數據中心用不一樣顏色展現。

第6行：Anonymize，匿名集羣中的主機名。

注意：左邊導航欄裏的圖標，表示實例的歸納：實例名、別名、故障檢測和恢復等信息。

⑧：導航欄裏的圖標，表示是否禁止全局恢復。禁止掉的話不會進行Failover。

⑨：導航欄裏的圖標，表示是否開啓刷新頁面（默認60一次）。

⑩：導航欄裏的圖標，表示MySQL實例遷移模式。

Smart mode：自動選擇遷移模式，讓Orch本身選擇遷移模式。
Classic mode：經典遷移模式，經過binlog和position進行遷移。
GTID mode：GTID遷移模式。
Pseudo GTID mode：僞GTID遷移模式。

到此，Orchestrator的基本測試和Web說明已經介紹完畢。和MHA比已經有很大的體驗提高，不只在Web進行部分參數的設置修改，還能夠改變複製拓撲，最重要的是解決MHA Manager單點的問題。還有什麼理由不替換MHA呢？:)

工做流程說明

Orchestrator實現了自動Failover，如今來看看自動Failover的大體流程是怎麼樣的。

1. 檢測流程

① orchestrator利用複製拓撲，先檢查主自己，並觀察其slaves。

② 若是orchestrator自己連不上主，能夠連上該主的從，則經過從去檢測，若在從上也看不到主（IO Thread）「2次檢查」，判斷Master宕機。

該檢測方法比較合理，當從都連不上主了，則複製確定有出問題，故會進行切換。因此在生產中很是可靠。

檢測發生故障後並不都會進行自動恢復，好比：禁止全局恢復、設置了shutdown time、上次恢復離本次恢復時間在RecoveryPeriodBlockSeconds設置的時間內、失敗類型不被認爲值得恢復等。檢測與恢復無關，但始終啓用。每次檢測都會執行OnFailureDetectionProcesses Hooks。

配置故障檢測：

{
  "FailureDetectionPeriodBlockMinutes": 60,
}

Hooks相關參數：
{
  "OnFailureDetectionProcesses": [
    "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countReplicas}' >> /tmp/recovery.log"
  ],
}

MySQL複製相關調整：
slave_net_timeout
MASTER_CONNECT_RETRY

2. 恢復流程

恢復的實例須要支持：GTID、僞GTID、開啓Binlog。恢復的配置以下：

{
  "RecoveryPeriodBlockSeconds": 3600,
  "RecoveryIgnoreHostnameFilters": [],
  "RecoverMasterClusterFilters": [
    "thiscluster",
    "thatcluster"
  ],
  "RecoverMasterClusterFilters": ["*"],
  "RecoverIntermediateMasterClusterFilters": [
    "*"
  ],
}

{
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PreventCrossDataCenterMasterFailover": false,
  "FailMasterPromotionIfSQLThreadNotUpToDate": true,
  "MasterFailoverLostInstancesDowntimeMinutes": 10,
  "DetachLostReplicasAfterMasterFailover": true,
}

Hooks：
{
  "PreGracefulTakeoverProcesses": [
    "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [
    "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [
    "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [],
  "PostMasterFailoverProcesses": [
    "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:     {failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostIntermediateMasterFailoverProcesses": [],
  "PostGracefulTakeoverProcesses": [
    "echo 'Planned takeover complete' >> /tmp/recovery.log"
  ],
}

具體的參數含義請參考「MySQL高可用複製管理工具 —— Orchestrator介紹」。在執行故障檢測和恢復的時候均可以執行外部自定義腳本（hooks），來配合使用（VIP、Proxy、DNS）。

能夠恢復中繼主庫（DeadIntermediateMaster）和主庫：

中繼主庫：恢復會找其同級的節點進行作主從。匹配副本按照哪些實例具備log-slave-updates、實例是否延遲、它們是否具備複製過濾器、哪些版本的MySQL等等

主庫：恢復能夠指定提高特定的從庫「提高規則」（register-candidate），提高的從庫不必定是最新的，而是選擇最合適的，設置完提高規則以後，有效期爲1個小時。

提高規則選項有：

prefer      --比較喜歡
neutral    --中立（默認）
prefer_not --比較不喜歡
must_not  --拒絕

恢復支持的類型有：自動恢復、優雅的恢復、手動恢復、手動強制恢復，恢復的時候也能夠執行相應的Hooks參數。具體的恢復流程能夠看恢復流程的說明。關於恢復的配置能夠官方說明。

補充：每次恢復除了自動的Failover以外，都須要配合執行本身定義的Hooks的腳原本處理外部的一些操做：VIP修改、DNS修改、Proxy修改等等。因此這麼多Hooks的參數該如何設置呢？哪一個參數須要執行，哪一個參數不須要執行，以及Hooks的執行順序是怎麼樣的？雖然文章裏有介紹，但爲了更好的進行說明，下面進行各類恢復場景執行Hooks的順序：

   "OnFailureDetectionProcesses": [   #檢測故障時執行
    "echo '②  Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
  ],
  "PreGracefulTakeoverProcesses": [   #在主變爲只讀以前當即執行
    "echo '①   Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
  ],
  "PreFailoverProcesses": [   #在執行恢復操做以前當即執行
    "echo '③  Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
  ],
  "PostMasterFailoverProcesses": [ #在主恢復成功結束時執行
    "echo '④  Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostFailoverProcesses": [   #在任何成功恢復結束時執行
    "echo '⑤  (for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostUnsuccessfulFailoverProcesses": [  #在任何不成功的恢復結束時執行
    "echo '⑧  >> /tmp/recovery.log'"
  ],
  "PostIntermediateMasterFailoverProcesses": [  #在成功的中間主恢復結束時執行
    "echo '⑥ Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
  ],
  "PostGracefulTakeoverProcesses": [   #在舊主位於新晉升的主以後執行
    "echo '⑦ Planned takeover complete' >> /tmp/recovery.log"
  ],

View Code

 主庫宕機，自動Failover
②  Detected UnreachableMaster on test1:3307. Affected replicas: 2
②  Detected DeadMaster on test1:3307. Affected replicas: 2
③  Will recover from DeadMaster on test1:3307
④  Recovered from DeadMaster on test1:3307. Failed: test1:3307; Promoted: test2:3307
⑤  (for all types) Recovered from DeadMaster on test1:3307. Failed: test1:3307; Successor: test2:3307

 優雅的主從切換：test2:3307優雅的切換到test1:3307，切換以後須要手動執行start slave
  orchestrator-client -c graceful-master-takeover -a test2:3307 -d test1:3307
①  Planned takeover about to take place on test2:3307. Master will switch to read_only
②  Detected DeadMaster on test2:3307. Affected replicas: 1
③  Will recover from DeadMaster on test2:3307
④  Recovered from DeadMaster on test2:3307. Failed: test2:3307; Promoted: test1:3307
⑤  (for all types) Recovered from DeadMaster on test2:3307. Failed: test2:3307; Successor: test1:3307
⑦ Planned takeover complete

 手動恢復，當從庫進入停機或則維護模式，此時主庫宕機，不會自動Failover，須要手動執行恢復，指定死掉的主實例：
  orchestrator-client -c recover -i test1:3307
②  Detected UnreachableMaster on test1:3307. Affected replicas: 2
②  Detected DeadMaster on test1:3307. Affected replicas: 2
③  Will recover from DeadMaster on test1:3307
④  Recovered from DeadMaster on test1:3307. Failed: test1:3307; Promoted: test2:3307
⑤  (for all types) Recovered from DeadMaster on test1:3307. Failed: test1:3307; Successor: test2:3307

 手動強制恢復，無論任何狀況，都進行恢復：
  orchestrator-client -c force-master-failover -i test2:3307
②  Detected DeadMaster on test2:3307. Affected replicas: 2
③  Will recover from DeadMaster on test2:3307
②  Detected AllMasterSlavesNotReplicating on test2:3307. Affected replicas: 2
④  Recovered from DeadMaster on test2:3307. Failed: test2:3307; Promoted: test1:3307
⑤  (for all types) Recovered from DeadMaster on test2:3307. Failed: test2:3307; Successor: test1:3307

其中上面的狀況下，⑥和⑧都沒執行。由於⑥是執行中間主庫時候執行的，沒有中間主庫（級聯複製）能夠不用設置。⑧是恢復失敗的時候執行的，上面恢復沒有出現失敗，能夠定義一些告警提醒。

生產環境上部署

在生產上部署Orchestrator，能夠參考文檔。

1. Orchestrator首先須要確認自己高可用的後端數據庫是用單個MySQL，MySQL複製仍是自己的Raft。

2. 運行發現服務（web、orchestrator-client）

orchestrator-client -c discover -i this.hostname.com

3. 肯定提高規則（某些服務器更適合被提高）

orchestrator -c register-candidate -i ${::fqdn} --promotion-rule ${promotion_rule}

4. 若是服務器出現問題，將在Web界面上的問題下拉列表中顯示。使用Downtiming則不會在問題列表裏顯示，而且也不會進行恢復，處於維護模式。

orchestrator -c begin-downtime -i ${::fqdn} --duration=5m --owner=cron --reason=continuous_downtime"

也能夠用API：
curl -s "http://my.orchestrator.service:80/api/begin-downtime/my.hostname/3306/wallace/experimenting+failover/45m"

5. 僞GTID，若是MySQL沒有開啓GTID，則能夠開啓僞GTID實現相似GTID的功能。

6. 保存元數據，元數據大部分經過參數的query來獲取，好比在自的表cluster裏獲取集羣的別名(DetectClusterAliasQuery)、數據中心(DetectDataCenterQuery)、域名(DetectClusterDomainQuery)等，以及複製的延遲（pt-heartbeat）、是否半同步(DetectSemiSyncEnforcedQuery)。以及能夠經過正則匹配：DataCenterPattern、PhysicalEnvironmentPattern等。

7. 能夠給實例打標籤。

命令行、API的使用

Orchestrator不只有Web界面來進行查看和管理，還能夠經過命令行（orchestrator-client）和API（curl）來執行更多的管理命令，如今來講明幾個比較經常使用方法。

經過help來看下有哪些能夠執行的命令：./orchestrator-client --help，命令的說明能夠看手冊說明。

Usage: orchestrator-client -c <command> [flags...]
Example: orchestrator-client -c which-master -i some.replica
Options:

  -h, --help
    print this help
  -c <command>, --command <command>
    indicate the operation to perform (see listing below)
  -a <alias>, --alias <alias>
    cluster alias
  -o <owner>, --owner <owner>
    name of owner for downtime/maintenance commands
  -r <reason>, --reason <reason>
    reason for downtime/maintenance operation
  -u <duration>, --duration <duration>
    duration for downtime/maintenance operations
  -R <promotion rule>, --promotion-rule <promotion rule>
    rule for 'register-candidate' command
  -U <orchestrator_api>, --api <orchestrator_api>
    override $orchestrator_api environemtn variable,
    indicate where the client should connect to.
  -P <api path>, --path <api path>
    With '-c api', indicate the specific API path you wish to call
  -b <username:password>, --auth <username:password>
    Specify when orchestrator uses basic HTTP auth.
  -q <query>, --query <query>
    Indicate query for 'restart-replica-statements' command
  -l <pool name>, --pool <pool name>
    pool name for pool related commands
  -H <hostname> -h <hostname>
    indicate host for resolve and raft operations

    help                                     Show available commands
    which-api                                Output the HTTP API to be used
    api                                      Invoke any API request; provide --path argument
    async-discover                           Lookup an instance, investigate it asynchronously. Useful for bulk loads
    discover                                 Lookup an instance, investigate it
    forget                                   Forget about an instance's existence
    forget-cluster                           Forget about a cluster
    topology                                 Show an ascii-graph of a replication topology, given a member of that topology
    topology-tabulated                       Show an ascii-graph of a replication topology, given a member of that topology, in tabulated format
    clusters                                 List all clusters known to orchestrator
    clusters-alias                           List all clusters known to orchestrator
    search                                   Search for instances matching given substring
    instance"|"which-instance                Output the fully-qualified hostname:port representation of the given instance, or error if unknown
    which-master                             Output the fully-qualified hostname:port representation of a given instance's master
    which-replicas                           Output the fully-qualified hostname:port list of replicas of a given instance
    which-broken-replicas                    Output the fully-qualified hostname:port list of broken replicas of a given instance
    which-cluster-instances                  Output the list of instances participating in same cluster as given instance
    which-cluster                            Output the name of the cluster an instance belongs to, or error if unknown to orchestrator
    which-cluster-master                     Output the name of a writable master in given cluster
    all-clusters-masters                     List of writeable masters, one per cluster
    all-instances                            The complete list of known instances
    which-cluster-osc-replicas               Output a list of replicas in a cluster, that could serve as a pt-online-schema-change operation control replicas
    which-cluster-osc-running-replicas       Output a list of healthy, replicating replicas in a cluster, that could serve as a pt-online-schema-change operation control replicas
    downtimed                                List all downtimed instances
    dominant-dc                              Name the data center where most masters are found
    submit-masters-to-kv-stores              Submit a cluster's master, or all clusters' masters to KV stores
    relocate                                 Relocate a replica beneath another instance
    relocate-replicas                        Relocates all or part of the replicas of a given instance under another instance
    match                                    Matches a replica beneath another (destination) instance using Pseudo-GTID
    match-up                                 Transport the replica one level up the hierarchy, making it child of its grandparent, using Pseudo-GTID
    match-up-replicas                        Matches replicas of the given instance one level up the topology, making them siblings of given instance, using Pseudo-GTID
    move-up                                  Move a replica one level up the topology
    move-below                               Moves a replica beneath its sibling. Both replicas must be actively replicating from same master.
    move-equivalent                          Moves a replica beneath another server, based on previously recorded "equivalence coordinates"
    move-up-replicas                         Moves replicas of the given instance one level up the topology
    make-co-master                           Create a master-master replication. Given instance is a replica which replicates directly from a master.
    take-master                              Turn an instance into a master of its own master; essentially switch the two.
    move-gtid                                Move a replica beneath another instance via GTID
    move-replicas-gtid                       Moves all replicas of a given instance under another (destination) instance using GTID
    repoint                                  Make the given instance replicate from another instance without changing the binglog coordinates. Use with care
    repoint-replicas                         Repoint all replicas of given instance to replicate back from the instance. Use with care
    take-siblings                            Turn all siblings of a replica into its sub-replicas.
    tags                                     List tags for a given instance
    tag-value                                List tags for a given instance
    tag                                      Add a tag to a given instance. Tag in "tagname" or "tagname=tagvalue" format
    untag                                    Remove a tag from an instance
    untag-all                                Remove a tag from all matching instances
    tagged                                   List instances tagged by tag-string. Format: "tagname" or "tagname=tagvalue" or comma separated "tag0,tag1=val1,tag2" for intersection of all.
    submit-pool-instances                    Submit a pool name with a list of instances in that pool
    which-heuristic-cluster-pool-instances   List instances of a given cluster which are in either any pool or in a specific pool
    begin-downtime                           Mark an instance as downtimed
    end-downtime                             Indicate an instance is no longer downtimed
    begin-maintenance                        Request a maintenance lock on an instance
    end-maintenance                          Remove maintenance lock from an instance
    register-candidate                       Indicate the promotion rule for a given instance
    register-hostname-unresolve              Assigns the given instance a virtual (aka "unresolved") name
    deregister-hostname-unresolve            Explicitly deregister/dosassociate a hostname with an "unresolved" name
    stop-replica                             Issue a STOP SLAVE on an instance
    stop-replica-nice                        Issue a STOP SLAVE on an instance, make effort to stop such that SQL thread is in sync with IO thread (ie all relay logs consumed)
    start-replica                            Issue a START SLAVE on an instance
    restart-replica                          Issue STOP and START SLAVE on an instance
    reset-replica                            Issues a RESET SLAVE command; use with care
    detach-replica                           Stops replication and modifies binlog position into an impossible yet reversible value.
    reattach-replica                         Undo a detach-replica operation
    detach-replica-master-host               Stops replication and modifies Master_Host into an impossible yet reversible value.
    reattach-replica-master-host             Undo a detach-replica-master-host operation
    skip-query                               Skip a single statement on a replica; either when running with GTID or without
    gtid-errant-reset-master                 Remove errant GTID transactions by way of RESET MASTER
    gtid-errant-inject-empty                 Apply errant GTID as empty transactions on cluster's master
    enable-semi-sync-master                  Enable semi-sync (master-side)
    disable-semi-sync-master                 Disable semi-sync (master-side)
    enable-semi-sync-replica                 Enable semi-sync (replica-side)
    disable-semi-sync-replica                Disable semi-sync (replica-side)
    restart-replica-statements               Given `-q "<query>"` that requires replication restart to apply, wrap query with stop/start slave statements as required to restore instance to same replication state. Print out set of statements
    can-replicate-from                       Check if an instance can potentially replicate from another, according to replication rules
    can-replicate-from-gtid                  Check if an instance can potentially replicate from another, according to replication rules and assuming Oracle GTID
    is-replicating                           Check if an instance is replicating at this time (both SQL and IO threads running)
    is-replication-stopped                   Check if both SQL and IO threads state are both strictly stopped.
    set-read-only                            Turn an instance read-only, via SET GLOBAL read_only := 1
    set-writeable                            Turn an instance writeable, via SET GLOBAL read_only := 0
    flush-binary-logs                        Flush binary logs on an instance
    last-pseudo-gtid                         Dump last injected Pseudo-GTID entry on a server
    recover                                  Do auto-recovery given a dead instance, assuming orchestrator agrees there's a problem. Override blocking.
    graceful-master-takeover                 Gracefully promote a new master. Either indicate identity of new master via '-d designated.instance.com' or setup replication tree to have a single direct replica to the master.
    force-master-failover                    Forcibly discard master and initiate a failover, even if orchestrator doesn't see a problem. This command lets orchestrator choose the replacement master
    force-master-takeover                    Forcibly discard master and promote another (direct child) instance instead, even if everything is running well
    ack-cluster-recoveries                   Acknowledge recoveries for a given cluster; this unblocks pending future recoveries
    ack-all-recoveries                       Acknowledge all recoveries
    disable-global-recoveries                Disallow orchestrator from performing recoveries globally
    enable-global-recoveries                 Allow orchestrator to perform recoveries globally
    check-global-recoveries                  Show the global recovery configuration
    replication-analysis                     Request an analysis of potential crash incidents in all known topologies
    raft-leader                              Get identify of raft leader, assuming raft setup
    raft-health                              Whether node is part of a healthy raft group
    raft-leader-hostname                     Get hostname of raft leader, assuming raft setup
    raft-elect-leader                        Request raft re-elections, provide hint for new leader's identity

View Code

orchestrator-client不須要和Orchestrator服務放一塊兒，不須要訪問後端數據庫，在任意一臺上均可以。

注意：由於配置了Raft，有多個Orchestrator，因此須要ORCHESTRATOR_API的環境變量，orchestrator-client會自動選擇leader。如：

export ORCHESTRATOR_API="test1:3000/api test2:3000/api test3:3000/api"

1. 列出全部集羣：clusters

默認：

# orchestrator-client -c clusters
test2:3307

返回包含集羣別名：clusters-alias

# orchestrator-client -c clusters-alias
test2:3307,test

2. 發現指定實例：discover/async-discover

同步發現：

# orchestrator-client -c discover -i test1:3307
test1:3307

異步發現：適用於批量

# orchestrator-client -c async-discover -i test1:3307
:null

3. 忘記指定對象：forget/forget-cluster

忘記指定實例：

# orchestrator-client -c forget -i test1:3307

忘記指定集羣：

# orchestrator-client -c forget-cluster -i test

4. 打印指定集羣的拓撲：topology/topology-tabulated

普通返回：

# orchestrator-client -c topology -i test1:3307
test2:3307   [0s,ok,5.7.25-0ubuntu0.16.04.2-log,rw,ROW,>>,GTID]
+ test1:3307 [0s,ok,5.7.25-0ubuntu0.16.04.2-log,ro,ROW,>>,GTID]
+ test3:3307 [0s,ok,5.7.25-log,ro,ROW,>>,GTID]

列表返回：

# orchestrator-client -c topology-tabulated -i test1:3307
test2:3307  |0s|ok|5.7.25-0ubuntu0.16.04.2-log|rw|ROW|>>,GTID
+ test1:3307|0s|ok|5.7.25-0ubuntu0.16.04.2-log|ro|ROW|>>,GTID
+ test3:3307|0s|ok|5.7.25-log                 |ro|ROW|>>,GTID

5. 查看使用哪一個API：本身會選擇出leader。which-api

# orchestrator-client -c which-api
test3:3000/api

也能夠經過 http://192.168.163.133:3000/api/leader-check 查看。

6. 調用api請求，須要和 -path 參數一塊兒：api..-path

# orchestrator-client -c api -path clusters
[ "test2:3307" ]
# orchestrator-client -c api -path leader-check "OK"
# orchestrator-client -c api -path status
{ "Code": "OK", "Message": "Application node is healthy"...}

7. 搜索實例：search

# orchestrator-client -c search -i test
test2:3307
test1:3307
test3:3307

8. 打印指定實例的主庫：which-master

# orchestrator-client -c which-master -i test1:3307
test2:3307
# orchestrator-client -c which-master -i test3:3307
test2:3307
# orchestrator-client -c which-master -i test2:3307 #本身自己是主庫
:0

9. 打印指定實例的從庫：which-replicas

# orchestrator-client -c which-replicas -i test2:3307
test1:3307
test3:3307

10. 打印指定實例的實例名：which-instance

# orchestrator-client -c instance -i test1:3307
test1:3307

11. 打印指定主實例從庫異常的列表：which-broken-replicas，模擬test3的複製異常：

# orchestrator-client -c which-broken-replicas -i test2:3307
test3:3307

12. 給出一個實例或則集羣別名，打印出該實例所在集羣下的全部其餘實例。which-cluster-instances

# orchestrator-client -c which-cluster-instances -i test
test1:3307
test2:3307
test3:3307
root@test1:~# orchestrator-client -c which-cluster-instances -i test1:3307
test1:3307
test2:3307
test3:3307

13. 給出一個實例，打印該實的集羣名稱：默認是hostname:port。which-cluster

# orchestrator-client -c which-cluster -i test1:3307
test2:3307# orchestrator-client -c which-cluster -i test2:3307
test2:3307# orchestrator-client -c which-cluster -i test3:3307
test2:3307

14. 打印出指定實例/集羣名或則全部所在集羣的可寫實例，：which-cluster-master

指定實例：which-cluster-master

# orchestrator-client -c which-cluster-master -i test2:3307
test2:3307
# orchestrator-client -c which-cluster-master -i test
test2:3307

全部實例：all-clusters-masters，每一個集羣返回一個

# orchestrator-client -c all-clusters-masters
test1:3307

15. 打印出全部實例：all-instances

# orchestrator-client -c all-instances
test2:3307
test1:3307
test3:3307

16. 打印出集羣中能夠做爲pt-online-schema-change操做的副本列表：which-cluster-osc-replicas

~# orchestrator-client -c which-cluster-osc-replicas -i test
test1:3307
test3:3307
root@test1:~# orchestrator-client -c which-cluster-osc-replicas -i test2:3307
test1:3307
test3:3307

17. 打印出集羣中能夠做爲pt-online-schema-change能夠操做的健康的副本列表：which-cluster-osc-running-replicas

# orchestrator-client -c which-cluster-osc-running-replicas -i test
test1:3307
test3:3307
# orchestrator-client -c which-cluster-osc-running-replicas -i test1:3307
test1:3307
test3:3307

18. 打印出全部在維護（downtimed）的實例：downtimed

# orchestrator-client -c downtimed
test1:3307
test3:3307

19. 打印出進羣中主的數據中心：dominant-dc

# orchestrator-client -c dominant-dc
BJ

20. 將集羣的主提交到KV存儲。submit-masters-to-kv-stores

# orchestrator-client -c submit-masters-to-kv-stores 
mysql/master/test:test2:3307
mysql/master/test/hostname:test2
mysql/master/test/port:3307
mysql/master/test/ipv4:192.168.163.132
mysql/master/test/ipv6:

21. 遷移從庫到另外一個實例上：relocate

# orchestrator-client -c relocate -i test3:3307 -d test1:3307 #遷移test3:3307做爲test1:3307的從庫
test3:3307<test1:3307

查看
# orchestrator-client -c topology -i test2:3307
test2:3307     [0s,ok,5.7.25-0ubuntu0.16.04.2-log,rw,ROW,>>,GTID]
+ test1:3307   [0s,ok,5.7.25-0ubuntu0.16.04.2-log,ro,ROW,>>,GTID]
  + test3:3307 [0s,ok,5.7.25-log,ro,ROW,>>,GTID]

22. 遷移一個實例的全部從庫到另外一個實例上：relocate-replicas

# orchestrator-client -c relocate-replicas -i test1:3307 -d test2:3307 #遷移test1:3307下的全部從庫到test2:3307下，並列出被遷移的從庫的實例名
test3:3307

23. 將slave在拓撲上向上移動一級，對應web上的是在Classic Model下進行拖動：move-up

# orchestrator-client -c move-up -i test3:3307 -d test2:3307
test3:3307<test2:3307

結構從 test2:3307 -> test1:3307 -> test3:3307 變成 test2:3307 -> test1:3307

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 -> test3:3307

24. 將slave在拓撲上向下移動一級（移到同級的下面），對應web上的是在Classic Model下進行拖動：move-below

# orchestrator-client -c move-below -i test3:3307 -d test1:3307
test3:3307<test1:3307

結構從 test2:3307 -> test1:3307 變成 test2:3307 -> test1:3307 -> test3:3307

　　　　　　　 -> test3:3307

25. 將給定實例的全部從庫在拓撲上向上移動一級，基於Classic Model模式：move-up-replicas

# orchestrator-client -c move-up-replicas -i test1:3307

 test3:3307

結構從 test2:3307 -> test1:3307 -> test3:3307 變成 test2:3307 -> test1:3307

　　　　　　　　　　　　　　　　　　　　　　　　　　　　-> test3:3307

26. 建立主主複製，將給定實例直接和當前主庫作成主主複製：make-co-master

# orchestrator-client -c make-co-master -i test1:3307
test1:3307<test2:3307

27.將實例轉換爲本身主人的主人，切換兩個：take-master

# orchestrator-client -c take-master -i test3:3307
test3:3307<test2:3307

結構從 test2:3307 -> test1:3307 -> test3:3307 變成 test2:3307 -> test3:3307 -> test1:3307

28. 經過GTID移動副本，move-gtid：

經過orchestrator-client執行報錯：

# orchestrator-client -c move-gtid -i test3:3307 -d test1:3307
parse error: Invalid numeric literal at line 1, column 9
parse error: Invalid numeric literal at line 1, column 9
parse error: Invalid numeric literal at line 1, column 9

經過orchestrator執行是沒問題，須要添加--ignore-raft-setup參數：

# orchestrator -c move-gtid -i test3:3307 -d test2:3307 --ignore-raft-setup
test3:3307<test2:3307

29.經過GTID移動指定實例下的全部slaves到另外一個實例，move-replicas-gtid

經過orchestrator-client執行報錯：

# orchestrator-client -c move-replicas-gtid -i test3:3307 -d test1:3307
jq: error (at <stdin>:1): Cannot index string with string "Key"

經過orchestrator執行是沒問題，須要添加--ignore-raft-setup參數：

# ./orchestrator -c move-replicas-gtid -i test2:3307 -d test1:3307 --ignore-raft-setup
test3:3307

30. 將給定實例的同級slave，變動成他的slave，take-siblings

# orchestrator-client -c take-siblings -i test3:3307
test3:3307<test1:3307

結構從 test1:3307 -> test2:3307 變成 test1:3307 -> test3:3307 -> test2:3307

　　　　　　　 -> test3:3307

31. 給指定實例打上標籤，tag

# orchestrator-client -c tag -i test1:3307 --tag 'name=AAA'
test1:3307

32. 列出指定實例的標籤，tags：

# orchestrator-client -c tags -i test1:3307
name=AAA

33. 列出給定實例的標籤值：tag-value

# orchestrator-client -c tag-value -i test1:3307 --tag "name"
AAA

34. 移除指定實例上的標籤：untag

# orchestrator-client -c untag -i test1:3307 --tag "name=AAA"
test1:3307

35. 列出打過某個標籤的實例，tagged：

# orchestrator-client -c tagged -t name
test3:3307
test1:3307
test2:3307

36. 標記指定實例進入停用模式，包括時間、操做人、和緣由，begin-downtime：

# orchestrator-client -c begin-downtime -i test1:3307 -duration=10m -owner=zjy -reason 'test'
test1:3307

37. 移除指定實例的停用模式，end--downtime：

# orchestrator-client -c end-downtime -i test1:3307
test1:3307

38. 請求指定實例上的維護鎖：拓撲更改須要將鎖放在最小受影響的實例上，以免在同一個實例上發生兩個不協調的操做，begin-maintenance ：

# orchestrator-client -c begin-maintenance -i test1:3307 --reason "XXX"
test1:3307

鎖默認10分鐘後過時，有參數MaintenanceExpireMinutes。

39. 移除指定實例上的維護鎖：end-maintenance

# orchestrator-client -c end-maintenance -i test1:3307
test1:3307

40. 設置提高規則，恢復時能夠指定一個實例進行提高：register-candidate：須要和promotion-rule一塊兒使用

# orchestrator-client -c register-candidate -i test3:3307 --promotion-rule prefer 
test3:3307

提高test3:3307的權重，若是進行Failover，會成爲Master。

41. 指定實例執行中止複製：

普通的：stop slave：stop-replica

# orchestrator-client -c stop-replica -i test2:3307
test2:3307

應用完relay log，在stop slave：stop-replica-nice

# orchestrator-client -c stop-replica-nice -i test2:3307
test2:3307

42.指定實例執行開啓複製： start-replica

# orchestrator-client -c start-replica -i test2:3307
test2:3307

43. 指定實例執行復制重啓：restart-replica

# orchestrator-client -c restart-replica -i test2:3307
test2:3307

44.指定實例執行復制重置：reset-replica

# orchestrator-client -c reset-replica -i test2:3307
test2:3307

45.分離副本：非GTID修改binlog position，detach-replica ：

# orchestrator-client -c detach-replica -i test2:3307

46.恢復副本：reattach-replica

# orchestrator-client -c reattach-replica  -i test2:3307

47.分離副本：註釋master_host來分離，detach-replica-master-host ：如Master_Host: //test1

# orchestrator-client -c detach-replica-master-host -i test2:3307
test2:3307

48. 恢復副本：reattach-replica-master-host

# orchestrator-client -c reattach-replica-master-host -i test2:3307
test2:3307

49. 跳過SQL線程的Query，如主鍵衝突，支持在GTID和非GTID下：skip-query

# orchestrator-client -c skip-query -i test2:3307
test2:3307

50. 將錯誤的GTID事務當作空事務應用副本的主上：gtid-errant-inject-empty「web上的fix」

# orchestrator-client -c gtid-errant-inject-empty  -i test2:3307
test2:3307

51. 經過RESET MASTER刪除錯誤的GTID事務：gtid-errant-reset-master

# orchestrator-client -c gtid-errant-reset-master  -i test2:3307
test2:3307

52. 設置半同步相關的參數:

orchestrator-client -c $variable -i test1:3307

    enable-semi-sync-master      主上執行開啓半同步
    disable-semi-sync-master      主上執行關閉半同步
    enable-semi-sync-replica       從上執行開啓半同步
    disable-semi-sync-replica      從上執行關閉半同步

53. 執行須要stop/start slave配合的SQL：restart-replica-statements

# orchestrator-client -c restart-replica-statements -i test3:3307 -query "change master to auto_position=1" | jq .[] -r 
stop slave io_thread;
stop slave sql_thread;
change master to auto_position=1;
start slave sql_thread;
start slave io_thread;

# orchestrator-client -c restart-replica-statements -i test3:3307 -query "change master to master_auto_position=1" | jq .[] -r  |  mysql -urep -p -htest3 -P3307
Enter password:

54.根據複製規則檢查實例是否能夠從另外一個實例複製(GTID和非GTID）：

非GTID，can-replicate-from：

# orchestrator-client -c can-replicate-from -i test3:3307 -d test1:3307
test1:3307

GTID：can-replicate-from-gtid

# orchestrator-client -c can-replicate-from-gtid -i test3:3307 -d test1:3307
test1:3307

55. 檢查指定實例是否在複製：is-replicating

#有返回在複製
# orchestrator-client -c is-replicating -i test2:3307
test2:3307

#沒有返回，不在複製
# orchestrator-client -c is-replicating -i test1:3307

56.檢查指定實例的IO和SQL限制是否都中止：

# orchestrator-client -c is-replicating -i test2:3307

57.將指定實例設置爲只讀，經過SET GLOBAL read_only=1，set-read-only：

# orchestrator-client -c set-read-only -i test2:3307
test2:3307

58.將指定實例設置爲讀寫，經過SET GLOBAL read_only=0，set-writeable

# orchestrator-client -c set-writeable -i test2:3307
test2:3307

59. 輪詢指定實例的binary log，flush-binary-logs

# orchestrator-client -c flush-binary-logs -i test1:3307
test1:3307

60. 手動執行恢復，指定一個死機的實例，recover：

# orchestrator-client -c recover -i test2:3307
test3:3307

測試下來，該參數會讓處理停機或則維護狀態下的實例進行強制恢復。結構：

test1:3307 -> test2:3307 -> test3:3307(downtimed) 當test2:3307死掉以後，此時test3:3307處於停機狀態，不會進行Failover，執行後變成

test1:3307 -> test2:3307

　　　　　-> test3:3307

61. 優雅的進行主和指定從切換，graceful-master-takeover：

# orchestrator-client -c graceful-master-takeover -a test1:3307 -d test2:3307
test2:3307

結構從test1:3307 -> test2:3307 變成 test2:3307 -> test1:3307。新主指定變成讀寫，新從變成只讀，還須要手動start slave。

注意須要配置：須要從元表裏找到複製的帳號和密碼。

"ReplicationCredentialsQuery":"SELECT repl_user, repl_pass from meta.cluster where anchor=1"

62. 手動強制執行恢復，即便orch沒有發現問題，force-master-failover：轉移以後老主獨立，須要手動加入到集羣。

# orchestrator-client -c force-master-failover -i test1:3307
test3:3307

63.強行丟棄master並指定的一個實例，force-master-takeover：老主(test1)獨立，指定從(test2)提高爲master

# orchestrator-client -c force-master-takeover -i test1:3307 -d test2:3307
test2:3307

64. 確認集羣恢復理由，在web上的Audit->Recovery->Acknowledged 按鈕確認，/ack-all-recoveries

確認指定集羣：ack-cluster-recoveries

# orchestrator-client -c ack-cluster-recoveries  -i test2:3307 -reason=''
test1:3307

確認全部集羣：ack-all-recoveries

# orchestrator-client -c ack-all-recoveries  -reason='OOOPPP'
eason=XYZ

65.檢查、禁止、開啓orchestrator執行全局恢復：

檢查：check-global-recoveries

# orchestrator-client -c check-global-recoveries
enabled

禁止：disable-global-recoveries

# orchestrator-client -c disable-global-recoveries
disabled

開啓：enable--global-recoveries

# orchestrator-client -c enable-global-recoveries
enabled

66. 檢查分析複製拓撲中存在的問題：replication-analysis

# orchestrator-client -c replication-analysis
test1:3307 (cluster test1:3307): ErrantGTIDStructureWarning

67. raft檢測：leader查看、健康監測、遷移leader：

查看leader節點
# orchestrator-client -c raft-leader
192.168.163.131:10008

健康監測
# orchestrator-client -c raft-health
healthy

leader 主機名
# orchestrator-client -c  raft-leader-hostname 
test1

指定主機選舉leader
# orchestrator-client -c raft-elect-leader -hostname test3
test3

68.僞GTID相關參數：

match      #使用Pseudo-GTID指定一個從匹配到指定的另外一個（目標）實例下
match-up #Transport the replica one level up the hierarchy, making it child of its grandparent, using Pseudo-GTID
match-up-replicas  #Matches replicas of the given instance one level up the topology, making them siblings of given instance, using Pseudo-GTID
last-pseudo-gtid #Dump last injected Pseudo-GTID entry on a server

到此關於Orchestrator的使用以及命令行說明已經介紹完畢，Web API能夠在Orchestrator API查看，經過命令行和API上的操做能夠更好的進行自動化開發。

總結：

Orchestrator是一款開源(go編寫)的MySQL複製拓撲管理工具，支持MySQL主從複製拓撲關係的調整、主庫故障自動切換、手動主從切換等功能。提供Web界面展現MySQL集羣的拓撲關係及狀態，能夠更改MySQL實例的部分配置信息，也提供命令行和api接口。相對比MHA，Orchestrator自身能夠部署多個節點，經過raft分佈式一致性協議來保證自身的高可用。