在上一篇「MySQL高可用複製管理工具 —— Orchestrator介紹」中大體介紹了Orchestrator的功能、配置和部署,固然最詳細的說明能夠查閱官方文檔。本文開始對Orchestrator的各方面進行測試和說明。
html
服務器環境:node
三臺服務器 1:MySQL實例(3306是orch的後端數據庫,3307是MySQL主從架構「開啓GTID」) Master :192.168.163.131:3307 Slave :192.168.163.132:3307 Slave :192.168.163.133:3307 2:hosts(etc/hosts): 192.168.163.131 test1 192.168.163.132 test2 192.168.163.133 test3
這裏須要注意的是,orch檢測主庫宕機依賴從庫的IO線程(自己連不上主庫後,還會經過從庫再去檢測主庫是否異常),因此默認change搭建的主從感知主庫宕機的等待時間過長,須要須要稍微改下:mysql
change master to master_host='192.168.163.131',master_port=3307,master_user='rep',master_password='rep',master_auto_position=1,MASTER_HEARTBEAT_PERIOD=2,MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400;
set global slave_net_timeout=8;
slave_net_timeout(全局變量):MySQL5.7.7以後,默認改爲60秒。該參數定義了從庫從主庫獲取數據等待的秒數,超過這個時間從庫會主動退出讀取,中斷鏈接,並嘗試重連。
git
master_heartbeat_period:複製心跳的週期。默認是slave_net_timeout的一半。Master在沒有數據的時候,每master_heartbeat_period秒發送一個心跳包,這樣 Slave 就能知道 Master 是否是還正常。github
slave_net_timeout是設置在多久沒收到數據後認爲網絡超時,以後 Slave 的 IO 線程會從新鏈接 Master 。結合這兩個設置就能夠避免因爲網絡問題致使的複製延誤。master_heartbeat_period 單位是秒,能夠是個帶上小數,如 10.5,最高精度爲 1 毫秒。web
重試策略爲:
備庫過了slave-net-timeout秒尚未收到主庫來的數據,它就會開始第一次重試。而後每過 master-connect-retry 秒,備庫會再次嘗試重連主庫。直到重試了 master-retry-count 次,它纔會放棄重試。若是重試的過程當中,連上了主庫,那麼它認爲當前主庫是好的,又會開始 slave-net-timeout 秒的等待。 slave-net-timeout 的默認值是 60 秒, master-connect-retry 默認爲 60 秒, master-retry-count 默認爲 86400 次。也就是說,若是主庫一分鐘都沒有任何數據變動發送過來,備庫纔會嘗試重連主庫。
這樣,主庫宕機以後,約8~10秒感知主庫異常,Orchestrator開始切換。另外還須要注意的是,orch默認是用主機名來進行管理的,須要在mysql的配置文件裏添加:report_host和report_port參數。算法
數據庫環境:sql
Orchestrator後端數據庫:
在啓動Orchestrator程序的時候,會自動在數據庫裏建立orchestrator數據庫,保存orchestrator的一些數據信息。
Orchestrator管理的數據庫:
在配置文件裏配置的一些query參數,須要在每一個被管理的目標庫裏有meta庫來保留一些元信息(相似cmdb功能),好比用pt-heartbeat來驗證主從延遲;用cluster表來保存別名、數據中心等。
以下面是測試環境的cluster表信息:
數據庫
> CREATE TABLE `cluster` ( `anchor` tinyint(4) NOT NULL, `cluster_name` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '', `cluster_domain` varchar(128) CHARACTER SET ascii NOT NULL DEFAULT '', `data_center` varchar(128) NOT NULL, PRIMARY KEY (`anchor`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 >select * from cluster; +--------+--------------+----------------+-------------+ | anchor | cluster_name | cluster_domain | data_center | +--------+--------------+----------------+-------------+ | 1 | test | CaoCao | BJ | +--------+--------------+----------------+-------------+
開啓Orchestrator進程:json
./orchestrator --config=/etc/orchestrator.conf.json http
在瀏覽器裏輸入三臺主機的任意主機的IP加端口(http://192.168.163.131:3000)進入到Web管理界面,在Clusters導航的Discover裏輸入任意一臺被管理MySQL實例的信息。添加完成以後,Web界面效果:
在web上能夠進行相關的管理,關於Web上的相關按鈕的說明,下面會作相關說明:
1. 部分可修改的參數(點擊Web上須要被修改實例的任意圖標):
說明
Instance Alias :實例別名
Last seen : 最後檢測時間
Self coordinates :自身的binlog位點信息
Num replicas :有幾個從庫
Server ID : MySQL server_id
Server UUID : MySQL UUID
Version : 版本
Read only : 是否只讀
Has binary logs :是否開啓binlog
Binlog format :binlog 模式
Logs slave updates :是否開啓log_slave_updates
GTID supported :是否支持GTID
GTID based replication :是不是基於GTID的複製
GTID mode :複製是否開啓了GTID
Executed GTID set :複製中執行過的GTID列表
Uptime :啓動時間
Allow TLS :是否開啓TLS
Cluster :集羣別名
Audit :審計實例
Agent :Agent實例
說明:上面圖中,後面有按鈕的都是能夠在Web上進行修改的功能,如:是否只讀,是否開啓GTID的複製等。其中Begin Downtime 會將實例標記爲已停用,此時若是發生Failover,該實例不會參與。
2. 任意改變主從的拓撲結構:能夠直接在圖上拖動變動複製,會自動恢復拓撲關係:
3. 主庫掛了以後自動Failover,如:
圖中顯示,當主掛掉以後,拓撲結構裏自動剔除該主節點,選擇一個最合適的從庫提高成主庫,並修復複製拓撲。在Failover過程中,能夠查看/tmp/recovery.log文件(配置文件裏定死),裏面包含了在Failover過程當中Hooks執行的外部腳本,相似MHA的master_ip_failover_script參數。能夠經過外部腳本進行相應的如:VIP切換、Proxy修改、DNS修改、中間件修改、LVS修改等等,具體的執行腳本能夠根據本身的實際狀況編寫。
4. Orchestrator高可用。由於在一開始就已經部署了3臺,經過配置文件裏的Raft參數進行通訊。只要有2個節點的Orchestrator正常,就不會影響使用,若是出現2個節點的Orchestrator異常,則Failover會失敗。2個節點異常的圖以下:
圖中的各個節點所有顯示灰色,此時Raft算法失效,致使Orch的Failover功能失敗。相對比MHA的Manager的單點,Orchestrator經過Raft算法解決了自己的高可用性以及解決網絡隔離問題,特別是跨數據中心網絡異常。這裏說明下Raft,經過共識算法:
Orchestrator節點可以選擇具備仲裁的領導者(leader)。若有3個orch節點,其中一個能夠成爲leader(3節點仲裁大小爲2,5節點仲裁大小爲3)。只容許leader進行修改,每一個MySQL拓撲服務器將由三個不一樣的orchestrator節點獨立訪問,在正常狀況下,三個節點將看到或多或少相同的拓撲圖,但他們每一個都會獨立分析寫入其本身的專用後端數據庫服務器:
① 全部更改都必須經過leader。
② 在啓用raft模式上禁止使用orchestrator客戶端。
③ 在啓用raft模式上使用orchestrator-client,orchestrator-client能夠安裝在沒有orchestrator上的服務器。
④ 單個orchestrator節點的故障不會影響orchestrator的可用性。在3節點設置上,最多一個服務器可能會失敗。在5節點設置上,2個節點可能會失敗。
⑤ Orchestrator節點異常關閉,而後再啓動。它將從新加入Raft組,並接收遺漏的任何事件,只要有足夠的Raft記錄。
⑥ 要加入比日誌保留容許的更長/更遠的orchestrator節點或者數據庫徹底爲空的節點,須要從另外一個活動節點克隆後端DB。
關於Raft更多的信息見:https://github.com/github/orchestrator/blob/master/docs/raft.md
Orchestrator的高可用有2種方式,第一種就是上面說的經過Raft(推薦),另外一種是經過後端數據庫的同步。詳細信息見文檔。文檔裏詳細比較了兩種高可用性部署方法。兩種方法的圖以下:
到這裏,Orchestrator的基本功能已經實現,包括主動Failover、修改拓撲結構以及Web上的可視化操做。
5. Web上各個按鈕的功能說明
①:Home下的status:查看orch的狀態:包括運行時間、版本、後端數據庫以及各個Raft節點的狀態。
②:Cluster下的dashboard:查看orch下的全部被管理的MySQL實例。
③:Cluster下的Failure analysis:查看故障分析以及包括記錄的故障類型列表。
④:Cluster下的Discover:用來發現被管理的MySQL實例。
⑤:Audit下的Failure detection:故障檢測信息,包含歷史信息。
⑥:Audit下的Recovery:故障恢復信息以及故障確認。
⑦:Audit下的Agent:是一個在MySQL主機上運行並與orchestrator通訊的服務,可以向orch提供操做系統,文件系統和LVM信息,以及調用某些命令和腳本。
⑧:導航欄裏的圖標,對應左邊導航欄的圖標:
第1行:集羣別名的查看修改。
第2行:pools。
第3行:Compact display,緊湊展現。
第4行:Pool indicator,池指示器。
第5行:Colorize DC,每一個數據中心用不一樣顏色展現。
第6行:Anonymize,匿名集羣中的主機名。
注意:左邊導航欄裏的圖標,表示實例的歸納:實例名、別名、故障檢測和恢復等信息。
⑧:導航欄裏的圖標,表示是否禁止全局恢復。禁止掉的話不會進行Failover。
⑨:導航欄裏的圖標,表示是否開啓刷新頁面(默認60一次)。
⑩:導航欄裏的圖標,表示MySQL實例遷移模式。
Smart mode:自動選擇遷移模式,讓Orch本身選擇遷移模式。
Classic mode:經典遷移模式,經過binlog和position進行遷移。
GTID mode:GTID遷移模式。
Pseudo GTID mode:僞GTID遷移模式。
到此,Orchestrator的基本測試和Web說明已經介紹完畢。和MHA比已經有很大的體驗提高,不只在Web進行部分參數的設置修改,還能夠改變複製拓撲,最重要的是解決MHA Manager單點的問題。還有什麼理由不替換MHA呢?:)
Orchestrator實現了自動Failover,如今來看看自動Failover的大體流程是怎麼樣的。
1. 檢測流程
① orchestrator利用複製拓撲,先檢查主自己,並觀察其slaves。
② 若是orchestrator自己連不上主,能夠連上該主的從,則經過從去檢測,若在從上也看不到主(IO Thread)「2次檢查」,判斷Master宕機。
該檢測方法比較合理,當從都連不上主了,則複製確定有出問題,故會進行切換。因此在生產中很是可靠。
檢測發生故障後並不都會進行自動恢復,好比:禁止全局恢復、設置了shutdown time、上次恢復離本次恢復時間在RecoveryPeriodBlockSeconds設置的時間內、失敗類型不被認爲值得恢復等。檢測與恢復無關,但始終啓用。 每次檢測都會執行OnFailureDetectionProcesses Hooks。
{ "FailureDetectionPeriodBlockMinutes": 60, } Hooks相關參數: { "OnFailureDetectionProcesses": [ "echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countReplicas}' >> /tmp/recovery.log" ], } MySQL複製相關調整: slave_net_timeout MASTER_CONNECT_RETRY
2. 恢復流程
恢復的實例須要支持:GTID、僞GTID、開啓Binlog。恢復的配置以下:
{ "RecoveryPeriodBlockSeconds": 3600, "RecoveryIgnoreHostnameFilters": [], "RecoverMasterClusterFilters": [ "thiscluster", "thatcluster" ], "RecoverMasterClusterFilters": ["*"], "RecoverIntermediateMasterClusterFilters": [ "*" ], } { "ApplyMySQLPromotionAfterMasterFailover": true, "PreventCrossDataCenterMasterFailover": false, "FailMasterPromotionIfSQLThreadNotUpToDate": true, "MasterFailoverLostInstancesDowntimeMinutes": 10, "DetachLostReplicasAfterMasterFailover": true, } Hooks: { "PreGracefulTakeoverProcesses": [ "echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log" ], "PreFailoverProcesses": [ "echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log" ], "PostFailoverProcesses": [ "echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log" ], "PostUnsuccessfulFailoverProcesses": [], "PostMasterFailoverProcesses": [ "echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}: {failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log" ], "PostIntermediateMasterFailoverProcesses": [], "PostGracefulTakeoverProcesses": [ "echo 'Planned takeover complete' >> /tmp/recovery.log" ], }
具體的參數含義請參考「MySQL高可用複製管理工具 —— Orchestrator介紹」。在執行故障檢測和恢復的時候均可以執行外部自定義腳本(hooks),來配合使用(VIP、Proxy、DNS)。
能夠恢復中繼主庫(DeadIntermediateMaster)和主庫:
中繼主庫:恢復會找其同級的節點進行作主從。匹配副本按照哪些實例具備log-slave-updates、實例是否延遲、它們是否具備複製過濾器、哪些版本的MySQL等等
主庫:恢復能夠指定提高特定的從庫「提高規則」(register-candidate),提高的從庫不必定是最新的,而是選擇最合適的,設置完提高規則以後,有效期爲1個小時。
提高規則選項有:
prefer --比較喜歡 neutral --中立(默認) prefer_not --比較不喜歡 must_not --拒絕
恢復支持的類型有:自動恢復、優雅的恢復、手動恢復、手動強制恢復,恢復的時候也能夠執行相應的Hooks參數。具體的恢復流程能夠看恢復流程的說明。關於恢復的配置能夠官方說明。
補充:每次恢復除了自動的Failover以外,都須要配合執行本身定義的Hooks的腳原本處理外部的一些操做:VIP修改、DNS修改、Proxy修改等等。因此這麼多Hooks的參數該如何設置呢?哪一個參數須要執行,哪一個參數不須要執行,以及Hooks的執行順序是怎麼樣的?雖然文章裏有介紹,但爲了更好的進行說明,下面進行各類恢復場景執行Hooks的順序:
"OnFailureDetectionProcesses": [ #檢測故障時執行 "echo '② Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log" ], "PreGracefulTakeoverProcesses": [ #在主變爲只讀以前當即執行 "echo '① Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log" ], "PreFailoverProcesses": [ #在執行恢復操做以前當即執行 "echo '③ Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log" ], "PostMasterFailoverProcesses": [ #在主恢復成功結束時執行 "echo '④ Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log" ], "PostFailoverProcesses": [ #在任何成功恢復結束時執行 "echo '⑤ (for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log" ], "PostUnsuccessfulFailoverProcesses": [ #在任何不成功的恢復結束時執行 "echo '⑧ >> /tmp/recovery.log'" ], "PostIntermediateMasterFailoverProcesses": [ #在成功的中間主恢復結束時執行 "echo '⑥ Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log" ], "PostGracefulTakeoverProcesses": [ #在舊主位於新晉升的主以後執行 "echo '⑦ Planned takeover complete' >> /tmp/recovery.log" ],
主庫宕機,自動Failover ② Detected UnreachableMaster on test1:3307. Affected replicas: 2 ② Detected DeadMaster on test1:3307. Affected replicas: 2 ③ Will recover from DeadMaster on test1:3307 ④ Recovered from DeadMaster on test1:3307. Failed: test1:3307; Promoted: test2:3307 ⑤ (for all types) Recovered from DeadMaster on test1:3307. Failed: test1:3307; Successor: test2:3307 優雅的主從切換:test2:3307優雅的切換到test1:3307,切換以後須要手動執行start slave orchestrator-client -c graceful-master-takeover -a test2:3307 -d test1:3307 ① Planned takeover about to take place on test2:3307. Master will switch to read_only ② Detected DeadMaster on test2:3307. Affected replicas: 1 ③ Will recover from DeadMaster on test2:3307 ④ Recovered from DeadMaster on test2:3307. Failed: test2:3307; Promoted: test1:3307 ⑤ (for all types) Recovered from DeadMaster on test2:3307. Failed: test2:3307; Successor: test1:3307 ⑦ Planned takeover complete 手動恢復,當從庫進入停機或則維護模式,此時主庫宕機,不會自動Failover,須要手動執行恢復,指定死掉的主實例: orchestrator-client -c recover -i test1:3307 ② Detected UnreachableMaster on test1:3307. Affected replicas: 2 ② Detected DeadMaster on test1:3307. Affected replicas: 2 ③ Will recover from DeadMaster on test1:3307 ④ Recovered from DeadMaster on test1:3307. Failed: test1:3307; Promoted: test2:3307 ⑤ (for all types) Recovered from DeadMaster on test1:3307. Failed: test1:3307; Successor: test2:3307 手動強制恢復,無論任何狀況,都進行恢復: orchestrator-client -c force-master-failover -i test2:3307 ② Detected DeadMaster on test2:3307. Affected replicas: 2 ③ Will recover from DeadMaster on test2:3307 ② Detected AllMasterSlavesNotReplicating on test2:3307. Affected replicas: 2 ④ Recovered from DeadMaster on test2:3307. Failed: test2:3307; Promoted: test1:3307 ⑤ (for all types) Recovered from DeadMaster on test2:3307. Failed: test2:3307; Successor: test1:3307
其中上面的狀況下,⑥和⑧都沒執行。由於⑥是執行中間主庫時候執行的,沒有中間主庫(級聯複製)能夠不用設置。⑧是恢復失敗的時候執行的,上面恢復沒有出現失敗,能夠定義一些告警提醒。
在生產上部署Orchestrator,能夠參考文檔。
1. Orchestrator首先須要確認自己高可用的後端數據庫是用單個MySQL,MySQL複製仍是自己的Raft。
2. 運行發現服務(web、orchestrator-client)
orchestrator-client -c discover -i this.hostname.com
3. 肯定提高規則(某些服務器更適合被提高)
orchestrator -c register-candidate -i ${::fqdn} --promotion-rule ${promotion_rule}
4. 若是服務器出現問題,將在Web界面上的問題下拉列表中顯示。使用Downtiming則不會在問題列表裏顯示,而且也不會進行恢復,處於維護模式。
orchestrator -c begin-downtime -i ${::fqdn} --duration=5m --owner=cron --reason=continuous_downtime" 也能夠用API: curl -s "http://my.orchestrator.service:80/api/begin-downtime/my.hostname/3306/wallace/experimenting+failover/45m"
5. 僞GTID,若是MySQL沒有開啓GTID,則能夠開啓僞GTID實現相似GTID的功能。
6. 保存元數據,元數據大部分經過參數的query來獲取,好比在自的表cluster裏獲取集羣的別名(DetectClusterAliasQuery)、數據中心(DetectDataCenterQuery)、域名(DetectClusterDomainQuery)等,以及複製的延遲(pt-heartbeat)、是否半同步(DetectSemiSyncEnforcedQuery)。以及能夠經過正則匹配:DataCenterPattern、PhysicalEnvironmentPattern等。
7. 能夠給實例打標籤。
Orchestrator不只有Web界面來進行查看和管理,還能夠經過命令行(orchestrator-client)和API(curl)來執行更多的管理命令,如今來講明幾個比較經常使用方法。
經過help來看下有哪些能夠執行的命令:./orchestrator-client --help,命令的說明能夠看手冊說明。
Usage: orchestrator-client -c <command> [flags...] Example: orchestrator-client -c which-master -i some.replica Options: -h, --help print this help -c <command>, --command <command> indicate the operation to perform (see listing below) -a <alias>, --alias <alias> cluster alias -o <owner>, --owner <owner> name of owner for downtime/maintenance commands -r <reason>, --reason <reason> reason for downtime/maintenance operation -u <duration>, --duration <duration> duration for downtime/maintenance operations -R <promotion rule>, --promotion-rule <promotion rule> rule for 'register-candidate' command -U <orchestrator_api>, --api <orchestrator_api> override $orchestrator_api environemtn variable, indicate where the client should connect to. -P <api path>, --path <api path> With '-c api', indicate the specific API path you wish to call -b <username:password>, --auth <username:password> Specify when orchestrator uses basic HTTP auth. -q <query>, --query <query> Indicate query for 'restart-replica-statements' command -l <pool name>, --pool <pool name> pool name for pool related commands -H <hostname> -h <hostname> indicate host for resolve and raft operations help Show available commands which-api Output the HTTP API to be used api Invoke any API request; provide --path argument async-discover Lookup an instance, investigate it asynchronously. Useful for bulk loads discover Lookup an instance, investigate it forget Forget about an instance's existence forget-cluster Forget about a cluster topology Show an ascii-graph of a replication topology, given a member of that topology topology-tabulated Show an ascii-graph of a replication topology, given a member of that topology, in tabulated format clusters List all clusters known to orchestrator clusters-alias List all clusters known to orchestrator search Search for instances matching given substring instance"|"which-instance Output the fully-qualified hostname:port representation of the given instance, or error if unknown which-master Output the fully-qualified hostname:port representation of a given instance's master which-replicas Output the fully-qualified hostname:port list of replicas of a given instance which-broken-replicas Output the fully-qualified hostname:port list of broken replicas of a given instance which-cluster-instances Output the list of instances participating in same cluster as given instance which-cluster Output the name of the cluster an instance belongs to, or error if unknown to orchestrator which-cluster-master Output the name of a writable master in given cluster all-clusters-masters List of writeable masters, one per cluster all-instances The complete list of known instances which-cluster-osc-replicas Output a list of replicas in a cluster, that could serve as a pt-online-schema-change operation control replicas which-cluster-osc-running-replicas Output a list of healthy, replicating replicas in a cluster, that could serve as a pt-online-schema-change operation control replicas downtimed List all downtimed instances dominant-dc Name the data center where most masters are found submit-masters-to-kv-stores Submit a cluster's master, or all clusters' masters to KV stores relocate Relocate a replica beneath another instance relocate-replicas Relocates all or part of the replicas of a given instance under another instance match Matches a replica beneath another (destination) instance using Pseudo-GTID match-up Transport the replica one level up the hierarchy, making it child of its grandparent, using Pseudo-GTID match-up-replicas Matches replicas of the given instance one level up the topology, making them siblings of given instance, using Pseudo-GTID move-up Move a replica one level up the topology move-below Moves a replica beneath its sibling. Both replicas must be actively replicating from same master. move-equivalent Moves a replica beneath another server, based on previously recorded "equivalence coordinates" move-up-replicas Moves replicas of the given instance one level up the topology make-co-master Create a master-master replication. Given instance is a replica which replicates directly from a master. take-master Turn an instance into a master of its own master; essentially switch the two. move-gtid Move a replica beneath another instance via GTID move-replicas-gtid Moves all replicas of a given instance under another (destination) instance using GTID repoint Make the given instance replicate from another instance without changing the binglog coordinates. Use with care repoint-replicas Repoint all replicas of given instance to replicate back from the instance. Use with care take-siblings Turn all siblings of a replica into its sub-replicas. tags List tags for a given instance tag-value List tags for a given instance tag Add a tag to a given instance. Tag in "tagname" or "tagname=tagvalue" format untag Remove a tag from an instance untag-all Remove a tag from all matching instances tagged List instances tagged by tag-string. Format: "tagname" or "tagname=tagvalue" or comma separated "tag0,tag1=val1,tag2" for intersection of all. submit-pool-instances Submit a pool name with a list of instances in that pool which-heuristic-cluster-pool-instances List instances of a given cluster which are in either any pool or in a specific pool begin-downtime Mark an instance as downtimed end-downtime Indicate an instance is no longer downtimed begin-maintenance Request a maintenance lock on an instance end-maintenance Remove maintenance lock from an instance register-candidate Indicate the promotion rule for a given instance register-hostname-unresolve Assigns the given instance a virtual (aka "unresolved") name deregister-hostname-unresolve Explicitly deregister/dosassociate a hostname with an "unresolved" name stop-replica Issue a STOP SLAVE on an instance stop-replica-nice Issue a STOP SLAVE on an instance, make effort to stop such that SQL thread is in sync with IO thread (ie all relay logs consumed) start-replica Issue a START SLAVE on an instance restart-replica Issue STOP and START SLAVE on an instance reset-replica Issues a RESET SLAVE command; use with care detach-replica Stops replication and modifies binlog position into an impossible yet reversible value. reattach-replica Undo a detach-replica operation detach-replica-master-host Stops replication and modifies Master_Host into an impossible yet reversible value. reattach-replica-master-host Undo a detach-replica-master-host operation skip-query Skip a single statement on a replica; either when running with GTID or without gtid-errant-reset-master Remove errant GTID transactions by way of RESET MASTER gtid-errant-inject-empty Apply errant GTID as empty transactions on cluster's master enable-semi-sync-master Enable semi-sync (master-side) disable-semi-sync-master Disable semi-sync (master-side) enable-semi-sync-replica Enable semi-sync (replica-side) disable-semi-sync-replica Disable semi-sync (replica-side) restart-replica-statements Given `-q "<query>"` that requires replication restart to apply, wrap query with stop/start slave statements as required to restore instance to same replication state. Print out set of statements can-replicate-from Check if an instance can potentially replicate from another, according to replication rules can-replicate-from-gtid Check if an instance can potentially replicate from another, according to replication rules and assuming Oracle GTID is-replicating Check if an instance is replicating at this time (both SQL and IO threads running) is-replication-stopped Check if both SQL and IO threads state are both strictly stopped. set-read-only Turn an instance read-only, via SET GLOBAL read_only := 1 set-writeable Turn an instance writeable, via SET GLOBAL read_only := 0 flush-binary-logs Flush binary logs on an instance last-pseudo-gtid Dump last injected Pseudo-GTID entry on a server recover Do auto-recovery given a dead instance, assuming orchestrator agrees there's a problem. Override blocking. graceful-master-takeover Gracefully promote a new master. Either indicate identity of new master via '-d designated.instance.com' or setup replication tree to have a single direct replica to the master. force-master-failover Forcibly discard master and initiate a failover, even if orchestrator doesn't see a problem. This command lets orchestrator choose the replacement master force-master-takeover Forcibly discard master and promote another (direct child) instance instead, even if everything is running well ack-cluster-recoveries Acknowledge recoveries for a given cluster; this unblocks pending future recoveries ack-all-recoveries Acknowledge all recoveries disable-global-recoveries Disallow orchestrator from performing recoveries globally enable-global-recoveries Allow orchestrator to perform recoveries globally check-global-recoveries Show the global recovery configuration replication-analysis Request an analysis of potential crash incidents in all known topologies raft-leader Get identify of raft leader, assuming raft setup raft-health Whether node is part of a healthy raft group raft-leader-hostname Get hostname of raft leader, assuming raft setup raft-elect-leader Request raft re-elections, provide hint for new leader's identity
orchestrator-client不須要和Orchestrator服務放一塊兒,不須要訪問後端數據庫,在任意一臺上均可以。
注意:由於配置了Raft,有多個Orchestrator,因此須要ORCHESTRATOR_API的環境變量,orchestrator-client會自動選擇leader。如:
export ORCHESTRATOR_API="test1:3000/api test2:3000/api test3:3000/api"
1. 列出全部集羣:clusters
默認:
# orchestrator-client -c clusters test2:3307
返回包含集羣別名:clusters-alias
# orchestrator-client -c clusters-alias test2:3307,test
2. 發現指定實例:discover/async-discover
同步發現:
# orchestrator-client -c discover -i test1:3307 test1:3307
異步發現:適用於批量
# orchestrator-client -c async-discover -i test1:3307 :null
3. 忘記指定對象:forget/forget-cluster
忘記指定實例:
# orchestrator-client -c forget -i test1:3307
忘記指定集羣:
# orchestrator-client -c forget-cluster -i test
4. 打印指定集羣的拓撲:topology/topology-tabulated
普通返回:
# orchestrator-client -c topology -i test1:3307 test2:3307 [0s,ok,5.7.25-0ubuntu0.16.04.2-log,rw,ROW,>>,GTID] + test1:3307 [0s,ok,5.7.25-0ubuntu0.16.04.2-log,ro,ROW,>>,GTID] + test3:3307 [0s,ok,5.7.25-log,ro,ROW,>>,GTID]
列表返回:
# orchestrator-client -c topology-tabulated -i test1:3307 test2:3307 |0s|ok|5.7.25-0ubuntu0.16.04.2-log|rw|ROW|>>,GTID + test1:3307|0s|ok|5.7.25-0ubuntu0.16.04.2-log|ro|ROW|>>,GTID + test3:3307|0s|ok|5.7.25-log |ro|ROW|>>,GTID
5. 查看使用哪一個API:本身會選擇出leader。which-api
# orchestrator-client -c which-api test3:3000/api
也能夠經過 http://192.168.163.133:3000/api/leader-check 查看。
6. 調用api請求,須要和 -path 參數一塊兒:api..-path
# orchestrator-client -c api -path clusters [ "test2:3307" ] # orchestrator-client -c api -path leader-check "OK" # orchestrator-client -c api -path status { "Code": "OK", "Message": "Application node is healthy"...}
7. 搜索實例:search
# orchestrator-client -c search -i test test2:3307 test1:3307 test3:3307
8. 打印指定實例的主庫:which-master
# orchestrator-client -c which-master -i test1:3307 test2:3307 # orchestrator-client -c which-master -i test3:3307 test2:3307 # orchestrator-client -c which-master -i test2:3307 #本身自己是主庫 :0
9. 打印指定實例的從庫:which-replicas
# orchestrator-client -c which-replicas -i test2:3307 test1:3307 test3:3307
10. 打印指定實例的實例名:which-instance
# orchestrator-client -c instance -i test1:3307 test1:3307
11. 打印指定主實例從庫異常的列表:which-broken-replicas,模擬test3的複製異常:
# orchestrator-client -c which-broken-replicas -i test2:3307 test3:3307
12. 給出一個實例或則集羣別名,打印出該實例所在集羣下的全部其餘實例。which-cluster-instances
# orchestrator-client -c which-cluster-instances -i test test1:3307 test2:3307 test3:3307 root@test1:~# orchestrator-client -c which-cluster-instances -i test1:3307 test1:3307 test2:3307 test3:3307
13. 給出一個實例,打印該實的集羣名稱:默認是hostname:port。which-cluster
# orchestrator-client -c which-cluster -i test1:3307 test2:3307# orchestrator-client -c which-cluster -i test2:3307 test2:3307# orchestrator-client -c which-cluster -i test3:3307 test2:3307
14. 打印出指定實例/集羣名或則全部所在集羣的可寫實例,:which-cluster-master
指定實例:which-cluster-master
# orchestrator-client -c which-cluster-master -i test2:3307 test2:3307 # orchestrator-client -c which-cluster-master -i test test2:3307
全部實例:all-clusters-masters,每一個集羣返回一個
# orchestrator-client -c all-clusters-masters test1:3307
15. 打印出全部實例:all-instances
# orchestrator-client -c all-instances test2:3307 test1:3307 test3:3307
16. 打印出集羣中能夠做爲pt-online-schema-change操做的副本列表:which-cluster-osc-replicas
~# orchestrator-client -c which-cluster-osc-replicas -i test test1:3307 test3:3307 root@test1:~# orchestrator-client -c which-cluster-osc-replicas -i test2:3307 test1:3307 test3:3307
17. 打印出集羣中能夠做爲pt-online-schema-change能夠操做的健康的副本列表:which-cluster-osc-running-replicas
# orchestrator-client -c which-cluster-osc-running-replicas -i test test1:3307 test3:3307 # orchestrator-client -c which-cluster-osc-running-replicas -i test1:3307 test1:3307 test3:3307
18. 打印出全部在維護(downtimed)的實例:downtimed
# orchestrator-client -c downtimed test1:3307 test3:3307
19. 打印出進羣中主的數據中心:dominant-dc
# orchestrator-client -c dominant-dc
BJ
20. 將集羣的主提交到KV存儲。submit-masters-to-kv-stores
# orchestrator-client -c submit-masters-to-kv-stores mysql/master/test:test2:3307 mysql/master/test/hostname:test2 mysql/master/test/port:3307 mysql/master/test/ipv4:192.168.163.132 mysql/master/test/ipv6:
21. 遷移從庫到另外一個實例上:relocate
# orchestrator-client -c relocate -i test3:3307 -d test1:3307 #遷移test3:3307做爲test1:3307的從庫 test3:3307<test1:3307 查看 # orchestrator-client -c topology -i test2:3307 test2:3307 [0s,ok,5.7.25-0ubuntu0.16.04.2-log,rw,ROW,>>,GTID] + test1:3307 [0s,ok,5.7.25-0ubuntu0.16.04.2-log,ro,ROW,>>,GTID] + test3:3307 [0s,ok,5.7.25-log,ro,ROW,>>,GTID]
22. 遷移一個實例的全部從庫到另外一個實例上:relocate-replicas
# orchestrator-client -c relocate-replicas -i test1:3307 -d test2:3307 #遷移test1:3307下的全部從庫到test2:3307下,並列出被遷移的從庫的實例名 test3:3307
23. 將slave在拓撲上向上移動一級,對應web上的是在Classic Model下進行拖動:move-up
# orchestrator-client -c move-up -i test3:3307 -d test2:3307 test3:3307<test2:3307
結構從 test2:3307 -> test1:3307 -> test3:3307 變成 test2:3307 -> test1:3307
-> test3:3307
24. 將slave在拓撲上向下移動一級(移到同級的下面),對應web上的是在Classic Model下進行拖動:move-below
# orchestrator-client -c move-below -i test3:3307 -d test1:3307 test3:3307<test1:3307
結構從 test2:3307 -> test1:3307 變成 test2:3307 -> test1:3307 -> test3:3307
-> test3:3307
25. 將給定實例的全部從庫在拓撲上向上移動一級,基於Classic Model模式:move-up-replicas
# orchestrator-client -c move-up-replicas -i test1:3307
test3:3307
結構從 test2:3307 -> test1:3307 -> test3:3307 變成 test2:3307 -> test1:3307
-> test3:3307
26. 建立主主複製,將給定實例直接和當前主庫作成主主複製:make-co-master
# orchestrator-client -c make-co-master -i test1:3307 test1:3307<test2:3307
27.將實例轉換爲本身主人的主人,切換兩個:take-master
# orchestrator-client -c take-master -i test3:3307 test3:3307<test2:3307
結構從 test2:3307 -> test1:3307 -> test3:3307 變成 test2:3307 -> test3:3307 -> test1:3307
28. 經過GTID移動副本,move-gtid:
經過orchestrator-client執行報錯:
# orchestrator-client -c move-gtid -i test3:3307 -d test1:3307 parse error: Invalid numeric literal at line 1, column 9 parse error: Invalid numeric literal at line 1, column 9 parse error: Invalid numeric literal at line 1, column 9
經過orchestrator執行是沒問題,須要添加--ignore-raft-setup參數:
# orchestrator -c move-gtid -i test3:3307 -d test2:3307 --ignore-raft-setup test3:3307<test2:3307
29.經過GTID移動指定實例下的全部slaves到另外一個實例,move-replicas-gtid
經過orchestrator-client執行報錯:
# orchestrator-client -c move-replicas-gtid -i test3:3307 -d test1:3307 jq: error (at <stdin>:1): Cannot index string with string "Key"
經過orchestrator執行是沒問題,須要添加--ignore-raft-setup參數:
# ./orchestrator -c move-replicas-gtid -i test2:3307 -d test1:3307 --ignore-raft-setup test3:3307
30. 將給定實例的同級slave,變動成他的slave,take-siblings
# orchestrator-client -c take-siblings -i test3:3307 test3:3307<test1:3307
結構從 test1:3307 -> test2:3307 變成 test1:3307 -> test3:3307 -> test2:3307
-> test3:3307
31. 給指定實例打上標籤,tag
# orchestrator-client -c tag -i test1:3307 --tag 'name=AAA' test1:3307
32. 列出指定實例的標籤,tags:
# orchestrator-client -c tags -i test1:3307 name=AAA
33. 列出給定實例的標籤值:tag-value
# orchestrator-client -c tag-value -i test1:3307 --tag "name" AAA
34. 移除指定實例上的標籤:untag
# orchestrator-client -c untag -i test1:3307 --tag "name=AAA" test1:3307
35. 列出打過某個標籤的實例,tagged:
# orchestrator-client -c tagged -t name test3:3307 test1:3307 test2:3307
36. 標記指定實例進入停用模式,包括時間、操做人、和緣由,begin-downtime:
# orchestrator-client -c begin-downtime -i test1:3307 -duration=10m -owner=zjy -reason 'test' test1:3307
37. 移除指定實例的停用模式,end--downtime:
# orchestrator-client -c end-downtime -i test1:3307 test1:3307
38. 請求指定實例上的維護鎖:拓撲更改須要將鎖放在最小受影響的實例上,以免在同一個實例上發生兩個不協調的操做,begin-maintenance :
# orchestrator-client -c begin-maintenance -i test1:3307 --reason "XXX" test1:3307
鎖默認10分鐘後過時,有參數MaintenanceExpireMinutes。
39. 移除指定實例上的維護鎖:end-maintenance
# orchestrator-client -c end-maintenance -i test1:3307 test1:3307
40. 設置提高規則,恢復時能夠指定一個實例進行提高:register-candidate:須要和promotion-rule一塊兒使用
# orchestrator-client -c register-candidate -i test3:3307 --promotion-rule prefer test3:3307
提高test3:3307的權重,若是進行Failover,會成爲Master。
41. 指定實例執行中止複製:
普通的:stop slave:stop-replica
# orchestrator-client -c stop-replica -i test2:3307 test2:3307
應用完relay log,在stop slave:stop-replica-nice
# orchestrator-client -c stop-replica-nice -i test2:3307 test2:3307
42.指定實例執行開啓複製: start-replica
# orchestrator-client -c start-replica -i test2:3307 test2:3307
43. 指定實例執行復制重啓:restart-replica
# orchestrator-client -c restart-replica -i test2:3307 test2:3307
44.指定實例執行復制重置:reset-replica
# orchestrator-client -c reset-replica -i test2:3307 test2:3307
45.分離副本:非GTID修改binlog position,detach-replica :
# orchestrator-client -c detach-replica -i test2:3307
46.恢復副本:reattach-replica
# orchestrator-client -c reattach-replica -i test2:3307
47.分離副本:註釋master_host來分離,detach-replica-master-host :如Master_Host: //test1
# orchestrator-client -c detach-replica-master-host -i test2:3307 test2:3307
48. 恢復副本:reattach-replica-master-host
# orchestrator-client -c reattach-replica-master-host -i test2:3307 test2:3307
49. 跳過SQL線程的Query,如主鍵衝突,支持在GTID和非GTID下:skip-query
# orchestrator-client -c skip-query -i test2:3307 test2:3307
50. 將錯誤的GTID事務當作空事務應用副本的主上:gtid-errant-inject-empty「web上的fix」
# orchestrator-client -c gtid-errant-inject-empty -i test2:3307 test2:3307
51. 經過RESET MASTER刪除錯誤的GTID事務:gtid-errant-reset-master
# orchestrator-client -c gtid-errant-reset-master -i test2:3307 test2:3307
52. 設置半同步相關的參數:
orchestrator-client -c $variable -i test1:3307
enable-semi-sync-master 主上執行開啓半同步 disable-semi-sync-master 主上執行關閉半同步 enable-semi-sync-replica 從上執行開啓半同步 disable-semi-sync-replica 從上執行關閉半同步
53. 執行須要stop/start slave配合的SQL:restart-replica-statements
# orchestrator-client -c restart-replica-statements -i test3:3307 -query "change master to auto_position=1" | jq .[] -r stop slave io_thread; stop slave sql_thread; change master to auto_position=1; start slave sql_thread; start slave io_thread; # orchestrator-client -c restart-replica-statements -i test3:3307 -query "change master to master_auto_position=1" | jq .[] -r | mysql -urep -p -htest3 -P3307 Enter password:
54.根據複製規則檢查實例是否能夠從另外一個實例複製(GTID和非GTID):
非GTID,can-replicate-from:
# orchestrator-client -c can-replicate-from -i test3:3307 -d test1:3307 test1:3307
GTID:can-replicate-from-gtid
# orchestrator-client -c can-replicate-from-gtid -i test3:3307 -d test1:3307 test1:3307
55. 檢查指定實例是否在複製:is-replicating
#有返回在複製 # orchestrator-client -c is-replicating -i test2:3307 test2:3307 #沒有返回,不在複製 # orchestrator-client -c is-replicating -i test1:3307
56.檢查指定實例的IO和SQL限制是否都中止:
# orchestrator-client -c is-replicating -i test2:3307
57.將指定實例設置爲只讀,經過SET GLOBAL read_only=1,set-read-only:
# orchestrator-client -c set-read-only -i test2:3307 test2:3307
58.將指定實例設置爲讀寫,經過SET GLOBAL read_only=0,set-writeable
# orchestrator-client -c set-writeable -i test2:3307 test2:3307
59. 輪詢指定實例的binary log,flush-binary-logs
# orchestrator-client -c flush-binary-logs -i test1:3307 test1:3307
60. 手動執行恢復,指定一個死機的實例,recover:
# orchestrator-client -c recover -i test2:3307 test3:3307
測試下來,該參數會讓處理停機或則維護狀態下的實例進行強制恢復。結構:
test1:3307 -> test2:3307 -> test3:3307(downtimed) 當test2:3307死掉以後,此時test3:3307處於停機狀態,不會進行Failover,執行後變成
test1:3307 -> test2:3307
-> test3:3307
61. 優雅的進行主和指定從切換,graceful-master-takeover:
# orchestrator-client -c graceful-master-takeover -a test1:3307 -d test2:3307 test2:3307
結構從test1:3307 -> test2:3307 變成 test2:3307 -> test1:3307。新主指定變成讀寫,新從變成只讀,還須要手動start slave。
注意須要配置:須要從元表裏找到複製的帳號和密碼。
"ReplicationCredentialsQuery":"SELECT repl_user, repl_pass from meta.cluster where anchor=1"
62. 手動強制執行恢復,即便orch沒有發現問題,force-master-failover:轉移以後老主獨立,須要手動加入到集羣。
# orchestrator-client -c force-master-failover -i test1:3307 test3:3307
63.強行丟棄master並指定的一個實例,force-master-takeover:老主(test1)獨立,指定從(test2)提高爲master
# orchestrator-client -c force-master-takeover -i test1:3307 -d test2:3307 test2:3307
64. 確認集羣恢復理由,在web上的Audit->Recovery->Acknowledged 按鈕確認,/ack-all-recoveries
確認指定集羣:ack-cluster-recoveries
# orchestrator-client -c ack-cluster-recoveries -i test2:3307 -reason='' test1:3307
確認全部集羣:ack-all-recoveries
# orchestrator-client -c ack-all-recoveries -reason='OOOPPP' eason=XYZ
65.檢查、禁止、開啓orchestrator執行全局恢復:
檢查:check-global-recoveries
# orchestrator-client -c check-global-recoveries
enabled
禁止:disable-global-recoveries
# orchestrator-client -c disable-global-recoveries
disabled
開啓:enable--global-recoveries
# orchestrator-client -c enable-global-recoveries
enabled
66. 檢查分析複製拓撲中存在的問題:replication-analysis
# orchestrator-client -c replication-analysis test1:3307 (cluster test1:3307): ErrantGTIDStructureWarning
67. raft檢測:leader查看、健康監測、遷移leader:
查看leader節點 # orchestrator-client -c raft-leader 192.168.163.131:10008 健康監測 # orchestrator-client -c raft-health healthy leader 主機名 # orchestrator-client -c raft-leader-hostname test1 指定主機選舉leader # orchestrator-client -c raft-elect-leader -hostname test3 test3
68.僞GTID相關參數:
match #使用Pseudo-GTID指定一個從匹配到指定的另外一個(目標)實例下 match-up #Transport the replica one level up the hierarchy, making it child of its grandparent, using Pseudo-GTID match-up-replicas #Matches replicas of the given instance one level up the topology, making them siblings of given instance, using Pseudo-GTID last-pseudo-gtid #Dump last injected Pseudo-GTID entry on a server
到此關於Orchestrator的使用以及命令行說明已經介紹完畢,Web API能夠在Orchestrator API查看,經過命令行和API上的操做能夠更好的進行自動化開發。
Orchestrator是一款開源(go編寫)的MySQL複製拓撲管理工具,支持MySQL主從複製拓撲關係的調整、主庫故障自動切換、手動主從切換等功能。提供Web界面展現MySQL集羣的拓撲關係及狀態,能夠更改MySQL實例的部分配置信息,也提供命令行和api接口。相對比MHA,Orchestrator自身能夠部署多個節點,經過raft分佈式一致性協議來保證自身的高可用。