MHA的理論知識網上有不少教程,這裏不會說明;僅推薦博客連接!html
MHA的理論說明:http://www.ywnds.com/?p=8094node
MHA的安裝包須要在google上面下載,或者就是csdn上面花錢下載!mysql
詳細說明怎麼搭建MHAsql
#四臺服務器分配以下 10.0.102.214 test3 MHA的管理節點 10.0.102.204 test2 master節點 10.0.102.179 test1 slave節點(做爲備用的管理節點) 10.0.102.221 mgt01 slave節點
#這裏咱們一主兩從的架構基於binlog複製,首先須要配置好一主兩從的架構。
#須要注意的是,做爲備用主的slave服務器須要開通二進制日誌和配置log_slave_updates參數
#MySQL基於binglog複製過程以下: http://www.javashuo.com/article/p-hlrsiymt-k.html
#部署過程當中不會說明怎麼搭建MySQL主從架構
第一步:搭建好主從架構,也就是一主兩從的架構。【MHA的官方不支持一主一從,可是傳聞阿里修改了源碼使其支持一主一從,這裏使用官方的結構】shell
須要注意的是要在做爲備用主的服務器添加以下配置:數據庫
log-bin= #開啓二進制日誌
log_slave_updates #把SQL線程的動做寫入二進制日誌
第二步:安裝MHAbash
在MHA的集羣的全部服務器上須要安裝MHA-node節點,服務器
[root@mgt01 ~]# yum install epel-release perl-DBD-MySQL perl-CPAN -y #安裝依賴包 [root@mgt01 src]# ls mha4mysql-node-0.56.tar.gz [root@mgt01 src]# tar zxvf mha4mysql-node-0.56.tar.gz -C ../ #解壓 [root@mgt01 src]# cd ../ [root@mgt01 local]# cd mha4mysql-node-0.56/ [root@mgt01 mha4mysql-node-0.56]# ls AUTHORS bin COPYING debian inc lib Makefile.PL MANIFEST META.yml README rpm t [root@mgt01 mha4mysql-node-0.56]# perl Makefile.PL #編譯 [root@mgt01 mha4mysql-node-0.56]# make & make install #安裝
[root@mgt01 ~]# cd /usr/local/bin #安裝完成以後,會在/usr/local/bin目錄下面生成以下文件
[root@mgt01 bin]# ls
apply_diff_relay_logs filter_mysqlbinlog purge_relay_logs save_binary_logs
[root@mgt01 bin]# ll
total 44
-r-xr-xr-x 1 root root 16367 Dec 8 10:29 apply_diff_relay_logs
-r-xr-xr-x 1 root root 4807 Dec 8 10:29 filter_mysqlbinlog
-r-xr-xr-x 1 root root 8261 Dec 8 10:29 purge_relay_logs
-r-xr-xr-x 1 root root 7525 Dec 8 10:29 save_binary_logs
注意上面的這一步操做,須要在MHA集羣的每一個節點上都執行!多線程
安裝MHA-manager,也就是MHA集羣的管理節點!架構
#首先安裝MHA-manager須要安裝的包
yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y
#安裝MHA-manager
tar zxvf mha4mysql-manager-0.56.tar.gz cd mha4mysql-manager-0.56/ perl Makefile.PL make & make install
cp -frp samples/scripts/* /usr/local/bin #把這些腳本文件拷貝到/usr/local/bin下面,這樣不用再添加環境變量
master_ip_failover:故障自動切換時對vip管理的腳本,不是必須。若是咱們使用keepalived的,咱們能夠本身編寫腳本完成對vip的管理,好比監控mysql,若是mysql異常,咱們中止keepalived就行,這樣vip就會自動漂移。
master_ip_online_change:在線切換時對vip的管理,不是必須,一樣能夠自行編寫簡單的shell完成。
power_manager:故障發生後關閉主機的腳本,不是必須。
send_report:因故障切換後發送報警的腳本,不是必須,可自行編寫簡單的shell完成。
第三步:配置MHA
配置MHA這一步主要作的就是寫MHA的配置文件,而後建立對應的目錄!
在上面的samples/ 目錄下還有一個目錄conf,裏面有兩個配置文件模板:
[root@test3 conf]# ls app1.cnf masterha_default.cnf [root@test3 conf]#
把配置文件模板拷貝到/etc下面:
mkdir /etc/masterha -p #在/etc下面建立MHA使用的配置文件的目錄【名字能夠隨意,最好能夠標識目錄的內容】 cp * /etc/masterha/
首先編輯masterha_default.cnf文件:
[root@mha ~]# cat /etc/masterha_default.cnf [server default] # 設置監控用戶mha,須要有受權 user=mha # 設置mysql中root用戶的密碼,這個密碼是前文中建立監控用戶的那個密碼; password=123456 # 設置複製環境中的複製用戶名; repl_user=repl # 設置複製用戶的密碼; repl_password=123456 # 設置ssh的登陸用戶名; ssh_user=root # 設置ssh的登陸端口(不寫默認22端口); ssh_port=22 # 設置監控主庫,發送ping包的時間間隔,默認是3秒,嘗試三次沒有迴應的時候自動進行failover; ping_interval=3 # 設置mysql master保存binlog的目錄,以便MHA能夠找到master的二進制日誌; master_binlog_dir= /data/mysql/ # 設置mysql master在發生切換時保存binlog的目錄,在mysql master上建立這個目錄(不寫默認爲/var/tmp); remote_workdir=/data/log/masterha # 一旦MHA到mysql01的監控之間出現問題,MHA Manager將會嘗試從mysql02,mysql03登陸到mysql01; secondary_check_script= masterha_secondary_check -s test1 -s mgt01 --user=root --port=22 --master_host=test2 --master_port=3306 # 設置自動failover時候的切換腳本(腳本有瑕疵,須要自行修改); #master_ip_failover_script=/usr/local/bin/master_ip_failover # 設置手動切換時候的切換腳本(腳本有瑕疵,須要自行修改); #master_ip_online_change_script=/usr/local/bin/master_ip_online_change # 設置發生切換後發送的報警的腳本(可自行編寫); #report_script=/usr/local/bin/send_report # 設置故障發生後關閉故障主機腳本(該腳本的主要做用是關閉主機放在發生腦裂,這裏沒有使用); #shutdown_script=""
上面給出了masterha_default.cnf每一個配置參數的說明狀況,下面這個是個人配置
[server default] user=root password=123456 ssh_user=root ssh_port=22 ping_interval=3 repl_user=repl repl_password=123456 master_binlog_dir= /data/mysql/ remote_workdir=/data/log/masterha secondary_check_script= masterha_secondary_check -s test1 -s mgt01 --user=root --port=22 --master_host=test2 --master_port=3306 master_ip_failover_script= /usr/local/bin/master_ip_failover # shutdown_script= /script/masterha/power_manager report_script= /usr/local/bin/send_report # master_ip_online_change_script= /script/masterha/master_ip_online_change
而後編輯配置app1.conf文件
只針對單個應用生效,可是app1.cnf的配置參數優先級高於masterha_default.cnf,通常都會在app1.cnf包含masterha_default.cnf全部參數。MHA能夠監控多個主從的集羣,每一個集羣的配置文件能夠用名字區分,由於這裏只有一個集羣,所以只有app1.conf一個文件!
[root@test3 masterha]# cat app1.cnf
manager_log=/data/log/app1/manager.log
manager_workdir=/data/log/app1
master_binlog_dir=/data/mysql
password=123456
ping_interval=3
remote_workdir=/data/log/masterha
repl_password=123456
repl_user=repl
report_script=/usr/local/bin/send_report
secondary_check_script=masterha_secondary_check -s test1 -s mgt01 --user=root --port=22 --master_host=test2 --master_port=3306
ssh_port=22
ssh_user=root
user=root
[server1]
hostname=10.0.102.204
port=3306
candidate_master=1
[server2]
candidate_master=1
hostname=10.0.102.179
port=3306
[server3]
hostname=10.0.102.221
no_master=1
port=3306
這個配置文件的參數基本都比較好理解,須要注意的是,配置文件指定的目錄都須要另行建立!
mkdir -p /data/log/masterha mkdir /data/log/app1
當candidate_master設置爲1時,表示爲候選master,若是設置該參數之後,發生主從切換之後將會將此從庫提高爲主庫,即便這個主庫不是集羣中事件最新的slave。默認狀況下若是一個slave落後master 100M的relay logs的話,MHA將不會選擇該slave做爲一個新的master,由於對於這個slave的恢復須要花費很長時間,經過設置check_repl_delay=0,MHA觸發切換在選擇一個新的master的時候將會忽略複製延時,這個參數對於設置了candidate_master=1的主機很是有用,由於這個候選主在切換的過程當中必定是新的master check_repl_delay=0。
一樣設置爲候選master的slave必定要開啓二進制日誌和log_slave_updates參數!
設置relay log的清除方式(在每一個slave節點上)
在配置文件中加上relay_log_purge=0,須要重啓才能生效!
注意:MHA在發生切換的過程當中,從庫的恢復過程當中依賴於relay log的相關信息,因此這裏要將relay log的自動清除設置爲OFF,採用手動清除relay log的方式。在默認狀況下,從服務器上的中繼日誌會在SQL線程執行完畢後被自動刪除。可是在MHA環境中,這些中繼日誌在恢復其餘從服務器時可能會被用到,所以須要禁用中繼日誌的自動刪除功能。按期清除中繼日誌須要考慮到複製延時的問題。在ext3的文件系統下,刪除大的文件須要必定的時間,會致使嚴重的複製延時。爲了不復制延時,須要暫時爲中繼日誌建立硬連接,由於在Linux系統中經過硬連接刪除大文件速度會很快。(在mysql數據庫中,刪除大表時,一般也採用創建硬連接的方式)
MHA節點中包含了pure_relay_logs命令工具,它能夠爲中繼日誌建立硬連接,執行SET GLOBAL relay_log_purge=1,等待幾秒鐘以便SQL線程切換到新的中繼日誌,再執行SET GLOBAL relay_log_purge=0。
pure_relay_logs腳本參數以下所示:
--user mysql #用戶名; --password mysql #密碼; --port #端口號; --workdir #指定建立relay log的硬連接的位置,默認是/var/tmp,因爲系統不一樣分區建立硬連接文件會失敗,故須要執行硬連接具體位置,成功執行腳本後,硬連接的中繼日誌文件被刪除; --disable_relay_log_purge #默認狀況下,若是relay_log_purge=1,腳本會什麼都不清理,自動退出,經過設定這個參數,當relay_log_purge=1的狀況下會將relay_log_purge設置爲0。清理relay log以後,最後將參數設置爲OFF;
設置按期清理relay腳本
[root@mgt01 ~]# cat !$ cat purge_relay.sh #!/bin/bash user=root passwd=123456 port=3306 log_dir='/data/masterha/log' work_dir='/data' purge='/usr/local/bin/purge_relay_logs' if [ ! -d $log_dir ];then mkdir $log_dir -p fi $purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1
把以上腳本加入定時計劃任務:
[root@mgt01 log]# crontab -l * 4 * * * sh /root/purge_relay.sh
purge_relay_logs腳本刪除中繼日誌不會阻塞SQL線程
第四步: 設置ssh無密碼認證
MHA的管理節點能夠無密碼訪問集羣中的其他節點!
MySQL集羣須要互相之間能夠無密碼訪問!
ssh無密碼訪問再也不寫過程。
使用MHA檢查ssh是否成功
[root@test3 ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
若成功則進行下一步,檢查複製
【有一些博客提到:暫時先註釋配置文件中master_ip_failover_script= /usr/local/bin/master_ip_failover這個選項,否則這個檢查過不去的。可是我測試時候沒有註釋,也是能夠檢查成功的】
[root@test3 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
[root@test3 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf Sat Dec 8 17:03:38 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sat Dec 8 17:03:38 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Sat Dec 8 17:03:38 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Sat Dec 8 17:03:38 2018 - [info] MHA::MasterMonitor version 0.56. Sat Dec 8 17:03:38 2018 - [info] GTID failover mode = 0 Sat Dec 8 17:03:38 2018 - [info] Dead Servers: Sat Dec 8 17:03:38 2018 - [info] Alive Servers: Sat Dec 8 17:03:38 2018 - [info] 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:03:38 2018 - [info] 10.0.102.179(10.0.102.179:3306) Sat Dec 8 17:03:38 2018 - [info] 10.0.102.221(10.0.102.221:3306) Sat Dec 8 17:03:38 2018 - [info] Alive Slaves: Sat Dec 8 17:03:38 2018 - [info] 10.0.102.179(10.0.102.179:3306) Version=5.7.22-log (oldest major version between slaves) log-bin:enabled Sat Dec 8 17:03:38 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:03:38 2018 - [info] Primary candidate for the new Master (candidate_master is set) Sat Dec 8 17:03:38 2018 - [info] 10.0.102.221(10.0.102.221:3306) Version=5.7.22 (oldest major version between slaves) log-bin:disabled Sat Dec 8 17:03:38 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:03:38 2018 - [info] Not candidate for the new Master (no_master is set) Sat Dec 8 17:03:38 2018 - [info] Current Alive Master: 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:03:38 2018 - [info] Checking slave configurations.. Sat Dec 8 17:03:38 2018 - [info] read_only=1 is not set on slave 10.0.102.179(10.0.102.179:3306). Sat Dec 8 17:03:38 2018 - [info] read_only=1 is not set on slave 10.0.102.221(10.0.102.221:3306). Sat Dec 8 17:03:38 2018 - [warning] log-bin is not set on slave 10.0.102.221(10.0.102.221:3306). This host cannot be a master. Sat Dec 8 17:03:38 2018 - [info] Checking replication filtering settings.. Sat Dec 8 17:03:38 2018 - [info] binlog_do_db= , binlog_ignore_db= Sat Dec 8 17:03:38 2018 - [info] Replication filtering check ok. Sat Dec 8 17:03:38 2018 - [info] GTID (with auto-pos) is not supported Sat Dec 8 17:03:38 2018 - [info] Starting SSH connection tests.. Sat Dec 8 17:03:40 2018 - [info] All SSH connection tests passed successfully. Sat Dec 8 17:03:40 2018 - [info] Checking MHA Node version.. Sat Dec 8 17:03:40 2018 - [info] Version check ok. Sat Dec 8 17:03:40 2018 - [info] Checking SSH publickey authentication settings on the current master.. Sat Dec 8 17:03:40 2018 - [info] HealthCheck: SSH to 10.0.102.204 is reachable. Sat Dec 8 17:03:41 2018 - [info] Master MHA Node version is 0.56. Sat Dec 8 17:03:41 2018 - [info] Checking recovery script configurations on 10.0.102.204(10.0.102.204:3306).. Sat Dec 8 17:03:41 2018 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql --output_file=/data/log/masterha/save_binary_logs_test --manager_version=0.56 --start_file=test2-bin.000007 Sat Dec 8 17:03:41 2018 - [info] Connecting to root@10.0.102.204(10.0.102.204:22).. Creating /data/log/masterha if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /data/mysql, up to test2-bin.000007 Sat Dec 8 17:03:41 2018 - [info] Binlog setting check done. Sat Dec 8 17:03:41 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Sat Dec 8 17:03:41 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.102.179 --slave_ip=10.0.102.179 --slave_port=3306 --workdir=/data/log/masterha --target_version=5.7.22-log --manager_version=0.56 --relay_log_info=/data/mysql/relay-log.info --relay_dir=/data/mysql/ --slave_pass=xxx Sat Dec 8 17:03:41 2018 - [info] Connecting to root@10.0.102.179(10.0.102.179:22).. Checking slave recovery environment settings.. Opening /data/mysql/relay-log.info ... ok. Relay log found at /data/mysql, up to test1-relay-bin.000002 Temporary relay log file is /data/mysql/test1-relay-bin.000002 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Sat Dec 8 17:03:41 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.102.221 --slave_ip=10.0.102.221 --slave_port=3306 --workdir=/data/log/masterha --target_version=5.7.22 --manager_version=0.56 --relay_log_info=/data/mysql/relay-log.info --relay_dir=/data/mysql/ --slave_pass=xxx Sat Dec 8 17:03:41 2018 - [info] Connecting to root@10.0.102.221(10.0.102.221:22).. Checking slave recovery environment settings.. Opening /data/mysql/relay-log.info ... ok. Relay log found at /data/mysql, up to mgt01-relay-bin.000002 Temporary relay log file is /data/mysql/mgt01-relay-bin.000002 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Sat Dec 8 17:03:41 2018 - [info] Slaves settings check done. Sat Dec 8 17:03:41 2018 - [info] 10.0.102.204(10.0.102.204:3306) (current master) +--10.0.102.179(10.0.102.179:3306) +--10.0.102.221(10.0.102.221:3306) Sat Dec 8 17:03:41 2018 - [info] Checking replication health on 10.0.102.179.. Sat Dec 8 17:03:41 2018 - [info] ok. Sat Dec 8 17:03:41 2018 - [info] Checking replication health on 10.0.102.221.. Sat Dec 8 17:03:41 2018 - [info] ok. Sat Dec 8 17:03:41 2018 - [warning] master_ip_failover_script is not defined. Sat Dec 8 17:03:41 2018 - [warning] shutdown_script is not defined. Sat Dec 8 17:03:41 2018 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.詳細過程
遇到過一次是複製檢查時,老是會dead servers下面有一個服務器,可是集羣裏面是正常的,各類都是正常的,後來發現是本地的解析出錯!【/etc/hosts文化和ssh目錄下面的known_hosts文件,新建的服務器通常不會出現這問題】
查看MHA-manger的狀態
[root@test3 masterha]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 is stopped(2:NOT_RUNNING). [root@test3 masterha]#
開啓MHa-manager
[root@test3 masterha]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover &
#參數說明
remove_dead_master_conf:設置了這個參數後,若是MHA failover結束後,MHA Manager會自動在配置文件中刪除dead master的相關項。若是不設置,
因爲dead master的配置還存在文件中,那麼當MHA failover後,當再次restart MHA manager後,會報錯(there is a dead slave previous dead master)。
ignore_last_failover:默認狀況下,若是一個或者多個slave down掉了,master monitor進程就會停掉,就算你設置了ignore_fail。若是設置了–ignore_fail_on_start參數,ignore_fail標記了slave掛掉也不會讓master monitor進程停掉。
啓動以後查看狀態:
[root@test3 masterha]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 (pid:18866) is running(0:PING_OK), master:10.0.102.204
若是啓動沒有報錯,那麼一個MHA的集羣就已經搭建成功!
關閉MHA-manager可使用以下命令:
masterha_stop --conf=/etc/masterha/app1.cnf
停掉MySQL主從集羣中的主,查看是否會自動切換到從!在測試主從以前最後能夠寫入一點數據,這裏我利用tpcc寫入了一些數據!
./tpcc_load -h 10.0.102.204 -P 3306 -d tpcc_test -u root -p 123456 -w 3
tpcc的測試使用:http://www.javashuo.com/article/p-bzgirdva-k.html
停掉當前的主服務器!
[root@test2 ~]# service mysqld stop
Shutting down MySQL............ SUCCESS!
而後查看MHA的管理日誌
Sat Dec 8 17:21:50 2018 - [info] Executing secondary network check script: masterha_secondary_check -s test1 -s mgt01 --user=root --port=22 --master_host=test2 --master_port=3306 --user=root --master_host=10.0.102.204 --master_ip=10.0.102.204 --master_port=3306 --master_user=root --master_password=123456 --ping_type=SELECT Sat Dec 8 17:21:50 2018 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql --output_file=/data/log/masterha/save_binary_logs_test --manager_version=0.56 --binlog_prefix=test2-bin Monitoring server test1 is reachable, Master is not reachable from test1. OK. Sat Dec 8 17:21:50 2018 - [info] HealthCheck: SSH to 10.0.102.204 is reachable. Monitoring server mgt01 is reachable, Master is not reachable from mgt01. OK. Sat Dec 8 17:21:50 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start. Sat Dec 8 17:21:53 2018 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111) Sat Dec 8 17:21:53 2018 - [warning] Connection failed 2 time(s).. Sat Dec 8 17:21:56 2018 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111) Sat Dec 8 17:21:56 2018 - [warning] Connection failed 3 time(s).. Sat Dec 8 17:21:59 2018 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111) Sat Dec 8 17:21:59 2018 - [warning] Connection failed 4 time(s).. Sat Dec 8 17:21:59 2018 - [warning] Master is not reachable from health checker! Sat Dec 8 17:21:59 2018 - [warning] Master 10.0.102.204(10.0.102.204:3306) is not reachable! Sat Dec 8 17:21:59 2018 - [warning] SSH is reachable. Sat Dec 8 17:21:59 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status.. Sat Dec 8 17:21:59 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sat Dec 8 17:21:59 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Sat Dec 8 17:21:59 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Sat Dec 8 17:21:59 2018 - [info] GTID failover mode = 0 Sat Dec 8 17:21:59 2018 - [info] Dead Servers: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Alive Servers: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.179(10.0.102.179:3306) Sat Dec 8 17:21:59 2018 - [info] 10.0.102.221(10.0.102.221:3306) Sat Dec 8 17:21:59 2018 - [info] Alive Slaves: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.179(10.0.102.179:3306) Version=5.7.22-log (oldest major version between slaves) log-bin:enabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Primary candidate for the new Master (candidate_master is set) Sat Dec 8 17:21:59 2018 - [info] 10.0.102.221(10.0.102.221:3306) Version=5.7.22 (oldest major version between slaves) log-bin:disabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Not candidate for the new Master (no_master is set) Sat Dec 8 17:21:59 2018 - [info] Checking slave configurations.. Sat Dec 8 17:21:59 2018 - [info] read_only=1 is not set on slave 10.0.102.179(10.0.102.179:3306). Sat Dec 8 17:21:59 2018 - [info] read_only=1 is not set on slave 10.0.102.221(10.0.102.221:3306). Sat Dec 8 17:21:59 2018 - [warning] log-bin is not set on slave 10.0.102.221(10.0.102.221:3306). This host cannot be a master. Sat Dec 8 17:21:59 2018 - [info] Checking replication filtering settings.. Sat Dec 8 17:21:59 2018 - [info] Replication filtering check ok. Sat Dec 8 17:21:59 2018 - [info] Master is down! Sat Dec 8 17:21:59 2018 - [info] Terminating monitoring script. Sat Dec 8 17:21:59 2018 - [info] Got exit code 20 (Master dead). Sat Dec 8 17:21:59 2018 - [info] MHA::MasterFailover version 0.56. Sat Dec 8 17:21:59 2018 - [info] Starting master failover. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] * Phase 1: Configuration Check Phase.. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] GTID failover mode = 0 Sat Dec 8 17:21:59 2018 - [info] Dead Servers: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Checking master reachability via MySQL(double check)... Sat Dec 8 17:21:59 2018 - [info] ok. Sat Dec 8 17:21:59 2018 - [info] Alive Servers: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.179(10.0.102.179:3306) Sat Dec 8 17:21:59 2018 - [info] 10.0.102.221(10.0.102.221:3306) Sat Dec 8 17:21:59 2018 - [info] Alive Slaves: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.179(10.0.102.179:3306) Version=5.7.22-log (oldest major version between slaves) log-bin:enabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Primary candidate for the new Master (candidate_master is set) Sat Dec 8 17:21:59 2018 - [info] 10.0.102.221(10.0.102.221:3306) Version=5.7.22 (oldest major version between slaves) log-bin:disabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Not candidate for the new Master (no_master is set) Sat Dec 8 17:21:59 2018 - [info] Starting Non-GTID based failover. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] ** Phase 1: Configuration Check Phase completed. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] * Phase 2: Dead Master Shutdown Phase.. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] Forcing shutdown so that applications never connect to the current master.. Sat Dec 8 17:21:59 2018 - [warning] master_ip_failover_script is not set. Skipping invalidating dead master IP address. Sat Dec 8 17:21:59 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Sat Dec 8 17:21:59 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] * Phase 3: Master Recovery Phase.. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] The latest binary log file/position on all slaves is test2-bin.000007:154 Sat Dec 8 17:21:59 2018 - [info] Latest slaves (Slaves that received relay log files to the latest): Sat Dec 8 17:21:59 2018 - [info] 10.0.102.179(10.0.102.179:3306) Version=5.7.22-log (oldest major version between slaves) log-bin:enabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Primary candidate for the new Master (candidate_master is set) Sat Dec 8 17:21:59 2018 - [info] 10.0.102.221(10.0.102.221:3306) Version=5.7.22 (oldest major version between slaves) log-bin:disabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Not candidate for the new Master (no_master is set) Sat Dec 8 17:21:59 2018 - [info] The oldest binary log file/position on all slaves is test2-bin.000007:154 Sat Dec 8 17:21:59 2018 - [info] Oldest slaves: Sat Dec 8 17:21:59 2018 - [info] 10.0.102.179(10.0.102.179:3306) Version=5.7.22-log (oldest major version between slaves) log-bin:enabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Primary candidate for the new Master (candidate_master is set) Sat Dec 8 17:21:59 2018 - [info] 10.0.102.221(10.0.102.221:3306) Version=5.7.22 (oldest major version between slaves) log-bin:disabled Sat Dec 8 17:21:59 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:21:59 2018 - [info] Not candidate for the new Master (no_master is set) Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Sat Dec 8 17:21:59 2018 - [info] Sat Dec 8 17:21:59 2018 - [info] Fetching dead master's binary logs.. Sat Dec 8 17:21:59 2018 - [info] Executing command on the dead master 10.0.102.204(10.0.102.204:3306): save_binary_logs --command=save --start_file=test2-bin.000007 --start_pos=154 --binlog_dir=/data/mysql --output_file=/data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 Creating /data/log/masterha if not exists.. ok. Concat binary/relay logs from test2-bin.000007 pos 154 to test2-bin.000007 EOF into /data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. Dumping effective binlog data from /data/mysql/test2-bin.000007 position 154 to tail(177).. ok. Binlog Checksum enabled Concat succeeded. Sat Dec 8 17:22:00 2018 - [info] scp from root@10.0.102.204:/data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog to local:/data/log/app1/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog succeeded. Sat Dec 8 17:22:00 2018 - [info] HealthCheck: SSH to 10.0.102.179 is reachable. Sat Dec 8 17:22:00 2018 - [info] HealthCheck: SSH to 10.0.102.221 is reachable. Sat Dec 8 17:22:00 2018 - [info] Sat Dec 8 17:22:00 2018 - [info] * Phase 3.3: Determining New Master Phase.. Sat Dec 8 17:22:00 2018 - [info] Sat Dec 8 17:22:00 2018 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Sat Dec 8 17:22:00 2018 - [info] All slaves received relay logs to the same position. No need to resync each other. Sat Dec 8 17:22:00 2018 - [info] Searching new master from slaves.. Sat Dec 8 17:22:00 2018 - [info] Candidate masters from the configuration file: Sat Dec 8 17:22:00 2018 - [info] 10.0.102.179(10.0.102.179:3306) Version=5.7.22-log (oldest major version between slaves) log-bin:enabled Sat Dec 8 17:22:00 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:22:00 2018 - [info] Primary candidate for the new Master (candidate_master is set) Sat Dec 8 17:22:00 2018 - [info] Non-candidate masters: Sat Dec 8 17:22:00 2018 - [info] 10.0.102.221(10.0.102.221:3306) Version=5.7.22 (oldest major version between slaves) log-bin:disabled Sat Dec 8 17:22:00 2018 - [info] Replicating from 10.0.102.204(10.0.102.204:3306) Sat Dec 8 17:22:00 2018 - [info] Not candidate for the new Master (no_master is set) Sat Dec 8 17:22:00 2018 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Sat Dec 8 17:22:00 2018 - [info] New master is 10.0.102.179(10.0.102.179:3306) Sat Dec 8 17:22:00 2018 - [info] Starting master failover.. Sat Dec 8 17:22:00 2018 - [info] From: 10.0.102.204(10.0.102.204:3306) (current master) +--10.0.102.179(10.0.102.179:3306) +--10.0.102.221(10.0.102.221:3306) To: 10.0.102.179(10.0.102.179:3306) (new master) +--10.0.102.221(10.0.102.221:3306) Sat Dec 8 17:22:00 2018 - [info] Sat Dec 8 17:22:00 2018 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Sat Dec 8 17:22:00 2018 - [info] Sat Dec 8 17:22:00 2018 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Sat Dec 8 17:22:00 2018 - [info] Sending binlog.. Sat Dec 8 17:22:01 2018 - [info] scp from local:/data/log/app1/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog to root@10.0.102.179:/data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog succeeded. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] * Phase 3.4: Master Log Apply Phase.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Sat Dec 8 17:22:01 2018 - [info] Starting recovery on 10.0.102.179(10.0.102.179:3306).. Sat Dec 8 17:22:01 2018 - [info] Generating diffs succeeded. Sat Dec 8 17:22:01 2018 - [info] Waiting until all relay logs are applied. Sat Dec 8 17:22:01 2018 - [info] done. Sat Dec 8 17:22:01 2018 - [info] Getting slave status.. Sat Dec 8 17:22:01 2018 - [info] This slave(10.0.102.179)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(test2-bin.000007:154). No need to recover from Exec_Master_Log_Pos. Sat Dec 8 17:22:01 2018 - [info] Connecting to the target slave host 10.0.102.179, running recover script.. Sat Dec 8 17:22:01 2018 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=10.0.102.179 --slave_ip=10.0.102.179 --slave_port=3306 --apply_files=/data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog --workdir=/data/log/masterha --target_version=5.7.22-log --timestamp=20181208172159 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx Sat Dec 8 17:22:01 2018 - [info] MySQL client version is 5.7.22. Using --binary-mode. Applying differential binary/relay log files /data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog on 10.0.102.179:3306. This may take long time... Applying log files succeeded. Sat Dec 8 17:22:01 2018 - [info] All relay logs were successfully applied. Sat Dec 8 17:22:01 2018 - [info] Getting new master's binlog name and position.. Sat Dec 8 17:22:01 2018 - [info] test1-bin.000001:154 Sat Dec 8 17:22:01 2018 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.0.102.179', MASTER_PORT=3306, MASTER_LOG_FILE='test1-bin.000001', MASTER_LOG_POS=154, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Sat Dec 8 17:22:01 2018 - [warning] master_ip_failover_script is not set. Skipping taking over new master IP address. Sat Dec 8 17:22:01 2018 - [info] ** Finished master recovery successfully. Sat Dec 8 17:22:01 2018 - [info] * Phase 3: Master Recovery Phase completed. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] * Phase 4: Slaves Recovery Phase.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] -- Slave diff file generation on host 10.0.102.221(10.0.102.221:3306) started, pid: 19729. Check tmp log /data/log/app1/10.0.102.221_3306_20181208172159.log if it takes time.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] Log messages from 10.0.102.221 ... Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Sat Dec 8 17:22:01 2018 - [info] End of log messages from 10.0.102.221. Sat Dec 8 17:22:01 2018 - [info] -- 10.0.102.221(10.0.102.221:3306) has the latest relay log events. Sat Dec 8 17:22:01 2018 - [info] Generating relay diff files from the latest slave succeeded. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] -- Slave recovery on host 10.0.102.221(10.0.102.221:3306) started, pid: 19731. Check tmp log /data/log/app1/10.0.102.221_3306_20181208172159.log if it takes time.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] Log messages from 10.0.102.221 ... Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] Sending binlog.. Sat Dec 8 17:22:01 2018 - [info] scp from local:/data/log/app1/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog to root@10.0.102.221:/data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog succeeded. Sat Dec 8 17:22:01 2018 - [info] Starting recovery on 10.0.102.221(10.0.102.221:3306).. Sat Dec 8 17:22:01 2018 - [info] Generating diffs succeeded. Sat Dec 8 17:22:01 2018 - [info] Waiting until all relay logs are applied. Sat Dec 8 17:22:01 2018 - [info] done. Sat Dec 8 17:22:01 2018 - [info] Getting slave status.. Sat Dec 8 17:22:01 2018 - [info] This slave(10.0.102.221)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(test2-bin.000007:154). No need to recover from Exec_Master_Log_Pos. Sat Dec 8 17:22:01 2018 - [info] Connecting to the target slave host 10.0.102.221, running recover script.. Sat Dec 8 17:22:01 2018 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=10.0.102.221 --slave_ip=10.0.102.221 --slave_port=3306 --apply_files=/data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog --workdir=/data/log/masterha --target_version=5.7.22 --timestamp=20181208172159 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx Sat Dec 8 17:22:01 2018 - [info] MySQL client version is 5.7.22. Using --binary-mode. Applying differential binary/relay log files /data/log/masterha/saved_master_binlog_from_10.0.102.204_3306_20181208172159.binlog on 10.0.102.221:3306. This may take long time... Applying log files succeeded. Sat Dec 8 17:22:01 2018 - [info] All relay logs were successfully applied. Sat Dec 8 17:22:01 2018 - [info] Resetting slave 10.0.102.221(10.0.102.221:3306) and starting replication from the new master 10.0.102.179(10.0.102.179:3306).. Sat Dec 8 17:22:01 2018 - [info] Executed CHANGE MASTER. Sat Dec 8 17:22:01 2018 - [info] Slave started. Sat Dec 8 17:22:01 2018 - [info] End of log messages from 10.0.102.221. Sat Dec 8 17:22:01 2018 - [info] -- Slave recovery on host 10.0.102.221(10.0.102.221:3306) succeeded. Sat Dec 8 17:22:01 2018 - [info] All new slave servers recovered successfully. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] * Phase 5: New master cleanup phase.. Sat Dec 8 17:22:01 2018 - [info] Sat Dec 8 17:22:01 2018 - [info] Resetting slave info on the new master.. Sat Dec 8 17:22:02 2018 - [info] 10.0.102.179: Resetting slave info succeeded. Sat Dec 8 17:22:02 2018 - [info] Master failover to 10.0.102.179(10.0.102.179:3306) completed successfully. Sat Dec 8 17:22:02 2018 - [info] Deleted server1 entry from /etc/masterha/app1.cnf . Sat Dec 8 17:22:02 2018 - [info] ----- Failover Report ----- app1: MySQL Master failover 10.0.102.204(10.0.102.204:3306) to 10.0.102.179(10.0.102.179:3306) succeeded Master 10.0.102.204(10.0.102.204:3306) is down! Check MHA Manager logs at test3:/data/log/app1/manager.log for details. Started automated(non-interactive) failover. The latest slave 10.0.102.179(10.0.102.179:3306) has all relay logs for recovery. Selected 10.0.102.179(10.0.102.179:3306) as a new master. 10.0.102.179(10.0.102.179:3306): OK: Applying all logs succeeded. 10.0.102.221(10.0.102.221:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 10.0.102.221(10.0.102.221:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.0.102.179(10.0.102.179:3306) 10.0.102.179(10.0.102.179:3306): Resetting slave info succeeded. Master failover to 10.0.102.179(10.0.102.179:3306) completed successfully. Sat Dec 8 17:22:02 2018 - [info] Sending mail.. Unknown option: conf
由mha切換日誌能夠看出,整個故障切換已經完成了。整個過程各個階段核心切換邏輯簡化後以下:
Phase 1:配置文件檢查
Phase 2:非存活Master關閉服務
Phase 3:Master恢復
Phase 3.1:獲取與Master延遲最小的Slave節點
Phase 3.2:生成Master與延遲最小的Slave節點的差別binlog並保存到manager節點
Phase 3.3:找出新的New Master,若是New Master不是最新的Slave節點,那麼須要生成它們之間的差別Relay log
Phase 3.4: New Master恢復差別Relay log,隨後獲取Master binlog位點信息,而後恢復差別binlog日誌
Phase 4: Slaves恢復
Phase 4.1:多線程生成延遲最小的Slave節點與其餘一個或多個Slave差別Relay log
Phase 4.2:多線程恢復Slaves節點差別Relay log,而後change master到NEW MASTER節點
Phase 5:New Master清理Slave信息,並刪除掉MHA配置文件中的選主信息防止誤操做