MHA(Master High Availability)目前在MySQL高可用方面是一個相對成熟的解決方案,是一套優秀的做爲MySQL高可用性環境下故障切換和主從提高的高可用軟件。在MySQL故障切換過程當中,MHA能作到在0~30秒以內自動完成數據庫的故障切換操做,而且在進行故障切換的過程當中,MHA能在最大程度上保證數據的一致性,以達到真正意義上的高可用。它由兩部分組成:MHA Manager(管理節點)和MHA Node(數據節點)。MHA Manager能夠單獨部署在一臺獨立的機器上管理多個master-slave集羣,也能夠部署在一臺slave節點上。MHA Node運行在每臺MySQL服務器上,MHA Manager會定時探測集羣中的master節點,當master出現故障時,它能夠自動將最新數據的slave提高爲新的master,而後將全部其餘的slave從新指向新的master。整個故障轉移過程對應用程序徹底透明。在MHA自動故障切換過程當中,MHA試圖從宕機的主服務器上保存二進制日誌,最大程度的保證數據的不丟失,但這並不老是可行的。例如,若是主服務器硬件故障或沒法經過ssh訪問,MHA無法保存二進制日誌,只進行故障轉移而丟失了最新的數據。使用MySQL 5.5的半同步複製,能夠大大下降數據丟失的風險。MHA能夠與半同步複製結合起來。若是隻有一個slave已經收到了最新的二進制日誌,MHA能夠將最新的二進制日誌應用於其餘全部的slave服務器上,所以能夠保證全部節點的數據一致性。html
(1)從宕機崩潰的master保存二進制日誌事件(binlog events);node
(2)識別含有最新更新的slave;mysql
(3)應用差別的中繼日誌(relay log)到其餘的slave;linux
(4)應用從master保存的二進制日誌事件(binlog events);redis
(5)提高一個slave爲新的master;sql
(6)使其餘的slave鏈接新的master進行復制;shell
MHA軟件由兩部分組成,Manager工具包和Node工具包數據庫
Manager工具包主要包括如下幾個工具:express
masterha_check_ssh 檢查MHA的SSH配置情況 masterha_check_repl 檢查MySQL複製情況 masterha_manger 啓動MHA masterha_check_status 檢測當前MHA運行狀態 masterha_master_monitor 檢測master是否宕機 masterha_master_switch 控制故障轉移(自動或者手動) masterha_conf_host 添加或刪除配置的server信息
Node工具包(這些工具一般由MHA Manager的腳本觸發,無需人爲操做)主要包括如下幾個工具:vim
save_binary_logs 保存和複製master的二進制日誌 apply_diff_relay_logs 識別差別的中繼日誌事件並將其差別的事件應用於其餘的slave filter_mysqlbinlog 去除沒必要要的ROLLBACK事件(MHA已再也不使用這個工具) purge_relay_logs 清除中繼日誌(不會阻塞SQL線程)
Node服務器安裝
mkdir -p /usr/local/mha
manage服務器安裝
mkdir -p /usr/local/mha/ha1/fail_script
mkdir -p /usr/local/mha/ha1/workdir
/usr/local/mha:程序安裝目錄
/usr/local/mha/ha1:用於區別每個mha方案,當前方案ha1
/usr/local/mha/ha1/fail_script:方案ha1的failover腳本保存路徑
/usr/local/mha/ha1/workdir:方案ha1的的日誌和failover產生的binlog保存路徑
使用yum方式安裝,須要安裝epel源
epel源
wget http://mirrors.ustc.edu.cn/fedora/epel/6/x86_64/epel-release-6-8.noarch.rpm
全部服務器都安裝(mananage須要安裝如下全部插件,node節點只須要安裝perl-DBD-MySQL,cpan)
yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes cpan
也可使用perl方式安裝
#!/bin/bash wget http://xrl.us/cpanm --no-check-certificate mv cpanm /usr/bin chmod 755 /usr/bin/cpanm cat > /root/list << EOF install DBD::mysql install Config::Tiny install Log::Dispatch install Parallel::ForkManager install Time::HiRes install CPAN install Digest::SHA EOF for package in `cat /root/list` do cpanm $package done
tar -xvf mha4mysql-node-0.54.tar.gz cd mha4mysql-node-0.54 perl Makefile.PL INSTALL_BASE=/usr/local/mha make && make install
tar -xvf mha4mysql-manager-0.55.tar.gz cd mha4mysql-manager-0.55 perl Makefile.PL INSTALL_BASE=/usr/local/mha make && make install
cp samples/scripts/* /usr/local/mha/bin/
master_ip_failover:自動切換時vip管理的腳本
master_ip_online_change:手動切換使用的腳本
power_manager:故障發生後關閉主機的腳本
send_report:發送報警的腳本。
將MHA Manager主機的/usr/local/mha/bin加入環境變量
爲了避免麻煩全部服務器都執行吧,其實最後兩個mysql,mysqlbinlog的軟連接只有Node服務器須要添加,其它的全部服務器都須要添加。
mkdir -p /usr/local/bin mkdir -p /usr/local/share/man/man1 mkdir -p /usr/local/share/perl5/MHA ln -s /usr/local/mha/bin/* /usr/local/bin; ln -s /usr/local/mha/man/man1/* /usr/local/share/man/man1; ln -s /usr/local/mha/lib/perl5/MHA /usr/local/share/perl5/MHA; ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog; ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql;
(1)在manage配置到全部Node節點的無密碼登入
ssh-keygen -t rsa 一直enter,會在/root/.ssh/下面生成id_rsa.pub ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.10 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.20 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.30
(2)在Node 10配置到Node 20,30的無密碼登入
ssh-keygen -t rsa ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.20 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.30
(3)在Node 20配置到Node 10,30的無密碼登入
ssh-keygen -t rsa
ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.10 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.30
(4)在Node 30配置到Node 10,20的無密碼登入
ssh-keygen -t rsa ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.10 ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.137.20
複製環境以前已經搭建好,能夠參考我前面寫過的文章,複製用戶和密碼都是repl;每一個Node上都必須建立這個repl帳號,除非Node不做爲故障專業的master
1.在全部Node上建立Manage監控用戶
grant all privileges on *.* to 'root'@'192.168.137.%' identified by 'root';
vim /usr/local/mha/ha1/ha1.cnf
[server default] manager_workdir=/usr/local/mha/ha1/workdir ##項目的主目錄 manager_log=/usr/local/mha/ha1/workdir/manager.log ###mha記錄日誌 master_binlog_dir=/mysql/log ####node服務器的binlog存放路徑,若是每一個node的binlog路徑不一致的話就在下面的每一個server下面單獨配置 master_ip_failover_script=/usr/local/mha/ha1/fail_script/master_ip_failover ####mha在線自動failover時處理VIP的配置文件 master_ip_online_change_script=/usr/local/mha/ha1/fail_script/master_ip_online_change ####在線手動執行master切換時VIP的處理文件 secondary_check_script=/usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 ##一旦MHA到master之間的網絡出現問題,manager會嘗試從backup登入到masger #report_script=/usr/local/mha/ha1/fail_script/send_report ###發生切換後執行的報警腳本 shutdown_script="" ####故障後關閉master主機的腳本(主要是使用keepalive作VIP時會出現腦裂致使VIP頻繁切換因此會將故障的master關閉) ping_interval=1 ###監控mater,ping的頻率 remote_workdir=/tmp ###node服務器在發生master切換時,binlog保持的路徑,每一個node都會在該目錄下保存一份差別的binlog,除非沒有差別。 repl_password=repl ##複製使用的用戶名,每一個node服務器都須要存在 repl_user=repl ##複製使用的密碼 user=root ##mnager監控用的mysql root用戶 password=root ##root用戶密碼 ssh_user=root ##ssh登入用戶名 [server1] hostname=192.168.137.10 port=3306 candidate_master=1 check_repl_delay=0 [server2] hostname=192.168.137.20 port=3306 #master_binlog_dir=/mysql/log candidate_master=1 ##設置爲候選master,若是設置該參數之後,發生主從切換之後將會將此從庫提高爲主庫,即便這個主庫不是集羣中事件最新的slave check_repl_delay=0 ##默認狀況下若是一個slave落後master 100M的relay logs的話,MHA將不會選擇該slave做爲一個新的master,由於對於這個slave的恢復須要花費很長時間,經過設置check_repl_delay=0,MHA觸發切換在選擇一個新master的時候將會忽略複製延時,這個參數對於設置了candidate_master=1的主機很是有用,由於這個候選主在切換的過程當中必定是新的master [server3] hostname=192.168.137.30 port=3306 ignore_fail=1 ####若是不加上該參數,當該slave主機故障了,mha將沒法啓動,加上該參數會忽略該主機是否正常,在mha啓動的時候加上參數--ignore_fail_on_start no_master=1 ###不將該主機轉換爲master
注意:對於上面的配置必定要確保server1和server2之間是最新的binlog,通常會配置兩者爲雙主的半同步複製,這樣就保證了它們之間的binlog是最新的,不然應用差別的binlog將花費很是長的時間(若是它們和master延時很是大的狀況下)
VIP的配置可使用keepalived也能夠寫腳本,keepalived對網絡的要求很高不然容易腦裂,在我前面搭建雙主環境講過keepalived的搭建方法,我這裏使用腳本的方式。
#!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Getopt::Long; my ( $command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port, $new_master_user, $new_master_password ); my $vip = '192.168.137.50/24'; ###VIP my $key = '1'; ###用於區別自己的eth0 my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down"; GetOptions( 'command=s' => \$command, 'ssh_user=s' => \$ssh_user, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, 'new_master_user=s' => \$new_master_user, 'new_master_password=s' => \$new_master_password, ); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; exit 0; } else { &usage(); exit 1; } } sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } sub stop_vip() { return 0 unless ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
注意:須要手動先在master服務器上面添加VIP
/sbin/ifconfig eth0:1 192.168.137.50/24
perl腳本
#!/usr/bin/env perl use strict; use warnings FATAL =>'all'; use Getopt::Long; my $vip = '192.168.137.50/24'; # Virtual IP my $key = "1"; my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down"; my $exit_code = 0; my ( $command, $orig_master_is_new_slave, $orig_master_host, $orig_master_ip, $orig_master_port, $orig_master_user, $orig_master_password, $new_master_host, $new_master_ip, $new_master_port, $new_master_user, $new_master_password, ); GetOptions( 'command=s' => \$command, 'orig_master_is_new_slave' => \$orig_master_is_new_slave, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'orig_master_user=s' => \$orig_master_user, 'orig_master_password=s' => \$orig_master_password, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, 'new_master_user=s' => \$new_master_user, 'new_master_password=s' => \$new_master_password, ); exit &main(); sub main { #print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { # $orig_master_host, $orig_master_ip, $orig_master_port are passed. # If you manage master ip address at global catalog database, # invalidate orig_master_ip here. my $exit_code = 1; eval { print "\n\n\n***************************************************************\n"; print "Disabling the VIP - $vip on old master: $orig_master_host\n"; print "***************************************************************\n\n\n\n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { # all arguments are passed. # If you manage master ip address at global catalog database, # activate new_master_ip here. # You can also grant write access (create user, set read_only=0, etc) here. my $exit_code = 10; eval { print "\n\n\n***************************************************************\n"; print "Enabling the VIP - $vip on new master: $new_master_host \n"; print "***************************************************************\n\n\n\n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; `ssh $orig_master_user\@$orig_master_host \" $ssh_start_vip \"`; exit 0; } else { &usage(); exit 1; } } # A simple system call that enable the VIP on the new master sub start_vip() { `ssh $new_master_user\@$new_master_host \" $ssh_start_vip \"`; } # A simple system call that disable the VIP on the old_master sub stop_vip() { `ssh $orig_master_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
shell腳本
#/bin/bash #source /root/.bash_profile vip=`echo '192.168.137.50/24'` # Virtual IP key=`echo '1'` command=`echo "$1" | awk -F = '{print $2}'` orig_master_host=`echo "$2" | awk -F = '{print $2}'` new_master_host=`echo "$7" | awk -F = '{print $2}'` stop_vip=`echo "ssh root@$orig_master_host /sbin/ifconfig eth0:$key down"` start_vip=`echo "ssh root@$new_master_host /sbin/ifconfig eth0:$key $vip"` if [ $command = 'stop' ] then echo -e "\n\n\n***************************************************************\n" echo -e "Disabling the VIP - $vip on old master: $orig_master_host\n" $stop_vip if [ $? -eq 0 ] then echo "Disabled the VIP successfully" else echo "Disabled the VIP failed" fi echo -e "***************************************************************\n\n\n\n" fi if [ $command = 'start' -o $command = 'status' ] then echo -e "\n\n\n***************************************************************\n" echo -e "Enabling the VIP - $vip on new master: $new_master_host \n" $start_vip if [ $? -eq 0 ] then echo "Enabled the VIP successfully" else echo "Enabled the VIP failed" fi echo -e "***************************************************************\n\n\n\n" fi
#!/usr/bin/perl # Copyright (C) 2011 DeNA Co.,Ltd. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA ## Note: This is a sample script and is not complete. Modify the script based on your environment. use strict; use warnings FATAL => 'all'; use Mail::Sender; use Getopt::Long; #new_master_host and new_slave_hosts are set only when recovering master succeeded my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body ); my $smtp='smtp.163.com'; my $mail_from='xxxx'; my $mail_user='xxxxx'; my $mail_pass='xxxxx'; my $mail_to=['xxxx','xxxx']; GetOptions( 'orig_master_host=s' => \$dead_master_host, 'new_master_host=s' => \$new_master_host, 'new_slave_hosts=s' => \$new_slave_hosts, 'subject=s' => \$subject, 'body=s' => \$body, ); mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body); sub mailToContacts { my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_; open my $DEBUG, "> /tmp/monitormail.log" or die "Can't open the debug file:$!\n"; my $sender = new Mail::Sender { ctype => 'text/plain; charset=utf-8', encoding => 'utf-8', smtp => $smtp, from => $mail_from, auth => 'LOGIN', TLS_allowed => '0', authid => $user, authpwd => $passwd, to => $mail_to, subject => $subject, debug => $DEBUG }; $sender->MailMsg( { msg => $msg, debug => $DEBUG } ) or print $Mail::Sender::Error; return 1; } # Do whatever you want here exit 0;
(1)全部Node的cnf配置文件加上
relay_log_purge=0
MHA在發生切換的過程當中,從庫的恢復過程當中依賴於relay log的相關信息,因此這裏要將relay log的自動清除設置爲OFF,採用手動清除relay log的方式。
在默認狀況下,從服務器上的中繼日誌會在SQL線程執行完畢後被自動刪除。可是在MHA環境中,這些中繼日誌在恢復其餘從服務器時可能會被用到,所以須要禁用中繼日誌的自動刪除功能。按期清除中繼日誌須要考慮到複製延時的問題。在ext3的文件系統下,刪除大的文件須要必定的時間,會致使嚴重的複製延時。爲了不復制延時,須要暫時爲中繼日誌建立硬連接,由於在linux系統中經過硬連接刪除大文件速度會很快。
提示:在mysql數據庫中,刪除大表時,一般也採用創建硬連接的方式
MHA節點中包含了pure_relay_logs命令工具,它能夠爲中繼日誌建立硬連接,執行SET GLOBAL relay_log_purge=1,等待幾秒鐘以便SQL線程切換到新的中繼日誌,再執行SET GLOBAL relay_log_purge=0。
pure_relay_logs腳本參數以下所示:
--user mysql 用戶名 --password mysql 密碼 --port 端口號 --workdir 指定建立relay log的硬連接的位置,默認是/var/tmp,因爲系統不一樣分區建立硬連接文件會失敗,故須要執行硬連接具體位置,成功執行腳本後,硬連接的中繼日誌文件被刪除 --disable_relay_log_purge 默認狀況下,若是relay_log_purge=1,腳本會什麼都不清理,自動退出,經過設定這個參數,當relay_log_purge=1的狀況下會將relay_log_purge設置爲0。清理relay log以後,最後將參數設置爲OFF。
(2)在每臺slave Node上建立
vim /usr/local/mha/purge_relay_log.sh
#!/bin/bash
user=root passwd=root ####確保用戶和密碼能經過127.0.0.1登入 host='127.0.0.1' port=3306 work_dir='/mysql/data' purge='/usr/local/mha/bin/purge_relay_logs' $purge --user=$user --password=$passwd --host=$host --disable_relay_log_purge --port=$port --workdir=$work_dir >> /usr/local/mha/purge_relay_logs.log 2>&1
chmod u+x /usr/local/mha/purge_relay_log.sh
將腳本加入到os定時任務中
masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf
[root@monitor ha1]# masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf Thu Aug 25 14:53:30 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Aug 25 14:53:30 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 14:53:30 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 14:53:30 2016 - [info] Starting SSH connection tests.. Thu Aug 25 14:53:35 2016 - [debug] Thu Aug 25 14:53:31 2016 - [debug] Connecting via SSH from root@192.168.137.20(192.168.137.20:22) to root@192.168.137.10(192.168.137.10:22).. Thu Aug 25 14:53:33 2016 - [debug] ok. Thu Aug 25 14:53:33 2016 - [debug] Connecting via SSH from root@192.168.137.20(192.168.137.20:22) to root@192.168.137.30(192.168.137.30:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:35 2016 - [debug] Thu Aug 25 14:53:31 2016 - [debug] Connecting via SSH from root@192.168.137.30(192.168.137.30:22) to root@192.168.137.10(192.168.137.10:22).. Thu Aug 25 14:53:33 2016 - [debug] ok. Thu Aug 25 14:53:33 2016 - [debug] Connecting via SSH from root@192.168.137.30(192.168.137.30:22) to root@192.168.137.20(192.168.137.20:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:36 2016 - [debug] Thu Aug 25 14:53:30 2016 - [debug] Connecting via SSH from root@192.168.137.10(192.168.137.10:22) to root@192.168.137.20(192.168.137.20:22).. Thu Aug 25 14:53:34 2016 - [debug] ok. Thu Aug 25 14:53:34 2016 - [debug] Connecting via SSH from root@192.168.137.10(192.168.137.10:22) to root@192.168.137.30(192.168.137.30:22).. Thu Aug 25 14:53:35 2016 - [debug] ok. Thu Aug 25 14:53:36 2016 - [info] All SSH connection tests passed successfully.
能夠看到每一個Node到其它的Node都是相通的。
masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf
[root@monitor ha1]# masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf Thu Aug 25 16:09:19 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Aug 25 16:09:19 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 16:09:19 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Thu Aug 25 16:09:19 2016 - [info] MHA::MasterMonitor version 0.55. Thu Aug 25 16:09:20 2016 - [info] Dead Servers: Thu Aug 25 16:09:20 2016 - [info] Alive Servers: Thu Aug 25 16:09:20 2016 - [info] 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.20(192.168.137.20:3306) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.30(192.168.137.30:3306) Thu Aug 25 16:09:20 2016 - [info] Alive Slaves: Thu Aug 25 16:09:20 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Thu Aug 25 16:09:20 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Aug 25 16:09:20 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Thu Aug 25 16:09:20 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Not candidate for the new Master (no_master is set) Thu Aug 25 16:09:20 2016 - [info] Current Alive Master: 192.168.137.10(192.168.137.10:3306) Thu Aug 25 16:09:20 2016 - [info] Checking slave configurations.. Thu Aug 25 16:09:20 2016 - [info] read_only=1 is not set on slave 192.168.137.20(192.168.137.20:3306). Thu Aug 25 16:09:20 2016 - [info] Checking replication filtering settings.. Thu Aug 25 16:09:20 2016 - [info] binlog_do_db= , binlog_ignore_db= Thu Aug 25 16:09:20 2016 - [info] Replication filtering check ok. Thu Aug 25 16:09:20 2016 - [info] Starting SSH connection tests.. Thu Aug 25 16:09:25 2016 - [info] All SSH connection tests passed successfully. Thu Aug 25 16:09:25 2016 - [info] Checking MHA Node version.. Thu Aug 25 16:09:26 2016 - [info] Version check ok. Thu Aug 25 16:09:26 2016 - [info] Checking SSH publickey authentication settings on the current master.. Thu Aug 25 16:09:27 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Thu Aug 25 16:09:29 2016 - [info] Master MHA Node version is 0.54. Thu Aug 25 16:09:29 2016 - [info] Checking recovery script configurations on the current master.. Thu Aug 25 16:09:29 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/log --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --start_file=mysql-bin.000138 Thu Aug 25 16:09:29 2016 - [info] Connecting to root@192.168.137.10(192.168.137.10).. Creating /tmp if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /mysql/log, up to mysql-bin.000138 Thu Aug 25 16:09:30 2016 - [info] Master setting check done. Thu Aug 25 16:09:30 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Thu Aug 25 16:09:30 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.137.20 --slave_ip=192.168.137.20 --slave_port=3306 --workdir=/tmp --target_version=5.6.15-log --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx Thu Aug 25 16:09:30 2016 - [info] Connecting to root@192.168.137.20(192.168.137.20:22).. Checking slave recovery environment settings.. Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000006 Temporary relay log file is /mysql/data/mysql-relay-bin.000006 Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Aug 25 16:09:31 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --workdir=/tmp --target_version=5.6.15-log --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx Thu Aug 25 16:09:31 2016 - [info] Connecting to root@192.168.137.30(192.168.137.30:22).. Checking slave recovery environment settings.. Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000002 Temporary relay log file is /mysql/data/mysql-relay-bin.000002 Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Aug 25 16:09:32 2016 - [info] Slaves settings check done. Thu Aug 25 16:09:32 2016 - [info] 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 Thu Aug 25 16:09:32 2016 - [info] Checking replication health on 192.168.137.20.. Thu Aug 25 16:09:32 2016 - [info] ok. Thu Aug 25 16:09:32 2016 - [info] Checking replication health on 192.168.137.30.. Thu Aug 25 16:09:32 2016 - [info] ok. Thu Aug 25 16:09:32 2016 - [info] Checking master_ip_failover_script status: Thu Aug 25 16:09:32 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Checking the Status of the script.. OK Thu Aug 25 16:09:32 2016 - [info] OK. Thu Aug 25 16:09:32 2016 - [warning] shutdown_script is not defined. Thu Aug 25 16:09:32 2016 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.
--ignore_fail_on_start: 當有slave 節點宕掉時,默認是啓動不了的,加上 --ignore_fail_on_start 即便有節點宕掉也能啓動MHA,加上該參數會忽略啓動文件中配置ignore_fail=1的server
masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf
因爲mha尚未啓動,因此這裏檢測是stopped
nohup masterha_manager --conf=/usr/local/mha/ha1/ha1.cnf --ignore_fail_on_start --ignore_last_failover < /dev/null > /usr/local/mha/ha1/start.log 2>&1 &
--remove_dead_master_conf:該參數表明當發生主從切換後,老的主庫的ip將會從配置文件中移除。這裏暫時不使用該參數,由於發生使用該參數會將ha1.cnf配置文件搞亂。
--start_log:日誌。
--ignore_last_failover:發生主從切換後,MHAmanager服務會自動停掉,且在manager_workdir目錄下面生成文件app1.failover.complete,若要啓動MHA,必須先刪除該文件,該參數表明忽略上次MHA觸發切換產生的文件,這裏設置爲-ignore_last_failover。 在缺省狀況下,若是MHA檢測到連續發生宕機,且兩次宕機間隔不足8小時的話,則不會進行Failover,之因此這樣限制是爲了不ping-pong效應。
--ignore_fail_on_start: 當有slave 節點宕掉時,默認是啓動不了的,加上 --ignore_fail_on_start 即便有節點宕掉也能啓動MHA,加上該參數會忽略啓動文件中配置ignore_fail=1的server。
(1)再次查看MHA狀態是否正常:
[root@monitor ha1]# masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf ha1 (pid:6371) is running(0:PING_OK), master:192.168.137.10 [root@monitor ha1]#
(2)查看啓動日誌
cat manager.log
Thu Aug 25 17:11:50 2016 - [info] 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 Thu Aug 25 17:11:50 2016 - [info] Checking master_ip_failover_script status: Thu Aug 25 17:11:50 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Checking the Status of the script.. OK Thu Aug 25 17:11:50 2016 - [info] OK. Thu Aug 25 17:11:50 2016 - [warning] shutdown_script is not defined. Thu Aug 25 17:11:50 2016 - [info] Set master ping interval 1 seconds. Thu Aug 25 17:11:50 2016 - [info] Set secondary check script: /usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 Thu Aug 25 17:11:50 2016 - [info] Starting ping health check on 192.168.137.10(192.168.137.10:3306).. Thu Aug 25 17:11:50 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. [root@monitor ha1]#
(3)產生的文件
ha1.master_status.health:mha正常啓動會產生該文件
manager.log:mha監控日誌
start.log:mha啓動時生成的日誌
masterha_stop --conf=/usr/local/mha/ha1/ha1.cnf
發生主從切換後,MHA服務會自動停掉
檢查故障處理的日誌,確保故障正常轉移。
cat /usr/local/mha/ha1/manager.log
處理故障的master,將其配置爲從庫chang到新的master,能夠從manager.log找到change語句。
grep "CHANGE MASTER TO MASTER" /usr/local/mha/ha1/manager.log | tail -1
Fri Aug 26 12:04:22 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000143', MASTER_LOG_POS=22123166, MASTER_USER='repl', MASTER_PASSWORD='xxx';
注意:這裏要確保slave的SQL_THREAD和IO_TRREAD正常,若是是配置半同步複製要確保半同步複製啓動正常,能夠執行"show status like '%rpl_%';",具體參考前面半同步複製的搭建。
須要修改"secondary_check_script"選項中的master_host、master_ip、master_port爲新的master;若是兩臺電腦的配置都相同的話其它地方不用修改。
因爲啓動mha的時候加上了--ignore_last_failover參數,因此不刪除failower生成的文件也能啓動,不然須要刪除failower生成的文件「ha1.failover.complete」。
rm -f /usr/local/mha/ha1/ha1.failover.complete
檢查SSH配置 masterha_check_ssh --conf=/usr/local/mha/ha1/ha1.cnf 檢查複製 masterha_check_repl --conf=/usr/local/mha/ha1/ha1.cnf 檢查狀態 masterha_check_status --conf=/usr/local/mha/ha1/ha1.cnf
必需保證全部的檢查都經過
nohup masterha_manager --conf=/usr/local/mha/ha1/ha1.cnf --ignore_fail_on_start --ignore_last_failover < /dev/null > /usr/local/mha/ha1/start.log 2>&1 &
我這裏是異步複製,137.20是當前的master,而後在137.20上執行併發插入,同時關閉137.10和137.30的IO線程,在137.20上壓測一段時間,而後先開啓137.30的IO線程,過幾秒鐘再開啓137.10的IO線程;保證137.30的binlog比候選的137.10的binlog更新。
master 137.20(22497564)
candidate slave:137.10(pos=9857376)
new replay slave:137.30(pos=22461852)
Fri Aug 26 11:57:36 2016 - [warning] Got error on MySQL select ping: 2013 (Lost connection to MySQL server during query) Fri Aug 26 11:57:36 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/log --output_file=/tmp/save_binary_logs_test --manager_version=0.55 --binlog_prefix=mysql-bin Fri Aug 26 11:57:36 2016 - [info] Executing seconary network check script: /usr/local/mha/bin/masterha_secondary_check -s backup -s master --user=root --master_host=master --master_ip=192.168.137.10 --master_port=3306 --user=root --master_host=192.168.137.20 --master_ip=192.168.137.20 --master_port=3306 Fri Aug 26 11:57:37 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:37 2016 - [warning] Connection failed 1 time(s).. Fri Aug 26 11:57:38 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:38 2016 - [warning] Connection failed 2 time(s).. Fri Aug 26 11:57:38 2016 - [info] HealthCheck: SSH to 192.168.137.20 is reachable. Monitoring server backup is reachable, Master is not reachable from backup. OK. Fri Aug 26 11:57:39 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.137.20' (111)) Fri Aug 26 11:57:39 2016 - [warning] Connection failed 3 time(s).. Monitoring server master is reachable, Master is not reachable from master. OK. Fri Aug 26 11:57:41 2016 - [info] Master is not reachable from all other monitoring servers. Failover should start. Fri Aug 26 11:57:41 2016 - [warning] Master is not reachable from health checker! Fri Aug 26 11:57:41 2016 - [warning] Master 192.168.137.20(192.168.137.20:3306) is not reachable! Fri Aug 26 11:57:41 2016 - [warning] SSH is reachable. Fri Aug 26 11:57:41 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /usr/local/mha/ha1/ha1.cnf again, and trying to connect to all servers to check server status.. Fri Aug 26 11:57:41 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Fri Aug 26 11:57:41 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 11:57:41 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 11:57:42 2016 - [info] Dead Servers: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Alive Servers: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:42 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 11:57:42 2016 - [info] Alive Slaves: Fri Aug 26 11:57:42 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:42 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:42 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:42 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:42 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:42 2016 - [info] Checking slave configurations.. Fri Aug 26 11:57:42 2016 - [info] read_only=1 is not set on slave 192.168.137.10(192.168.137.10:3306). Fri Aug 26 11:57:42 2016 - [info] Checking replication filtering settings.. Fri Aug 26 11:57:42 2016 - [info] Replication filtering check ok. Fri Aug 26 11:57:42 2016 - [info] Master is down! Fri Aug 26 11:57:42 2016 - [info] Terminating monitoring script. Fri Aug 26 11:57:42 2016 - [info] Got exit code 20 (Master dead). Fri Aug 26 11:57:42 2016 - [info] MHA::MasterFailover version 0.55. Fri Aug 26 11:57:42 2016 - [info] Starting master failover. Fri Aug 26 11:57:42 2016 - [info] Fri Aug 26 11:57:42 2016 - [info] * Phase 1: Configuration Check Phase.. Fri Aug 26 11:57:42 2016 - [info] Fri Aug 26 11:57:44 2016 - [info] Dead Servers: Fri Aug 26 11:57:44 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:44 2016 - [info] Checking master reachability via mysql(double check).. Fri Aug 26 11:57:44 2016 - [info] ok. Fri Aug 26 11:57:44 2016 - [info] Alive Servers: Fri Aug 26 11:57:44 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:44 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 11:57:44 2016 - [info] Alive Slaves: Fri Aug 26 11:57:44 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:44 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:44 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:44 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:44 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:44 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:44 2016 - [info] ** Phase 1: Configuration Check Phase completed. Fri Aug 26 11:57:44 2016 - [info] Fri Aug 26 11:57:44 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Fri Aug 26 11:57:44 2016 - [info] Fri Aug 26 11:57:44 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Fri Aug 26 11:57:44 2016 - [info] Executing master IP deactivatation script: Fri Aug 26 11:57:44 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Disabling the VIP on old master: 192.168.137.20 Fri Aug 26 11:57:45 2016 - [info] done. Fri Aug 26 11:57:45 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Fri Aug 26 11:57:45 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] * Phase 3: Master Recovery Phase.. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] The latest binary log file/position on all slaves is mysql-bin.000074:22461852 Fri Aug 26 11:57:45 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Fri Aug 26 11:57:45 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:45 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:45 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:45 2016 - [info] The oldest binary log file/position on all slaves is mysql-bin.000074:9857376 Fri Aug 26 11:57:45 2016 - [info] Oldest slaves: Fri Aug 26 11:57:45 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:45 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:45 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:45 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Fri Aug 26 11:57:45 2016 - [info] Fri Aug 26 11:57:46 2016 - [info] Fetching dead master's binary logs.. Fri Aug 26 11:57:46 2016 - [info] Executing command on the dead master 192.168.137.20(192.168.137.20:3306): save_binary_logs --command=save --start_file=mysql-bin.000074 --start_pos=22461852 --binlog_dir=/mysql/log --output_file=/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000074 pos 22461852 to mysql-bin.000074 EOF into /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 120.. ok. Dumping effective binlog data from /mysql/log/mysql-bin.000074 position 22461852 to tail(22497564).. ok. Concat succeeded. Fri Aug 26 11:57:49 2016 - [info] scp from root@192.168.137.20:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded. Fri Aug 26 11:57:52 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Fri Aug 26 11:57:55 2016 - [info] HealthCheck: SSH to 192.168.137.30 is reachable. Fri Aug 26 11:57:55 2016 - [info] Fri Aug 26 11:57:55 2016 - [info] * Phase 3.3: Determining New Master Phase.. Fri Aug 26 11:57:55 2016 - [info] Fri Aug 26 11:57:55 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Fri Aug 26 11:57:55 2016 - [info] Checking whether 192.168.137.30 has relay logs from the oldest position.. Fri Aug 26 11:57:55 2016 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=mysql-bin.000074 --latest_rmlp=22461852 --target_mlf=mysql-bin.000074 --target_rmlp=9857376 --server_id=30 --workdir=/tmp --timestamp=20160826115742 --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ : Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:mysql-relay-bin.000003, start_pos:9857539. Target relay log FOUND! Fri Aug 26 11:57:56 2016 - [info] OK. 192.168.137.30 has all relay logs. Fri Aug 26 11:57:56 2016 - [info] Searching new master from slaves.. Fri Aug 26 11:57:56 2016 - [info] Candidate masters from the configuration file: Fri Aug 26 11:57:56 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:56 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:56 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 11:57:56 2016 - [info] Non-candidate masters: Fri Aug 26 11:57:56 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 11:57:56 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Fri Aug 26 11:57:56 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 11:57:56 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Fri Aug 26 11:57:56 2016 - [info] Not found. Fri Aug 26 11:57:56 2016 - [info] Searching from all candidate_master slaves.. Fri Aug 26 11:57:56 2016 - [info] New master is 192.168.137.10(192.168.137.10:3306) Fri Aug 26 11:57:56 2016 - [info] Starting master failover.. Fri Aug 26 11:57:56 2016 - [info] From: 192.168.137.20 (current master) +--192.168.137.10 +--192.168.137.30 To: 192.168.137.10 (new master) +--192.168.137.30 Fri Aug 26 11:57:56 2016 - [info] Fri Aug 26 11:57:56 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Fri Aug 26 11:57:56 2016 - [info] Fri Aug 26 11:57:56 2016 - [info] Server 192.168.137.10 received relay logs up to: mysql-bin.000074:9857376 Fri Aug 26 11:57:56 2016 - [info] Need to get diffs from the latest slave(192.168.137.30) up to: mysql-bin.000074:22461852 (using the latest slave's relay logs) Fri Aug 26 11:57:56 2016 - [info] Connecting to the latest slave host 192.168.137.30, generating diff relay log files.. Fri Aug 26 11:57:56 2016 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=192.168.137.10 --latest_mlf=mysql-bin.000074 --latest_rmlp=22461852 --target_mlf=mysql-bin.000074 --target_rmlp=9857376 --server_id=30 --diff_file_readtolatest=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog --workdir=/tmp --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ Fri Aug 26 11:58:02 2016 - [info] Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:mysql-relay-bin.000003, start_pos:9857539. Concat binary/relay logs from mysql-relay-bin.000003 pos 9857539 to mysql-relay-bin.000003 EOF into /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 283.. ok. Dumping effective binlog data from /mysql/data/mysql-relay-bin.000003 position 9857539 to tail(22462015).. ok. Concat succeeded. Generating diff relay log succeeded. Saved at /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog . scp slave:/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to root@192.168.137.10(22) succeeded. Fri Aug 26 11:58:02 2016 - [info] Generating diff files succeeded. Fri Aug 26 11:58:02 2016 - [info] Sending binlog.. Fri Aug 26 11:58:04 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to root@192.168.137.10:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded. Fri Aug 26 11:58:04 2016 - [info] Fri Aug 26 11:58:04 2016 - [info] * Phase 3.4: Master Log Apply Phase.. Fri Aug 26 11:58:04 2016 - [info] Fri Aug 26 11:58:04 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Fri Aug 26 11:58:04 2016 - [info] Starting recovery on 192.168.137.10(192.168.137.10:3306).. Fri Aug 26 11:58:04 2016 - [info] Generating diffs succeeded. Fri Aug 26 11:58:04 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 12:00:06 2016 - [info] done. Fri Aug 26 12:00:06 2016 - [info] Getting slave status.. Fri Aug 26 12:00:06 2016 - [info] This slave(192.168.137.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000074:9857376). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 12:00:06 2016 - [info] Connecting to the target slave host 192.168.137.10, running recover script.. Fri Aug 26 12:00:06 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.10 --slave_ip=192.168.137.10 --slave_port=3306 --apply_files=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 12:04:22 2016 - [info] Concat all apply files to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog .. Copying the first binlog file /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog.. dumped up to pos 120. ok. /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog has effective binlog events from pos 120. Dumping effective binlog data from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog position 120 to tail(35832).. ok. Concat succeeded. All apply target binary logs are concatinated at /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog . MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog on 192.168.137.10:3306. This may take long time... Applying log files succeeded. Fri Aug 26 12:04:22 2016 - [info] All relay logs were successfully applied. Fri Aug 26 12:04:22 2016 - [info] Getting new master's binlog name and position.. Fri Aug 26 12:04:22 2016 - [info] mysql-bin.000143:22123166 Fri Aug 26 12:04:22 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000143', MASTER_LOG_POS=22123166, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Fri Aug 26 12:04:22 2016 - [info] Executing master IP activate script: Fri Aug 26 12:04:22 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --new_master_host=192.168.137.10 --new_master_ip=192.168.137.10 --new_master_port=3306 --new_master_user='root' --new_master_password='root' IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Enabling the VIP - 192.168.137.50/24 on the new master - 192.168.137.10 Fri Aug 26 12:04:25 2016 - [info] OK. Fri Aug 26 12:04:25 2016 - [info] ** Finished master recovery successfully. Fri Aug 26 12:04:25 2016 - [info] * Phase 3: Master Recovery Phase completed. Fri Aug 26 12:04:25 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] * Phase 4: Slaves Recovery Phase.. Fri Aug 26 12:04:25 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Fri Aug 26 12:04:25 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] -- Slave diff file generation on host 192.168.137.30(192.168.137.30:3306) started, pid: 5029. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826115742.log if it takes time.. Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:25 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Fri Aug 26 12:04:26 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 12:04:26 2016 - [info] -- 192.168.137.30(192.168.137.30:3306) has the latest relay log events. Fri Aug 26 12:04:26 2016 - [info] Generating relay diff files from the latest slave succeeded. Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Fri Aug 26 12:04:26 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) started, pid: 5031. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826115742.log if it takes time.. Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:32 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:26 2016 - [info] Sending binlog.. Fri Aug 26 12:04:28 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to root@192.168.137.30:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded. Fri Aug 26 12:04:28 2016 - [info] Starting recovery on 192.168.137.30(192.168.137.30:3306).. Fri Aug 26 12:04:28 2016 - [info] Generating diffs succeeded. Fri Aug 26 12:04:28 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 12:04:28 2016 - [info] done. Fri Aug 26 12:04:28 2016 - [info] Getting slave status.. Fri Aug 26 12:04:28 2016 - [info] This slave(192.168.137.30)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000074:22461852). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 12:04:28 2016 - [info] Connecting to the target slave host 192.168.137.30, running recover script.. Fri Aug 26 12:04:28 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 12:04:30 2016 - [info] MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog on 192.168.137.30:3306. This may take long time... Applying log files succeeded. Fri Aug 26 12:04:30 2016 - [info] All relay logs were successfully applied. Fri Aug 26 12:04:30 2016 - [info] Resetting slave 192.168.137.30(192.168.137.30:3306) and starting replication from the new master 192.168.137.10(192.168.137.10:3306).. Fri Aug 26 12:04:31 2016 - [info] Executed CHANGE MASTER. Fri Aug 26 12:04:31 2016 - [info] Slave started. Fri Aug 26 12:04:32 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 12:04:32 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) succeeded. Fri Aug 26 12:04:32 2016 - [info] All new slave servers recovered successfully. Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:32 2016 - [info] * Phase 5: New master cleanup phase.. Fri Aug 26 12:04:32 2016 - [info] Fri Aug 26 12:04:32 2016 - [info] Resetting slave info on the new master.. Fri Aug 26 12:04:32 2016 - [info] 192.168.137.10: Resetting slave info succeeded. Fri Aug 26 12:04:32 2016 - [info] Master failover to 192.168.137.10(192.168.137.10:3306) completed successfully. Fri Aug 26 12:04:32 2016 - [info] ----- Failover Report ----- ha1: MySQL Master failover 192.168.137.20 to 192.168.137.10 succeeded Master 192.168.137.20 is down! Check MHA Manager logs at monitor:/usr/local/mha/ha1/manager.log for details. Started automated(non-interactive) failover. Invalidated master IP address on 192.168.137.20. The latest slave 192.168.137.30(192.168.137.30:3306) has all relay logs for recovery. Selected 192.168.137.10 as a new master. 192.168.137.10: OK: Applying all logs succeeded. 192.168.137.10: OK: Activated master IP address. 192.168.137.30: This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.137.30: OK: Applying all logs succeeded. Slave started, replicating from 192.168.137.10. 192.168.137.10: Resetting slave info succeeded. Master failover to 192.168.137.10(192.168.137.10:3306) completed successfully.
說明:用紅色標記了一些主要的處理過程,亮色加粗標記了每一個步驟總共5個步驟
Failover步驟以下:
1.fail判斷,分別判斷dead master的mysql(Ping(SELECT))和ssh分別到達狀況(之間會調用masterha_secondary_check腳本)→dead master處理階段
2.配置文件檢查,會檢查整個集羣配置文件配置(分別肯定dead server,候選的master和全部的server的配置狀況以及配置是否知足條件)→dead master處理階段
3.宕機的master處理,包括虛擬ip摘除操做,主機關機操做(這裏暫時沒有配置關機操做)→dead master處理階段
/usr/local/mha/ha1/fail_script/master_ip_failover --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --command=stopssh --ssh_user=root
4.找到含有最新relay log的slave(同時找到最舊的binlog的slave的position), 分別判斷是不是候選的slave→new master還原階段
5.保存dead master(137.20)和最新slave(137.30)相差的relay log保存在dead master的/tmp目錄下(根據配置文件配置的remote_workdir),而後肯定這部分差別binlog(saved_master_binlog_)是否有效,也就是dead master和最新的slave之間是否存在binlog差別,存在差別則將生成的這個差別binlog拷貝到mha的workdir(137.40)下→new master還原階段
Fri Aug 26 11:57:46 2016 - [info] Executing command on the dead master 192.168.137.20(192.168.137.20:3306): save_binary_logs --command=save --start_file=mysql-bin.000074 --start_pos=22461852 --binlog_dir=/mysql/log --output_file=/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000074 pos 22461852 to mysql-bin.000074 EOF into /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 120.. ok. Dumping effective binlog data from /mysql/log/mysql-bin.000074 position 22461852 to tail(22497564).. ok. Concat succeeded. Fri Aug 26 11:57:49 2016 - [info] scp from root@192.168.137.20:/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog to local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog succeeded.
7.肯定新的master,並檢查最新slave(30)的relay log是否能夠用來還原其餘的slave→new master還原階段
8.生成最新slave(137.30)和new master(137.10)之間的差別relay log(在最新relay log的slave的/tmp下生成和其它slave差別的binlog,是兩者的「Read_Master_Log_Pos」的差,取名爲「relay_from_read_to_latest_後面緊接的是目標slave的ip」),而後cp到目標(new master)的slave的/tmp下,同時將mha workdir下剛纔保存的"saved_master_binlog_"(若是存在)文件拷貝到new master的/tmp下→new master還原階段
Fri Aug 26 11:57:56 2016 - [info] Connecting to the latest slave host 192.168.137.30, generating diff relay log files.. Fri Aug 26 11:57:56 2016 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=192.168.137.10 --latest_mlf=mysql-bin.000074 --latest_rmlp=22461852 --target_mlf=mysql-bin.000074 --target_rmlp=9857376 --server_id=30 --diff_file_readtolatest=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog --workdir=/tmp --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ Fri Aug 26 11:58:02 2016 - [info] Opening /mysql/data/relay-log.info ... ok. Relay log found at /mysql/data, up to mysql-relay-bin.000003 Fast relay log position search succeeded. Target relay log file/position found. start_file:mysql-relay-bin.000003, start_pos:9857539. Concat binary/relay logs from mysql-relay-bin.000003 pos 9857539 to mysql-relay-bin.000003 EOF into /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog .. Dumping binlog format description event, from position 0 to 283.. ok. Dumping effective binlog data from /mysql/data/mysql-relay-bin.000003 position 9857539 to tail(22462015).. ok. Concat succeeded. Generating diff relay log succeeded. Saved at /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog . scp slave:/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to root@192.168.137.10(22) succeeded.
9.new master應用差別的relay log(首先會判斷該salve本來Read_Master_Log_Pos」和「Exec_Master_Log_Pos是否相等,因爲不是半同步複製因此slave雖然讀到了該pos可是可是因爲複製是異步的全部還得等待master定時發送binlog到slave,若是這中間master故障了就會致使兩者的不一致),若是不相等會在該slave執行save_binary_logs命令保存之間差別的relay log取名爲「relay_from_exec_to_read_後面緊接的是自身的ip」;而後應用"relay_from_read_to_latest_、saved_master_binlog_、relay_from_exec_to_read_"這三個差別的relay log,同時將這三個文件的內容合併生成一個新的binlog文件「total_binlog_for_」→new master還原階段
Fri Aug 26 12:00:06 2016 - [info] This slave(192.168.137.10)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000074:9857376). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 12:00:06 2016 - [info] Connecting to the target slave host 192.168.137.10, running recover script.. Fri Aug 26 12:00:06 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.10 --slave_ip=192.168.137.10 --slave_port=3306 --apply_files=/tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826115742 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 12:04:22 2016 - [info] Concat all apply files to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog .. Copying the first binlog file /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog to /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog.. ok. Dumping binlog head events (rotate events), skipping format description events from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog.. dumped up to pos 120. ok. /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog has effective binlog events from pos 120. Dumping effective binlog data from /tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog position 120 to tail(35832).. ok. Concat succeeded. All apply target binary logs are concatinated at /tmp/total_binlog_for_192.168.137.10_3306.20160826115742.binlog . MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/relay_from_read_to_latest_192.168.137.10_3306_20160826115742.binlog,/tmp/saved_master_binlog_from_192.168.137.20_3306_20160826115742.binlog on 192.168.137.10:3306. This may take long time... Applying log files succeeded. Fri Aug 26 12:04:22 2016 - [info] All relay logs were successfully applied.
10.生成其它slave到新master的change語句,執行master_ip_failover完成切換生成VIP→new master還原階段
11. 其它的slave也重複new masters slave的步驟 (從第8-9步,例如這裏就會從mha的workdir拷貝saved_master_binlog_到最新的slave(137.30)上應用差別的relay log) →other slave還原階段
12. other slave change new master→other slave還原階段
13.生成failover report
注意:這裏的relay log指的是slave已經讀取到的master的binglog的位置(也就是slave中的relay log文件中已經保存了最新master binlog的位置)在show slave status \G中也就是「Read_Master_Log_Pos」並非「Exec_Master_Log_Pos」,因此最新relay log的salve並不必定就是數據最新的(可是這種狀況比較少),只能說明它保存的master binlog是最新的。
mha最早修復new master(無論它是不是最新的slave,因此若是候選的slave是最新的slave那麼天然是最好的候選的slave就會很快修復好),而後再去修復其它的slave。
注意:前提條件是mha沒有啓動,且存在dead的master,MHA manager檢測到沒有dead的server,將報錯,並結束failover。
手動failover,這種場景意味着在業務上沒有啓用MHA自動切換功能,當主服務器故障時,人工手動調用MHA來進行故障切換操做,具體命令以下:
語句以下:
masterha_master_switch --master_state=dead --conf=/usr/local/mha/ha1/ha1.cnf --dead_master_host=192.168.137.10 --dead_master_port=3306 --new_master_host=192.168.137.20 --new_master_port=3306 --ignore_fail_on_start --ignore_last_failover
切換的過程當中會存在幾回須要輸入「yes」進行下一步
[root@monitor ha1]# masterha_master_switch --master_state=dead --conf=/usr/local/mha/ha1/ha1.cnf --dead_master_host=192.168.137.10 --dead_master_port=3306 --new_master_host=192.168.137.20 --new_master_port=3306 --ignore_fail_on_start --ignore_last_failover --dead_master_ip=<dead_master_ip> is not set. Using 192.168.137.10. Fri Aug 26 17:44:10 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Fri Aug 26 17:44:10 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 17:44:10 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Fri Aug 26 17:44:10 2016 - [info] MHA::MasterFailover version 0.55. Fri Aug 26 17:44:10 2016 - [info] Starting master failover. Fri Aug 26 17:44:10 2016 - [info] Fri Aug 26 17:44:10 2016 - [info] * Phase 1: Configuration Check Phase.. Fri Aug 26 17:44:10 2016 - [info] Fri Aug 26 17:44:11 2016 - [info] Dead Servers: Fri Aug 26 17:44:11 2016 - [info] 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:11 2016 - [info] Checking master reachability via mysql(double check).. Fri Aug 26 17:44:11 2016 - [info] ok. Fri Aug 26 17:44:11 2016 - [info] Alive Servers: Fri Aug 26 17:44:11 2016 - [info] 192.168.137.20(192.168.137.20:3306) Fri Aug 26 17:44:11 2016 - [info] 192.168.137.30(192.168.137.30:3306) Fri Aug 26 17:44:11 2016 - [info] Alive Slaves: Fri Aug 26 17:44:11 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:11 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:11 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 17:44:11 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:11 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:11 2016 - [info] Not candidate for the new Master (no_master is set) Master 192.168.137.10 is dead. Proceed? (yes/NO): yes Fri Aug 26 17:44:18 2016 - [info] ** Phase 1: Configuration Check Phase completed. Fri Aug 26 17:44:18 2016 - [info] Fri Aug 26 17:44:18 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Fri Aug 26 17:44:18 2016 - [info] Fri Aug 26 17:44:20 2016 - [info] HealthCheck: SSH to 192.168.137.10 is reachable. Fri Aug 26 17:44:22 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Fri Aug 26 17:44:22 2016 - [info] Executing master IP deactivatation script: Fri Aug 26 17:44:22 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Disabling the VIP on old master: 192.168.137.10 Fri Aug 26 17:44:23 2016 - [info] done. Fri Aug 26 17:44:23 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Fri Aug 26 17:44:23 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] * Phase 3: Master Recovery Phase.. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] The latest binary log file/position on all slaves is mysql-bin.000144:120 Fri Aug 26 17:44:23 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Fri Aug 26 17:44:23 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 17:44:23 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 17:44:23 2016 - [info] The oldest binary log file/position on all slaves is mysql-bin.000144:120 Fri Aug 26 17:44:23 2016 - [info] Oldest slaves: Fri Aug 26 17:44:23 2016 - [info] 192.168.137.20(192.168.137.20:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Primary candidate for the new Master (candidate_master is set) Fri Aug 26 17:44:23 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Fri Aug 26 17:44:23 2016 - [info] Replicating from 192.168.137.10(192.168.137.10:3306) Fri Aug 26 17:44:23 2016 - [info] Not candidate for the new Master (no_master is set) Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:23 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Fri Aug 26 17:44:23 2016 - [info] Fri Aug 26 17:44:24 2016 - [info] Fetching dead master's binary logs.. Fri Aug 26 17:44:24 2016 - [info] Executing command on the dead master 192.168.137.10(192.168.137.10:3306): save_binary_logs --command=save --start_file=mysql-bin.000144 --start_pos=120 --binlog_dir=/mysql/log --output_file=/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000144 pos 120 to mysql-bin.000144 EOF into /tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog .. Dumping binlog format description event, from position 0 to 120.. ok. Dumping effective binlog data from /mysql/log/mysql-bin.000144 position 120 to tail(143).. ok. Concat succeeded. saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog 100% 143 0.1KB/s 00:00 Fri Aug 26 17:44:27 2016 - [info] scp from root@192.168.137.10:/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog to local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog succeeded. Fri Aug 26 17:44:29 2016 - [info] HealthCheck: SSH to 192.168.137.20 is reachable. Fri Aug 26 17:44:31 2016 - [info] HealthCheck: SSH to 192.168.137.30 is reachable. Fri Aug 26 17:44:31 2016 - [info] Fri Aug 26 17:44:31 2016 - [info] * Phase 3.3: Determining New Master Phase.. Fri Aug 26 17:44:31 2016 - [info] Fri Aug 26 17:44:31 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Fri Aug 26 17:44:31 2016 - [info] All slaves received relay logs to the same position. No need to resync each other. Fri Aug 26 17:44:31 2016 - [info] 192.168.137.20 can be new master. Fri Aug 26 17:44:31 2016 - [info] New master is 192.168.137.20(192.168.137.20:3306) Fri Aug 26 17:44:31 2016 - [info] Starting master failover.. Fri Aug 26 17:44:31 2016 - [info] From: 192.168.137.10 (current master) +--192.168.137.20 +--192.168.137.30 To: 192.168.137.20 (new master) +--192.168.137.30 Starting master switch from 192.168.137.10(192.168.137.10:3306) to 192.168.137.20(192.168.137.20:3306)? (yes/NO): yes Fri Aug 26 17:44:40 2016 - [info] New master decided manually is 192.168.137.20(192.168.137.20:3306) Fri Aug 26 17:44:40 2016 - [info] Fri Aug 26 17:44:40 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Fri Aug 26 17:44:40 2016 - [info] Fri Aug 26 17:44:40 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Fri Aug 26 17:44:40 2016 - [info] Sending binlog.. saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog 100% 143 0.1KB/s 00:00 Fri Aug 26 17:44:42 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog to root@192.168.137.20:/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog succeeded. Fri Aug 26 17:44:42 2016 - [info] Fri Aug 26 17:44:42 2016 - [info] * Phase 3.4: Master Log Apply Phase.. Fri Aug 26 17:44:42 2016 - [info] Fri Aug 26 17:44:42 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Fri Aug 26 17:44:42 2016 - [info] Starting recovery on 192.168.137.20(192.168.137.20:3306).. Fri Aug 26 17:44:42 2016 - [info] Generating diffs succeeded. Fri Aug 26 17:44:42 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 17:44:42 2016 - [info] done. Fri Aug 26 17:44:42 2016 - [info] Getting slave status.. Fri Aug 26 17:44:42 2016 - [info] This slave(192.168.137.20)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000144:120). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 17:44:42 2016 - [info] Connecting to the target slave host 192.168.137.20, running recover script.. Fri Aug 26 17:44:42 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.20 --slave_ip=192.168.137.20 --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826174410 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 17:44:43 2016 - [info] MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog on 192.168.137.20:3306. This may take long time... Applying log files succeeded. Fri Aug 26 17:44:43 2016 - [info] All relay logs were successfully applied. Fri Aug 26 17:44:43 2016 - [info] Getting new master's binlog name and position.. Fri Aug 26 17:44:43 2016 - [info] mysql-bin.000075:120 Fri Aug 26 17:44:43 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.20', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000075', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Fri Aug 26 17:44:43 2016 - [info] Executing master IP activate script: Fri Aug 26 17:44:43 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.137.10 --orig_master_ip=192.168.137.10 --orig_master_port=3306 --new_master_host=192.168.137.20 --new_master_ip=192.168.137.20 --new_master_port=3306 --new_master_user='root' --new_master_password='root' IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.137.50/24=== Enabling the VIP - 192.168.137.50/24 on the new master - 192.168.137.20 Fri Aug 26 17:44:44 2016 - [info] OK. Fri Aug 26 17:44:44 2016 - [info] ** Finished master recovery successfully. Fri Aug 26 17:44:44 2016 - [info] * Phase 3: Master Recovery Phase completed. Fri Aug 26 17:44:44 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] * Phase 4: Slaves Recovery Phase.. Fri Aug 26 17:44:44 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Fri Aug 26 17:44:44 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] -- Slave diff file generation on host 192.168.137.30(192.168.137.30:3306) started, pid: 5354. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826174410.log if it takes time.. Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:44 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Fri Aug 26 17:44:45 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 17:44:45 2016 - [info] -- 192.168.137.30(192.168.137.30:3306) has the latest relay log events. Fri Aug 26 17:44:45 2016 - [info] Generating relay diff files from the latest slave succeeded. Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Fri Aug 26 17:44:45 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) started, pid: 5356. Check tmp log /usr/local/mha/ha1/192.168.137.30_3306_20160826174410.log if it takes time.. saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog 100% 143 0.1KB/s 00:00 Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:47 2016 - [info] Log messages from 192.168.137.30 ... Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:45 2016 - [info] Sending binlog.. Fri Aug 26 17:44:45 2016 - [info] scp from local:/usr/local/mha/ha1/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog to root@192.168.137.30:/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog succeeded. Fri Aug 26 17:44:45 2016 - [info] Starting recovery on 192.168.137.30(192.168.137.30:3306).. Fri Aug 26 17:44:45 2016 - [info] Generating diffs succeeded. Fri Aug 26 17:44:45 2016 - [info] Waiting until all relay logs are applied. Fri Aug 26 17:44:45 2016 - [info] done. Fri Aug 26 17:44:45 2016 - [info] Getting slave status.. Fri Aug 26 17:44:45 2016 - [info] This slave(192.168.137.30)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000144:120). No need to recover from Exec_Master_Log_Pos. Fri Aug 26 17:44:45 2016 - [info] Connecting to the target slave host 192.168.137.30, running recover script.. Fri Aug 26 17:44:45 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='root' --slave_host=192.168.137.30 --slave_ip=192.168.137.30 --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog --workdir=/tmp --target_version=5.6.15-log --timestamp=20160826174410 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx Fri Aug 26 17:44:45 2016 - [info] MySQL client version is 5.6.15. Using --binary-mode. Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.137.10_3306_20160826174410.binlog on 192.168.137.30:3306. This may take long time... Applying log files succeeded. Fri Aug 26 17:44:45 2016 - [info] All relay logs were successfully applied. Fri Aug 26 17:44:45 2016 - [info] Resetting slave 192.168.137.30(192.168.137.30:3306) and starting replication from the new master 192.168.137.20(192.168.137.20:3306).. Fri Aug 26 17:44:46 2016 - [info] Executed CHANGE MASTER. Fri Aug 26 17:44:46 2016 - [info] Slave started. Fri Aug 26 17:44:47 2016 - [info] End of log messages from 192.168.137.30. Fri Aug 26 17:44:47 2016 - [info] -- Slave recovery on host 192.168.137.30(192.168.137.30:3306) succeeded. Fri Aug 26 17:44:47 2016 - [info] All new slave servers recovered successfully. Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:47 2016 - [info] * Phase 5: New master cleanup phase.. Fri Aug 26 17:44:47 2016 - [info] Fri Aug 26 17:44:47 2016 - [info] Resetting slave info on the new master.. Fri Aug 26 17:44:47 2016 - [info] 192.168.137.20: Resetting slave info succeeded. Fri Aug 26 17:44:47 2016 - [info] Master failover to 192.168.137.20(192.168.137.20:3306) completed successfully. Fri Aug 26 17:44:47 2016 - [info] ----- Failover Report ----- ha1: MySQL Master failover 192.168.137.10 to 192.168.137.20 succeeded Master 192.168.137.10 is down! Check MHA Manager logs at monitor for details. Started manual(interactive) failover. Invalidated master IP address on 192.168.137.10. The latest slave 192.168.137.20(192.168.137.20:3306) has all relay logs for recovery. Selected 192.168.137.20 as a new master. 192.168.137.20: OK: Applying all logs succeeded. 192.168.137.20: OK: Activated master IP address. 192.168.137.30: This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.137.30: OK: Applying all logs succeeded. Slave started, replicating from 192.168.137.20. 192.168.137.20: Resetting slave info succeeded. Master failover to 192.168.137.20(192.168.137.20:3306) completed successfully.
在許多狀況下, 須要將現有的主服務器遷移到另一臺服務器上。 好比主服務器硬件故障,RAID 控制卡須要重建,將主服務器移到性能更好的服務器上等等。維護主服務器引發性能降低, 致使停機時間至少沒法寫入數據。 另外, 阻塞或殺掉當前運行的會話會致使主主之間數據不一致的問題發生。 MHA 提供快速切換和優雅的阻塞寫入,這個切換過程只須要 0.5-2s 的時間,這段時間內數據是沒法寫入的。在不少狀況下,0.5-2s 的阻塞寫入是能夠接受的。所以切換主服務器不須要計劃分配維護時間窗口。
MHA在線切換的大概過程:
1.檢測複製設置和肯定當前主服務器
2.肯定新的主服務器
3.阻塞寫入到當前主服務器
4.等待全部從服務器遇上覆制
5.授予寫入到新的主服務器
6.從新設置從服務器
注意,在線切換的時候應用架構須要考慮如下兩個問題:
1.自動識別master和slave的問題(master的機器可能會切換),若是採用了vip的方式,基本能夠解決這個問題。
2.負載均衡的問題(能夠定義大概的讀寫比例,每臺機器可承擔的負載比例,當有機器離開集羣時,須要考慮這個問題)
爲了保證數據徹底一致性,在最快的時間內完成切換,MHA的在線切換必須知足如下條件纔會切換成功,不然會切換失敗。
1.全部slave的IO線程都在運行
2.全部slave的SQL線程都在運行
3.全部的show slave status的輸出中Seconds_Behind_Master參數小於或者等於running_updates_limit秒,若是在切換過程當中不指定running_updates_limit,那麼默認狀況下running_updates_limit爲1秒。
4.在old master端,經過show processlist輸出,沒有一個更新花費的時間大於running_updates_limit秒。
1.首先,停掉MHA監控:
masterha_stop --conf=/usr/local/mha/ha1/ha1.cnf
2.手動在線failover
新的master爲192.168.137.10
masterha_master_switch --conf=/usr/local/mha/ha1/ha1.cnf --master_state=alive --new_master_host=192.168.137.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
--orig_master_is_new_slave 切換時加上此參數是將原 master 變爲 slave 節點,若是不加此參數,原來的 master 將不啓動
--running_updates_limit=10000,故障切換時,候選master 若是有延遲的話, mha 切換不能成功,加上此參數表示延遲在此時間範圍內均可切換(單位爲s),可是切換的時間長短是由recover 時relay 日誌的大小決定
注意:手動在線用的failover腳本是「master_ip_online_change」
3.切換日誌
[root@monitor fail_script]# masterha_master_switch --conf=/usr/local/mha/ha1/ha1.cnf --master_state=alive --new_master_host=192.168.137.10 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000 Mon Aug 29 10:41:37 2016 - [info] MHA::MasterRotate version 0.55. Mon Aug 29 10:41:37 2016 - [info] Starting online master switch.. Mon Aug 29 10:41:37 2016 - [info] Mon Aug 29 10:41:37 2016 - [info] * Phase 1: Configuration Check Phase.. Mon Aug 29 10:41:37 2016 - [info] Mon Aug 29 10:41:37 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Aug 29 10:41:37 2016 - [info] Reading application default configurations from /usr/local/mha/ha1/ha1.cnf.. Mon Aug 29 10:41:37 2016 - [info] Reading server configurations from /usr/local/mha/ha1/ha1.cnf.. Mon Aug 29 10:41:38 2016 - [info] Current Alive Master: 192.168.137.20(192.168.137.20:3306) Mon Aug 29 10:41:38 2016 - [info] Alive Slaves: Mon Aug 29 10:41:38 2016 - [info] 192.168.137.10(192.168.137.10:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Mon Aug 29 10:41:38 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Mon Aug 29 10:41:38 2016 - [info] Primary candidate for the new Master (candidate_master is set) Mon Aug 29 10:41:38 2016 - [info] 192.168.137.30(192.168.137.30:3306) Version=5.6.15-log (oldest major version between slaves) log-bin:enabled Mon Aug 29 10:41:38 2016 - [info] Replicating from 192.168.137.20(192.168.137.20:3306) Mon Aug 29 10:41:38 2016 - [info] Not candidate for the new Master (no_master is set) It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.137.20(192.168.137.20:3306)? (YES/no): yes Mon Aug 29 10:41:40 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Mon Aug 29 10:41:40 2016 - [info] ok. Mon Aug 29 10:41:40 2016 - [info] Checking MHA is not monitoring or doing failover.. Mon Aug 29 10:41:40 2016 - [info] Checking replication health on 192.168.137.10.. Mon Aug 29 10:41:40 2016 - [info] ok. Mon Aug 29 10:41:40 2016 - [info] Checking replication health on 192.168.137.30.. Mon Aug 29 10:41:40 2016 - [info] ok. Mon Aug 29 10:41:40 2016 - [info] 192.168.137.10 can be new master. Mon Aug 29 10:41:40 2016 - [info] From: 192.168.137.20 (current master) +--192.168.137.10 +--192.168.137.30 To: 192.168.137.10 (new master) +--192.168.137.30 +--192.168.137.20 Starting master switch from 192.168.137.20(192.168.137.20:3306) to 192.168.137.10(192.168.137.10:3306)? (yes/NO): yes Mon Aug 29 10:41:42 2016 - [info] Checking whether 192.168.137.10(192.168.137.10:3306) is ok for the new master.. Mon Aug 29 10:41:42 2016 - [info] ok. Mon Aug 29 10:41:42 2016 - [info] 192.168.137.20(192.168.137.20:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Mon Aug 29 10:41:42 2016 - [info] 192.168.137.20(192.168.137.20:3306): Resetting slave pointing to the dummy host. Mon Aug 29 10:41:42 2016 - [info] ** Phase 1: Configuration Check Phase completed. Mon Aug 29 10:41:42 2016 - [info] Mon Aug 29 10:41:42 2016 - [info] * Phase 2: Rejecting updates Phase.. Mon Aug 29 10:41:42 2016 - [info] Mon Aug 29 10:41:42 2016 - [info] Executing master ip online change script to disable write on the current master: Mon Aug 29 10:41:42 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_online_change --command=stop --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='root' --new_master_host=192.168.137.10 --new_master_ip=192.168.137.10 --new_master_port=3306 --new_master_user='root' --new_master_password='root' *************************************************************** Disabling the VIP - 192.168.137.50/24 on old master: 192.168.137.20 *************************************************************** Mon Aug 29 10:41:43 2016 - [info] ok. Mon Aug 29 10:41:43 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Mon Aug 29 10:41:43 2016 - [info] Executing FLUSH TABLES WITH READ LOCK.. Mon Aug 29 10:41:43 2016 - [info] ok. Mon Aug 29 10:41:43 2016 - [info] Orig master binlog:pos is mysql-bin.000076:120. Mon Aug 29 10:41:43 2016 - [info] Waiting to execute all relay logs on 192.168.137.10(192.168.137.10:3306).. Mon Aug 29 10:41:43 2016 - [info] master_pos_wait(mysql-bin.000076:120) completed on 192.168.137.10(192.168.137.10:3306). Executed 0 events. Mon Aug 29 10:41:43 2016 - [info] done. Mon Aug 29 10:41:43 2016 - [info] Getting new master's binlog name and position.. Mon Aug 29 10:41:43 2016 - [info] mysql-bin.000147:120 Mon Aug 29 10:41:43 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.10', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000147', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Mon Aug 29 10:41:43 2016 - [info] Executing master ip online change script to allow write on the new master: Mon Aug 29 10:41:43 2016 - [info] /usr/local/mha/ha1/fail_script/master_ip_online_change --command=start --orig_master_host=192.168.137.20 --orig_master_ip=192.168.137.20 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='root' --new_master_host=192.168.137.10 --new_master_ip=192.168.137.10 --new_master_port=3306 --new_master_user='root' --new_master_password='root' *************************************************************** Enabling the VIP - 192.168.137.50/24 on new master: 192.168.137.10 *************************************************************** Mon Aug 29 10:41:44 2016 - [info] ok. Mon Aug 29 10:41:44 2016 - [info] Mon Aug 29 10:41:44 2016 - [info] * Switching slaves in parallel.. Mon Aug 29 10:41:44 2016 - [info] Mon Aug 29 10:41:44 2016 - [info] -- Slave switch on host 192.168.137.30(192.168.137.30:3306) started, pid: 4371 Mon Aug 29 10:41:44 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] Log messages from 192.168.137.30 ... Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:44 2016 - [info] Waiting to execute all relay logs on 192.168.137.30(192.168.137.30:3306).. Mon Aug 29 10:41:44 2016 - [info] master_pos_wait(mysql-bin.000076:120) completed on 192.168.137.30(192.168.137.30:3306). Executed 0 events. Mon Aug 29 10:41:44 2016 - [info] done. Mon Aug 29 10:41:44 2016 - [info] Resetting slave 192.168.137.30(192.168.137.30:3306) and starting replication from the new master 192.168.137.10(192.168.137.10:3306).. Mon Aug 29 10:41:44 2016 - [info] Executed CHANGE MASTER. Mon Aug 29 10:41:44 2016 - [info] Slave started. Mon Aug 29 10:41:45 2016 - [info] End of log messages from 192.168.137.30 ... Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] -- Slave switch on host 192.168.137.30(192.168.137.30:3306) succeeded. Mon Aug 29 10:41:45 2016 - [info] Unlocking all tables on the orig master: Mon Aug 29 10:41:45 2016 - [info] Executing UNLOCK TABLES.. Mon Aug 29 10:41:45 2016 - [info] ok. Mon Aug 29 10:41:45 2016 - [info] Starting orig master as a new slave.. Mon Aug 29 10:41:45 2016 - [info] Resetting slave 192.168.137.20(192.168.137.20:3306) and starting replication from the new master 192.168.137.10(192.168.137.10:3306).. Mon Aug 29 10:41:45 2016 - [info] Executed CHANGE MASTER. Mon Aug 29 10:41:45 2016 - [info] Slave started. Mon Aug 29 10:41:45 2016 - [info] All new slave servers switched successfully. Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] * Phase 5: New master cleanup phase.. Mon Aug 29 10:41:45 2016 - [info] Mon Aug 29 10:41:45 2016 - [info] 192.168.137.10: Resetting slave info succeeded. Mon Aug 29 10:41:45 2016 - [info] Switching master to 192.168.137.10(192.168.137.10:3306) completed successfully.
切換失敗:
[error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln262] We should not start online master switch when one of connections are running long queries on the new master(192.168.137.10(192.168.137.10:3306)). Currently 1 thread(s) are running. Details: {'Time' => '1173','Command' => 'Daemon','db' => undef,'Id' => '3','Info' => undef,'User' => 'event_scheduler','State' => 'Waiting on empty queue','Host' => 'localhost'}
從錯誤信息已經說的很是的清除,在new master中存在一個long queries,且該線程的id=3是一個事件調度線程,因爲new master我開啓了「scheduler」致使了在線手動切換失敗,我把「scheduler」關閉就成功了。
在MHA的環境中備選的Master不能開啓「scheduler」
參考:
http://www.cnblogs.com/gomysql/p/3675429.html
http://blog.csdn.net/lichangzai/article/details/50470771
mha下載地址:
https://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.55.tar.gz
https://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.54.tar.gz
以前寫的複製相關文章:
主從複製:http://www.cnblogs.com/chenmh/p/5089919.html
主主複製:http://www.cnblogs.com/chenmh/p/5153184.html
MMM方案:http://www.cnblogs.com/chenmh/p/5563778.html
半同步複製與MMM:http://www.cnblogs.com/chenmh/p/5744227.html
主和備主之間須要半同步複製才能保證mha的最大程度的數據不丟失,不然使用MHA也就沒優點了;還有就主和備主不要開啓scheduler(做業),不然手動在線failover會失敗。mha最早修復new master(無論它是不是最新的slave,因此若是候選的slave是最新的slave那麼天然是最好的候選的slave就會很快修復好),而後再去修復其它的slave。
注意:文章中對一些配置作了備註說明,在實際部署中須要將這些備註刪掉
備註: 做者:pursuer.chen 博客:http://www.cnblogs.com/chenmh 本站點全部隨筆都是原創,歡迎你們轉載;但轉載時必須註明文章來源,且在文章開頭明顯處給明連接。 《歡迎交流討論》 |