本文介紹gpdb的master故障及恢復測試以及segment故障恢復測試。html
環境介紹:
Gpdb版本:5.5.0 二進制版本
操做系統版本: centos linux 7.0
Master segment: 192.168.1.225/24 hostname: mfsmaster
Stadnby segemnt: 192.168.1.227/24 hostname: server227
Segment 節點1: 192.168.1.227/24 hostname: server227
Segment 節點2: 192.168.1.17/24 hostname: server17
Segment 節點3: 192.168.1.11/24 hostname: server11
每一個segment節點上分別運行一個primary segment和一個mirror segmentnode
1、查看原始狀態linux
select * from gp_segment_configuration;
$ gpstate -f 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-Starting gpstate with args: -f 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.5.0 build commit:67afa18296aa238d53a2dfcc724da60ed2f944f0' 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.5.0 build commit:67afa18296aa238d53a2dfcc724da60ed2f944f0) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Feb 17 2018 15:23:55' 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-Obtaining Segment details from master... 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-Standby master details 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:----------------------- 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby address = server227 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby data directory = /home/gpadmin/master/gpseg-1 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby port = 5432 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby PID = 22279 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:- Standby status = Standby host passive 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-------------------------------------------------------------- 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--pg_stat_replication 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:-------------------------------------------------------------- 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--WAL Sender State: streaming 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Sync state: sync 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Sent Location: 0/CF2C470 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Flush Location: 0/CF2C470 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--Replay Location: 0/CF2C470 20180320:13:50:38:021814 gpstate:mfsmaster:gpadmin-[INFO]:--------------------------------------------------------------
2、master主從切換
一、模擬當前主庫宕機,這裏直接採用killall gpadmin用戶下的全部進程來模擬
二、在master standby節點(227服務器上)進行執行切換命令,提高227爲master web
$ gpactivatestandby -d master/gpseg-1/ 20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------ 20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby data directory = /home/gpadmin/master/gpseg-1 20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby port = 5432 20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby running = yes 20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Force standby activation = no 20180320:13:53:20:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------ Do you want to continue with standby master activation? Yy|Nn (default=N): > y 20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-found standby postmaster process 20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Updating transaction files filespace flat files... 20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Updating temporary files filespace flat files... 20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Promoting standby... 20180320:13:53:26:030558 gpactivatestandby:server227:gpadmin-[DEBUG]:-Waiting for connection... 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Standby master is promoted 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Reading current configuration... 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[DEBUG]:-Connecting to dbname='postgres' 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Writing the gp_dbid file - /home/gpadmin/master/gpseg-1/gp_dbid... 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-But found an already existing file. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Hence removed that existing file. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Creating a new file... 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Wrote dbid: 1 to the file. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Now marking it as read only... 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Verifying the file... 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------ 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-The activation of the standby master has completed successfully. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-server227 is now the new primary master. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-You will need to update your user access mechanism to reflect 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-the change of master hostname. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Do not re-start the failed master while the fail-over master is 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-operational, this could result in database corruption! 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-MASTER_DATA_DIRECTORY is now /home/gpadmin/master/gpseg-1 if 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-this has changed as a result of the standby master activation, remember 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-to change this in any startup scripts etc, that may be configured 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-to set this value. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-MASTER_PORT is now 5432, if this has changed, you 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-may need to make additional configuration changes to allow access 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-to the Greenplum instance. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Refer to the Administrator Guide for instructions on how to re-activate 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-the master to its previous state once it becomes available. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-Query planner statistics must be updated on all databases 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-following standby master activation. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:-When convenient, run ANALYZE against all user databases. 20180320:13:53:27:030558 gpactivatestandby:server227:gpadmin-[INFO]:------------------------------------------------------
三、測試提高後的主庫是否正常sql
$ psql -d postgres -c 'ANALYZE' postgres=# select * from gp_segment_configuration;
四、這裏可能須要同步配置一下pg_hba.conf文件,才能經過客戶端進行遠程鏈接
到這裏咱們已經完成了master節點的故障切換工做。數據庫
3、添加新的master standby
一、 在225服務器上執行gpstart -a命令啓動gpdb數據庫的時候報錯」error: Standby active, this node no more can act as master」。當standby 提高爲master的時候,原master服務器從故障中恢復過來,須要以standby的角色加入
二、在原master服務器225上的數據進行備份centos
$ cd master/ $ ls gpseg-1 $ mv gpseg-1/ backup-gpseg-1
三、在當前master服務器227上進行 gpinitstandby添加225爲standby服務器
$ gpinitstandby -s mfsmaster $ gpstate -f
4、primary segment和mirror segment切換
一、首先咱們來捋一下當前的數據庫環境
Master segment: 192.168.1.227/24 hostname: server227
Stadnby segemnt: 192.168.1.225/24 hostname: mfsmaster
Segment 節點1: 192.168.1.227/24 hostname: server227
Segment 節點2: 192.168.1.17/24 hostname: server17
Segment 節點3: 192.168.1.11/24 hostname: server11
每一個segment節點上分別運行一個primary segment和一個mirror segmentide
二、接着咱們採用一樣的方式把227服務器上gpadmin用戶的全部進行殺掉工具
$ killall -u gpadmin
三、在225服務器上執行切換master命令
$ gpactivatestandby -d master/gpseg-1/
四、完成切換後使用客戶端工具鏈接查看segment狀態,能夠看到227服務器上的server227
的primary和mirror節點都已經宕機了。
五、這裏爲了方面查看,咱們使用greenplum-cc-web工具來查看集羣狀態
$ gpcmdr --start hbjy
須要將pg_hba.conf文件還原回去,由於227上全部的segment已經宕掉,執行gpstop -u命令會有報錯
在segment status頁面中能夠看到當前segment的狀態是異常的。server11上有兩組的primary segment,這很危險,若是不幸server11也宕機了,整個集羣的狀態就變成不可用了。
六、將server227作爲master standby從新加入集羣
$ cd master/ $ mv gpseg-1/ backupgpseg-1 $ gpinitstandby -s server227
七、在master上重啓集羣
$ gpstop -M immediate $ gpstart -a
八、在master上恢復集羣
$ gprecoverseg
雖然全部的segment均已啓動,但server11上有仍是有兩組的primary segment
九、在master上恢復segment節點分佈到原始狀態
$ gprecoverseg -r