1. 前言
Citus是一個很是實用的可以使PostgreSQL具備進行水平擴展能力的插件,或者說是一款以PostgreSQL插件形式部署的基於PostgreSQL的分佈式HTAP數據庫。本文簡單說明Citus的高可用技術方案,並實際演示基於Patroni搭建Citus HA環境的步驟。php
2. 技術方案
2.1 Citus HA方案選型
Citus集羣由一個CN節點和N個Worker節點組成。CN節點的高可用可使用任何通用的PG 高可用方案,即爲CN節點經過流複製配置主備2臺PG機器;Worker節點的高可用除了能夠像CN同樣採用PG原生的高可用方案,還支持另外一種多副本分片的高可用方案。html
多副本高可用方案是Citus早期版本默認的Worker高可用方案(當時citus.shard_count
默認值爲2),這種方案部署很是簡單,並且壞一個Worker節點也不影響業務。採用多副本高可用方案時,每次寫入數據,CN節點須要在2個Worker上分別寫數據,這也帶來一系列不利的地方。node
- 數據寫入的性能降低
- 對多個副本的數據一致性的保障也沒有PG原生的流複製強
- 存在功能上的限制,好比不支持Citus MX架構
所以,Citus的多副本高可用方案適用場景有限,Citus 官方文檔上也說可能它只適用於append only的業務場景,不做爲推薦的高可用方案了(在Citus 6.1的時候,citus.shard_count
默認值從2改爲了1)。python
所以,建議Citus和CN和Worker節點都使用PG的原生流複製部署高可用。linux
2.2 PG HA支持工具的選型
PG自己提供的流複製的HA的部署和維護都不算很複雜,可是若是咱們追求更高程度的自動化,特別是自動故障切換,可使用一些使用第3方的HA工具。目前有不少種可選的開源工具,下面幾種算是比較經常使用的git
- PAF(PostgreSQL Automatic Failover)
- repmgr
- Patroni
它們的比較能夠參考: https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/github
其中Patroni採用DCS(Distributed Configuration Store,好比etcd,ZooKeeper,Consul等)存儲元數據,可以嚴格的保障元數據的一致性,可靠性高;並且它的功能也比較強大。sql
所以我的推薦使用Patroni(只有2臺機器沒法部署etcd的狀況能夠考慮其它方案)。本文介紹基於Patroni的PostgreSQL高可用的部署。shell
2.3 客戶端流量切換方案
PG 主備切換後,訪問數據庫的客戶端也要相應地鏈接到新的主庫。目前常見的有下面幾種方案:數據庫
-
HAProxy
-
優勢
- 可靠
- 支持負載均衡
-
缺點
- 性能損耗
- 須要配置HAProxy自身的HA
-
-
VIP
- 優勢
- 無性能損耗,不佔用機器資源
- 缺點
- 主備節點IP必須在同網段
- 優勢
-
客戶端多主機URL
-
優勢
- 無性能損耗,不佔用機器資源
- 不依賴VIP,易於在雲環境部署
- pgjdbc支持讀寫分離和負載均衡
-
缺點
- 僅部分客戶端驅動支持(目前包括pgjdbc,libpq和基於libpq的驅動,如python和php)
- 若是數據庫層面沒控制好出現了"雙主", 客戶端同時向2個主寫數據的風險較高
-
根據Citus集羣的特色,推薦的候選方案以下
- 應用鏈接Citus
-
客戶端多主機URL
若是客戶端驅動支持,特別對Java應用,推薦採用客戶端多主機URL訪問Citus
-
VIP
-
- Citus CN鏈接Worker
- VIP
- Worker節點發生切換時動態修改Citus CN上的worker節點元數據
關於Citus CN鏈接Worker的方式,本文下面的實驗中會演示2種架構,採用不一樣的實現方式。
普通架構
- CN經過Worker的實際IP鏈接Worekr主節點
- CN上經過監控腳本檢測Worker節點狀態,Worker發生主備切換時動態修改Citus CN上的元數據
支持讀寫分離的架構
- CN經過Worker的讀寫VIP和只讀VIP鏈接Worekr
- CN上經過Patroni回調腳本動態控制CN主節點使用讀寫VIP,CN備節點使用只讀VIP
- Worker上經過Patroni回調腳本動態綁定讀寫VIP
- Worker上經過keepalived動態綁定只讀VIP
3. 實驗環境
主要軟件
- CentOS 7.8
- PostgreSQL 12
- Citus 10.4
- patroni 1.6.5
- etcd 3.3.25
機器和VIP資源
- Citus CN
- node1:192.168.234.201
- node2:192.168.234.202
- Citus Worker
- node3:192.168.234.203
- node4:192.168.234.204
- etcd
- node4:192.168.234.204
- VIP(Citus CN )
- 讀寫VIP:192.168.234.210
- 只讀VIP:192.168.234.211
環境準備
全部節點設置時鐘同步
yum install -y ntpdate ntpdate time.windows.com && hwclock -w
若是使用防火牆須要開放postgres,etcd和patroni的端口。
- postgres:5432
- patroni:8008
- etcd:2379/2380
更簡單的作法是將防火牆關閉
setenforce 0 sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config systemctl disable firewalld.service systemctl stop firewalld.service iptables -F
4. etcd部署
由於本文的主題不是etcd的高可用,因此只在node4上部署單節點的etcd用於實驗。生產環境至少須要3臺獨立的機器,也能夠和數據庫部署在一塊兒。etcd的部署步驟以下
安裝須要的包
yum install -y gcc python-devel epel-release
安裝etcd
yum install -y etcd
編輯etcd配置文件/etc/etcd/etcd.conf
, 參考配置以下
ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="http://192.168.234.204:2380" ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://192.168.234.204:2379" ETCD_NAME="etcd0" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.234.204:2380" ETCD_ADVERTISE_CLIENT_URLS="http://192.168.234.204:2379" ETCD_INITIAL_CLUSTER="etcd0=http://192.168.234.204:2380" ETCD_INITIAL_CLUSTER_TOKEN="cluster1" ETCD_INITIAL_CLUSTER_STATE="new"
啓動etcd
systemctl start etcd
設置etcd自啓動
systemctl enable etcd
5. PostgreSQL + Citus + Patroni HA部署
在須要運行PostgreSQL的實例上安裝相關軟件
安裝PostgreSQL 12和Citus
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm yum install -y postgresql12-server postgresql12-contrib yum install -y citus_12
安裝Patroni
yum install -y gcc epel-release yum install -y python-pip python-psycopg2 python-devel pip install --upgrade pip pip install --upgrade setuptools pip install patroni[etcd]
建立PostgreSQL數據目錄
mkdir -p /pgsql/data chown postgres:postgres -R /pgsql chmod -R 700 /pgsql/data
建立Partoni的service配置文件/etc/systemd/system/patroni.service
[Unit] Description=Runners to orchestrate a high-availability PostgreSQL After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres #StandardOutput=syslog ExecStart=/usr/bin/patroni /etc/patroni.yml ExecReload=/bin/kill -s HUP $MAINPID KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target
建立Patroni配置文件/etc/patroni.yml
,如下是node1的配置示例
scope: cn namespace: /service/ name: pg1 restapi: listen: 0.0.0.0:8008 connect_address: 192.168.234.201:8008 etcd: host: 192.168.234.204:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 master_start_timeout: 300 synchronous_mode: false postgresql: use_pg_rewind: true use_slots: true parameters: listen_addresses: "0.0.0.0" port: 5432 wal_level: logical hot_standby: "on" wal_keep_segments: 1000 max_wal_senders: 10 max_replication_slots: 10 wal_log_hints: "on" max_connections: "100" max_prepared_transactions: "100" shared_preload_libraries: "citus" citus.node_conninfo: "sslmode=prefer" citus.replication_model: streaming citus.task_assignment_policy: round-robin initdb: - encoding: UTF8 - locale: C - lc-ctype: zh_CN.UTF-8 - data-checksums pg_hba: - host replication repl 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 postgresql: listen: 0.0.0.0:5432 connect_address: 192.168.234.201:5432 data_dir: /pgsql/data bin_dir: /usr/pgsql-12/bin authentication: replication: username: repl password: "123456" superuser: username: postgres password: "123456" basebackup: max-rate: 100M checkpoint: fast tags: nofailover: false noloadbalance: false clonefrom: false nosync: false
其餘PG節點的patroni.yml須要相應修改下面4個參數
- scope
- node1,node2設置爲cn
- node3,node4設置爲wk1
- name
node1~node4
分別設置pg1~pg4
- restapi.connect_address
- 根據各自節點IP設置
- postgresql.connect_address
- 根據各自節點IP設置
啓動Patroni
在全部節點上啓動Patroni。
systemctl start patroni
同一個cluster中,第一次啓動的Patroni實例會做爲leader運行,並初始建立PostgreSQL實例和用戶。後續節點初次啓動時從leader節點克隆數據
查看cn集羣狀態
[root@node1 ~]# patronictl -c /etc/patroni.yml list + Cluster: cn (6869267831456178056) +---------+----+-----------+-----------------+ | Member | Host | Role | State | TL | Lag in MB | Pending restart | +--------+-----------------+--------+---------+----+-----------+-----------------+ | pg1 | 192.168.234.201 | | running | 1 | 0.0 | * | | pg2 | 192.168.234.202 | Leader | running | 1 | | | +--------+-----------------+--------+---------+----+-----------+-----------------+
查看wk1集羣狀態
[root@node3 ~]# patronictl -c /etc/patroni.yml list + Cluster: wk1 (6869267726994446390) ---------+----+-----------+-----------------+ | Member | Host | Role | State | TL | Lag in MB | Pending restart | +--------+-----------------+--------+---------+----+-----------+-----------------+ | pg3 | 192.168.234.203 | | running | 1 | 0.0 | * | | pg4 | 192.168.234.204 | Leader | running | 1 | | | +--------+-----------------+--------+---------+----+-----------+-----------------+
爲了方便平常操做,設置全局環境變量PATRONICTL_CONFIG_FILE
echo 'export PATRONICTL_CONFIG_FILE=/etc/patroni.yml' >/etc/profile.d/patroni.sh
添加如下環境變量到~postgres/.bash_profile
export PGDATA=/pgsql/data export PATH=/usr/pgsql-12/bin:$PATH
設置postgres擁有sudoer權限
echo 'postgres ALL=(ALL) NOPASSWD: ALL'> /etc/sudoers.d/postgres
5. 配置Citus
在cn和wk的主節點上建立citus擴展
create extension citus
在cn的主節點上,添加wk1的主節點IP,groupid設置爲1。
SELECT * from master_add_node('192.168.234.204', 5432, 1, 'primary');
在Worker的主備節點上分別修改/pgsql/data/pg_hba.conf
配置文件,如下內容添加到其它配置項前面容許CN免密鏈接Worker。
host all all 192.168.234.201/32 trust host all all 192.168.234.202/32 trust
修改後從新加載配置
su - postgres pg_ctl reload
注:也能夠經過在CN上設置~postgres/.pgpass
實現免密,可是沒有上面的方式維護方便。
建立分片表測試驗證
create table tb1(id int primary key,c1 text); set citus.shard_count = 64; select create_distributed_table('tb1','id'); select * from tb1;
6. 配置Worker的自動流量切換
上面配置的Worker IP是當時的Worker主節點IP,在Worker發生主備切換後,須要相應更新這個IP。
實現上,能夠經過腳本監視Worker主備狀態,當Worker主備角色變動時,自動更新Citus上的Worker元數據爲新主節點的IP。下面是腳本的參考實現
將如下配置添加到Citus CN主備節點的/etc/patroni.yml
裏
citus: loop_wait: 10 databases: - postgres workers: - groupid: 1 nodes: - 192.168.234.203:5432 - 192.168.234.204:5432
也可使用獨立的配置文件,若是那樣作須要補充認證配置
postgresql: connect_address: 192.168.234.202:5432 authentication: superuser: username: postgres password: "123456"
建立worker流量自動切換腳本/pgsql/citus_controller.py
#!/usr/bin/env python2 # -*- coding: utf-8 -*- import os import time import argparse import logging import yaml import psycopg2 def get_pg_role(url): result = 'unknow' try: with psycopg2.connect(url, connect_timeout=2) as conn: conn.autocommit = True cur = conn.cursor() cur.execute("select pg_is_in_recovery()") row = cur.fetchone() if row[0] == True: result = 'secondary' elif row[0] == False: result = 'primary' except Exception as e: logging.debug('get_pg_role() failed. url:{0} error:{1}'.format( url, str(e))) return result def update_worker(url, role, groupid, nodename, nodeport): logging.debug('call update worker. role:{0} groupid:{1} nodename:{2} nodeport:{3}'.format( role, groupid, nodename, nodeport)) try: sql = "select nodeid,nodename,nodeport from pg_dist_node where groupid={0} and noderole = '{1}' order by nodeid limit 1".format( groupid, role) conn = psycopg2.connect(url, connect_timeout=2) conn.autocommit = True cur = conn.cursor() cur.execute(sql) row = cur.fetchone() if row is None: logging.error("can not found nodeid whose groupid={0} noderole = '{1}'".format(groupid, role)) return False nodeid = row[0] oldnodename = row[1] oldnodeport = str(row[2]) if oldnodename == nodename and oldnodeport == nodeport: logging.debug('skip for current nodename:nodeport is same') return False sql= "select master_update_node({0}, '{1}', {2})".format(nodeid, nodename, nodeport) ret = cur.execute(sql) logging.info("Changed worker node {0} from '{1}:{2}' to '{3}:{4}'".format(nodeid, oldnodename, oldnodeport, nodename, nodeport)) return True except Exception as e: logging.error('update_worker() failed. role:{0} groupid:{1} nodename:{2} nodeport:{3} error:{4}'.format( role, groupid, nodename, nodeport, str(e))) return False def main(): parser = argparse.ArgumentParser(description='Script to auto setup Citus worker') parser.add_argument('-c', '--config', default='citus_controller.yml') parser.add_argument('-d', '--debug', action='store_true', default=False) args = parser.parse_args() if args.debug: logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.DEBUG) else: logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO) # read config file f = open(args.config,'r') contents = f.read() config = yaml.load(contents, Loader=yaml.FullLoader) cn_connect_address = config['postgresql']['connect_address'] username = config['postgresql']['authentication']['superuser']['username'] password = config['postgresql']['authentication']['superuser']['password'] databases = config['citus']['databases'] workers = config['citus']['workers'] loop_wait = config['citus'].get('loop_wait',10) logging.info('start main loop') loop_count = 0 while True: loop_count += 1 logging.debug("##### main loop start [{}] #####".format(loop_count)) dbname = databases[0] cn_url = "postgres://{0}/{1}?user={2}&password={3}".format( cn_connect_address,dbname,username,password) if(get_pg_role(cn_url) == 'primary'): for worker in workers: groupid = worker['groupid'] nodes = worker['nodes'] ## get role of worker nodes primarys = [] secondarys = [] for node in nodes: wk_url = "postgres://{0}/{1}?user={2}&password={3}".format( node,dbname,username,password) role = get_pg_role(wk_url) if role == 'primary': primarys.append(node) elif role == 'secondary': secondarys.append(node) logging.debug('Role info groupid:{0} primarys:{1} secondarys:{2}'.format( groupid,primarys,secondarys)) ## update worker node for dbname in databases: cn_url = "postgres://{0}/{1}?user={2}&password={3}".format( cn_connect_address,dbname,username,password) if len(primarys) == 1: nodename = primarys[0].split(':')[0] nodeport = primarys[0].split(':')[1] update_worker(cn_url, 'primary', groupid, nodename, nodeport) """ Citus的pg_dist_node元數據中要求nodename:nodeport必須惟一,因此沒法同時支持secondary節點的動態更新。 一個可能的迴避方法是爲每一個worker配置2個IP地址,一個做爲parimary角色時使用,另外一個做爲secondary角色時使用。 if len(secondarys) >= 1: nodename = secondarys[0].split(':')[0] nodeport = secondarys[0].split(':')[1] update_worker(cn_url, 'secondary', groupid, nodename, nodeport) elif len(secondarys) == 0 and len(primarys) == 1: nodename = primarys[0].split(':')[0] nodeport = primarys[0].split(':')[1] update_worker(cn_url, 'secondary', groupid, nodename, nodeport) """ time.sleep(loop_wait) if __name__ == '__main__': main()
建立該腳本的service配置文件/etc/systemd/system/citus_controller.service
[Unit] Description=Auto update primary worker ip in Citus CN After=syslog.target network.target [Service] Type=simple User=postgres Group=postgres ExecStart=/bin/python /pgsql/citus_controller.py -c /etc/patroni.yml KillMode=process TimeoutSec=30 Restart=no [Install] WantedBy=multi-user.target
在cn主備節點上都啓動Worker流量自動切換腳本
systemctl start citus_controller
7. 讀寫分離
根據上面的配置,Citus CN不會訪問Worker的備機,這些備機閒着也是閒着,可否把這些備節用起來,讓Citus CN支持讀寫分離呢?具體而言就是讓CN的備機優先訪問Worker的備機,Worker備節故障時訪問Worker的主機。
Citus自己支持讀寫分離功能,能夠把一個Worker的主備2個節點做爲2個」worker"分別以primary
和secondary
的角色加入到同一個worker group裏。可是,因爲Citus的pg_dist_node
元數據中要求nodename:nodeport必須惟一,因此前面的動態修改Citus元數據中的worker IP的方式沒法同時支持primary節點和secondary節點的動態更新。
解決辦法有2個
方法1:Citus元數據中只寫固定的主機名,好比wk1,wk2...,而後經過自定義的Worker流量自動切換腳本將這個固定的主機名解析成不一樣的IP地址寫入到/etc/hosts
裏,在CN主庫上解析成Worker主庫的IP,在CN備庫上解析成Worker備庫的IP。
方法2:在Worker上動態綁定讀寫VIP和只讀VIP。在Citus元數據中讀寫VIP做爲primary角色的worker,只讀VIP做爲secondary角色的worker。
Patroni動態綁VIP的方法參考基於Patroni的PostgreSQL高可用環境部署.md,對Citus Worker,讀寫VIP經過回調腳本動態綁定;只讀VIP經過keepalived動態綁定。
下面按方法2進行配置。
建立Citus集羣時,在CN的主節點上,添加wk1的讀寫VIP(192.168.234.210)和只讀VIP(192.168.234.211),分別做爲primary
worker和secondary
worker,groupid設置爲1。
SELECT * from master_add_node('192.168.234.210', 5432, 1, 'primary'); SELECT * from master_add_node('192.168.234.211', 5432, 1, 'secondary');
爲了讓CN備庫鏈接到secondary的worker,還須要在CN備庫上設置如下參數
alter system set citus.use_secondary_nodes=always; select pg_reload_conf();
這個參數的變動只對新建立的會話生效,若是但願當即生效,須要在修改參數後殺掉已有會話。
如今分別到CN主庫和備庫上執行同一條SQL,能夠看到SQL被髮往不一樣的worker。
CN主庫(未設置citus.use_secondary_nodes=always
):
postgres=# explain select * from tb1; QUERY PLAN ------------------------------------------------------------------------------- Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=192.168.234.210 port=5432 dbname=postgres -> Seq Scan on tb1_102168 tb1 (cost=0.00..22.70 rows=1270 width=36) (6 rows)
CN備庫(設置了citus.use_secondary_nodes=always
):
postgres=# explain select * from tb1; QUERY PLAN ------------------------------------------------------------------------------- Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36) Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=192.168.234.211 port=5432 dbname=postgres -> Seq Scan on tb1_102168 tb1 (cost=0.00..22.70 rows=1270 width=36) (6 rows)
因爲CN也會發生主備切換,``citus.use_secondary_nodes`參數必須動態調節。這可使用Patroni的回調腳本實現
建立動態設置參數的/pgsql/switch_use_secondary_nodes.sh
#!/bin/bash DBNAME=postgres KILL_ALL_SQL="select pg_terminate_backend(pid) from pg_stat_activity where backend_type='client backend' and application_name <> 'Patroni' and pid <> pg_backend_pid()" action=$1 role=$2 cluster=$3 log() { echo "switch_use_secondary_nodes: $*"|logger } alter_use_secondary_nodes() { value="$1" oldvalue=`psql -d postgres -Atc "show citus.use_secondary_nodes"` if [ "$value" = "$oldvalue" ] ; then log "old value of use_secondary_nodes already be '${value}', skip change" return fi psql -d ${DBNAME} -c "alter system set citus.use_secondary_nodes=${value}" >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to alter use_secondary_nodes to '${value}' rc=$rc" exit 1 fi psql -d ${DBNAME} -c 'select pg_reload_conf()' >/dev/null rc=$? if [ $rc -ne 0 ] ;then log "fail to call pg_reload_conf() rc=$rc" exit 1 fi log "changed use_secondary_nodes to '${value}'" ## kill all existing connections killed_conns=`psql -d ${DBNAME} -Atc "${KILL_ALL_SQL}" | wc -l` rc=$? if [ $rc -ne 0 ] ;then log "failed to kill connections rc=$rc" exit 1 fi log "killed ${killed_conns} connections" } log "switch_use_secondary_nodes start args:'$*'" case $action in on_start|on_restart|on_role_change) case $role in master) alter_use_secondary_nodes never ;; replica) alter_use_secondary_nodes always ;; *) log "wrong role '$role'" exit 1 ;; esac ;; *) log "wrong action '$action'" exit 1 ;; esac
修改Patroni配置文件/etc/patroni.yml
,配置回調函數
postgresql: ... callbacks: on_start: /bin/bash /pgsql/switch_use_secondary_nodes.sh on_restart: /bin/bash /pgsql/switch_use_secondary_nodes.sh on_role_change: /bin/bash /pgsql/switch_use_secondary_nodes.sh
全部節點的Patroni配置文件都修改後,從新加載Patroni配置
patronictl reload cn
CN上執行switchover後,能夠看到use_secondary_nodes
參數發生了修改
/var/log/messages:
Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: switch_use_secondary_nodes start args:'on_role_change replica cn' Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: changed use_secondary_nodes to 'always' Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: killed 0 connections
8. 參考
- 基於Patroni的PostgreSQL高可用環境部署.md
- 《基於Patroni的Citus高可用方案》(PostgreSQL中國用戶大會2019分享主題)
- https://patroni.readthedocs.io/en/latest/
- http://blogs.sungeek.net/unixwiz/2018/09/02/centos-7-postgresql-10-patroni/
- https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/
- https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters
- https://www.percona.com/blog/2019/10/23/seamless-application-failover-using-libpq-features-in-postgresql/