Postgresql & Pgpool 場景問題和解決方案

時間 2019-11-16

標籤 postgresql pgpool 場景問題解決方案欄目 Postgre SQL 简体版

原文原文鏈接

原文地址： https://www.tony-yin.site/201...

本文整理了一些筆者遇到的postgresql和pgpool的常見問題和解決方案。php

環境

Postgresql做爲數據後端，pgpool做爲postgresql的中間件，經過vip對客戶端提供服務，並利用自身的failover機制保證數據庫HA。html

Nodes：node

192.168.1.1
192.168.1.2
192.168.1.3

Vip：python

192.168.1.4

Version：web

Postgresql: 9.4.20
Pgpool: 4.0.3

如何查看pgpool節點信息

[root@host1 ~]# psql -h 192.168.1.4 -p 9998 -U postgres postgres -c "show pool_nodes"
Password for user postgres: 
 node_id |   hostname    | port | status | lb_weight |  role   | select_cnt | load_balance_n
ode | replication_delay | last_status_change  
---------+---------------+------+--------+-----------+---------+------------+---------------
----+-------------------+---------------------
 0       | 192.168.1.1 | 5432 | up     | 0.333333  | standby | 0          | false         
    | 0                 | 2019-10-23 16:14:42
 1       | 192.168.1.2 | 5432 | up     | 0.333333  | primary | 34333174   | true          
    | 0                 | 2019-10-23 16:14:42
 2       | 192.168.1.3 | 5432 | up     | 0.333333  | standby | 0          | false         
    | 0                 | 2019-10-23 16:14:42
(3 rows)

如何將一個節點踢出集羣

[root@host1]# pcp_dettach_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

如何將一個節點從新加入集羣

[root@host1]# pcp_attach_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

如何查看watchdog信息

[root@host1]# pcp_watchdog_info -h 192.168.1.1 -p 9898 -U postgres
Password: 
3 YES 192.168.1.1:9998 Linux host16 192.168.1.1

192.168.1.1:9998 Linux host16 192.168.1.1 9998 9000 4 MASTER
192.168.1.2:9998 Linux host17 192.168.1.2 9998 9000 7 STANDBY
192.168.1.3:9998 Linux host18 192.168.1.3 9998 9000 7 STANDBY

如何查看pg node是否在recovery

[root@host1]# psql -h 192.168.1.1 -p 5432 -U postgres postgres -c "select pg_is_in_recovery()"
Password for user postgres: 
 pg_is_in_recovery 
-------------------
 f
(1 row)

如何指定某個節點成爲新的master節點

這個命令並不會真正的將postgresql後端從standby改成master，而只是修改了pgpool的內部狀態；簡而言之只是修改了pgpool的狀態，而postgresql對應的recovery文件並無改變，仍是須要failover腳原本改變。sql

pcp_promote_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

Ansible 版本改變致使部署後的數據庫腦裂

一套環境上部署的數據庫常常會發生腦裂問題，後經定位發現是pgpool配置文件涉及other_pgpool_id的參數項沒有正確配置致使，沒有遞增。shell

這些配置項不正確是因爲該環境上的ansible版本爲2.7，而數據庫自動化部署是基於ansible 2.4開發，ansible 2.4.2後，jinja2部分高級語法發生改變，pgpool配置文件是經過jinja2生成的，部分配置項若是還使用原有語法，那麼有些配置結果不會達到預期，因此須要根據ansible版本定製配置文件模板。數據庫

下面是模板文件pgpool.conf.j2對應不一樣版本的對應配置：django

ansible < v2.4.2

# - Other pgpool Connection Settings -
{% set other_pgpool_id = 0 %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
heartbeat_destination{{other_pgpool_id}} = '{{backend.ip}}'
heartbeat_destination_port{{other_pgpool_id}} = 9694
heartbeat_device{{other_pgpool_id}} = ''

{% set other_pgpool_id = other_pgpool_id + 1 %}
{% endif %}
{% endfor %}

# - Other pgpool Connection Settings -
{% set other_pgpool_id = 0 %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
other_pgpool_hostname{{other_pgpool_id}} = '{{backend.ip}}'
other_pgpool_port{{other_pgpool_id}} = 9998
other_wd_port{{other_pgpool_id}} = 9000

{% set other_pgpool_id = other_pgpool_id + 1 %}
{% endif %}
{% endfor %}

ansible >= v2.4.2

# - Other pgpool Connection Settings -
{% set other_pgpool_id = namespace(a=0) %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
heartbeat_destination{{other_pgpool_id.a}} = '{{backend.ip}}'
heartbeat_destination_port{{other_pgpool_id.a}} = 9694
heartbeat_device{{other_pgpool_id.a}} = ''

{% set other_pgpool_id.a = other_pgpool_id.a + 1 %}
{% endif %}
{% endfor %}

# - Other pgpool Connection Settings -
{% set other_pgpool_id = namespace(a=0) %}
{% for backend in pgpool_cluster_entries %}
{% if inventory_hostname != backend.ip %}
other_pgpool_hostname{{other_pgpool_id.a}} = '{{backend.ip}}'
other_pgpool_port{{other_pgpool_id.a}} = 9998
other_wd_port{{other_pgpool_id.a}} = 9000

{% set other_pgpool_id.a = other_pgpool_id.a + 1 %}
{% endif %}
{% endfor %}

重啓3個節點後，存在節點狀態爲down

問題緣由

pgpool做爲postgresql的中間件，當集羣內存在至少兩個節點時，就會進行選舉，若是此時第三個節點還沒起來，當選舉完成後，pgpool不會將沒有參加選舉的節點自動加入集羣，須要手工attach進集羣，或者同時重啓pgpool進行重啓選舉，即pgpool自己不具備重啓後能自動加入集羣並恢復的機制。後端

解決方案

方案1：手動attach

將掉線節點手動從新加入數據庫集羣中，例如掉線節點爲192.168.1.1而且node id爲0，執行下面的attach命令：

pcp_attach_node -n 0 -p 9898 -h 192.168.1.1 -U postgres

方案2：重啓pgpool，觸發從新選舉

分別在三個節點上，中止pgpool服務

systemctl stop pgpool.service

Pgpool每次選舉都會讀取pgpool狀態文件，爲了不影響下次選舉，因此須要刪除該狀態文件

rm -f /var/log/pgpool/pgpool_status

分別在三個節點上，啓動pgpool服務

systemctl start pgpool.service

重啓3個節點後，有一個節點狀態爲down

問題緣由

NetworkManager未關閉致使。

解決方案

NetworkManager開啓會影響pgpool的正常工做，需確保關閉。

重啓primary節點，數據庫進入只讀模式

問題緣由

該問題是個小几率偶現問題，即master節點斷電後，新的master被選舉出，新的master會將本地配置文件修改成master對應的，而後還在成爲新master的過程當中，這時候經過數據庫VIP讀取的master信息仍爲舊master，這就使得本地數據庫failover腳本認爲新master出現了不一致，因而將以前postgresql修改成master的一系列配置文件又改回了standby對應的配置文件，其中primary info仍指向爲舊master。這就致使沒有新的master產生，舊的master一直爲down的狀態。而沒有master節點，數據庫則會進入只讀模式。

解決方案

修改failover腳本代碼邏輯，當本地配置文件與數據庫角色狀態不一致時，不會第一時間去修改本地recovery文件。以前再加一層判斷：若是master節點postgresql服務還能正常訪問，再去修改recovery文件。

數據庫常常出現短暫卡頓

問題緣由

客戶端的數據庫鏈接數超過pgpool配置鏈接數上限。

解決方案

客戶端

針對代碼異常沒有運行到Response的狀況，須要添加try-catch，在最終的finally加上返回Response的代碼；
針對非web接口，即最後不走Response的狀況，須要在程序最後額外添加關閉數據庫鏈接的代碼；
針對多線程使用數據庫的場景，解決方案就是除了主線程的每次工做線程完成一個任務後，就把它相關的數據庫鏈接關掉。

from django.db import connections
# 每個線程都有專屬的connections，把本線程名下的全部鏈接關閉。
connections.close_all()

數據庫端

Pgpool配置文件中配置客戶端鏈接空閒最大時間爲300秒：

client_idle_limit = 300     # Client is disconnected after being idle for that many seconds
                            # (even inside an explicit transactions!)
                            # 0 means no disconnection

該參數表示當一個客戶端在執行最後一條查詢後若是空閒到了client_idle_limit 秒數，到這個客戶端的鏈接將被斷開。這裏配置爲300秒，防止客戶端存在長時間的空閒鏈接佔用鏈接數。

Refer

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。