2017-12-3 馬哥運維corosync+pacemaker部署運維筆記整理

時間 2019-12-02

標籤 corosync+pacemaker corosync pacemaker 部署筆記整理简体版

原文原文鏈接

corosync是集羣框架引擎程序，pacemaker是高可用集羣資源管理器，crmsh是pacemaker的命令行工具。html

1、NTP對時，免密鑰登錄
[root@node-1 ~]# vim /etc/hosts
192.168.43.128 node-2
192.168.43.129 node-1
[root@node-1 ~]# ssh-keygen
[root@node-1 ~]# ssh-copy-id -i /root/.ssh/id_rsa root@node-2
[root@node-1 corosync]# scp /etc/hosts node-2:/etc/hosts
[root@node-1 ~]# ssh node-2
[root@node-1 ~]# yum install ntp -y
[root@node-2 ~]# hwclock -s //將硬件主板時鐘設爲系統時鐘，比ntpdate和date -s命令強多了node

[root@node-2 ~]# ssh-keygen
[root@node-2 ~]# ssh-copy-id -i /root/.ssh/id_rsa root@node-1
[root@node-2 ~]# ssh node-1
[root@node-2 ~]# yum install ntp -y web

2、安裝corosync、pacemakershell

[root@node-1 corosync]# yum install corosync pacemaker -y //centos自帶源便可，也能夠只安裝pcs便可。
[root@node-2 ~]# yum install corosync pacemaker -y
[root@node-1 ~]# vim /etc/yum.repos.d/crm.repo
--------------------------------
[network_ha-clustering_Stable]
name=Stable High Availability/Clustering packages (CentOS_CentOS-7)
type=rpm-md
baseurl=http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/
gpgcheck=1
gpgkey=http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/repodata/repomd.xml.key
enabled=1
-------------------------------
[root@node-1 ~]# yum install crmsh -ybootstrap

[root@node-1 corosync]# cd /etc/corosync
[root@node-1 corosync]# cp corosync.conf.example corosync.conf
[root@node-1 corosync]# vim corosync.conf
bindnetaddr: 192.168.43.0
service {
var: 0
name: pacemaker #表示啓動pacemaker
}
------------------------
corosync的節點直接須要密鑰的。
[root@node-1 corosync]# mv /dev/{random,random.bak}
[root@node-1 corosync]# ln -s /dev/urandom /dev/random
[root@node-1 corosync]# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Writing corosync key to /etc/corosync/authkey.
[root@node-1 corosync]# scp corosync.conf authkey root@node-2:/etc/corosync/
[root@node-1 corosync]# systemctl start corosync;ssh node-2 systemctl start corosync //兩臺機器同時啓動corosync服務vim

=====================
馬哥運維理論：
資源管理層（pacemaker負責仲裁指定誰是活動節點、IP地址的轉移、本地資源管理系統）、消息傳遞層負責心跳信息（heartbeat、corosync）、Resource Agent（理解爲服務腳本）負責服務的啓動、中止、查看狀態。多個節點上容許多個不一樣服務，剩下的2個備節點稱爲故障轉移域，主節點所在位置只是相對的，一樣，第三方仲裁也是相對的。vote system:少數服從多數。當故障節點修復後，資源返回來稱爲failback，當故障節點修復後，資源仍在備用節點，稱爲failover。
CRM：cluster resource manager ===>pacemaker心臟起搏器，每一個節點都要一個crmd（5560/tcp）的守護進程，有命令行接口crmsh和pcs(在heartbeat v3，紅帽提出的)編輯xml文件，讓crmd識別並負責資源服務的處理。也就是說crmsh和pcs等價。
Resource Agent,OCF(open cluster framework)
primtive：主資源，在集羣中只運行一個實例。clone：克隆資源，在集羣中可運行多個實例。每一個資源都有必定的優先級。
無窮大+負無窮大=負無窮大。主機名要和DNS解析的名稱相同才行。centos

1、安裝pcs管理工具
[root@node-1 ~]# ansible corosync -m service -a "name=pcsd state=started enabled=yes" //下載ansible，定義主機組爲corosync
[root@node-1 ~]# systemctl status pcsd ;ssh node-2 "systemctl status pcsd"
[root@node-1 ~]# ansible corosync -m shell -a "echo "passw0rd"|passwd --stdin hacluster" ##單首創建用戶，並設定密碼，讓用戶名進行認證。
[root@node-1 ~]# pcs cluster auth node-2 node-1 ##本機的pcs客戶端向pcsd的守護進程發起請求，若是向遠端node-1的pcsd進行認證不經過，多是firewalld的關係
Username: hacluster
Password:
node-1: Authorized
node-2: Authorized
[root@node-2 yum.repos.d]# pcs cluster auth node-1 node-2 //最好進行雙向認證。
Username: hacluster
Password:
node-1: Authorized
node-2: Authorized瀏覽器

2、創建集羣
[root@node-1 corosync]# pcs cluster setup --name mycluster node-1 node-2 --force
[root@node-2 corosync]# cat corosync.conf //執行完建立集羣的命令後，會在節點之間單獨產生一個配置文件
totem {
version: 2
secauth: off
cluster_name: mycluster
transport: udpu
}框架

nodelist {
node {
ring0_addr: node-1
nodeid: 1
}運維

node {
ring0_addr: node-2
nodeid: 2
}
}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}

解釋：totem是兩個節點進行心跳傳播的協議，ring 0表明不須要向任何信息就能到達。
[root@node-1 ~]# pcs cluster start
[root@node-1 ~]# pcs cluster status
Cluster Status:
Stack: unknown
Current DC: NONE
Last updated: Sat Oct 28 20:17:56 2017
Last change: Sat Oct 28 20:17:52 2017 by hacluster via crmd on node-1
2 nodes configured
0 resources configured
PCSD Status:
node-2: Online
node-1: Online
[root@node-2 ~]# pcs cluster start ##每一個節點要單獨啓動pcsd守護進程。
Starting Cluster...
[root@node-2 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 192.168.43.128
status = ring 0 active with no faults
[root@node-2 ~]# corosync-cmapctl |grep members ##檢查當前的集羣成員狀況
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.43.129)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.43.128)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
[root@node-1 ~]# pcs status ##DC(Designated Coordinator)的意思是說指定的協調員
每一個node都有CRM，會有一個被選爲DC，是整個Cluster的大腦，這個DC控制的CIB(cluster information base)是master CIB，其餘的CIB都是副本
Cluster name: mycluster
WARNING: no stonith devices and stonith-enabled is not false ##stonith沒有啓用隔離設備，也就是說在搶佔資源的時候直接把對方給爆頭
Stack: corosync
Current DC: node-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Sat Oct 28 20:28:01 2017
Last change: Sat Oct 28 20:18:13 2017 by hacluster via crmd on node-1
2 nodes configured
0 resources configured
Online: [ node-1 node-2 ]
No resources
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@node-2 ~]# pcs status corosync
Membership information
----------------------
Nodeid Votes Name
2 1 node-2 (local)
1 1 node-1
[root@node-1 ~]# crm_verify -L -V ##crm_verify命令用來驗證當前的集羣配置是否有錯誤
error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
[root@node-1 ~]# pcs property set stonith-enabled=false
[root@node-1 ~]# pcs property list ##查看已經更改過的集羣屬性，若是是全局的，使用pcs property --all
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: mycluster
dc-version: 1.1.16-12.el7_4.4-94ff4df
have-watchdog: false
stonith-enabled: false

3、安裝crmsh命令行集羣管理工具
[root@node-1 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo
crm(live)# configure
crm(live)configure# edit ##編輯集羣屬性，相似於vim模式，修改後保存退出。

crm部署web service:
VIP:
httpd:
兩個節點安裝httpd，注意，只能中止httpd服務，而不能重啓，而且不能設置爲開機自啓動，由於resource manager會自動管理這些服務的運行或中止。
node-1和node-2均作如下步驟：
[root@node-2 ~]# systemctl start httpd
[root@node-2 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
[root@node-1 ~]# systemctl start httpd ##httpd不可以設置爲enable，得靠crm本身管理
[root@node-1 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
此時，能夠從瀏覽器訪問2個節點的web界面
[root@node-2 ~]# crm
crm(live)# status ##必須保證全部節點都上線，才執行那些命令
crm(live)# ra
crm(live)ra# list systemd
httpd
crm(live)ra# help info
crm(live)ra# classes
crm(live)ra# cd
crm(live)# configure
crm(live)configure# help primitive

一、添加webIP資源
crm(live)ra# classes
crm(live)ra# list ocf ##ocf是classes
crm(live)ra# info ocf:IPaddr ##IPaddr是provider
crm(live)configure# primitive WebIP ocf:IPaddr params ip=192.168.43.120
crm(live)configure# show
node 1: node-1
node 2: node-2
primitive WebIP IPaddr \
params ip=192.168.43.120
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
cluster-name=mycluster \
stonith-enabled=false
crm(live)configure# verify
crm(live)configure# commit
crm(live)# status
WebIP (ocf::heartbeat:IPaddr): Stopped
二、添加webservice資源
crm(live)configure# primitive WebServer systemd:httpd ##systemd是classes命令看到的
crm(live)configure# verify
WARNING: WebServer: default timeout 20s for start is smaller than the advised 100
WARNING: WebServer: default timeout 20s for stop is smaller than the advised 100
crm(live)configure# commit

三、webip和webserver綁定組資源
crm(live)configure# help group
crm(live)configure# group WebService WebIP WebServer ##它們之間是有順序的，IP在哪兒，webserver就在哪兒
crm(live)configure# verify
WARNING: WebServer: default timeout 20s for start is smaller than the advised 100
WARNING: WebServer: default timeout 20s for stop is smaller than the advised 100
crm(live)configure# commit

crm(live)configure# node standby ##把當前節點設爲備節點

4、如何保證某節點故障然後上線，資源不會從另外一個節點轉移回來？
學習文檔：http://blog.51cto.com/nmshuishui/1399811

+++++++++++++++++++++++++++++++排錯筆記++++++++++++++++++++++++++
一、node-1節點執行crm status發現OFFLINE: [ node-1 node-2 ] ，node-2節點執行crm status發現Online: [ node-2 ]，OFFLINE: [ node-1 ] ？
解決：NTP不對時問題
（1）[root@node-2 ~]# systemctl status pcsd;ssh node-1 "systemctl status pcsd" ##均正常
[root@node-2 ~]# systemctl status corosync;ssh node-1 "systemctl status corosync" ##均爲active
兩節點都可以ping通和互相SSH，因而查看corosync和pcsd日誌，無明顯error
（2）懷疑認證密鑰不經過了，結果不是
[root@node-1 ~]# pcs cluster auth node-1 node-2
node-1: Already authorized
node-2: Already authorized
[root@node-2 ~]# pcs cluster auth node-1 node-2
node-1: Already authorized
node-2: Already authorized
（3）[root@node-1 ~]# crm status ##緣由是packmaker掛了，[root@node-1 ~]# systemctl status crm_mon
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
（4）[root@node-1 ~]# systemctl status pacemaker ##看了博客才發覺NTP又沒同步過來
Active: failed (Result: exit-code)
[root@node-1 ~]# vim /etc/ntp.conf
server 192.168.43.128 burst iburst prefer
[root@node-2 ~]# vim /etc/ntp.conf
server 127.127.1.0
fudge 127.127.1.0 stratum 10
發現重啓NTP仍是沒有卵用，只能date -s "23:52:10"了
[root@node-1 ~]# date ; ssh node-2 "date"
2017年 12月 01日星期五 23:57:55 CST
2017年 12月 01日星期五 23:57:56 CST
（5）最後，兩個節點重啓systemctl restart pacemaker，運行crm status，臥槽，終於Online: [ node-1 node-2 ]了。
參考文檔：http://blog.51cto.com/nmshuishui/1399811

二、corosync服務起不來，進而致使pacemaker服務沒法啓動？
報錯：[root@node-2 ~]# crm status
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport endpoint is not connected
[root@node-2 ~]# systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Dec 04 19:57:28 node-2 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager.
Dec 04 19:57:28 node-2 systemd[1]: Job pacemaker.service/start failed with result 'dependency'.

解決：節點更換了IP地址，忘了更新hosts文件。注意：是全部節點都要更新Hosts文件
[root@node-2 ~]# tail /var/log/cluster/corosync.log
[4577] node-2 corosyncerror [MAIN ] parse error in config: No interfaces defined
[4577] node-2 corosyncerror [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1414.
[root@node-2 ~]# vim /etc/hosts
#添加新的IP地址和主機名便可。
[root@node-2 ~]# systemctl restart corosync
[root@node-2 ~]# systemctl restart pacemaker

三、Pacemaker服務起不來？
報錯：[root@node-2 ~]# systemctl status pacemaker
Active: deactivating (stop-sigterm) since Mon 2017-12-04 21:04:44 CST; 54s ago
Dec 04 21:04:44 node-2 pengine[4880]: warning: Processing failed op stop for WebIP on node-2: not configured (6)
Dec 04 21:04:44 node-2 pengine[4880]: error: Preventing WebIP from re-starting anywhere: operation stop faile...d' (6
解決：WebIP這個資源進程有問題，用cleanup清理掉進程便可。
[root@node-2 ~]# crm resource cleanup WebIP
crm(live)configure# delete WebIP ##刪除一個組或資源都行
crm(live)configure# commit
[root@node-2 ~]# systemctl status pacemaker

四、刪掉crm(live)configure# delete WebIP，依然報出WebIP (ocf::heartbeat:IPaddr): ORPHANED FAILED node-2 (unmanaged)
解決：[root@node-2 ~]# crm resource cleaup WebIP

五、node-1認爲node-2不在線，node-2認爲node-1不在線？
報錯：[root@node-2 ~]# crm status
Online: [ node-2 ]
OFFLINE: [ node-1 ]
[root@node-1 ~]# crm status
Online: [ node-1 ]
OFFLINE: [ node-2 ]

未解決：兩節點環境中，沒法實現仲裁，那麼每一個節點都認爲他是DC
[root@node-1 ~]# time=`date |awk '{print $5}'`;ssh node-2 date -s "$time" ##保證遠程主機跟本機時間同步
[root@node-1 ~]# date ;ssh node-2 "date"
2017年 12月 04日星期一 21:37:33 CST
2017年 12月 04日星期一 21:37:33 CST

[root@node-2 ~]# systemctl list-unit-files|grep ntp ##開機保持NTP服務開啓
ntpd.service enabled
[root@node-2 ~]# hwclock -w ##將當前系統時間寫入BIOS

六、pacemaker服務有問題，報出配置文件格式有問題
[root@node-2 ~]# systemctl status pacemaker -l
Dec 04 21:52:35 node-2 cib[6776]: error: Completed cib_replace operation for section 'all': Update does not conform to the configured schema
解決：corosync.conf配置文件都是一個關鍵字，而後一個空格，一個花括符，緊接着4個空格，可是在拷貝的時候格式發生了變化，因此最好不要scp，手動改
[root@node-2 corosync]# vim corosync.conf
quorum {
provider: corosync_votequorum
two_node: 1
}

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。