3

I'm trying to set up a cluster of two nodes using CentOS 7, Corosync, Pacemaker and pcsd. I can migrate resources manually from one node to another, but if I turn off primary node (by unplugging the power cable), secondary node does not become primary. I have 2 network interfaces. eno1 10.211.0.0/24 for default route and VRRP and eno2 10.255.255.0/30 for Corosync and Pacemaker.

Here are configs:

pcs config show Cluster Name: PBX Corosync Nodes: pbx-1no pbx-2no Pacemaker Nodes: pbx-1no pbx-2no Resources: Master: PBX_DRBD_master Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true Resource: PBX_DRBD (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=asterisk_DRBD Operations: demote interval=0s timeout=90 (PBX_DRBD-demote-interval-0s) monitor interval=10s on-fail=restart role=Master timeout=20s (PBX_DRBD-monitor-interval-10s) monitor interval=20s on-fail=restart role=Slave timeout=20s (PBX_DRBD-monitor-interval-20s) notify interval=0s timeout=90 (PBX_DRBD-notify-interval-0s) promote interval=0s timeout=90 (PBX_DRBD-promote-interval-0s) reload interval=0s timeout=30 (PBX_DRBD-reload-interval-0s) start interval=0s on-fail=restart timeout=240s (PBX_DRBD-start-interval-0s) stop interval=0s on-fail=block timeout=100s (PBX_DRBD-stop-interval-0s) Resource: PBX_FS (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd0 directory=/mnt/drbd0 fstype=ext4 Operations: monitor interval=20s on-fail=restart timeout=40s (PBX_FS-monitor-interval-20s) notify interval=0s timeout=60s (PBX_FS-notify-interval-0s) start interval=0s on-fail=restart timeout=60s (PBX_FS-start-interval-0s) stop interval=0s on-fail=block timeout=60s (PBX_FS-stop-interval-0s) Resource: PBX_IP (class=ocf provider=heartbeat type=IPaddr2) Attributes: cidr_netmask=24 iflabel=0 ip=10.211.0.10 nic=eno1 Operations: monitor interval=10s on-fail=restart timeout=20s (PBX_IP-monitor-interval-10s) start interval=0s on-fail=restart timeout=20s (PBX_IP-start-interval-0s) stop interval=0s on-fail=block timeout=20s (PBX_IP-stop-interval-0s) Resource: PBX_ROUTE_default (class=ocf provider=heartbeat type=Route) Attributes: destination=0.0.0.0/0 family=ip4 gateway=10.211.0.1 source=10.211.0.10 Operations: monitor interval=10s on-fail=restart timeout=20s (PBX_ROUTE_default-monitor-interval-10s) reload interval=0s timeout=20s (PBX_ROUTE_default-reload-interval-0s) start interval=0s on-fail=restart timeout=20s (PBX_ROUTE_default-start-interval-0s) stop interval=0s on-fail=ignore timeout=20s (PBX_ROUTE_default-stop-interval-0s) Resource: PBX_mariadb (class=systemd type=mariadb.service) Operations: monitor interval=100s on-fail=ignore timeout=60s (PBX_mariadb-monitor-interval-100s) start interval=0s on-fail=ignore timeout=100s (PBX_mariadb-start-interval-0s) stop interval=0s on-fail=ignore timeout=100s (PBX_mariadb-stop-interval-0s) Resource: PBX_httpd (class=systemd type=httpd.service) Operations: monitor interval=100s on-fail=ignore timeout=60s (PBX_httpd-monitor-interval-100s) start interval=0s on-fail=ignore timeout=100s (PBX_httpd-start-interval-0s) stop interval=0s on-fail=ignore timeout=100s (PBX_httpd-stop-interval-0s) Resource: PBX_asterisk (class=systemd type=asterisk.service) Operations: monitor interval=100s on-fail=ignore timeout=60s (PBX_asterisk-monitor-interval-100s) start interval=0s on-fail=ignore timeout=100s (PBX_asterisk-start-interval-0s) stop interval=0s on-fail=ignore timeout=100s (PBX_asterisk-stop-interval-0s) Clone: ping_internal-clone Resource: ping_internal (class=ocf provider=pacemaker type=ping) Attributes: dampen=5s host_list="10.255.255.1 10.255.255.2" multiplier=1000 Operations: monitor interval=10 timeout=60 (ping_internal-monitor-interval-10) start interval=0s timeout=60 (ping_internal-start-interval-0s) stop interval=0s timeout=20 (ping_internal-stop-interval-0s) Stonith Devices: Resource: hpilo1 (class=stonith type=fence_ilo5) Attributes: ipaddr=ilo1.emergency login=admin passwd=11111 pcmk_host_list=pbx-1no Operations: monitor interval=60s (hpilo1-monitor-interval-60s) Resource: hpilo2 (class=stonith type=fence_ilo5) Attributes: ipaddr=ilo2.emergency login=admin passwd=11111 pcmk_host_list=pbx-2no Operations: monitor interval=60s (hpilo2-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: PBX_FS Enabled on: pbx-1no (score:INFINITY) (role: Started) (id:cli-prefer-PBX_FS) Resource: hpilo1 Disabled on: pbx-1no (score:-INFINITY) (id:location-hpilo1-pbx-1no--INFINITY) Resource: hpilo2 Disabled on: pbx-2no (score:-INFINITY) (id:location-hpilo2-pbx-2no--INFINITY) Ordering Constraints: promote PBX_DRBD_master then start PBX_FS (kind:Mandatory) (id:order-PBX_DRBD_master-PBX_FS-mandatory) start PBX_FS then start PBX_IP (kind:Mandatory) (id:order-PBX_FS-PBX_IP-mandatory) start PBX_IP then start PBX_ROUTE_default (kind:Mandatory) (id:order-PBX_IP-PBX_ROUTE_default-mandatory) start PBX_FS then start PBX_asterisk (kind:Mandatory) (id:order-PBX_FS-PBX_asterisk-mandatory) start PBX_FS then start PBX_mariadb (kind:Mandatory) (id:order-PBX_FS-PBX_mariadb-mandatory) start PBX_mariadb then start PBX_httpd (kind:Mandatory) (id:order-PBX_mariadb-PBX_httpd-mandatory) Colocation Constraints: PBX_ROUTE_default with PBX_IP (score:INFINITY) (id:colocation-PBX_ROUTE_default-PBX_IP-INFINITY) PBX_FS with PBX_DRBD_master (score:INFINITY) (with-rsc-role:Master) (id:colocation-PBX_FS-PBX_DRBD_master-INFINITY) PBX_IP with PBX_FS (score:INFINITY) (id:colocation-PBX_IP-PBX_FS-INFINITY) PBX_asterisk with PBX_FS (score:INFINITY) (id:colocation-PBX_asterisk-PBX_FS-INFINITY) PBX_mariadb with PBX_FS (score:INFINITY) (id:colocation-PBX_mariadb-PBX_FS-INFINITY) PBX_httpd with PBX_FS (score:INFINITY) (id:colocation-PBX_httpd-PBX_FS-INFINITY) Ticket Constraints: Alerts: Alert: smtp_alert (path=/var/lib/pacemaker/alert_smtp.sh) Recipients: Recipient: smtp_alert-recipient (value=hidden) Resources Defaults: resource-stickiness=100 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: PBX dc-version: 1.1.23-1.el7_9.1-9acf116022 have-watchdog: false last-lrm-refresh: 1613632161 no-quorum-policy: ignore stonith-enabled: true Quorum: Options: 

Corosync.conf

 totem { version: 2 cluster_name: PBX secauth: on transport: udpu token: 5000 } nodelist { node { ring0_addr: pbx-1no nodeid: 1 } 

asterisk.DRBD

resource asterisk_DRBD { handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } disk { on-io-error detach; } net { protocol C; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri call-pri-lost-after-sb; cram-hmac-alg "sha1"; shared-secret "something"; } on pbx-1 { device /dev/drbd0; disk /dev/md3; address 10.255.255.1:7789; meta-disk internal; } on pbx-2 { device /dev/drbd0; disk /dev/md3; address 10.255.255.2:7789; meta-disk internal; } } node { ring0_addr: pbx-2no nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes } 

At first I thought about routes, because when eno2 is down, there is no route for 10.255.255.0/30, and it goes through default gateway. But I made a rule on router, which drops these packets and it has no result. What coult be the problem?

1
  • Corosync can supports a third node to know which real server is down. Then no more split brain ! You can put it on a Raspi, VM or big server, it has a very small footprint. Do not create clusters (with data) with only two nodes : use at least 3 nodes (even if the third will never see the data) Commented Feb 18, 2021 at 7:51

1 Answer 1

2

The problem was in IP address. When main node shuts down, ethernet link on the secondary node also turns off and there is no IP. So i made a script, which makes ifdown/ifup if there is no IP on the intereface

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.