9

My Debian 8.9 DRBD 8.4.3 setup somehow has got into a state where the two nodes cannot connect over the network any more. They should replicate a single resource r1, but immediately after drbdadm down r1; drbadm up r1 on both nodes their /proc/drbd describe the situation as follows:

on 1st node (Connection State is either WFConnection or StandAlone):

1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----- ns:0 nr:0 dw:0 dr:912 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20 

on 2nd node:

1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:48 

The two nodes can ping each other over the IP addresses cited in /etc/drbd.d/r1.res, and netstat shows that both are listening on the cited port.

How can I (further diagnose and) get out of this situation so that the two nodes can become Connected and replicate over DRBD again?

BTW, on a higher level of abstraction this problem currently manifests itself by systemctl start drbd never exiting, apparently because it gets stuck in drbdadm wait-connect all (as suggested by /lib/systemd/system/drbd.service).

2 Answers 2

18

The situation was apparently caused by a case of split-brain.

I had not noticed this because I had only inspected recent journal entries for drbd.service (sudo journalctl -u drbd), but the problem apparently was reported in other kernel logs and slightly earlier (sudo journalctl | grep Split-Brain).

With that, manually solving the split-brain (as described here or here) also resolved the troublesome situation as follows.

On split-brain victim (assuming the DRBD resource is r1):

drbdadm disconnect r1 drbdadm secondary r1 drbdadm connect --discard-my-data r1 

On split-brain survivor:

drbdadm primary r1 drbdadm connect r1 
3
  • 2
    It's best to include your steps in your answer versus linking to a site that might move later. I imagine you just needed drbdadm disconnect r1 on both nodes, then drbdadm connect r1 --discard-my-data on the victim, and drbdadm connect r1 on the survivor. Commented Aug 25, 2017 at 14:44
  • @MattKereczman Done now. Commented Aug 31, 2017 at 6:11
  • Thank you for adding the hint about grepping the journal for split-brain. I was in the same situation ;) Commented Feb 22, 2024 at 16:04
0

I use the following pattern: On Sick Node(Which is not Current DC, run pcs status)

drbdadm dump all drbdadm disconnect resource drbdadm secondary resource drbdadm connect resource 

On Healthy Node (Which is current DC, run pcs status )

drbdadm dump all drbdadm disconnect resource drbdadm primary resource drbdadm connect resource 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.