7

I have a HA cluster with two nodes, node one is the primary and node 2 is its mirror. I have a problem in the mysql resource since my nodes are not synchronized

drbd-overview

Node Principal:
0:home Connected Primary/Secondary UpToDate/UpToDate C r-----
1:storage Connected Secondary/Primary UpToDate/UpToDate C r-----
2:mysql StandAlone Secondary/Unknown UpToDate/Outdated r-----

Node Secundary:
0:home Connected Secondary/Primary UpToDate/UpToDate C r-----
1:storage Connected Primary/Secondary UpToDate/UpToDate C r-----
2:mysql StandAlone Primary/Unknown UpToDate/Outdated r-----

Reviewing the messages file I found the following

Apr-19 18:20:36 clsstd2 kernel: block drbd2:self C1480E287A8CAFAB:C7B94724E2658B94:5CAE57DEB3EDC4EE:F5887A918B55FB1A bits:114390101 flags:0 Apr-19 18:20:36 clsstd2 kernel: block drbd2:peer 719D326BDE8272E2:0000000000000000:C7BA4724E2658B94:C7B94724E2658B95 bits:0 flags:1 Apr-19 18:20:36 clsstd2 kernel: block drbd2:uuid_compare()=-1000 by rule 100 Apr-19 18:20:37 clsstd2 kernel: block drbd2:Unrelated data, aborting! Apr-19 18:20:37 clsstd2 kernel: block drbd2:conn (WFReportParams -> Disconnecting) Apr-19 18:20:37 clsstd2 kernel: block drbd2:error receiving ReportState, l: 4! Apr-19 18:20:38 clsstd2 kernel: block drbd2:asender terminated Apr-19 18:20:38 clsstd2 kernel: block drbd2:Terminating asender thread Apr-19 18:20:38 clsstd2 kernel: block drbd2:Connection closed Apr-19 18:20:38 clsstd2 kernel: block drbd2:conn (Disconnecting -> StandAlone) Apr-19 18:20:39 clsstd2 kernel: block drbd2:reciver terminated Apr-19 18:20:39 clsstd2 kernel: block drbd2:Terminating reciver thread Apr-19 18:20:39 clsstd2 auditd[3960]: Audit daemon rotating log files 

I don't understand what the problem is and how I can solve it, since checking both nodes I realized that in the var/lib/mysql directory I don't have the ibdata1 file in node 2 but it does exist in node1

3 Answers 3

6

The problem is you caught "famous" DRBD split brain condition and both DRBD nodes went to “StandAlone” state. It’s difficult to say do your have valid or corrupted DB on your primary node, but for now you have two routes to choose from:

  1. Try to re-sync the DRBD nodes assigning one of them as having more recent version of the data, which isn't necessary your case.

This is what you run on the second node:

#drbdadm secondary resource #drbdadm disconnect resource #drbdadm -- --discard-my-data connect resource 

This is what you run on your alive node, one you think having the most recent version of the data:

#drbdadm connect resource 

If it won’t help you can trash second node and imitate automatic rebuild executing this command:

#drbdadm invalidate resource 
  1. Purge both nodes data with the last command from (1) and recover your DB from backups.

Hope this helped!

P.S. I would really recommend avoiding DRBD in production. What your see is a quite common thing, unfortunately.

1
  • 5
    Right, this is a split brain in DRBD and possibly, there is a following message in the logs: "kernel: block drbd0: Split-Brain detected, dropping connection!" (although it's not always detected). Route 1 is worth trying. Just an example to illustrate: suse.com/support/kb/doc/?id=000019009. And you're right, DRBD is well-known for this issue. To avoid it, either use Quorum with a third node or go for something that works properly on 2 nodes like StarWind vSAN for example. Commented May 4, 2023 at 7:47
1

The issue here is the Unrelated data, aborting! you see within the logs. Likely the nodes have changed roles enough times, while disconnected, that the historical generation identified within the meta-data no longer match. See the DRBD User's Guide here for further information: https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-gi

At this point, you will need to select a node to overwrite the data of the other and perform a new full sync. To do this you should recreate the meta-data on the node to be the SyncTarget. You can do this with a drbdadm create-md <resource>

3
  • Thank you for answering, when performing these steps, the data of the main node is not at risk? Commented Apr 28, 2023 at 21:33
  • As long as you do not recreate the metadata on the primary node, it will automatically be chosen at the SyncSource once they connect. Commented Apr 28, 2023 at 23:19
  • Thanks, you were right, the solution was to recreate the metadata Commented May 8, 2023 at 21:22
1

The solution was to recreate the metadata by running the following commands on the node where I wanted to regenerate it. Now, everything is synchronized again.

drbdadm down resource drbdadm wipe-md resource drbdadm create-md resource drbdadm up resource drbdadm disconnect resource drbdadm connect resource 

The last command is executed first on the node where the metadata is recreated, and then on the other node.

1
  • Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. Commented May 17, 2023 at 10:37

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.