MongoDB frequently switching primaries

Question

We are running a Mongo 2.6 replica set with 3 members: primary, secondary, arbiter. Nearly every day our MongoDB is switching which server is primary and this causes all connections to that DB to be interrupted. It would be perfectly fine if it was doing this because one of the servers truly was down, the challenge is that in each case it seems as if the "down" server wasn't actually down. It was up the whole time.

Here's what we know:

The mongod process on all 3 servers did not restart or go down.
The servers were still reporting to New Relic the whole time.
From the mongo log we're seeing frequent heartbeat failures.
The servers aren't really under a very high load at any point. I am seeing a CPU spike every hour about 10 minutes past the hour, but that doesn't neatly line up with the failures.

The following is the result of show log rs while shell'd in to the current primary.

2015-05-17T15:05:49.339+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-17T15:05:49.358+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T15:05:56.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-17T22:11:36.638+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond): 2015-05-17T22:11:36.644+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN 2015-05-17T22:11:37.495+0000 [rsMgr] not electing self, we are not freshest 2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is up 2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY 2015-05-17T22:11:39.140+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T22:11:39.147+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T23:05:47.876+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-18T10:05:46.821+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-18T10:05:46.822+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-18T10:05:51.014+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-18T22:12:11.433+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond): 2015-05-18T22:12:11.434+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN 2015-05-18T22:12:11.507+0000 [rsMgr] replSet info electSelf 3 2015-05-18T22:12:14.708+0000 [rsMgr] replSet PRIMARY 2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is up 2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY 2015-05-18T22:12:21.610+0000 [rsHealthPoll] replSet member server1:27017 is now in state ROLLBACK 2015-05-18T22:12:23.612+0000 [rsHealthPoll] replSet member server1:27017 is now in state SECONDARY 2015-05-19T22:13:13.004+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x), connection attempt failed 2015-05-19T22:13:24.127+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x) failed, connection attempt failed 2015-05-19T22:13:29.267+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state 2015-05-20T22:14:35.832+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state

You can see we're getting frequent heartbeat failures and down notifications, but in each case the server would go from down to back up in seconds each time. I'm not really sure where to even start looking next to try and figure out what could be causing the problem.

Wesley · Accepted Answer · 2015-05-21 03:54:59Z

I see this frequently and it's always outside of the mongod process. DNS resolver issues, TCP/IP stack problems, network links, physical hardware, etc. Work your way out from the mongod process. Check networking errors on your host OS, check physical links (if physical is in the equation), check your cloud provider between the two servers if you're spanning regions. In all likelihood this is something on the host OS's and nothing to do with MongoDB itself.

That was my suspicion. My current thought is to try and find some way to log the mongoDB heartbeat calls and looking at that data. Unfortunately I'm not able to find any docs that detail the implementation of the heartbeat. — Owen Allen
– Owen Allen, Commented May 21, 2015 at 16:30
@Nucleon Run MongoDB with the most verbose logging: docs.mongodb.org/manual/reference/configuration-options/… However, the mongod isn't really itself going to know anything about why. The path lies outside of the processes. Probably some tcpdumping is in order. — Wesley
– Wesley, Commented May 21, 2015 at 17:28

Owen Allen · Accepted Answer · 2015-06-04 16:15:04Z

This has been resolved. The core issue was that we had our hosting provider was running VMWare snapshots as a backup mechanism. These snapshots were causing the VM to temporarily go into a period of stasis, I believe the technical term is that the VM quiesces.

Once these snapshots were disabled, we no longer had any issues.

Stack Exchange Network

MongoDB frequently switching primaries

2 Answers 2

You must log in to answer this question.

Hot Network Questions

MongoDB frequently switching primaries

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions