We are running a Mongo 2.6 replica set with 3 members: primary, secondary, arbiter. Nearly every day our MongoDB is switching which server is primary and this causes all connections to that DB to be interrupted. It would be perfectly fine if it was doing this because one of the servers truly was down, the challenge is that in each case it seems as if the "down" server wasn't actually down. It was up the whole time.
Here's what we know:
- The
mongodprocess on all 3 servers did not restart or go down. - The servers were still reporting to New Relic the whole time.
- From the mongo log we're seeing frequent heartbeat failures.
- The servers aren't really under a very high load at any point. I am seeing a CPU spike every hour about 10 minutes past the hour, but that doesn't neatly line up with the failures.
The following is the result of show log rs while shell'd in to the current primary.
2015-05-17T15:05:49.339+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-17T15:05:49.358+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T15:05:56.444+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-17T22:11:36.638+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond): 2015-05-17T22:11:36.644+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN 2015-05-17T22:11:37.495+0000 [rsMgr] not electing self, we are not freshest 2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is up 2015-05-17T22:11:38.656+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY 2015-05-17T22:11:39.140+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T22:11:39.147+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-17T23:05:47.431+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-17T23:05:47.876+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-18T10:05:46.821+0000 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: server1:27017 2015-05-18T10:05:46.822+0000 [rsBackgroundSync] replSet syncing to: server1:27017 2015-05-18T10:05:51.014+0000 [rsBackgroundSync] replset setting syncSourceFeedback to server1:27017 2015-05-18T22:12:11.433+0000 [rsHealthPoll] replSet info server1:27017 is down (or slow to respond): 2015-05-18T22:12:11.434+0000 [rsHealthPoll] replSet member server1:27017 is now in state DOWN 2015-05-18T22:12:11.507+0000 [rsMgr] replSet info electSelf 3 2015-05-18T22:12:14.708+0000 [rsMgr] replSet PRIMARY 2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is up 2015-05-18T22:12:14.709+0000 [rsHealthPoll] replSet member server1:27017 is now in state PRIMARY 2015-05-18T22:12:21.610+0000 [rsHealthPoll] replSet member server1:27017 is now in state ROLLBACK 2015-05-18T22:12:23.612+0000 [rsHealthPoll] replSet member server1:27017 is now in state SECONDARY 2015-05-19T22:13:13.004+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x), connection attempt failed 2015-05-19T22:13:24.127+0000 [rsHealthPoll] couldn't connect to server1:27017: couldn't connect to server server1:27017 (x.x.x.x) failed, connection attempt failed 2015-05-19T22:13:29.267+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state 2015-05-20T22:14:35.832+0000 [rsHealthPoll] replset info server1:27017 just heartbeated us, but our heartbeat failed: , not changing state You can see we're getting frequent heartbeat failures and down notifications, but in each case the server would go from down to back up in seconds each time. I'm not really sure where to even start looking next to try and figure out what could be causing the problem.