Rapid Prototyping a Safe, Logless Reconfiguration Protocol for MongoDB with TLA+
MongoDB provides high availability and fault tolerance using replica sets, which are a group of database servers that operate a Raft-like consensus protocol. Each database write operation is replicated in a sequential log (the oplog ) and applied to all replicas. The consensus protocol guarantees that once an oplog entry is committed on a majority of replica set nodes, the write will be durable even if some nodes fail. Over time, however, we may need to change the set of servers operating within a replica set, to remove or replace failed nodes, a problem known as dynamic reconfiguration . Reconfiguration is a critical operation within replica sets for dynamically expanding a cluster or replacing unhealthy nodes, so its correctness is crucial for enabling customer confidence in these operations and overall reliability within a replica set or sharded cluster. In 2019, we needed to implement a new, safe reconfiguration protocol with rigorous correctness guarantees. At the time, the MongoDB replication system had an existing, legacy reconfiguration mechanism, but it had several known correctness bugs which necessitated a new protocol design. Although the existing protocol had correctness issues, it also had some attractive design characteristics. In particular, it decoupled reconfigurations from the main database operation log and employed a logless design, storing configurations as single objects and replicating them between nodes in a gossip-based manner. Therefore, as part of our design process, we had a goal of developing a new, safe reconfiguration protocol while minimizing changes to this existing, legacy gossip-based reconfiguration protocol. We knew that dynamic reconfiguration protocols were notoriously difficult to design correctly, so we needed a design approach that would allow us to proceed efficiently and with high confidence. With the help of formal specification and model checking tools—specifically TLA+ and its model checker, TLC—we were able to embark on a process of rapidly developing the design of a new, safe, logless reconfiguration protocol in just a couple of weeks, and implementing it in production in a few months. In this post, we discuss our process of formally modeling the legacy reconfiguration protocol in TLA+, characterizing its bugs with a model checker, and iteratively developing modifications to lead to a safe, logless reconfiguration protocol design. There were a few key, high-level takeaways from our process. Most notably, rigorous, formal modeling didn’t slow us down, but instead accelerated design and delivery timelines while maintaining a high correctness bar. It also led to a simpler protocol design, allowing maintenance of a unified reconfiguration engine, rather than dealing with two parallel protocols, which could be prone to unexpected interactions and maintenance burden. The new protocol also provided novel performance benefits over standard reconfiguration approaches, due to the decoupling of reconfigurations from the main database log. Background and motivation The original MongoDB replication system used a legacy, gossip-based reconfiguration protocol that was fully decoupled from the main oplog. Each configuration was ordered by a numeric, monotonic config version, and nodes in a replica set learned the latest config from each other via periodic heartbeat messages. Upon learning of a higher config, it was immediately installed and took effect on that node. We refer to this original protocol design as logless, since it stored configurations as a single object and propagated them in a gossip-based manner, with no use of a sequential log for recording and replicating reconfiguration operations. This protocol also had a “force reconfig” feature, allowing users to install a new configuration even if a majority of nodes were offline. While the legacy protocol performed well in most scenarios, it was known to be unsafe in certain cases. Moreover, we expected reconfiguration to become a more common operation in MongoDB, necessitating the development of a new, safe reconfiguration protocol. Initially, we considered Raft's existing reconfiguration protocols , including its single-node reconfiguration protocol, which restricts reconfigurations to adding or removing a single server. The standard Raft approach, however, was ultimately deemed incompatible with "force reconfig," and would require maintenance of both a new, log-based implementation and the legacy, gossip-based one. It would also be complicated to ensure the two protocols didn’t interfere with each other. Instead, we hoped to develop a new protocol that minimized changes to the existing legacy protocol to simplify design and implementation. Ideally, we would be able to adopt ideas from Raft’s single-node reconfiguration protocol to our gossip-based, legacy reconfig protocol—which would allow for better compatibility with "force" reconfig, would be easier to upgrade and downgrade, and would eliminate the need for a new oplog entry format for reconfigurations. This idea of developing a safe, logless reconfiguration protocol seemed promising, as it would eliminate the need to mix two protocols and to share the basic mechanism for both normal and force reconfigurations. We needed, however, to be very confident in the correctness of such an approach, which was difficult to do manually and with a short design time frame. When we first pitched this idea early in the design process, it was unclear if such a solution was possible and whether it could be successful and implemented safely in production. There was some existing work on decoupling reconfigurations and on logless consensus , but none that directly applied to a Raft-based consensus system such as ours. Also, the discovery of a critical safety bug in one of Raft's reconfiguration protocols after its initial publication highlighted how challenging the task of designing or modifying reconfiguration protocols for consensus systems can be. This bug was only discovered over a year after Raft’s initial publication and required subtle protocol modifications to address. Around that time, in 2019, MongoDB’s replication team had had some past success with TLA+ and model checking on similar protocol design problems. Encouraged by these experiences, we set off to employ TLA+ and its model checker, TLC, to rapidly iterate on a candidate design and to develop a safe, logless reconfiguration protocol design that was simpler, easier to implement, and which provided novel performance benefits. Modeling the legacy protocol We were focused on developing a reconfiguration protocol that minimized design changes to the existing system, so we started by developing a TLA+ specification of the legacy reconfiguration protocol. This allowed us to characterize the flaws in this legacy protocol precisely and guide us towards modifications needed to make the protocol safe. To model the legacy, gossip-based protocol, we extended an existing TLA+ specification we had developed for an abstract version of the MongoDB replication protocol that did not include reconfiguration behavior. We extended this specification with two key reconfiguration-related actions: a Reconfig action, which represents the installation of a new config on a primary node, and a SendConfig action, which gossips a new config with a higher config version from one node to another. This model also defines the high-level safety properties of the protocol. The fundamental external guarantee is that when a majority write is committed on a replica set, the write will be durable as long as a majority of nodes are alive. This guarantee is largely captured in the LeaderCompleteness property, stating that any new leader in a higher term must contain log entries committed in earlier terms. Along with this, we also include a lower-level correctness property of Raft-based systems, ElectionSafety , which states that there can never be two primaries in the same term. Iteratively strengthening our reconfiguration rules Our legacy protocol model and its underlying correctness properties served as the starting point for a series of experiments, guided by the model checker, that iteratively led us towards a safe protocol design. We explored a series of design candidates by incrementally analyzing and refining our design in response to counterexamples discovered by the model checker. Single node changes One of the fundamental, challenging aspects of dynamic reconfiguration is related to the fact that the notion of “quorum” (i.e., majority) changes when the set of servers operating the protocol changes. For example, consider a reconfiguration that expands the protocol’s set of servers from C1={n 1 , n 2 , n 3 } to C2={n 1 , n 2 , n 3 , n 4 , n 5 }. Contacting a quorum in C1 may (correctly) contact servers Q1={n 1 , n 2 }, but a valid quorum in C2 may be Q2={n 3 , n 4 , n 5 }, which is problematic since Q1 and Q2 do not intersect, a key property of all quorums in standard Raft (and most other practical consensus protocols). Raft’s single-node approach attempts to partially address this by restricting configuration changes to those that add or remove a single node, which enforces overlapping quorums between such configurations. So we started by considering a basic initial question: does enforcing single-node changes partially address the safety issues of the legacy protocol? We had expected this would not be a fully sufficient condition for safety, but it was a stepping stone towards safer protocol revisions, and we wanted to confirm each of our hypotheses along the way. We introduced the single node change rule in the Reconfig action, which ensures that any majority of nodes in the old config and any majority of nodes in the new config share at least one common node. In our specification, we employed a slightly generalized definition of this property, which allows reconfigurations between any nodes where majority quorums overlap, even if not strictly a single-node change (e.g. all majority quorums of C1={n 1 , n 2 } and C2={n 1 , n 2 , n 3 , n 4 } intersect, but you cannot move from one to the other via a single addition/removal). One of the benefits of specifying the protocol in a high-level, mathematical specification language like TLA+ is that it enables concise definition of these kinds of properties, as seen below. After adding this condition to our Reconfig action, TLC was able to produce a violation trace for this updated protocol in a few seconds, and this bug was clear to understand, as shown below (only modified variables are shown in each state): Essentially, single-node changes only guarantee safe quorum intersection between adjacent configurations, but a series of locally adjacent reconfigurations may lead to a globally unsafe situation—i.e., two configurations that are both active but violate the quorum overlap property. This is demonstrated in the above trace concretely, and leads to a violation of the ElectionSafety property, with two nodes acting as primary in the same term in State 6. Node n 1 was safely elected in configuration {n 1 }, but then two subsequent reconfigurations occur to move the system to {n 1 , n 2 , n 3 }, and n 2 is elected in this configuration with a quorum of {n 2 , n 3 }, with no intersection of the original quorum of config {n 1 }. Our initial expectation was that just adding the single-node change constraint would not be correct by itself, but it was reassuring to have the model checker confirm this with a counterexample in just a few seconds. This began to give us more confidence to iterate on a new protocol design, which we proceeded to develop over the next week or so, next moving on to understand a deeper investigation of protocol safety requirements. Config commitment rule Adopting the single-node change condition is straightforward, as it only requires verifying new configurations in a pairwise, local manner. As we saw above, though, it is still problematic to move through arbitrary sequences of overlapping configurations, so we need to take extra care to avoid these problematic cases. Our first hunch was to add an explicit notion of “config commitment” within the protocol, similar to the commitment rules of Raft. That is, restrict a reconfiguration from taking place until some appropriate commitment conditions have been satisfied. Intuitively, this would place restrictions on how quickly, for example, a primary could execute reconfigurations—i.e., it would prevent a primary from moving to a new configuration before an older, non-overlapping configuration was, in a sense, “deactivated.” One natural idea was to borrow some similar concepts from Raft on log commitment, adapted for our logless, gossip-based setting. After a few iterations, we developed the following additional preconditions for the Reconfig action: ConfigQuorumCheck : A quorum of nodes have the same config version as the primary executing the reconfig. TermQuorumCheck : A majority of nodes in the primary’s config have reached the term of the primary or newer. We modeled the protocol with these new TermQuorumCheck and ConfigQuorumCheck , and they were initially sufficient to rule out the counterexamples we encountered previously. They were not yet fully general to ensure safety, though, as we will see below, where we worked out a final solution for config commitment. Oplog commitment rule In addition to the "config commitment" idea, it is worth noting the relationship between the config and oplog caused by the divergence from Raft. Raft sequences a reconfiguration among other oplog entries, thereby establishing a strong implicit ordering among them. However, since the gossip-based reconfig protocol does not include the configuration as part of the oplog, there may be some implicit dependencies between oplog entries and configurations that are not accounted for. We had started to think about this interaction between oplog entry commitment and reconfiguration, and conjectured a few problematic scenarios that we were able to confirm with the model checker. An example of this problem is illustrated by the following, simplified error trace: The core issue here is that config C3={n 1 , n 2 , n 3 } (with version=3) is installed even though the entry <<1,1>> (index, term) that was committed in a previous configuration, C1={n 1 }, has not been committed in the current configuration, C2={n 1 ,n 2 }. Since quorums may not overlap for non-adjacent configurations (e.g., C1 and C3), by ensuring that the commitment of writes in a previous configuration is also guaranteed in the current configuration, we can "propagate" the durability guarantee of earlier configurations to the future. As a result, we need to explicitly check this property when accepting reconfiguration commands. The rules for accepting a new configuration now include this additional, newly developed precondition: This rule is about ensuring that durable, replicated log entries from older configs are transferred to newer configs, which must be upheld to ensure safe protocol operation over time. This feature is implicit in Raft reconfiguration due to the tight coupling of reconfigurations and the main operation oplog, but must be handled explicitly here due to the decoupled design. The config as logless state machine We were now confident that we had established strong rules to guarantee local quorum overlap, the proper sequential ordering of configs, and the appropriate transfer of oplog entries between configs. After re-checking our model with these new preconditions, though, the model checker discovered a new counterexample after running for several hours on a larger workstation. The following is a simplified version of this error trace: In this case, node n 1 executes a reconfig to Ca={n 1 ,n 2 ,n 3 }, but hasn't propagated it to any other nodes at state 3. Then, n 2 becomes the primary and reconfigures to config Cb={n 1 ,n 2 ,n 4 } in state 6. n 1 can then be elected in term 3 with quorum {n 1 ,n 3 }, and n 2 can be elected in term 3 with quorum {n 2 ,n 4 }, violating the ElectionSafety property. The problem in the above trace is that when n 2 moved to a new config, it should have ensured that, in the future, no leaders would ever be elected in “earlier” configs. It failed to do so and, in the last step, a quorum was then able to be formed in a config with version 2, leading to two active, non-overlapping quorums. A key here is that divergence between configs in different terms leads to the issue. That is, config commitment as we did above was sufficient for a sequence of reconfigs by a single leader, but not with concurrent leaders in competing terms. Figure 1. Concurrent configurations with non-intersecting majority quorums. After going through these counterexamples, we understood the problem more clearly and had a path to refine our correctness argument. We realized that agreeing on the configuration among nodes can be viewed as a separate kind of consensus problem, separate from the oplog consensus but with similar rules. In our system, the config itself can be viewed as a compacted (i.e., rolled up) replicated state machine (RSM) that does not require a log (i.e., it is “logless”), since explicit maintenance of config history isn’t needed and only the latest config takes effect. Propagating the config via heartbeats can be viewed as “appending” to the config log (e.g. as in Raft), and rolling back a config is never explicitly required—i.e., we always simply install a more up-to-date config. This config RSM already shares many similarities with the oplog RSM, such as term propagation. The similarity suggests that just using the config version to identify a config is not sufficient. Viewing the config as its own RSM, we need to assign the primary’s term to configs. The config term is then a separate property of the config, similar to how the oplog entry’s term is part of every oplog entry. Thus, a config should be defined and ordered by the tuple (configVersion, configTerm), analogous to how an oplog entry is identified and ordered by its (timestamp, term), with term being compared first, followed by timestamp/version. The elections of these two consensus protocols can then be merged together by adding a new rule that a voter checks if the candidate’s config is stale in addition to other checks. Moreover, we can borrow the definition of “commitment” from the oplog RSM to the config RSM. That is, when a config is propagated to a majority of nodes in the primary’s term, the config is committed. It also became clear that the RSM only moves ahead through committed configs sequentially - the config RSM can choose the next config and commit it only if its current one is committed. Putting it all together Our final protocol specification included all of the above preconditions and features, producing a version of the protocol which we refer to as safe, logless dynamic reconfiguration. We conducted final model checking runs for several cases over 20 hours, exploring over 800 million protocol states, with configurations of four and five servers, along with pen and paper explanations for the correctness of the final result. Note that, at a high level, we can understand dynamic reconfiguration protocols like this as needing to deal with two core conceptual aspects: (1) config deactivation and (2) state transfer. Our various config commitment rules combine to address the first, which is related to ensuring that different configs that diverge over time cannot both be concurrently active. Aspect (2) relates to the fact that various types of replicated, durable state within a configuration must be appropriately transferred over to newer configurations. This is what the oplog commitment rules address, as well as the rules for ensuring that the term state propagates appropriately between configurations. Once we had the abstract protocol ironed out and gained confidence in its correctness, we were ready to move forward swiftly to implementation, and completed it in the MongoDB replication system over the course of a few months. The protocol has been running reliably in MongoDB and in production for several years since its introduction, and the implementation and protocol were significantly simpler than our original design alternatives. Takeaways Overall, we were able to get a draft protocol in one week, and within two weeks we finalized the protocol and successfully passed correctness checks using the model checker. It was motivating to see our vague ideas turn into something tangible, and the successful outcome from this design phase gave us the confidence to move forward to the implementation phase. Model checking is an excellent tool for rapidly and precisely answering "what if" design questions. Our efforts also emphasized an important feature of lightweight, design-level formal methods techniques, which is about more than simply ensuring correctness of your system design. Rather, it enables the exploration of protocol optimizations at a level of aggressiveness and velocity that would typically be infeasible with manual design methods. From this perspective, we can view these formal methods tools as not only a means for improving correctness of our systems and protocols, but as a means for efficient exploration of the optimization design space while maintaining a high correctness bar. This also speaks to the potential value of investing some amount of time upfront in models for key protocols that are highly critical and may need to evolve over time. Due to our novel protocol design, the scope of the implementation changes also became much smaller. We delivered the project in three months with three to four developers, and "force reconfig" was implemented using the same mechanism with relaxed rules. Version upgrade/downgrade only involves a small on-disk format change of the config, avoiding switching between two different reconfig approaches. In addition, our approach also provided potential performance improvements. Specifically, the decoupled reconfiguration design can bypass the oplog to recover the system when the oplog becomes the bottleneck. Similar ideas have since been explored in other, recent reconfiguration protocols like Matchmaker Paxos . Since its introduction in MongoDB 4.4 in 2019, the new, logless reconfiguration protocol has proven to be reliable and has served as a solid building block for other features, such as automatically assigning new nodes votes only after their initial sync. There have been no significant protocol bugs discovered since its deployment, a testament to the value of these rigorous protocol design techniques. While we focused on the intuition of the new protocol and the experience of leveraging model checking in this article, our paper , published in OPODIS 2021, includes a much more detailed description of the reconfiguration protocol, and a formal safety proof was also published. The final versions of the specifications we developed and published can be found in this Github repository , as well as some of the original specs we used in the MongoDB repository .
July 2, 2025