New in MongoDB Atlas Stream Processing: External Function Support

Laura Zhukas
July 3, 2025

Today we're excited to introduce External Functions, a new capability in MongoDB Atlas Stream Processing that lets you invoke AWS Lambda, directly from your streaming pipelines. The addition of External Functions to Atlas Stream Processing unlocks new ways to enrich, validate, and transform data in-flight, enabling smarter and more modular event-driven applications. This functionality is available through a new pipeline stage, $externalFunction.

What are external functions?

External functions allow you to integrate Atlas Stream Processing with external logic services such as AWS Lambda. This lets you reuse existing business logic, perform AI/ML inference, or enrich and validate data as it moves through your pipeline, all without needing to rebuild that logic directly in your pipeline definition.

AWS Lambda is a serverless compute service that runs your code in response to events, scales automatically, and supports multiple languages (JavaScript, Python, Go, etc.). Because there’s no infrastructure to manage, Lambda is ideal for event-driven systems. Now, by using external functions, you can seamlessly plug that logic into your streaming workloads.

Where $externalFunction fits in your pipeline

MongoDB Atlas Stream Processing can connect to a wide range of sources and output to various sinks. The diagram below shows a typical streaming architecture: Atlas Stream Processing ingests data, enriches it with stages like $https and $externalFunction, and routes the transformed results to various destinations.

Figure 1. A high-level visual of a stream processing pipeline.

This diagram is broken down into three portions from left to right. On the left, is a box labeled sources, with the description connect to a variety of streaming sources or Atlas databases. This box then points to the center, which is a box labeled Atlas Stream Processing; there is an arrow pointing up and an arrow pointing down from there. The up arrow goes to HTTPS operator, and the down arrow goes to synchronous external function. To the right, is a box labeled Sinks, with a description of land data in Atlas databases or streaming destinations.

The $externalFunction stage can be placed anywhere in your pipeline (except as the initial source stage) allowing you to inject external logic at any step. Atlas Stream Processing supports two modes for invoking external functions—synchronous and asynchronous.

Synchronous execution type

In synchronous mode, the pipeline calls the Lambda function and waits for a response. The result is stored in a user-defined field (using the “as” key) and passed into the following stages.

let syncEF = { 
 $externalFunction: {
 connectionName: "myLambdaConnection",
 functionName: "arn:aws:lambda:region:account-id:function:function-name",
 execution: "sync",
 as: "response",
 onError: "fail",
 payload: [
 { $replaceRoot: { newRoot: "$fullDocument.payloadToSend" } },
 { $addFields: { sum: { $sum: "$randomArray" }}},
 { $project: { success: 1, sum: 1 }}
 ]
 }
}

Let’s walk through what each part of the $externalFunction stage does in this synchronous setup:

connectionName: external function connection name specified in the Connection Registry.
functionName: full AWS ARN or the name of the AWS Lambda function.
execution: Indicates synchronous execution ("sync") as opposed to asynchronous (“async).
as: specifies the Lambda response will be stored in the “response” field.
onError: behavior when the operator encounters an error (in this case "fail" stops the processor). The default is to add the event to the dead letter queue.
payload: inner pipeline that allows you to customize the request body sent, using this allows you to decrease the size of the data passed and ensure only relevant data is sent to the external function.

This type is useful when you want to enrich or transform a document using external logic before it proceeds through the rest of the pipeline.

Asynchronous execution type

In async mode, the function is called, but the pipeline does not wait for a response. This is useful when you want to notify downstream systems, trigger external workflows, or pass data into AWS without halting the pipeline.

let asyncEF = {
 $externalFunction: {
 connectionName: "EF-Connection",
 functionName: "arn:aws:lambda:us-west-1:12112121212:function:EF-Test",
 execution: "async"
 }
}

Use the async execution type for propagating information outward, for example:

Triggering downstream AWS applications or analytics
Notifying external systems
Firing off alerts or billing logic

Real-world use case: Solar device diagnostics

To illustrate the power of external functions, let’s walk through an example: a solar energy company wants to monitor real-time telemetry from thousands of solar devices. Each event includes sensor readings (e.g., temperature, power output) and metadata like device_id and timestamp. These events need to be processed, enriched and then stored into a MongoDB Atlas collection for dashboards and alerts.

This can easily be accomplished using a synchronous external function. Each event will be sent to a Lambda function that enriches the record with a status (e.g., ok, warning, critical) as well as diagnostic comments. After which the function waits for the enriched events to be returned and then sends them to the desired MongoDB collection.

Step 1: Define the external function connection

First, create a new AWS Lambda connection in the Connection Registry within Atlas. You can authenticate using Atlas's Unified AWS Access, which securely connects Atlas and your AWS account.

Figure 2. Adding an AWS Lambda connection in the UI.

Screenshot of the Atlas UI for adding a AWS Lambda connection.

2. Implement the lambda function

Here’s a simple diagnostic function. It receives solar telemetry data, checks it against thresholds, and returns a structured result.

export const handler = async (event) => {
 const {
 device_id,
 group_id,
 watts,
 temp,
 max_watts,
 timestamp
 } = event;


 // Default thresholds
 const expectedTempRange = [20, 40]; // Celsius
 const wattsLowerBound = 0.6 * max_watts; // 60% of max output


 let status = "ok";
 let messages = [];


 // Wattage check
 if (watts < wattsLowerBound) {
 status = "warning";
 messages.push(`Observed watts (${watts}) below 60% of max_watts (${max_watts}).`);
 }


 // Temperature check
 if (temp < expectedTempRange[0] || temp > expectedTempRange[1]) {
 status = "warning";
 messages.push(`Temperature (${temp}°C) out of expected range [${expectedTempRange[0]}–${expectedTempRange[1]}].`);
 }


 // If multiple warnings, escalate to critical
 if (messages.length > 1) {
 status = "critical";
 }


 return {
 device_id,
 status,
 timestamp,
 watts_expected_range: [wattsLowerBound, max_watts],
 temp_expected_range: expectedTempRange,
 comment: messages.length ? messages.join(" ") : "All readings within expected ranges."
 };
};

3. Create the streaming pipeline

Using VS Code, define a stream processor using the sample solar stream as input.

let s = {
 $source: {
 connectionName: 'sample_stream_solar'
 }
};


// Define the External Function
let EFStage = {
 $externalFunction: {
 connectionName: "telemetryCheckExternalFunction",
 onError: "fail",
 functionName: "arn:aws:lambda:us-east-1:121212121212:function:checkDeviceTelemetry",
 as: "responseFromLambda",
 }
};
// Replace the original document with the Lambda response
let projectStage = {
 $replaceRoot: {
 newRoot: "$responseFromLambda"
 }
};
// Merge the results into a DeviceTelemetryResults collection
let sink = {
 $merge: {
 into: {
 connectionName: "IoTDevicesCluster",
 db: "SolarDevices",
 coll: "DeviceTelemetryResults"
 }
 }
};


sp.createStreamProcessor("monitorSolarDevices", [s, EFStage, projectStage, sink]);
sp.monitorSolarDevices.start();

Once running, the processor ingests live telemetry data, invokes the Lambda diagnostics logic, and returns enriched results to MongoDB Atlas, complete with status and diagnostic comments.

4. View enriched results in MongoDB Atlas

Explore the enriched data in MongoDB Atlas using the Data Explorer. For example, filter all documents where status = "ok" after a specific date.

Figure 3. Data Explorer filtering for all documents with a status of “ok” from June 14 onwards.

Smarter stream processing with external logic

MongoDB Atlas Stream Processing external functions allow you to enrich your data stream with logic that lives outside the pipeline, making your processing smarter and more adaptable. In this example, we used AWS Lambda to apply device diagnostics in real-time and store results in MongoDB. You could easily extend this to use cases in fraud detection, personalization, enrichment from third-party APIs, and more.

Log in today to get started, or check out our documentation to create your first external function. Have an idea for how you'd use external functions in your pipelines? Let us know in the MongoDB community forum!

← Previous

Rapid Prototyping a Safe, Logless Reconfiguration Protocol for MongoDB with TLA+

MongoDB provides high availability and fault tolerance using replica sets, which are a group of database servers that operate a Raft-like consensus protocol. Each database write operation is replicated in a sequential log (the oplog ) and applied to all replicas. The consensus protocol guarantees that once an oplog entry is committed on a majority of replica set nodes, the write will be durable even if some nodes fail. Over time, however, we may need to change the set of servers operating within a replica set, to remove or replace failed nodes, a problem known as dynamic reconfiguration . Reconfiguration is a critical operation within replica sets for dynamically expanding a cluster or replacing unhealthy nodes, so its correctness is crucial for enabling customer confidence in these operations and overall reliability within a replica set or sharded cluster. In 2019, we needed to implement a new, safe reconfiguration protocol with rigorous correctness guarantees. At the time, the MongoDB replication system had an existing, legacy reconfiguration mechanism, but it had several known correctness bugs which necessitated a new protocol design. Although the existing protocol had correctness issues, it also had some attractive design characteristics. In particular, it decoupled reconfigurations from the main database operation log and employed a logless design, storing configurations as single objects and replicating them between nodes in a gossip-based manner. Therefore, as part of our design process, we had a goal of developing a new, safe reconfiguration protocol while minimizing changes to this existing, legacy gossip-based reconfiguration protocol. We knew that dynamic reconfiguration protocols were notoriously difficult to design correctly, so we needed a design approach that would allow us to proceed efficiently and with high confidence. With the help of formal specification and model checking tools—specifically TLA+ and its model checker, TLC—we were able to embark on a process of rapidly developing the design of a new, safe, logless reconfiguration protocol in just a couple of weeks, and implementing it in production in a few months. In this post, we discuss our process of formally modeling the legacy reconfiguration protocol in TLA+, characterizing its bugs with a model checker, and iteratively developing modifications to lead to a safe, logless reconfiguration protocol design. There were a few key, high-level takeaways from our process. Most notably, rigorous, formal modeling didn’t slow us down, but instead accelerated design and delivery timelines while maintaining a high correctness bar. It also led to a simpler protocol design, allowing maintenance of a unified reconfiguration engine, rather than dealing with two parallel protocols, which could be prone to unexpected interactions and maintenance burden. The new protocol also provided novel performance benefits over standard reconfiguration approaches, due to the decoupling of reconfigurations from the main database log. Background and motivation The original MongoDB replication system used a legacy, gossip-based reconfiguration protocol that was fully decoupled from the main oplog. Each configuration was ordered by a numeric, monotonic config version, and nodes in a replica set learned the latest config from each other via periodic heartbeat messages. Upon learning of a higher config, it was immediately installed and took effect on that node. We refer to this original protocol design as logless, since it stored configurations as a single object and propagated them in a gossip-based manner, with no use of a sequential log for recording and replicating reconfiguration operations. This protocol also had a “force reconfig” feature, allowing users to install a new configuration even if a majority of nodes were offline. While the legacy protocol performed well in most scenarios, it was known to be unsafe in certain cases. Moreover, we expected reconfiguration to become a more common operation in MongoDB, necessitating the development of a new, safe reconfiguration protocol. Initially, we considered Raft's existing reconfiguration protocols , including its single-node reconfiguration protocol, which restricts reconfigurations to adding or removing a single server. The standard Raft approach, however, was ultimately deemed incompatible with "force reconfig," and would require maintenance of both a new, log-based implementation and the legacy, gossip-based one. It would also be complicated to ensure the two protocols didn’t interfere with each other. Instead, we hoped to develop a new protocol that minimized changes to the existing legacy protocol to simplify design and implementation. Ideally, we would be able to adopt ideas from Raft’s single-node reconfiguration protocol to our gossip-based, legacy reconfig protocol—which would allow for better compatibility with "force" reconfig, would be easier to upgrade and downgrade, and would eliminate the need for a new oplog entry format for reconfigurations. This idea of developing a safe, logless reconfiguration protocol seemed promising, as it would eliminate the need to mix two protocols and to share the basic mechanism for both normal and force reconfigurations. We needed, however, to be very confident in the correctness of such an approach, which was difficult to do manually and with a short design time frame. When we first pitched this idea early in the design process, it was unclear if such a solution was possible and whether it could be successful and implemented safely in production. There was some existing work on decoupling reconfigurations and on logless consensus , but none that directly applied to a Raft-based consensus system such as ours. Also, the discovery of a critical safety bug in one of Raft's reconfiguration protocols after its initial publication highlighted how challenging the task of designing or modifying reconfiguration protocols for consensus systems can be. This bug was only discovered over a year after Raft’s initial publication and required subtle protocol modifications to address. Around that time, in 2019, MongoDB’s replication team had had some past success with TLA+ and model checking on similar protocol design problems. Encouraged by these experiences, we set off to employ TLA+ and its model checker, TLC, to rapidly iterate on a candidate design and to develop a safe, logless reconfiguration protocol design that was simpler, easier to implement, and which provided novel performance benefits. Modeling the legacy protocol We were focused on developing a reconfiguration protocol that minimized design changes to the existing system, so we started by developing a TLA+ specification of the legacy reconfiguration protocol. This allowed us to characterize the flaws in this legacy protocol precisely and guide us towards modifications needed to make the protocol safe. To model the legacy, gossip-based protocol, we extended an existing TLA+ specification we had developed for an abstract version of the MongoDB replication protocol that did not include reconfiguration behavior. We extended this specification with two key reconfiguration-related actions: a Reconfig action, which represents the installation of a new config on a primary node, and a SendConfig action, which gossips a new config with a higher config version from one node to another. This model also defines the high-level safety properties of the protocol. The fundamental external guarantee is that when a majority write is committed on a replica set, the write will be durable as long as a majority of nodes are alive. This guarantee is largely captured in the LeaderCompleteness property, stating that any new leader in a higher term must contain log entries committed in earlier terms. Along with this, we also include a lower-level correctness property of Raft-based systems, ElectionSafety , which states that there can never be two primaries in the same term. Iteratively strengthening our reconfiguration rules Our legacy protocol model and its underlying correctness properties served as the starting point for a series of experiments, guided by the model checker, that iteratively led us towards a safe protocol design. We explored a series of design candidates by incrementally analyzing and refining our design in response to counterexamples discovered by the model checker. Single node changes One of the fundamental, challenging aspects of dynamic reconfiguration is related to the fact that the notion of “quorum” (i.e., majority) changes when the set of servers operating the protocol changes. For example, consider a reconfiguration that expands the protocol’s set of servers from C1={n 1 , n 2 , n 3 } to C2={n 1 , n 2 , n 3 , n 4 , n 5 }. Contacting a quorum in C1 may (correctly) contact servers Q1={n 1 , n 2 }, but a valid quorum in C2 may be Q2={n 3 , n 4 , n 5 }, which is problematic since Q1 and Q2 do not intersect, a key property of all quorums in standard Raft (and most other practical consensus protocols). Raft’s single-node approach attempts to partially address this by restricting configuration changes to those that add or remove a single node, which enforces overlapping quorums between such configurations. So we started by considering a basic initial question: does enforcing single-node changes partially address the safety issues of the legacy protocol? We had expected this would not be a fully sufficient condition for safety, but it was a stepping stone towards safer protocol revisions, and we wanted to confirm each of our hypotheses along the way. We introduced the single node change rule in the Reconfig action, which ensures that any majority of nodes in the old config and any majority of nodes in the new config share at least one common node. In our specification, we employed a slightly generalized definition of this property, which allows reconfigurations between any nodes where majority quorums overlap, even if not strictly a single-node change (e.g. all majority quorums of C1={n 1 , n 2 } and C2={n 1 , n 2 , n 3 , n 4 } intersect, but you cannot move from one to the other via a single addition/removal). One of the benefits of specifying the protocol in a high-level, mathematical specification language like TLA+ is that it enables concise definition of these kinds of properties, as seen below. After adding this condition to our Reconfig action, TLC was able to produce a violation trace for this updated protocol in a few seconds, and this bug was clear to understand, as shown below (only modified variables are shown in each state): Essentially, single-node changes only guarantee safe quorum intersection between adjacent configurations, but a series of locally adjacent reconfigurations may lead to a globally unsafe situation—i.e., two configurations that are both active but violate the quorum overlap property. This is demonstrated in the above trace concretely, and leads to a violation of the ElectionSafety property, with two nodes acting as primary in the same term in State 6. Node n 1 was safely elected in configuration {n 1 }, but then two subsequent reconfigurations occur to move the system to {n 1 , n 2 , n 3 }, and n 2 is elected in this configuration with a quorum of {n 2 , n 3 }, with no intersection of the original quorum of config {n 1 }. Our initial expectation was that just adding the single-node change constraint would not be correct by itself, but it was reassuring to have the model checker confirm this with a counterexample in just a few seconds. This began to give us more confidence to iterate on a new protocol design, which we proceeded to develop over the next week or so, next moving on to understand a deeper investigation of protocol safety requirements. Config commitment rule Adopting the single-node change condition is straightforward, as it only requires verifying new configurations in a pairwise, local manner. As we saw above, though, it is still problematic to move through arbitrary sequences of overlapping configurations, so we need to take extra care to avoid these problematic cases. Our first hunch was to add an explicit notion of “config commitment” within the protocol, similar to the commitment rules of Raft. That is, restrict a reconfiguration from taking place until some appropriate commitment conditions have been satisfied. Intuitively, this would place restrictions on how quickly, for example, a primary could execute reconfigurations—i.e., it would prevent a primary from moving to a new configuration before an older, non-overlapping configuration was, in a sense, “deactivated.” One natural idea was to borrow some similar concepts from Raft on log commitment, adapted for our logless, gossip-based setting. After a few iterations, we developed the following additional preconditions for the Reconfig action: ConfigQuorumCheck : A quorum of nodes have the same config version as the primary executing the reconfig. TermQuorumCheck : A majority of nodes in the primary’s config have reached the term of the primary or newer. We modeled the protocol with these new TermQuorumCheck and ConfigQuorumCheck , and they were initially sufficient to rule out the counterexamples we encountered previously. They were not yet fully general to ensure safety, though, as we will see below, where we worked out a final solution for config commitment. Oplog commitment rule In addition to the "config commitment" idea, it is worth noting the relationship between the config and oplog caused by the divergence from Raft. Raft sequences a reconfiguration among other oplog entries, thereby establishing a strong implicit ordering among them. However, since the gossip-based reconfig protocol does not include the configuration as part of the oplog, there may be some implicit dependencies between oplog entries and configurations that are not accounted for. We had started to think about this interaction between oplog entry commitment and reconfiguration, and conjectured a few problematic scenarios that we were able to confirm with the model checker. An example of this problem is illustrated by the following, simplified error trace: The core issue here is that config C3={n 1 , n 2 , n 3 } (with version=3) is installed even though the entry <<1,1>> (index, term) that was committed in a previous configuration, C1={n 1 }, has not been committed in the current configuration, C2={n 1 ,n 2 }. Since quorums may not overlap for non-adjacent configurations (e.g., C1 and C3), by ensuring that the commitment of writes in a previous configuration is also guaranteed in the current configuration, we can "propagate" the durability guarantee of earlier configurations to the future. As a result, we need to explicitly check this property when accepting reconfiguration commands. The rules for accepting a new configuration now include this additional, newly developed precondition: This rule is about ensuring that durable, replicated log entries from older configs are transferred to newer configs, which must be upheld to ensure safe protocol operation over time. This feature is implicit in Raft reconfiguration due to the tight coupling of reconfigurations and the main operation oplog, but must be handled explicitly here due to the decoupled design. The config as logless state machine We were now confident that we had established strong rules to guarantee local quorum overlap, the proper sequential ordering of configs, and the appropriate transfer of oplog entries between configs. After re-checking our model with these new preconditions, though, the model checker discovered a new counterexample after running for several hours on a larger workstation. The following is a simplified version of this error trace: In this case, node n 1 executes a reconfig to Ca={n 1 ,n 2 ,n 3 }, but hasn't propagated it to any other nodes at state 3. Then, n 2 becomes the primary and reconfigures to config Cb={n 1 ,n 2 ,n 4 } in state 6. n 1 can then be elected in term 3 with quorum {n 1 ,n 3 }, and n 2 can be elected in term 3 with quorum {n 2 ,n 4 }, violating the ElectionSafety property. The problem in the above trace is that when n 2 moved to a new config, it should have ensured that, in the future, no leaders would ever be elected in “earlier” configs. It failed to do so and, in the last step, a quorum was then able to be formed in a config with version 2, leading to two active, non-overlapping quorums. A key here is that divergence between configs in different terms leads to the issue. That is, config commitment as we did above was sufficient for a sequence of reconfigs by a single leader, but not with concurrent leaders in competing terms. Figure 1. Concurrent configurations with non-intersecting majority quorums. After going through these counterexamples, we understood the problem more clearly and had a path to refine our correctness argument. We realized that agreeing on the configuration among nodes can be viewed as a separate kind of consensus problem, separate from the oplog consensus but with similar rules. In our system, the config itself can be viewed as a compacted (i.e., rolled up) replicated state machine (RSM) that does not require a log (i.e., it is “logless”), since explicit maintenance of config history isn’t needed and only the latest config takes effect. Propagating the config via heartbeats can be viewed as “appending” to the config log (e.g. as in Raft), and rolling back a config is never explicitly required—i.e., we always simply install a more up-to-date config. This config RSM already shares many similarities with the oplog RSM, such as term propagation. The similarity suggests that just using the config version to identify a config is not sufficient. Viewing the config as its own RSM, we need to assign the primary’s term to configs. The config term is then a separate property of the config, similar to how the oplog entry’s term is part of every oplog entry. Thus, a config should be defined and ordered by the tuple (configVersion, configTerm), analogous to how an oplog entry is identified and ordered by its (timestamp, term), with term being compared first, followed by timestamp/version. The elections of these two consensus protocols can then be merged together by adding a new rule that a voter checks if the candidate’s config is stale in addition to other checks. Moreover, we can borrow the definition of “commitment” from the oplog RSM to the config RSM. That is, when a config is propagated to a majority of nodes in the primary’s term, the config is committed. It also became clear that the RSM only moves ahead through committed configs sequentially - the config RSM can choose the next config and commit it only if its current one is committed. Putting it all together Our final protocol specification included all of the above preconditions and features, producing a version of the protocol which we refer to as safe, logless dynamic reconfiguration. We conducted final model checking runs for several cases over 20 hours, exploring over 800 million protocol states, with configurations of four and five servers, along with pen and paper explanations for the correctness of the final result. Note that, at a high level, we can understand dynamic reconfiguration protocols like this as needing to deal with two core conceptual aspects: (1) config deactivation and (2) state transfer. Our various config commitment rules combine to address the first, which is related to ensuring that different configs that diverge over time cannot both be concurrently active. Aspect (2) relates to the fact that various types of replicated, durable state within a configuration must be appropriately transferred over to newer configurations. This is what the oplog commitment rules address, as well as the rules for ensuring that the term state propagates appropriately between configurations. Once we had the abstract protocol ironed out and gained confidence in its correctness, we were ready to move forward swiftly to implementation, and completed it in the MongoDB replication system over the course of a few months. The protocol has been running reliably in MongoDB and in production for several years since its introduction, and the implementation and protocol were significantly simpler than our original design alternatives. Takeaways Overall, we were able to get a draft protocol in one week, and within two weeks we finalized the protocol and successfully passed correctness checks using the model checker. It was motivating to see our vague ideas turn into something tangible, and the successful outcome from this design phase gave us the confidence to move forward to the implementation phase. Model checking is an excellent tool for rapidly and precisely answering "what if" design questions. Our efforts also emphasized an important feature of lightweight, design-level formal methods techniques, which is about more than simply ensuring correctness of your system design. Rather, it enables the exploration of protocol optimizations at a level of aggressiveness and velocity that would typically be infeasible with manual design methods. From this perspective, we can view these formal methods tools as not only a means for improving correctness of our systems and protocols, but as a means for efficient exploration of the optimization design space while maintaining a high correctness bar. This also speaks to the potential value of investing some amount of time upfront in models for key protocols that are highly critical and may need to evolve over time. Due to our novel protocol design, the scope of the implementation changes also became much smaller. We delivered the project in three months with three to four developers, and "force reconfig" was implemented using the same mechanism with relaxed rules. Version upgrade/downgrade only involves a small on-disk format change of the config, avoiding switching between two different reconfig approaches. In addition, our approach also provided potential performance improvements. Specifically, the decoupled reconfiguration design can bypass the oplog to recover the system when the oplog becomes the bottleneck. Similar ideas have since been explored in other, recent reconfiguration protocols like Matchmaker Paxos . Since its introduction in MongoDB 4.4 in 2019, the new, logless reconfiguration protocol has proven to be reliable and has served as a solid building block for other features, such as automatically assigning new nodes votes only after their initial sync. There have been no significant protocol bugs discovered since its deployment, a testament to the value of these rigorous protocol design techniques. While we focused on the intuition of the new protocol and the experience of leveraging model checking in this article, our paper , published in OPODIS 2021, includes a much more detailed description of the reconfiguration protocol, and a formal safety proof was also published. The final versions of the specifications we developed and published can be found in this Github repository , as well as some of the original specs we used in the MongoDB repository .

July 2, 2025

Next →

Introducing Query Shape Insights in MongoDB Atlas

As modern applications scale, databases are often the first to show signs of stress, especially when query patterns shift or inefficiencies arise. MongoDB has invested in building a robust observability suite to help teams monitor and optimize performance. Tools such as the Query Profiler and, more recently, Namespace Insights provide deep visibility into query behavior and collection-level activity. While powerful, these capabilities primarily focus on individual queries or collections, limiting their ability to surface systemic patterns that impact overall application performance. Today, MongoDB is excited to announce Query Shape Insights, a powerful new feature for MongoDB Atlas that offers a high-resolution, holistic view of how queries behave at scale across clusters. Query Shape Insights delivers a paradigm shift in visibility by surfacing aggregated statistics for the most resource-intensive query shapes. This accelerates root cause analysis, streamlines optimization workflows, and improves operational efficiency. Figure 1. Overview page of Query Shape Insights showing the most resource-intensive query shapes. A new granularity for performance analysis Previously, if a modern application experienced a traffic surge, it risked overloading the database with queries, causing rapid performance degradation. In those critical moments, developers and database administrators must quickly identify the queries contributing most acutely to the bottleneck. This necessitated scrutinizing logs or per-query samples. With the launch of Query Shape Insights, the top 100 query shapes are surfaced by grouping structurally similar queries with shared filters, projects, and aggregation stages into defined query shapes. These query shapes are then ranked by total execution time, offering MongoDB Atlas users greater visibility into the most resource-intensive queries. Each query shape is enriched with detailed metrics such as execution time, operation count, number of documents examined and returned, and bytes read. These metrics are rendered as time series data, enabling developers and database administrators to pinpoint when the regressions began, how long they persisted, and what triggered them. Figure 2. Detailed view of a query shape, with a pop-up displaying associated metrics. This new feature integrates seamlessly into the performance workloads teams use to monitor, debug, and optimize applications. Each query shape includes associated client metadata, such as application name, driver version, and host. This empowers teams to identify which services, applications, or teams impact performance. This level of visibility is particularly valuable for microservices-based environments, where inefficiencies might manifest across multiple teams and services. Query Shape Insights adapts based on cluster tier to support varying workload sizes. Teams can analyze the performance data of each query shape over a 7-day window. This enables them to track trends, find changes in application behavior, and identify slow regressions that might otherwise be missed. Integration with MongoDB’s observability suite Query Shape Insights was designed to enable MongoDB Atlas users to move from detection to resolution with unprecedented speed and clarity. Built directly into the MongoDB Atlas experience, this feature is a clear starting point for performance investigations. This is imperative for dynamic environments where application behavior evolves rapidly and bottlenecks must be identified and resolved rapidly. The Query Shape Insights dashboard offers comprehensive, time series–based analysis of query patterns across clusters. It enables teams to detect inefficiencies and understand when and how workloads have changed. Query Shape Insights answers critical diagnostic questions by surfacing the most resource-intensive query shapes. It identifies the workloads that consume the most resources and can help determine whether these workloads are expected or anomalous. Query Shape Insights can also help identify the emergence of new workloads and reveal how workloads have changed over time. To support this level of analysis, Query Shape Insights offers a rich set of capabilities, giving teams the clarity and speed they need to troubleshoot intelligently and maintain high-performing applications: Unified query performance view: Monitor query shapes to rapidly identify and investigate bottlenecks. Detailed query shape statistics: Track key metrics including execution time, document return counts, and execution frequency. Interactive analysis tools: Query shape drill-downs to view detailed metadata and performance trends. Flexible filtering options: Narrow analysis by shard/host, data range, namespace, or operation type. Programmatic access: Leverage MongoDB’s new Admin API endpoint to integrate query shape data with the existing observability stack. After using Query Shape Insights, MongoDB Atlas users can pivot directly to Query Profiler with filters pre-applied to the specific collection and operation type for more information beyond that provided by Query Shape Insights. Once they have traced the issue to its root, users can continue their diagnostics journey by visiting Performance Advisor . This recommends indexes tailored to the query shape, ensuring that cluster optimizations are data-driven and precise. Query Shape Insights is a leap forward in how teams manage, investigate, and respond to performance issues with MongoDB. By introducing a high-level, shape-aware view of query activity, Query Shape Insights enhances traditional reactive troubleshooting with greater clarity. This enables teams to troubleshoot faster and monitor performance effectively. Query Shape Insights is now available for all MongoDB Atlas dedicated clusters (M10 and above) deployments. Clusters must run on MongoDB 8.0 or later to access this feature. Support for Cloud Manager deployments is planned for the future. Check out MongoDB’s documentation for more details on Query Shape Insights. Start using Query Shape Insights today through your MongoDB Atlas portal.

July 2, 2025