Posted on Mar 7

Why Aggregation in MongoDB

I have seen a lot of developers start with .find() when working with MongoDB and stick to that. It's not inherently bad, as it's simple nature gives a simple and fluid head start, but sticking to that as you move is what I am not a fan of.

Before jumping in, let's get things out of the way. find() is faster than aggregate() in general. If you are running a simple query, on a collection with limited documents, go with find(). For example (with mongoose, a JavaScript library that works as a ODM driver for MongoDB in node)):

// .find() const findResults = await BookModel.find({author: "John Doe"}); // .aggregate() const aggrResults = await BookModel.aggregate([ { $match: { author: "John Doe" } } ]);

If you run the above app, you'll have a noticeable time difference (depends on the collection size), where aggregation takes more time than find.

But that is not the reason I am writing this article, obviously. The problem with find comes in when we want to perform post-processing on the data. For example, after finding results you want to group them based on some parameter. Let's first weave an example:

Data

[ {"name": "The Silent Patient", "author": "Alex Michaelides", "rating": 4.3}, {"name": "The Maidens", "author": "Alex Michaelides", "rating": 4.0}, {"name": "Ares", "author": "Alex Michaelides", "rating": 3.9}, {"name": "1984", "author": "George Orwell", "rating": 4.7}, {"name": "Animal Farm", "author": "George Orwell", "rating": 4.5}, {"name": "Homage to Catalonia", "author": "George Orwell", "rating": 4.2}, {"name": "Norwegian Wood", "author": "Haruki Murakami", "rating": 4.4}, {"name": "Kafka on the Shore", "author": "Haruki Murakami", "rating": 4.6}, {"name": "1Q84", "author": "Haruki Murakami", "rating": 4.1}, {"name": "Killing Commendatore", "author": "Haruki Murakami", "rating": 4.0}, {"name": "Down and Out in Paris and London", "author": "George Orwell", "rating": 4.3}, {"name": "Hard-Boiled Wonderland and the End of the World", "author": "Haruki Murakami", "rating": 4.2} ]

With find()

// .find() const findResults = await BookModel.find(); const byAuthor = findResults.reduce((acc, r) => ({ ...acc, [r.author]: [...(acc.author ?? []), r], })); const byRating = findResults.reduce(   (acc, r) =>     r.rating > 4       ? {           ...acc,           highRated: [...acc.highRated, r],         }       : r.rating < 2         ? {             ...acc,             lowRated: [...acc.lowRated, r],           }         : {             midRated: [...acc.midRated, r],           },   {     highRated: [],     lowRated: [],     midRated: [],   }, ); console.log({ byAuthor, byRating });

With .aggregate()

// .aggregate() const aggrResults = await BookModel.aggregate([ { $facet: { byAuthors: [ { $group: { _id: "$author", books: { $push: "$$ROOT" }, } }, { $group: { _id: null, byAuthor: { $push: { k: "$_id", v: "$books" } }, } }, { $project: { _id: 0, byAuthor: { $arrayToObject: "$byAuthor" }, } } ], byRating: [ { $group: { _id: { $switch: { branches: [ { case: { $gte: ["$rating", 4], }, then: "highRated" }, { case: { $lte: ["$rating", 2], }, then: "lowRated" }, ], default: "midRated", } }, books: { $push: "$$ROOT" }, } }, { $group: { _id: null, byRating: { $push: { k: "$_id", v: "$books" } }, }, }, { $project: { _id: 0, byRating: { $arrayToObject: "$byRating" }, } } ] } } ]); const [{ byAuthor, byRating }] = aggrResult; // assuming the everything went right console.log({ byAuthor: byAuthor[0], byRating: byRating[0] });

📝 I am not including the results as they can be guessed.

So, I gave an example that makes aggregation looks complicated. Why go to such length when you can just find it and then go about your day? Well, as I said, in the intro for simple use cases, like this one, with 10 documents in your collection, always find() it.

Let's complicate the example a bit, where you have a author collection and you want to bring in data from there as well. Now, if you are used to mongoose, you can recall a nifty feature called .populate().

So our example becomes:

const findResults = await BookModel.find().limit(1000).populate("author"); // ... all the post-processing stuff

Day 1: All good. Never better.

Day 2: Now you have a million documents.

What changes is that you make an API call, binge watch One Piece and then come back to a meet a loading screen.

Let me explain why: The populate() function in mongoose makes a query for every book in your document to get it's author by id. So your one .find().populate("author") actually runs 1001 queries. Without the post-processing you are going to apply, i.e. 1000*n operations. And God forbid you have more relational fields that you want to populate.

What shines here is aggregate().

We run:

const aggrResults = await BookModel.aggregate([ { $limit: 1000 }, { $lookup: { from: "authors", localField: "author", foreignField: "$_id", as: "author" } }, // ... rest of your pipelines ])

Now, a single query runs. And it is faster than .find().populate("author"). And we don't have to post process the data. We are served on a silver plate and then fed with a silver spoon.

And this is just me touching the surface. Quoting Sir Isaac Newton, "What we know is a drop, what we don't know is an ocean." (from Dark, ofc)

But before you jump into aggregation()-ing everything, let me let you know of the caveats:

You have to be extremely careful with the limits of your MongoDB server. Aggregation, while handy, requires more resources to run than find.
In case you are using corporate managed MongoDB instances, like Atlas, you have to be careful about pricing, as prices spike up when you run resource expensive queries.
This extends point 1: If you have a locally setup server, you must keep in mind the memory allocated to the process/server, or your deployment will crash processing while large queries.

With that out of the way, get to replacing your complex find()s with aggregate().

I will cover more of aggregation in details in my following writeups.