I have seen a lot of developers start with .find()
when working with MongoDB and stick to that. It's not inherently bad, as it's simple nature gives a simple and fluid head start, but sticking to that as you move is what I am not a fan of.
Before jumping in, let's get things out of the way. find()
is faster than aggregate()
in general. If you are running a simple query, on a collection with limited documents, go with find()
. For example (with mongoose, a JavaScript library that works as a ODM driver for MongoDB in node)):
// .find() const findResults = await BookModel.find({author: "John Doe"}); // .aggregate() const aggrResults = await BookModel.aggregate([ { $match: { author: "John Doe" } } ]);
If you run the above app, you'll have a noticeable time difference (depends on the collection size), where aggregation takes more time than find.
But that is not the reason I am writing this article, obviously. The problem with find comes in when we want to perform post-processing on the data. For example, after finding results you want to group them based on some parameter. Let's first weave an example:
Data
[ {"name": "The Silent Patient", "author": "Alex Michaelides", "rating": 4.3}, {"name": "The Maidens", "author": "Alex Michaelides", "rating": 4.0}, {"name": "Ares", "author": "Alex Michaelides", "rating": 3.9}, {"name": "1984", "author": "George Orwell", "rating": 4.7}, {"name": "Animal Farm", "author": "George Orwell", "rating": 4.5}, {"name": "Homage to Catalonia", "author": "George Orwell", "rating": 4.2}, {"name": "Norwegian Wood", "author": "Haruki Murakami", "rating": 4.4}, {"name": "Kafka on the Shore", "author": "Haruki Murakami", "rating": 4.6}, {"name": "1Q84", "author": "Haruki Murakami", "rating": 4.1}, {"name": "Killing Commendatore", "author": "Haruki Murakami", "rating": 4.0}, {"name": "Down and Out in Paris and London", "author": "George Orwell", "rating": 4.3}, {"name": "Hard-Boiled Wonderland and the End of the World", "author": "Haruki Murakami", "rating": 4.2} ]
With find()
// .find() const findResults = await BookModel.find(); const byAuthor = findResults.reduce((acc, r) => ({ ...acc, [r.author]: [...(acc.author ?? []), r], })); const byRating = findResults.reduce( (acc, r) => r.rating > 4 ? { ...acc, highRated: [...acc.highRated, r], } : r.rating < 2 ? { ...acc, lowRated: [...acc.lowRated, r], } : { midRated: [...acc.midRated, r], }, { highRated: [], lowRated: [], midRated: [], }, ); console.log({ byAuthor, byRating });
With .aggregate()
// .aggregate() const aggrResults = await BookModel.aggregate([ { $facet: { byAuthors: [ { $group: { _id: "$author", books: { $push: "$$ROOT" }, } }, { $group: { _id: null, byAuthor: { $push: { k: "$_id", v: "$books" } }, } }, { $project: { _id: 0, byAuthor: { $arrayToObject: "$byAuthor" }, } } ], byRating: [ { $group: { _id: { $switch: { branches: [ { case: { $gte: ["$rating", 4], }, then: "highRated" }, { case: { $lte: ["$rating", 2], }, then: "lowRated" }, ], default: "midRated", } }, books: { $push: "$$ROOT" }, } }, { $group: { _id: null, byRating: { $push: { k: "$_id", v: "$books" } }, }, }, { $project: { _id: 0, byRating: { $arrayToObject: "$byRating" }, } } ] } } ]); const [{ byAuthor, byRating }] = aggrResult; // assuming the everything went right console.log({ byAuthor: byAuthor[0], byRating: byRating[0] });
📝 I am not including the results as they can be guessed.
So, I gave an example that makes aggregation looks complicated. Why go to such length when you can just find it and then go about your day? Well, as I said, in the intro for simple use cases, like this one, with 10 documents in your collection, always find()
it.
Let's complicate the example a bit, where you have a author collection and you want to bring in data from there as well. Now, if you are used to mongoose, you can recall a nifty feature called .populate()
.
So our example becomes:
const findResults = await BookModel.find().limit(1000).populate("author"); // ... all the post-processing stuff
Day 1: All good. Never better.
Day 2: Now you have a million documents.
What changes is that you make an API call, binge watch One Piece and then come back to a meet a loading screen.
Let me explain why: The populate()
function in mongoose makes a query for every book in your document to get it's author by id. So your one .find().populate("author")
actually runs 1001 queries. Without the post-processing you are going to apply, i.e. 1000*n operations. And God forbid you have more relational fields that you want to populate.
What shines here is aggregate()
.
We run:
const aggrResults = await BookModel.aggregate([ { $limit: 1000 }, { $lookup: { from: "authors", localField: "author", foreignField: "$_id", as: "author" } }, // ... rest of your pipelines ])
Now, a single query runs. And it is faster than .find().populate("author"). And we don't have to post process the data. We are served on a silver plate and then fed with a silver spoon.
And this is just me touching the surface. Quoting Sir Isaac Newton, "What we know is a drop, what we don't know is an ocean." (from Dark, ofc)
But before you jump into aggregation()
-ing everything, let me let you know of the caveats:
- You have to be extremely careful with the limits of your MongoDB server. Aggregation, while handy, requires more resources to run than find.
- In case you are using corporate managed MongoDB instances, like Atlas, you have to be careful about pricing, as prices spike up when you run resource expensive queries.
- This extends point 1: If you have a locally setup server, you must keep in mind the memory allocated to the process/server, or your deployment will crash processing while large queries.
With that out of the way, get to replacing your complex find()
s with aggregate()
.
I will cover more of aggregation in details in my following writeups.
📝 Thank you for reading till the end. I am open to criticisms, constructive or otherwise, as this is my first blog. Please help me to get better.
Top comments (0)