Aggregation Framework
Quick Overview of
Quick Overview of Document-oriented Schemaless JSON-style documents Rich Queries Scales Horizontally db.users.find({ last_name: 'Smith', age: {$gt : 10} }); SELECT * FROM users WHERE last_name=‘Smith’ AND age > 10;
Computing Aggregations in Databases SQL-based RDBMS JOIN GROUP BY AVG(), COUNT(), SUM(), FIRST(), LAST(), etc. MongoDB 2.0 MapReduce MongoDB 2.2+ MapReduce Aggregation Framework
MapReduce var map = function() { ... emit(key, val); } var reduce = function(key, vals) { ... return resultVal; } Data Map() emit(k,v) Sort(k) Group(k) Reduce(k,values) k,v Finalize(k,v) k,v MongoDB map iterates on documents Document is $this 1 at time per shard Input matches output Can run multiple times
What’s wrong with just using MapReduce? Map/Reduce is very powerful, but often overkill Lots of users relying on it for simple aggregation tasks • •
What’s wrong with just using MapReduce? Easy to screw up JavaScript Debugging a M/R job sucks Writing more JS for simple tasks should not be necessary • • • (ಠ︿ಠ)
Aggregation Framework Declarative (no need to write JS) Implemented directly in C++ Expression Evaluation Return computed values Framework: We can extend it with new ops • • • • •
Input Data (collection) Filter Project Unwind Group Sort Limit Result (document)
db.article.aggregate( { $project : {author : 1,tags : 1}}, { $unwind : "$tags" }, { $group : {_id : “$tags”, authors:{ $addToSet:"$author"}} } ); An aggregation command looks like:
db.article.aggregate( { $project : {author : 1, tags : 1}}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet:"$author"} }} ); New Helper Method: .aggregate() Operator pipeline db.runCommand({ aggregate : "article", pipeline : [ {$op1, $op2, ...} ] }
{ "result" : [ { "_id" : "art", "authors" : [ "bill", "bob" ] }, { "_id" : "sports", "authors" : [ "jane", "bob" ] }, { "_id" : "food", "authors" : [ "jane", "bob" ] }, { "_id" : "science", "authors" : [ "jane", "bill", "bob" ] } ], "ok" : 1 } Output Document Looks like this: result: array of pipeline output ok: 1 for success, 0 otherwise
Pipeline Input to the start of the pipeline is a collection Series of operators - each one filters or transforms its input Passes output data to next operator in the pipeline Output of the pipeline is the result document • • • • ps -ax | tee processes.txt | more Kind of like UNIX:
Let’s do: 1. Tour of the pipeline operators 2. A couple examples based on common SQL aggregation tasks $match $unwind $group $project $skip $limit $sort
filters documents from pipeline with a query predicate filtered with: {$match: {author:”bob”}} $match {author: "bob", pageViews:5, title:"Lorem Ipsum..."} {author: "bill", pageViews:3, title:"dolor sit amet..."} {author: "joe", pageViews:52, title:"consectetur adipi..."} {author: "jane", pageViews:51, title:"sed diam..."} {author: "bob", pageViews:14, title:"magna aliquam..."} {author: "bob", pageViews:53, title:"claritas est..."} filtered with: {$match: {pageViews:{$gt:50}} {author:"bob",pageViews:5,title:"Lorem Ipsum..."} {author:"bob",pageViews:14,title:"magna aliquam..."} {author:"bob",pageViews:53,title:"claritas est..."} {author: "joe", pageViews:52, title:"consectetur adipiscing..."} {author: "jane", pageViews:51, title:"sed diam..."} {author: "bob", pageViews:53, title:"claritas est..."} Input:
$unwind { "_id" : ObjectId("4f...146"), "author" : "bob", "tags" :[ "fun","good","awesome"] } explode the “tags” array with: { $unwind : ”$tags” } { _id : ObjectId("4f...146"), author : "bob", tags:"fun"}, { _id : ObjectId("4f...146"), author : "bob", tags:"good"}, { _id : ObjectId("4f...146"), author : "bob", tags:"awesome"} produces output: Produce a new document for each value in an input array
Bucket a subset of docs together, calculate an aggregated output doc from the bucket $sum $max, $min $avg $first, $last $addToSet $push db.article.aggregate( { $group : { _id : "$author", viewsPerAuthor : { $sum : "$pageViews" } } } ); $group Output Calculation Operators:
db.article.aggregate( { $group : { _id : "$author", viewsPerAuthor : { $sum : "$pageViews" } } } ); _id: selects a field to use as bucket key for grouping Output field name Operation used to calculate the output value ($sum, $max, $avg, etc.) $group (cont’d) dot notation (nested fields) a constant a multi-key expression inside {...} • • • also allowed here:
An example with $match and $group SELECT SUM(price) FROM orders WHERE customer_id = 4; MongoDB: SQL: db.orders.aggregate( {$match : {“$customer_id” : 4}}, {$group : { _id : null, total: {$sum : “price”}}) English: Find the sum of all prices of the orders placed by customer #4
An example with $unwind and $group MongoDB: SQL: English: db.posts.aggregate( { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} ); For all tags used in blog posts, produce a list of authors that have posted under each tag SELECT tag, author FROM post_tags LEFT JOIN posts ON post_tags.post_id = posts.id GROUP BY tag, author;
More operators - Controlling Pipeline Input $skip $limit $sort Similar to: .skip() .limit() .sort() in a regular Mongo query
$sort specified the same way as index keys: { $sort : { name : 1, age: -1 } } Must be used in order to take advantage of $first/$last with $group. order input documents
$limit limit the number of input documents {$limit : 5} $skip skips over documents {$skip : 5}
$project Use for: Add, Remove, Pull up, Push down, Rename Fields Building computed fields Reshape a document
$project (cont’d) Include or exclude fields {$project : { title : 1, author : 1} } Only pass on fields “title” and “author” {$project : { comments : 0} Exclude “comments” field, keep everything else
Moving + Renaming fields {$project : { page_views : “$pageViews”, catName : “$category.name”, info : { published : “$ctime”, update : “$mtime” } } } Rename page_views to pageViews Take nested field “category.name”, move it into top-level field called “catName” Populate a new sub-document into the output $project (cont’d)
db.article.aggregate( { $project : { name : 1, age_fixed : { $add:["$age", 2] } }} ); Building a Computed Field Output (computed field) Operands Expression $project (cont’d)
Lots of Available Expressions $project (cont’d) Numeric $add $sub $mod $divide $multiply Logical $eq $lte/$lt $gte/$gt $and $not $or $eq Dates $dayOfMonth $dayOfYear $dayOfWeek $second $minute $hour $week $month $isoDate Strings $substr $add $toLower $toUpper $strcasecmp
Example: $sort → $limit → $project→ $group MongoDB: SQL: English: Of the most recent 1000 blog posts, how many were posted within each calendar year? SELECT YEAR(pub_time) as pub_year, COUNT(*) FROM (SELECT pub_time FROM posts ORDER BY pub_time desc) GROUP BY pub_year; db.test.aggregate( {$sort : {pub_time: -1}}, {$limit : 1000}, {$project:{pub_year:{$year:["$pub_time"]}}}, {$group: {_id:"$pub_year", num_year:{$sum:1}}} )
Some Usage Notes In BSON, order matters - so computed fields always show up after regular fields We use $ in front of field names to distinguish fields from string literals in expressions “$name” “name” vs.
Some Usage Notes Use a $match,$sort and $limit first in pipeline if possible Cumulative Operators $group: be aware of memory usage Use $project to discard unneeded fields Remember the 16MB output limit
Aggregation vs. MapReduce Framework is geared towards counting/accumulating If you need something more exotic, use MapReduce No 16MB constraint on output size with MapReduce JS in M/R is not limited to any fixed set of expressions • • • •
thanks! ✌(-‿-)✌ questions? $$$ BTW: we are hiring! http://10gen.com/jobs $$$ @mpobrien github.com/mpobrien hit me up:

MongoDB Aggregation Framework

  • 1.
  • 2.
  • 3.
    Quick Overview of Document-oriented Schemaless JSON-styledocuments Rich Queries Scales Horizontally db.users.find({ last_name: 'Smith', age: {$gt : 10} }); SELECT * FROM users WHERE last_name=‘Smith’ AND age > 10;
  • 4.
    Computing Aggregations in Databases SQL-based RDBMS JOIN GROUPBY AVG(), COUNT(), SUM(), FIRST(), LAST(), etc. MongoDB 2.0 MapReduce MongoDB 2.2+ MapReduce Aggregation Framework
  • 5.
    MapReduce var map =function() { ... emit(key, val); } var reduce = function(key, vals) { ... return resultVal; } Data Map() emit(k,v) Sort(k) Group(k) Reduce(k,values) k,v Finalize(k,v) k,v MongoDB map iterates on documents Document is $this 1 at time per shard Input matches output Can run multiple times
  • 6.
    What’s wrong withjust using MapReduce? Map/Reduce is very powerful, but often overkill Lots of users relying on it for simple aggregation tasks • •
  • 7.
    What’s wrong withjust using MapReduce? Easy to screw up JavaScript Debugging a M/R job sucks Writing more JS for simple tasks should not be necessary • • • (ಠ︿ಠ)
  • 8.
    Aggregation Framework Declarative (no needto write JS) Implemented directly in C++ Expression Evaluation Return computed values Framework: We can extend it with new ops • • • • •
  • 9.
  • 10.
    db.article.aggregate( { $project :{author : 1,tags : 1}}, { $unwind : "$tags" }, { $group : {_id : “$tags”, authors:{ $addToSet:"$author"}} } ); An aggregation command looks like:
  • 11.
    db.article.aggregate( { $project :{author : 1, tags : 1}}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet:"$author"} }} ); New Helper Method: .aggregate() Operator pipeline db.runCommand({ aggregate : "article", pipeline : [ {$op1, $op2, ...} ] }
  • 12.
    { "result" : [ {"_id" : "art", "authors" : [ "bill", "bob" ] }, { "_id" : "sports", "authors" : [ "jane", "bob" ] }, { "_id" : "food", "authors" : [ "jane", "bob" ] }, { "_id" : "science", "authors" : [ "jane", "bill", "bob" ] } ], "ok" : 1 } Output Document Looks like this: result: array of pipeline output ok: 1 for success, 0 otherwise
  • 13.
    Pipeline Input to thestart of the pipeline is a collection Series of operators - each one filters or transforms its input Passes output data to next operator in the pipeline Output of the pipeline is the result document • • • • ps -ax | tee processes.txt | more Kind of like UNIX:
  • 14.
    Let’s do: 1. Tourof the pipeline operators 2. A couple examples based on common SQL aggregation tasks $match $unwind $group $project $skip $limit $sort
  • 15.
    filters documents frompipeline with a query predicate filtered with: {$match: {author:”bob”}} $match {author: "bob", pageViews:5, title:"Lorem Ipsum..."} {author: "bill", pageViews:3, title:"dolor sit amet..."} {author: "joe", pageViews:52, title:"consectetur adipi..."} {author: "jane", pageViews:51, title:"sed diam..."} {author: "bob", pageViews:14, title:"magna aliquam..."} {author: "bob", pageViews:53, title:"claritas est..."} filtered with: {$match: {pageViews:{$gt:50}} {author:"bob",pageViews:5,title:"Lorem Ipsum..."} {author:"bob",pageViews:14,title:"magna aliquam..."} {author:"bob",pageViews:53,title:"claritas est..."} {author: "joe", pageViews:52, title:"consectetur adipiscing..."} {author: "jane", pageViews:51, title:"sed diam..."} {author: "bob", pageViews:53, title:"claritas est..."} Input:
  • 16.
    $unwind { "_id" : ObjectId("4f...146"), "author": "bob", "tags" :[ "fun","good","awesome"] } explode the “tags” array with: { $unwind : ”$tags” } { _id : ObjectId("4f...146"), author : "bob", tags:"fun"}, { _id : ObjectId("4f...146"), author : "bob", tags:"good"}, { _id : ObjectId("4f...146"), author : "bob", tags:"awesome"} produces output: Produce a new document for each value in an input array
  • 17.
    Bucket a subsetof docs together, calculate an aggregated output doc from the bucket $sum $max, $min $avg $first, $last $addToSet $push db.article.aggregate( { $group : { _id : "$author", viewsPerAuthor : { $sum : "$pageViews" } } } ); $group Output Calculation Operators:
  • 18.
    db.article.aggregate( { $group :{ _id : "$author", viewsPerAuthor : { $sum : "$pageViews" } } } ); _id: selects a field to use as bucket key for grouping Output field name Operation used to calculate the output value ($sum, $max, $avg, etc.) $group (cont’d) dot notation (nested fields) a constant a multi-key expression inside {...} • • • also allowed here:
  • 19.
    An example with$match and $group SELECT SUM(price) FROM orders WHERE customer_id = 4; MongoDB: SQL: db.orders.aggregate( {$match : {“$customer_id” : 4}}, {$group : { _id : null, total: {$sum : “price”}}) English: Find the sum of all prices of the orders placed by customer #4
  • 20.
    An example with$unwind and $group MongoDB: SQL: English: db.posts.aggregate( { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }} ); For all tags used in blog posts, produce a list of authors that have posted under each tag SELECT tag, author FROM post_tags LEFT JOIN posts ON post_tags.post_id = posts.id GROUP BY tag, author;
  • 21.
    More operators -Controlling Pipeline Input $skip $limit $sort Similar to: .skip() .limit() .sort() in a regular Mongo query
  • 22.
    $sort specified the sameway as index keys: { $sort : { name : 1, age: -1 } } Must be used in order to take advantage of $first/$last with $group. order input documents
  • 23.
    $limit limit the numberof input documents {$limit : 5} $skip skips over documents {$skip : 5}
  • 24.
    $project Use for: Add, Remove,Pull up, Push down, Rename Fields Building computed fields Reshape a document
  • 25.
    $project (cont’d) Include or excludefields {$project : { title : 1, author : 1} } Only pass on fields “title” and “author” {$project : { comments : 0} Exclude “comments” field, keep everything else
  • 26.
    Moving + Renamingfields {$project : { page_views : “$pageViews”, catName : “$category.name”, info : { published : “$ctime”, update : “$mtime” } } } Rename page_views to pageViews Take nested field “category.name”, move it into top-level field called “catName” Populate a new sub-document into the output $project (cont’d)
  • 27.
    db.article.aggregate( { $project :{ name : 1, age_fixed : { $add:["$age", 2] } }} ); Building a Computed Field Output (computed field) Operands Expression $project (cont’d)
  • 28.
    Lots of Available Expressions $project (cont’d) Numeric$add $sub $mod $divide $multiply Logical $eq $lte/$lt $gte/$gt $and $not $or $eq Dates $dayOfMonth $dayOfYear $dayOfWeek $second $minute $hour $week $month $isoDate Strings $substr $add $toLower $toUpper $strcasecmp
  • 29.
    Example: $sort →$limit → $project→ $group MongoDB: SQL: English: Of the most recent 1000 blog posts, how many were posted within each calendar year? SELECT YEAR(pub_time) as pub_year, COUNT(*) FROM (SELECT pub_time FROM posts ORDER BY pub_time desc) GROUP BY pub_year; db.test.aggregate( {$sort : {pub_time: -1}}, {$limit : 1000}, {$project:{pub_year:{$year:["$pub_time"]}}}, {$group: {_id:"$pub_year", num_year:{$sum:1}}} )
  • 30.
    Some Usage Notes InBSON, order matters - so computed fields always show up after regular fields We use $ in front of field names to distinguish fields from string literals in expressions “$name” “name” vs.
  • 31.
    Some Usage Notes Usea $match,$sort and $limit first in pipeline if possible Cumulative Operators $group: be aware of memory usage Use $project to discard unneeded fields Remember the 16MB output limit
  • 32.
    Aggregation vs. MapReduce Framework isgeared towards counting/accumulating If you need something more exotic, use MapReduce No 16MB constraint on output size with MapReduce JS in M/R is not limited to any fixed set of expressions • • • •
  • 33.
    thanks! ✌(-‿-)✌ questions? $$$ BTW:we are hiring! http://10gen.com/jobs $$$ @mpobrien github.com/mpobrien hit me up: