#mongodbdays @nleite #aggfwk Aggregation Framework Norberto Leite Senior Solutions Architect, MongoDB
Agenda • What is the Aggregation Framework? • The Aggregation Pipeline • Usage and Limitations • Aggregation and Sharding • Summary
What is the Aggregation Framework?
Aggregation Framework
Aggregation in Nutshell • We're storing our data in MongoDB • Our applications need to run ad-hoc queries for grouping, summarizations, reporting, etc. • We must have a way to reshape data easily to support these access patterns • You can use Aggregation Framework for this!
MapReduce is great, but… • Extremely versatile, powerful • Overkill for simple aggregation tasks • Averages • Summation • Grouping • Reshaping • High level of complexity • Difficult to program and debug
Aggregation Framework • Executes in native code – Written in C++ – JSON parameters • Plays nice with sharding • Flexible, functional, and simple – Operation pipeline – Computational expressions
Aggregation Pipeline
Pipeline
What is an Aggregation Pipeline? • A Series of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection $match $project $group $sort • Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order
Pipeline Operators • $sort • Order documents • $limit / $skip • Paginate documents • $redact • Restrict documents • $geoNear • Proximity sort documents • $let, $map • Define variables • $match • Filter documents • $project • Reshape documents • $group • Summarize documents • $unwind • Expand documents
Our Example Data { "_id" : ObjectId("54523d2d25784427c6fabce1"), "From" : "norberto@mongodb.com", "To" : "mongodb-user@googlegroups.com", "Date" : ISODate("2012-08-15T22:32:34Z"), "body" : { "text/plain" : ”Hello Munich, nice to see yalll!" }, "Subject" : ”Live From MongoDB World" }
$match • Filter documents – Uses existing query syntax – No $where (server side Javascript)
Matching Field Values { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { $match: { from: "hipster@somemail.com" }} { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" }
Matching with Query Operators { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { $match: { words: {$gt: 100} }} { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" } { subject: "Hello There", words: 218, from: "norberto@mongodb.com" }
$project • Reshape Documents – Include, exclude or rename fields – Inject computed fields – Create sub-document fields
Including and Excluding Fields { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... } { $project: { _id: 0, subject: 1, from: 1 }} { subject: "Hello There", from:"norberto@mongodb.com" }
Including and Excluding Fields { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... } { $project: { _id: 0, subject: 1, from: 1 }} { subject: "Hello There", from:"norberto@mongodb.com" }
Renaming and Computing Fields { $project: { spamIndex: { $mul: ["$words", "$replies"] }, user: "$from" }} { _id: 12345, spamIndex: 72.6666 , user: "norberto@mongodb.com" } { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... }
Creating Sub-Document Fields { $project: { subject: 1, stats: { replies: "$replies", from: "$from", date: "$date" }}} { _id: 375, subject: "Hello There", stats: { replies: 3, from: "norberto@mongodb.com", date: ISODate("2012-08-05") }} { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... }
$group • Group documents by value – Field reference, object, constant – Other output fields are computed • $max, $min, $avg, $sum • $addToSet, $push • $first, $last – Processes all data in memory by default
Calculating An Average { $group: { _id: "$from", avgWords: { $avg: "$words" } }} { _id: "norberto@mongodb.com", avgPages: 154 } { _id: "hipster@somemail.com", avgPages: 100 } { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" }
Summing Fields and Counting { $group: { _id: "$from", words: { $sum: "$words" }, mails: { $sum: 1 } }} { _id: "norberto@mongodb.com", words: 308, mails: 2 } { _id: "hipster@somemail.com", words: 100, mails: 1 } { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" }
$unwind • Operate on an array field – Create documents from array elements • Array replaced by element value • Missing/empty fields → no output • Non-array fields → error – Pipe to $group to aggregate
Collecting Distinct Values { $unwind: "$to" } { subject: "2.8 will be great!", to: "marc@mongodb.com", account : "mongodb mail” } { _id: 2222, subject: "2.8 will be great!", to: [ "marc@mongodb.com", "eliot@mongodb.com", "asya@mongodb.com", ], account: "mongodb mail" } { subject: "2.8 will be great!", to: "eliot@mongodb.com", account : "mongodb mail” } { subject: "2.8 will be great!", to: "asya@mongodb.com", account : "mongodb mail” }
$sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
$redact • Restrict access to Documents – Use document fields to define privileges – Apply conditional queries to validate users • Field Level Access Control – $$DESCEND, $$PRUNE, $$KEEP – Applies to root and subdocument fields
$redact Example Data { _id: 375, item: "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", Manufacturer: "Sony", security: 0, quantity: 12, list: 4999, pricing: { security: 1, sale: 2698, wholesale: { security: 2, amount: 2300 } } }
Query by Security Level security = 0 db.catalog.aggregate([ { $match: {item: /^.*XBR55X900A*/} }, { $redact: { $cond: { if: { $lte: [ "$security", ?? ] }, then: "$$DESCEND", else: "$$PRUNE" } } }]) { "_id" : 375, "item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", "Manufacturer" : "Sony”, "security" : 0, "quantity" : 12, "list" : 4999 } { "_id" : 375, "item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", "Manufacturer" : "Sony", "security" : 0, "quantity" : 12, "list" : 4999, "pricing" : { "security" : 1, "sale" : 2698, "wholesale" : { "security" : 2, "amount" : 2300 } } } security = 2
$geoNear • Order/Filter Documents by Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
$geonear Example Data { "_id" : 35089, "city" : “Sony”, "loc" : [ -86.048397, 32.979068 ], "pop" : 1584, "state" : "AL” }
Query by Proximity db.catalog.aggregate([ { $geoNear : { near: [ -86.000, 33.000 ], distanceField: "dist", maxDistance: .050, spherical: true, num: 3 } }]) { "_id" : "35089", "city" : "KELLYTON", "loc" : [ -86.048397, 32.979068 ], "pop" : 1584, "state" : "AL", "dist" : 0.0007971432165364155 }, { "_id" : "35010", "city" : "NEW SITE", "loc" : [ -85.951086, 32.941445 ], "pop" : 19942, "state" : "AL", "dist" : 0.0012479615347306806 }, { "_id" : "35072", "city" : "GOODWATER", "loc" : [ -86.078149, 33.074642 ], "pop" : 3813, "state" : "AL", "dist" : 0.0017333719627032555 }
Usage and Limitations
Usage • collection.aggregate([…], {<options>}) – Returns a cursor – Takes an optional document to specify aggregation options • allowDiskUse, explain – Use $out to send results to a Collection • db.runCommand({aggregate:<collection>, pipeline:[…]}) – Returns a document, limited to 16 MB
Collection db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 }
Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, 1 } {} }$ group: { _id: "$language", numTitles: { $sum: ] }) { result : [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], “ok” : 1 }
Limitations • Pipeline operator memory limits – Stages limited to 100 MB – Use “allowDiskUse” option to use disk for larger data sets • Some BSON types unsupported – Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope
Aggregation and Sharding
Sharding Result mongos Shard 1 (Primary) $match, $project, $group Shard 2 $match, $project, $group Shard 3 excluded Shard 4 $match, $project, $group • Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match may excuse shards – Potential CPU and memory implications for primary shard host * Prior to v2.6 second stage pipeline processing was done by mongos
Summary
Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing and reshaping data
Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253
Future Enhancements • Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited output
Enabling Developers • Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shell
#mongodbdays #aggfwk #devs @mongodb Obrigado! Norberto Leite SA | Eng – norberto@mongodb.com

Aggregation Framework MongoDB Days Munich

  • 1.
    #mongodbdays @nleite #aggfwk Aggregation Framework Norberto Leite Senior Solutions Architect, MongoDB
  • 2.
    Agenda • Whatis the Aggregation Framework? • The Aggregation Pipeline • Usage and Limitations • Aggregation and Sharding • Summary
  • 3.
    What is theAggregation Framework?
  • 4.
  • 5.
    Aggregation in Nutshell • We're storing our data in MongoDB • Our applications need to run ad-hoc queries for grouping, summarizations, reporting, etc. • We must have a way to reshape data easily to support these access patterns • You can use Aggregation Framework for this!
  • 6.
    MapReduce is great,but… • Extremely versatile, powerful • Overkill for simple aggregation tasks • Averages • Summation • Grouping • Reshaping • High level of complexity • Difficult to program and debug
  • 7.
    Aggregation Framework •Executes in native code – Written in C++ – JSON parameters • Plays nice with sharding • Flexible, functional, and simple – Operation pipeline – Computational expressions
  • 8.
  • 9.
  • 10.
    What is anAggregation Pipeline? • A Series of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection $match $project $group $sort • Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order
  • 11.
    Pipeline Operators •$sort • Order documents • $limit / $skip • Paginate documents • $redact • Restrict documents • $geoNear • Proximity sort documents • $let, $map • Define variables • $match • Filter documents • $project • Reshape documents • $group • Summarize documents • $unwind • Expand documents
  • 12.
    Our Example Data { "_id" : ObjectId("54523d2d25784427c6fabce1"), "From" : "norberto@mongodb.com", "To" : "mongodb-user@googlegroups.com", "Date" : ISODate("2012-08-15T22:32:34Z"), "body" : { "text/plain" : ”Hello Munich, nice to see yalll!" }, "Subject" : ”Live From MongoDB World" }
  • 13.
    $match • Filterdocuments – Uses existing query syntax – No $where (server side Javascript)
  • 14.
    Matching Field Values { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { $match: { from: "hipster@somemail.com" }} { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" }
  • 15.
    Matching with QueryOperators { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { $match: { words: {$gt: 100} }} { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" } { subject: "Hello There", words: 218, from: "norberto@mongodb.com" }
  • 16.
    $project • ReshapeDocuments – Include, exclude or rename fields – Inject computed fields – Create sub-document fields
  • 17.
    Including and ExcludingFields { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... } { $project: { _id: 0, subject: 1, from: 1 }} { subject: "Hello There", from:"norberto@mongodb.com" }
  • 18.
    Including and ExcludingFields { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... } { $project: { _id: 0, subject: 1, from: 1 }} { subject: "Hello There", from:"norberto@mongodb.com" }
  • 19.
    Renaming and ComputingFields { $project: { spamIndex: { $mul: ["$words", "$replies"] }, user: "$from" }} { _id: 12345, spamIndex: 72.6666 , user: "norberto@mongodb.com" } { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... }
  • 20.
    Creating Sub-Document Fields { $project: { subject: 1, stats: { replies: "$replies", from: "$from", date: "$date" }}} { _id: 375, subject: "Hello There", stats: { replies: 3, from: "norberto@mongodb.com", date: ISODate("2012-08-05") }} { _id: 12345, subject: "Hello There", words: 218, from:"norberto@mongodb.com" to: [ "marc@mongodb.com", "sam@mongodb.com" ], account: "mongodb mail", date: ISODate("2012-08-05"), replies: 3, folder: "Inbox", ... }
  • 21.
    $group • Groupdocuments by value – Field reference, object, constant – Other output fields are computed • $max, $min, $avg, $sum • $addToSet, $push • $first, $last – Processes all data in memory by default
  • 22.
    Calculating An Average { $group: { _id: "$from", avgWords: { $avg: "$words" } }} { _id: "norberto@mongodb.com", avgPages: 154 } { _id: "hipster@somemail.com", avgPages: 100 } { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" }
  • 23.
    Summing Fields andCounting { $group: { _id: "$from", words: { $sum: "$words" }, mails: { $sum: 1 } }} { _id: "norberto@mongodb.com", words: 308, mails: 2 } { _id: "hipster@somemail.com", words: 100, mails: 1 } { subject: "Hello There", words: 218, from: "norberto@mongodb.com" } { subject: "I love Hofbrauhaus", words: 90, from: "norberto@mongodb.com" } { subject: "MongoDB Rules!", words: 100, from: "hipster@somemail.com" }
  • 24.
    $unwind • Operateon an array field – Create documents from array elements • Array replaced by element value • Missing/empty fields → no output • Non-array fields → error – Pipe to $group to aggregate
  • 25.
    Collecting Distinct Values { $unwind: "$to" } { subject: "2.8 will be great!", to: "marc@mongodb.com", account : "mongodb mail” } { _id: 2222, subject: "2.8 will be great!", to: [ "marc@mongodb.com", "eliot@mongodb.com", "asya@mongodb.com", ], account: "mongodb mail" } { subject: "2.8 will be great!", to: "eliot@mongodb.com", account : "mongodb mail” } { subject: "2.8 will be great!", to: "asya@mongodb.com", account : "mongodb mail” }
  • 26.
    $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
  • 27.
    $redact • Restrictaccess to Documents – Use document fields to define privileges – Apply conditional queries to validate users • Field Level Access Control – $$DESCEND, $$PRUNE, $$KEEP – Applies to root and subdocument fields
  • 28.
    $redact Example Data { _id: 375, item: "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", Manufacturer: "Sony", security: 0, quantity: 12, list: 4999, pricing: { security: 1, sale: 2698, wholesale: { security: 2, amount: 2300 } } }
  • 29.
    Query by SecurityLevel security = 0 db.catalog.aggregate([ { $match: {item: /^.*XBR55X900A*/} }, { $redact: { $cond: { if: { $lte: [ "$security", ?? ] }, then: "$$DESCEND", else: "$$PRUNE" } } }]) { "_id" : 375, "item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", "Manufacturer" : "Sony”, "security" : 0, "quantity" : 12, "list" : 4999 } { "_id" : 375, "item" : "Sony XBR55X900A 55Inch 4K Ultra High Definition TV", "Manufacturer" : "Sony", "security" : 0, "quantity" : 12, "list" : 4999, "pricing" : { "security" : 1, "sale" : 2698, "wholesale" : { "security" : 2, "amount" : 2300 } } } security = 2
  • 30.
    $geoNear • Order/FilterDocuments by Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
  • 31.
    $geonear Example Data { "_id" : 35089, "city" : “Sony”, "loc" : [ -86.048397, 32.979068 ], "pop" : 1584, "state" : "AL” }
  • 32.
    Query by Proximity db.catalog.aggregate([ { $geoNear : { near: [ -86.000, 33.000 ], distanceField: "dist", maxDistance: .050, spherical: true, num: 3 } }]) { "_id" : "35089", "city" : "KELLYTON", "loc" : [ -86.048397, 32.979068 ], "pop" : 1584, "state" : "AL", "dist" : 0.0007971432165364155 }, { "_id" : "35010", "city" : "NEW SITE", "loc" : [ -85.951086, 32.941445 ], "pop" : 19942, "state" : "AL", "dist" : 0.0012479615347306806 }, { "_id" : "35072", "city" : "GOODWATER", "loc" : [ -86.078149, 33.074642 ], "pop" : 3813, "state" : "AL", "dist" : 0.0017333719627032555 }
  • 33.
  • 34.
    Usage • collection.aggregate([…],{<options>}) – Returns a cursor – Takes an optional document to specify aggregation options • allowDiskUse, explain – Use $out to send results to a Collection • db.runCommand({aggregate:<collection>, pipeline:[…]}) – Returns a document, limited to 16 MB
  • 35.
    Collection db.books.aggregate([ {$project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 }
  • 36.
    Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, 1 } {} }$ group: { _id: "$language", numTitles: { $sum: ] }) { result : [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], “ok” : 1 }
  • 37.
    Limitations • Pipelineoperator memory limits – Stages limited to 100 MB – Use “allowDiskUse” option to use disk for larger data sets • Some BSON types unsupported – Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope
  • 38.
  • 39.
    Sharding Result mongos Shard 1 (Primary) $match, $project, $group Shard 2 $match, $project, $group Shard 3 excluded Shard 4 $match, $project, $group • Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match may excuse shards – Potential CPU and memory implications for primary shard host * Prior to v2.6 second stage pipeline processing was done by mongos
  • 40.
  • 41.
    Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing and reshaping data
  • 42.
    Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253
  • 43.
    Future Enhancements •Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited output
  • 44.
    Enabling Developers •Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shell
  • 45.
    #mongodbdays #aggfwk #devs@mongodb Obrigado! Norberto Leite SA | Eng – norberto@mongodb.com