Aggregation Framework

#mongodbdays Aggregation Framework Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfo Tuesday, January 29, 13

Agenda • State of Aggregation • Pipeline • Usage and Limitations • Optimization • Sharding • (Expressions) • Looking Ahead Tuesday, January 29, 13

State of Aggregation Tuesday, January 29, 13

State of Aggregation • We're storing our data in MongoDB • We need to do ad-hoc reporting, grouping, common aggregations, etc. • What are we using for this? Tuesday, January 29, 13

Data Warehousing Tuesday, January 29, 13

Data Warehousing • SQL for reporting and analytics • Infrastructure complications – Additional maintenance – Data duplication – ETL processes – Real time? Tuesday, January 29, 13

MapReduce Tuesday, January 29, 13

MapReduce • Extremely versatile, powerful • Intended for complex data analysis • Overkill for simple aggregation tasks, such as – Averages – Summation – Grouping Tuesday, January 29, 13

MapReduce in MongoDB • Implemented with JavaScript – Single-threaded – Difficult to debug • Concurrency – Appearance of parallelism – Write locks Tuesday, January 29, 13

Aggregation Framework Tuesday, January 29, 13

Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple – Operation pipeline – Computational expressions • Works well with sharding Tuesday, January 29, 13

Enabling Developers • Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shell Tuesday, January 29, 13

Pipeline Tuesday, January 29, 13

Pipeline • Process a stream of documents – Original input is a collection – Final output is a result document • Series of operators – Filter or transform data – Input/output chain ps ax | grep mongod | head -n 1 Tuesday, January 29, 13

Pipeline Operators • $match • $sort • $project • $limit • $group • $skip • $unwind Tuesday, January 29, 13

Example book data { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" } Tuesday, January 29, 13

$match • Filter documents • Uses existing query syntax • (No geospatial operations or $where) Tuesday, January 29, 13

Matching Field Values { { $match: { title: "The Great Gatsby", language: "Russian" pages: 218, }} language: "English" } { title: "War and Peace", { pages: 1440, title: "War and Peace", language: "Russian" pages: 1440, } language: "Russian" } { title: "Atlas Shrugged", pages: 1088, language: "English" } Tuesday, January 29, 13

Matching with Query Operators { { $match: { title: "The Great Gatsby", pages: { $gt: 1000 } pages: 218, }} language: "English" } { { title: "War and Peace", title: "War and Peace", pages: 1440, pages: 1440, language: "Russian" language: "Russian" } } { { title: "Atlas Shrugged", title: "Atlas Shrugged", pages: 1088, pages: 1088, language: "English" language: "English" } } Tuesday, January 29, 13

$project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Tuesday, January 29, 13

Including and Excluding Fields { { $project: { _id: 375, _id: 0, title: "Great Gatsby", title: 1, ISBN: "9781857150193", language: 1 available: true, }} pages: 218, subjects: [ "Long Island", "New York", "1920s" { ], title: " Great Gatsby", language: "English" language: "English" } } Tuesday, January 29, 13

Renaming and Computing Fields { { $project: { _id: 375, avgChapterLength: { title: "Great Gatsby", $divide: ["$pages", ISBN: "9781857150193", "$chapters"] available: true, }, pages: 218, lang: "$language" chapters: 9, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" avgChapterLength: 24.2222 , } lang: "English" } Tuesday, January 29, 13

Creating Sub-Document Fields { $project: { { title: 1, _id: 375, stats: { title: "Great Gatsby", pages: "$pages", ISBN: "9781857150193", language: "$language", available: true, } pages: 218, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" title: " Great Gatsby", } stats: { pages: 218, language: "English" } Tuesday, January 29, 13

$group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memory Tuesday, January 29, 13

Calculating an Average { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, avgPages: { $avg: language: "English" "$pages" } } }} { title: "War and Peace", pages: 1440, { language: "Russian" _id: "Russian", } avgPages: 1440 } { title: "Atlas Shrugged", { pages: 1088, _id: "English", language: "English" avgPages: 653 } } Tuesday, January 29, 13

Summating Fields and Counting { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, numTitles: { $sum: 1 }, language: "English" sumPages: { $sum: "$pages" } }} } { title: "War and Peace", { pages: 1440, _id: "Russian", language: "Russian” numTitles: 1, } sumPages: 1440 } { { title: "Atlas Shrugged", _id: "English", pages: 1088, numTitles: 2, language: "English" sumPages: 1306 } } Tuesday, January 29, 13

Collecting Distinct Values { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, titles: { $addToSet: "$title" } language: "English" }} } { { title: "War and Peace", _id: "Russian", titles: [ "War and Peace" ] pages: 1440, } language: "Russian" } { _id: "English", { titles: [ title: "Atlas Shrugged", "Atlas Shrugged", pages: 1088, "The Great Gatsby" language: "English" ] } } Tuesday, January 29, 13

$unwind • Applied to an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array values Tuesday, January 29, 13

Yielding Multiple Documents from One { { $unwind: "$subjects" } title: "The Great Gatsby", ISBN: "9781857150193", { subjects: [ title: "The Great Gatsby", "Long Island", ISBN: "9781857150193", "New York", subjects: "Long Island" "1920s" } ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" } Tuesday, January 29, 13

$sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior Tuesday, January 29, 13

Sort All the Documents in the Pipeline { title: "The Great Gatsby" } { $sort: { title: 1 }} { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } { title: "Fathers and Sons" } { title: "Fathers and Sons" } { title: "Invisible Man" } { title: "Grapes of Wrath" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "The Great Gatsby" } Tuesday, January 29, 13

Limit Documents Through the Pipeline { title: "The Great Gatsby" } { $limit: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "The Great Gatsby" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Grapes of Wrath" } { title: "Fathers and Sons" } { title: "Animal Farm" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } Tuesday, January 29, 13

Skip Over Documents in the Pipeline { title: "The Great Gatsby" } { $skip: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Fathers and Sons" } { title: "Lord of the Flies" } { title: "Invisible Man" } { title: "Fathers and Sons" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Fahrenheit 451" } Tuesday, January 29, 13

Usage and Limitations Tuesday, January 29, 13

Usage • collection.aggregate() method – Mongo shell – Most drivers • aggregate database command Tuesday, January 29, 13

Collection db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 } Tuesday, January 29, 13

Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 } Tuesday, January 29, 13

Limitations • Result limited by BSON document size – Final command result – Intermediate shard results • Pipeline operator memory limits • Some BSON types unsupported – Binary, Code, deprecated types Tuesday, January 29, 13

Sharding Tuesday, January 29, 13

Sharding • Split the pipeline at first $group or $sort – Shards execute pipeline up to that point – mongos merges results and continues • Early $match may excuse shards • CPU and memory implications for mongos Tuesday, January 29, 13

Sharding [ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }} ] Tuesday, January 29, 13

Aggregation in a sharded cluster Tuesday, January 29, 13

Expressions Tuesday, January 29, 13

Expressions • Return computed values • Used with $project and $group • Reference fields using $ (e.g. "$x") • Expressions may be nested Tuesday, January 29, 13

Boolean Operators • Input array of one or more values – $and, $or – Short-circuit logic • Invert values with $not • Evaluation of non-boolean types – null, undefined, zero ▶ false – Non-zero, strings, dates, objects ▶ true { $and: [true, false] } ▶ false { $or: ["foo", 0] } ▶ true { $not: null } ▶ true Tuesday, January 29, 13

Comparison Operators • Compare numbers, strings, and dates • Input array with two operands – $cmp, $eq, $ne – $gt, $gte, $lt, $lte { $cmp: [3, 4] } ▶ -1 { $eq: ["foo", "bar"] } ▶ false { $ne: ["foo", "bar"] } ▶ true { $gt: [9, 7] } ▶ true Tuesday, January 29, 13

Arithmetic Operators • Input array of one or more numbers – $add, $multiply • Input array of two operands – $subtract, $divide, $mod { $add: [1, 2, 3] } ▶ 6 { $multiply: [2, 2, 2] } ▶ 8 { $subtract: [10, 7] } ▶ 3 { $divide: [10, 2] } ▶ 5 { $mod: [8, 3] } ▶ 2 Tuesday, January 29, 13

String Operators • $strcasecmp case-insensitive comparison – $cmp is case-sensitive • $toLower and $toUpper case change • $substr for sub-string extraction • Not encoding aware (assumes ASCII alphabet) { $strcasecmp: ["foo", "bar"] } ▶ 1 { $substr: ["foo", 1, 2] } ▶ "oo" { $toUpper: "foo" } ▶ "FOO" { $toLower: "BAR" } ▶ "bar" Tuesday, January 29, 13

Date Operators • Extract values from date objects – $dayOfYear, $dayOfMonth, $dayOfWeek – $year, $month, $week – $hour, $minute, $second { $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012 { $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10 { $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24 { $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4 { $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299 { $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43 Tuesday, January 29, 13

Conditional Operators • $cond ternary operator • $ifNull { $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different” { $ifNull: ["foo", "bar"] } ▶ "foo" { $ifNull: [null, "bar"] } ▶ "bar" Tuesday, January 29, 13

Looking Ahead Tuesday, January 29, 13

Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing time series data Tuesday, January 29, 13

Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253 Tuesday, January 29, 13

Future Enhancements • Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited output Tuesday, January 29, 13

#mongodbdays Thank You Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfo Tuesday, January 29, 13

Aggregation Framework

More Related Content

Similar to Aggregation Framework

More from MongoDB

Aggregation Framework