Analytics with MongoDB Aggregation Framework and Hadoop Connector

@h_ingo Analytics with MongoDB alone and with Hadoop Connector Henrik Ingo Solution Architect, MongoDB

The Science in Data Science • Collect data • Explore the data, use visualization • Use math • Make predictions • Test predictions – Collect even more data • Repeat...

5 NoSQL categories Redis Cassandra Key Value Graph Neo4j Wide Column Document Map Reduce Hadoop

MongoDB and Enterprise IT Stack CRM, ERP, Collaboration, Mobile, BI Data Management Online Data Offline Data RDBMS RDBMS Hadoop EDW Infrastructure OS & Virtualization, Compute, Storage, Network Security & Auditing Management & Monitoring Applications

Exponential Data Growth http://www.worldwidewebsize.com/

Volume Velocity Variety Upserts avoid unnecessary reads Asynchronous writes Data Data Sources Data Sources Data Sources Sources Spread writes over multiple shards Writes buffered in RAM and flushed to disk in bulk

Volume Velocity Variety MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }

Data Processing in MongoDB • Pre-aggregated documents • Aggregation Framework • Map/Reduce • Hadoop Connector

Pre-aggregated documents Design Pattern

Pre-Aggregation Data for URL / Date { _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1", page: "/apache_pb.gif" }, daily: 5468426, hourly: { "0": 227850, "1": 210231, ... "23": 20457 }, minute: { "0": 3612, "1": 3241, ... "1439": 2819 } }

Pre-Aggregation Data for URL / Date query = { '_id': "20101010/site-1/apache_pb.gif" } update = { '$inc': { 'hourly.12' : 1, 'minute.739': 1 } } db.stats.daily.update(query, update, upsert=True)

Dynamic Queries Find all logs for a URL db.logs.find( { ‘path’ : ‘/index.html’ } ) Find all logs for a time range db.logs.find( { ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, s1) } } ) Find all logs for a host over a range of dates db.logs.find( { ‘host’ : ‘127.0.0.1’, ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, 1) } } )

Aggregation Framework Requests db.logs.aggregate( [ { '$match': { per day by 'time': { URL '$gte': new Date(2013, 0), '$lt': new Date(2013, 1) } } }, { '$project': { 'path': 1, 'date': { 'y': { '$year': '$time' }, 'm': { '$month': '$time' }, 'd': { '$dayOfMonth': '$time' } } } }, { '$group': { '_id': { 'p': '$path', 'y': '$date.y', 'm': '$date.m', 'd': '$date.d' }, 'hits': { '$sum': 1 } } }, ])

Aggregation Framework { ‘ok’: 1, ‘result’: [ { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': ] } 2013,'m': 2013,'m': 2013,'m': 2013,'m': 2013,'m': 1,'d': 1,'d': 1,'d': 1,'d': 1,'d': 1 2 3 4 5 }, }, }, }, }, 'hits’: 'hits’: 'hits’: 'hits’: 'hits’: 124 }, 245 }, 322 }, 175 }, 94 }

Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Scale-out • Declared in JSON, executes in C++ • Runs inside MongoDB on local data

Map Reduce – Map Phase Generate hourly rollups from log data var map = function() { var key = { p: this.path, d: new Date( this.ts.getFullYear(), this.ts.getMonth(), this.ts.getDate(), this.ts.getHours(), 0, 0, 0) }; emit( key, { hits: 1 } ); }

Map Reduce – Reduce Phase Generate hourly rollups from log data var reduce = function(key, values) { var r = { hits: 0 }; values.forEach(function(v) { r.hits += v.hits; }); return r; } )

Map Reduce - Execution query = { 'ts': { '$gte': new Date(2013, 0, 1), '$lte': new Date(2013, 0, 31) } } db.logs.mapReduce( map, reduce, { ‘query’: query, ‘out’: { ‘reduce’ : ‘stats.monthly’ } } )

MongoDB Map/Reduce Benefits • Runs inside MongoDB • Sharding supported • JavaScript – Pro: functionality, expressiveness – Con: overhead • Input can be a collection or query! • Output directly to document or collection • Easy, when you don’t want overhead of Hadoop

How it works • Adapter examines MongoDB input collection and calculates a set of splits from data • Each split is assigned to a Hadoop node • In parallel hadoop pulls data from splits on MongoDB (or BSON) and starts processing locally • Hadoop merges results and streams output back to MongoDB (or BSON) output collection

Read From MongoDB (or BSON) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir= file:///tmp/messages.bson mapred.input.dir= hdfs:///tmp/messages.bson mapred.input.dir= s3:///tmp/messages.bson

Write To MongoDB (or BSON) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir= file:///tmp/results.bson mapred.output.dir= hdfs:///tmp/results.bson mapred.output.dir= s3:///tmp/results.bson

Document Example { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }

Receiver Sender Pairs {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}

Map Phase – each document get’s through mapper function @Override public void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String) headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } } }

Reduce Phase – output Maps are grouped by key and passed to Reducer public void reduce(final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get(); } BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); }

Query Data mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" "f" } { "_id" : { "t" "f" } { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" : "kenneth.lay@enron.com", : "15126-1267@m2.innovyx.com" }, "count" : 1 : "kenneth.lay@enron.com", : "2586207@www4.imakenews.com" }, "count" : 1 : : : : : : : : : : "kenneth.lay@enron.com", "40enron@enron.com" }, "count" : 2 } "kenneth.lay@enron.com", "a..davis@enron.com" }, "count" : 2 } "kenneth.lay@enron.com", "a..hughes@enron.com" }, "count" : 4 } "kenneth.lay@enron.com", "a..lindholm@enron.com" }, "count" : 1 } "kenneth.lay@enron.com", "a..schroeder@enron.com" }, "count" : 1 }

Hadoop Connector Benefits • Full multi-core parallelism to process MongoDB data • mongo.input.query • Full integration w/ Hadoop and JVM ecosystem • Mahout, et.al. • Can be used on Amazon Elastic MapReduce • Read and write backup files to local, HDFS and S3 • Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive

A/B testing • Hey, it looks like teenage girls clicked a lot on that ad with a pink background... • Hypothesis: Given otherwise the same ad, teenage girls are more likely to click on ads with pink backgrounds than white • Test 50-50 pink vs white ads • Collect click stream stats in MongoDB or Hadoop • Analyze results

Recommendations – social filtering • ”Customers who bought this book also bought” • Computed offline / nightly • As easy as it sounds! google it: Amazon item-to-item algorithm

Personalization • ”Even if you are a teenage girl, you seem to be 60% more likely to click on blue ads than pink.” • User specific recommendations a hybrid of offline & online recommendations • User profile in MongoDB • May even be updated real time

@h_ingo Questions? Henrik Ingo Solution Architect, MongoDB

Analytics with MongoDB Aggregation Framework and Hadoop Connector

More Related Content

What's hot

Similar to Analytics with MongoDB Aggregation Framework and Hadoop Connector

More from Henrik Ingo

Recently uploaded

Analytics with MongoDB Aggregation Framework and Hadoop Connector