Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

Jeffrey Breen Director, Think Big Academy October 2014 NoSQL to Augment Hadoop Big Data Platforms

CONFIDENTIAL | 2 Outline • Introduction • Hadoop and NoSQL: What? Where? Why? When? • Document-Oriented NoSQL and Hadoop • Example: Add Statefulness • Example: Analytics Store • Example: Secondary Index − Caution: contains code • MongoDB Connector for Hadoop CONFIDENTIAL 2

Leading Provider of Big Data Solutions & Support CONFIDENTIAL | 3 Delivering Business Value Through Big Data Exclusive Focus on Big Data Tools, Technologies, and Techniques Onshore Team- Based Engineering and Data Science Methodology Prebuilt, Proven Components to Accelerate Delivery & Lower Risk

CONFIDENTIAL | 4 Agile Methodology Experiment-Driven Short Sprints with Quick Release Cycles We Accelerate Your Time to Value  Breaking Down Business and IT Barriers  Discrete Projects with Beginning and End  Early Releases to Validate ROI and Ensure Long Term Success DATA ENGINEERS DATA SCIENTISTS BUSINESS GOALS Innovation and Value

CONFIDENTIAL | 5 Jeffrey Breen Director, Think Big Academy Principal Consultant and Hands-on Architect IT guy, Data guy, Open Source guy Pilot and Airplane Geek Twitter: @JeffreyBreen jeffrey.breen@thinkbiganalytics.com CONFIDENTIAL 5

CONFIDENTIAL | 6 Hadoop and NoSQL • Not “either-or” − When together? Where? For what? • Hadoop − Not a database − Low cost storage with fault tolerance − Batch-oriented analytics (MapReduce, Hive, Pig) − Not good for random access and/or updates • NoSQL − Real databases with CRUD − Optimized for fast, random access − Many shapes and sizes (key-val, tabular, graph, document oriented) CONFIDENTIAL 6

CONFIDENTIAL | 7 Reference Architecture

CONFIDENTIAL | 8 Document-Oriented NoSQL with Hadoop • Advantages − Simple but flexible data model − Field-level indexing for fast querying − Easy and open APIs and data exchange formats • Examples 1. Add Statefulness. Preserve state between jobs and other stateless operations. 2. Analytics Store. Provide high performance destination for calculations and metrics. 3. Secondary Indexing. Add low-latency querying and access for high-latency data stores like HDFS. CONFIDENTIAL 8

CONFIDENTIAL | 9  Overview - Sometimes you just need a fast and safe place to store data between jobs, applications, iterations  Scenarios - Data extraction jobs - Ingestion processing status - Broadcasting “last best” parameters in machine learning, genetic algorithms, and other model fitting { "process": "db-extractor", "system": "database1", "tables": { "table1": { "columns": ["ts"], "values": ["2014-03-25 03:15:23"] }, "table2": { "columns": [ "client_id" ], "values": ["43110221"] } } } Example: Add Statefulness CONFIDENTIAL 9

CONFIDENTIAL | Example: Analytics Store • Great place to store aggregates and other calculated metrics • Can be populated from batch or streaming analytics • Great for serving live dashboards and reporting CONFIDENTIAL 10 { "metric": "session-length", "visitor": "{2CC8C651-A9F4-4CB4-8639-7688FCD21D59}", "visit-start": "2014-03-25 03:15:23", "data": { "value": 245.3, "units": "seconds" } } }

• HDFS is optimized for scans; seeks are very expensive • As in relational databases, secondary indexes can be created on specific elements • Hive even has indexing built in, but keeps the results on HDFS (still not optimized for seeks) • Solution: Use separate NoSQL database for secondary indexes CONFIDENTIAL | Example: Secondary Indexing CONFIDENTIAL 11

Sample Clickstream Data • Sample Omniture clickstream files are available from Hortonworks − 420,000+ page views over 15 days − https://s3.amazonaws.com/hw-sandbox/tutorial8/RefineDemoData.zip • Example records combine web page and visitor information, including CONFIDENTIAL | geocoding: 1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-2.8- AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651- A9F4-4CB4-8639-7688FCD21D59} U en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 0 ABC 0 120 ABC 0 1331434006 2012-03-10 18:46:46 2850864012585216412 6917530841728651042 FAS-2.8- AS3 N 0 24.6.122.234 1 0 10 http://www.acme.com/SH55126545/VD55177927 {52B4FFFE- 606A-1C2B-77E7-F62057879CC8} U en-us 574 0 0 U U Y 0 0 304 comcast.net 10/2/2012 18:17:59 6 480 45 2 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1 71 0 37 2 0 los gatos usa 807 ca 0 0 0 0 0 KGO 0 120 KGO CONFIDENTIAL 12

• Time is a very common dimension on which to organize data • Great for processing incoming data and for filtering any time-based queries… • …but can complicate other access patterns Hive partitions correspond to directories on HDFS /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=1/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=2/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=3/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=4/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=5/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=6/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=7/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=8/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=9/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0 […] CONFIDENTIAL | Time-Partitioned Data CONFIDENTIAL 13

CONFIDENTIAL | Top 10 ≃ Bottom 2000 Distribution of geographic locations detected in clickstream data: > sum(subset(df, rank <= 10)$count) [1] 36986 > sum(subset(df, rank > max(df$rank) - 2000)$count) [1] 33971  In this sample clickstream data set, the top 10 cities account for more traffic than the bottom 2,000 combined  Optimizations are usually designed for the most common cases - “Biggest bang for the buck” due to size, frequency, etc. - What are the chances that the optimizations you pick to handle the most common cases work well for the long tail? - What if a new business opportunity depends on the long tail? Welcome to the Long Tail CONFIDENTIAL 14 > sum(subset(df, rank <= 10)$count) [1] 36986 > sum(subset(df, rank > max(df$rank) - 2000)$count) [1] 33971

CONFIDENTIAL | Secondary Indexing in Hive • Hive has built-in facilities to index data create index location on table omniture_daily(city, state, country) as 'COMPACT' with deferred rebuild; alter index location on omniture_daily rebuild; • Index stores pointers to locations of each found record (path, file, and byte offset) • However, resulting index is partitioned the same way as the underlying table CONFIDENTIAL 15

Column parsing determined by Hive SerDe classes CONFIDENTIAL | Exporting Hive Data as JSON • Hive can easily read/write JSON data via a SerDe: − https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master add jar json-serde-1.1.9.2-Hive13-jar-with-dependencies.jar; create table json_export ( city string, country string, state string, bucketname string, offsets array<bigint>, year int, month int, day int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’ STORED AS TEXTFILE; insert into table json_export select * from default__omniture_daily_location__; CONFIDENTIAL 16 Hadoop’s InputFormat and OutputFormat

Hive indices contain physical location of original data, including byte offsets: { "city": "taunton", "state": "ma", "country": "usa”, "bucketname": "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_daily/yea r=2012/month=3/day=10/000000_0”, "offsets": [ 4748045, 3522685 ], "year": 2012 "day": 10, "month": 3, } CONFIDENTIAL | Sample Index entry CONFIDENTIAL 17

$ hadoop fs -text /apps/hive/warehouse/json_export/000000_0 | mongoimport --host localhost --db clickstream --collection locidx CONFIDENTIAL | Exporting Index Data to Mongo • Since our Hive index data is now stored on HDFS as JSON format, it’s very easy to load into Mongo directly. • Don’t do this in production, but that’s what makes simple examples so much fun: CONFIDENTIAL 18 connected to: localhost Sat Sep 27 10:30:22.325 100 16/second Sat Sep 27 10:30:24.448 check 9 12262 Sat Sep 27 10:30:24.449 imported 12262 objects

Specific file on HDFS containing the records of interest CONFIDENTIAL | Querying the Index in Mongo $ mongo localhost MongoDB shell version: 2.4.6 connecting to: localhost > use clickstream; switched to db clickstream > db.locidx.find( {'state':'ma', 'city':'taunton'} ); { "_id" : ObjectId("5426f42e6a6b0b1939528f80"), "bucketname” : "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_d aily/year=2012/month=3/day=10/000000_0”, "offsets" : [ 4748045, 3522685 ], "month" : 3, "state" : "ma", "year" : 2012, "day" : 10, "country" : "usa", "city" : "taunton” } CONFIDENTIAL 19 Byte offsets within that file containing the records of interest

$ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture _daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=3522685&length=615'; echo 1331431385 2012-03-10 18:03:05 2850813067829261564 4611687161967479390 FAS- 2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 20 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC $ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture _daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=4748045&length=615'; echo 1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS- 2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC CONFIDENTIAL | Using the index data to retrieve the original data CONFIDENTIAL 20

CONFIDENTIAL | So what’s the right way to do it? Check out the MongoDB Connector for Hadoop • Available at https://github.com/mongodb/mongo-hadoop • Contains a “storage engine” to connect Hive directly to MongoDB for live querying • Provides a Hive SerDe for direct access to static BSON files (i.e., backup files) • Allows Hadoop Streaming jobs (python, perl, R, etc.) access to Mongo files • And more CONFIDENTIAL 21

Work with the Leading Innovator in Big Data DATA SCIENTISTS DATA ARCHITECTS DATA SOLUTIONS Think Big Start Smart Scale Fast CONFIDENTIA2L2

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

More Related Content

What's hot

Similar to Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

More from MongoDB

Recently uploaded

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

Editor's Notes