Jeffrey Breen Director, Think Big Academy October 2014 NoSQL to Augment Hadoop Big Data Platforms
CONFIDENTIAL | 2 Outline • Introduction • Hadoop and NoSQL: What? Where? Why? When? • Document-Oriented NoSQL and Hadoop • Example: Add Statefulness • Example: Analytics Store • Example: Secondary Index − Caution: contains code • MongoDB Connector for Hadoop CONFIDENTIAL 2
Leading Provider of Big Data Solutions & Support CONFIDENTIAL | 3 Delivering Business Value Through Big Data Exclusive Focus on Big Data Tools, Technologies, and Techniques Onshore Team- Based Engineering and Data Science Methodology Prebuilt, Proven Components to Accelerate Delivery & Lower Risk
CONFIDENTIAL | 4 Agile Methodology Experiment-Driven Short Sprints with Quick Release Cycles We Accelerate Your Time to Value  Breaking Down Business and IT Barriers  Discrete Projects with Beginning and End  Early Releases to Validate ROI and Ensure Long Term Success DATA ENGINEERS DATA SCIENTISTS BUSINESS GOALS Innovation and Value
CONFIDENTIAL | 5 Jeffrey Breen Director, Think Big Academy Principal Consultant and Hands-on Architect IT guy, Data guy, Open Source guy Pilot and Airplane Geek Twitter: @JeffreyBreen jeffrey.breen@thinkbiganalytics.com CONFIDENTIAL 5
CONFIDENTIAL | 6 Hadoop and NoSQL • Not “either-or” − When together? Where? For what? • Hadoop − Not a database − Low cost storage with fault tolerance − Batch-oriented analytics (MapReduce, Hive, Pig) − Not good for random access and/or updates • NoSQL − Real databases with CRUD − Optimized for fast, random access − Many shapes and sizes (key-val, tabular, graph, document oriented) CONFIDENTIAL 6
CONFIDENTIAL | 7 Reference Architecture
CONFIDENTIAL | 8 Document-Oriented NoSQL with Hadoop • Advantages − Simple but flexible data model − Field-level indexing for fast querying − Easy and open APIs and data exchange formats • Examples 1. Add Statefulness. Preserve state between jobs and other stateless operations. 2. Analytics Store. Provide high performance destination for calculations and metrics. 3. Secondary Indexing. Add low-latency querying and access for high-latency data stores like HDFS. CONFIDENTIAL 8
CONFIDENTIAL | 9  Overview - Sometimes you just need a fast and safe place to store data between jobs, applications, iterations  Scenarios - Data extraction jobs - Ingestion processing status - Broadcasting “last best” parameters in machine learning, genetic algorithms, and other model fitting { "process": "db-extractor", "system": "database1", "tables": { "table1": { "columns": ["ts"], "values": ["2014-03-25 03:15:23"] }, "table2": { "columns": [ "client_id" ], "values": ["43110221"] } } } Example: Add Statefulness CONFIDENTIAL 9
CONFIDENTIAL | Example: Analytics Store • Great place to store aggregates and other calculated metrics • Can be populated from batch or streaming analytics • Great for serving live dashboards and reporting CONFIDENTIAL 10 { "metric": "session-length", "visitor": "{2CC8C651-A9F4-4CB4-8639-7688FCD21D59}", "visit-start": "2014-03-25 03:15:23", "data": { "value": 245.3, "units": "seconds" } } }
• HDFS is optimized for scans; seeks are very expensive • As in relational databases, secondary indexes can be created on specific elements • Hive even has indexing built in, but keeps the results on HDFS (still not optimized for seeks) • Solution: Use separate NoSQL database for secondary indexes CONFIDENTIAL | Example: Secondary Indexing CONFIDENTIAL 11
Sample Clickstream Data • Sample Omniture clickstream files are available from Hortonworks − 420,000+ page views over 15 days − https://s3.amazonaws.com/hw-sandbox/tutorial8/RefineDemoData.zip • Example records combine web page and visitor information, including CONFIDENTIAL | geocoding: 1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-2.8- AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651- A9F4-4CB4-8639-7688FCD21D59} U en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 0 ABC 0 120 ABC 0 1331434006 2012-03-10 18:46:46 2850864012585216412 6917530841728651042 FAS-2.8- AS3 N 0 24.6.122.234 1 0 10 http://www.acme.com/SH55126545/VD55177927 {52B4FFFE- 606A-1C2B-77E7-F62057879CC8} U en-us 574 0 0 U U Y 0 0 304 comcast.net 10/2/2012 18:17:59 6 480 45 2 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1 71 0 37 2 0 los gatos usa 807 ca 0 0 0 0 0 KGO 0 120 KGO CONFIDENTIAL 12
• Time is a very common dimension on which to organize data • Great for processing incoming data and for filtering any time-based queries… • …but can complicate other access patterns Hive partitions correspond to directories on HDFS /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=1/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=2/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=3/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=4/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=5/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=6/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=7/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=8/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=9/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0 […] CONFIDENTIAL | Time-Partitioned Data CONFIDENTIAL 13
CONFIDENTIAL | Top 10 ≃ Bottom 2000 Distribution of geographic locations detected in clickstream data: > sum(subset(df, rank <= 10)$count) [1] 36986 > sum(subset(df, rank > max(df$rank) - 2000)$count) [1] 33971  In this sample clickstream data set, the top 10 cities account for more traffic than the bottom 2,000 combined  Optimizations are usually designed for the most common cases - “Biggest bang for the buck” due to size, frequency, etc. - What are the chances that the optimizations you pick to handle the most common cases work well for the long tail? - What if a new business opportunity depends on the long tail? Welcome to the Long Tail CONFIDENTIAL 14 > sum(subset(df, rank <= 10)$count) [1] 36986 > sum(subset(df, rank > max(df$rank) - 2000)$count) [1] 33971
CONFIDENTIAL | Secondary Indexing in Hive • Hive has built-in facilities to index data create index location on table omniture_daily(city, state, country) as 'COMPACT' with deferred rebuild; alter index location on omniture_daily rebuild; • Index stores pointers to locations of each found record (path, file, and byte offset) • However, resulting index is partitioned the same way as the underlying table CONFIDENTIAL 15
Column parsing determined by Hive SerDe classes CONFIDENTIAL | Exporting Hive Data as JSON • Hive can easily read/write JSON data via a SerDe: − https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master add jar json-serde-1.1.9.2-Hive13-jar-with-dependencies.jar; create table json_export ( city string, country string, state string, bucketname string, offsets array<bigint>, year int, month int, day int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’ STORED AS TEXTFILE; insert into table json_export select * from default__omniture_daily_location__; CONFIDENTIAL 16 Hadoop’s InputFormat and OutputFormat
Hive indices contain physical location of original data, including byte offsets: { "city": "taunton", "state": "ma", "country": "usa”, "bucketname": "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_daily/yea r=2012/month=3/day=10/000000_0”, "offsets": [ 4748045, 3522685 ], "year": 2012 "day": 10, "month": 3, } CONFIDENTIAL | Sample Index entry CONFIDENTIAL 17
$ hadoop fs -text /apps/hive/warehouse/json_export/000000_0 | mongoimport --host localhost --db clickstream --collection locidx CONFIDENTIAL | Exporting Index Data to Mongo • Since our Hive index data is now stored on HDFS as JSON format, it’s very easy to load into Mongo directly. • Don’t do this in production, but that’s what makes simple examples so much fun: CONFIDENTIAL 18 connected to: localhost Sat Sep 27 10:30:22.325 100 16/second Sat Sep 27 10:30:24.448 check 9 12262 Sat Sep 27 10:30:24.449 imported 12262 objects
Specific file on HDFS containing the records of interest CONFIDENTIAL | Querying the Index in Mongo $ mongo localhost MongoDB shell version: 2.4.6 connecting to: localhost > use clickstream; switched to db clickstream > db.locidx.find( {'state':'ma', 'city':'taunton'} ); { "_id" : ObjectId("5426f42e6a6b0b1939528f80"), "bucketname” : "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_d aily/year=2012/month=3/day=10/000000_0”, "offsets" : [ 4748045, 3522685 ], "month" : 3, "state" : "ma", "year" : 2012, "day" : 10, "country" : "usa", "city" : "taunton” } CONFIDENTIAL 19 Byte offsets within that file containing the records of interest
$ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture _daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=3522685&length=615'; echo 1331431385 2012-03-10 18:03:05 2850813067829261564 4611687161967479390 FAS- 2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 20 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC $ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture _daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=4748045&length=615'; echo 1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS- 2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC CONFIDENTIAL | Using the index data to retrieve the original data CONFIDENTIAL 20
CONFIDENTIAL | So what’s the right way to do it? Check out the MongoDB Connector for Hadoop • Available at https://github.com/mongodb/mongo-hadoop • Contains a “storage engine” to connect Hive directly to MongoDB for live querying • Provides a Hive SerDe for direct access to static BSON files (i.e., backup files) • Allows Hadoop Streaming jobs (python, perl, R, etc.) access to Mongo files • And more CONFIDENTIAL 21
Work with the Leading Innovator in Big Data DATA SCIENTISTS DATA ARCHITECTS DATA SOLUTIONS Think Big Start Smart Scale Fast CONFIDENTIA2L2

Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Data Platforms

  • 1.
    Jeffrey Breen Director,Think Big Academy October 2014 NoSQL to Augment Hadoop Big Data Platforms
  • 2.
    CONFIDENTIAL | 2 Outline • Introduction • Hadoop and NoSQL: What? Where? Why? When? • Document-Oriented NoSQL and Hadoop • Example: Add Statefulness • Example: Analytics Store • Example: Secondary Index − Caution: contains code • MongoDB Connector for Hadoop CONFIDENTIAL 2
  • 3.
    Leading Provider ofBig Data Solutions & Support CONFIDENTIAL | 3 Delivering Business Value Through Big Data Exclusive Focus on Big Data Tools, Technologies, and Techniques Onshore Team- Based Engineering and Data Science Methodology Prebuilt, Proven Components to Accelerate Delivery & Lower Risk
  • 4.
    CONFIDENTIAL | 4 Agile Methodology Experiment-Driven Short Sprints with Quick Release Cycles We Accelerate Your Time to Value  Breaking Down Business and IT Barriers  Discrete Projects with Beginning and End  Early Releases to Validate ROI and Ensure Long Term Success DATA ENGINEERS DATA SCIENTISTS BUSINESS GOALS Innovation and Value
  • 5.
    CONFIDENTIAL | 5 Jeffrey Breen Director, Think Big Academy Principal Consultant and Hands-on Architect IT guy, Data guy, Open Source guy Pilot and Airplane Geek Twitter: @JeffreyBreen jeffrey.breen@thinkbiganalytics.com CONFIDENTIAL 5
  • 6.
    CONFIDENTIAL | 6 Hadoop and NoSQL • Not “either-or” − When together? Where? For what? • Hadoop − Not a database − Low cost storage with fault tolerance − Batch-oriented analytics (MapReduce, Hive, Pig) − Not good for random access and/or updates • NoSQL − Real databases with CRUD − Optimized for fast, random access − Many shapes and sizes (key-val, tabular, graph, document oriented) CONFIDENTIAL 6
  • 7.
    CONFIDENTIAL | 7 Reference Architecture
  • 8.
    CONFIDENTIAL | 8 Document-Oriented NoSQL with Hadoop • Advantages − Simple but flexible data model − Field-level indexing for fast querying − Easy and open APIs and data exchange formats • Examples 1. Add Statefulness. Preserve state between jobs and other stateless operations. 2. Analytics Store. Provide high performance destination for calculations and metrics. 3. Secondary Indexing. Add low-latency querying and access for high-latency data stores like HDFS. CONFIDENTIAL 8
  • 9.
    CONFIDENTIAL | 9  Overview - Sometimes you just need a fast and safe place to store data between jobs, applications, iterations  Scenarios - Data extraction jobs - Ingestion processing status - Broadcasting “last best” parameters in machine learning, genetic algorithms, and other model fitting { "process": "db-extractor", "system": "database1", "tables": { "table1": { "columns": ["ts"], "values": ["2014-03-25 03:15:23"] }, "table2": { "columns": [ "client_id" ], "values": ["43110221"] } } } Example: Add Statefulness CONFIDENTIAL 9
  • 10.
    CONFIDENTIAL | Example:Analytics Store • Great place to store aggregates and other calculated metrics • Can be populated from batch or streaming analytics • Great for serving live dashboards and reporting CONFIDENTIAL 10 { "metric": "session-length", "visitor": "{2CC8C651-A9F4-4CB4-8639-7688FCD21D59}", "visit-start": "2014-03-25 03:15:23", "data": { "value": 245.3, "units": "seconds" } } }
  • 11.
    • HDFS isoptimized for scans; seeks are very expensive • As in relational databases, secondary indexes can be created on specific elements • Hive even has indexing built in, but keeps the results on HDFS (still not optimized for seeks) • Solution: Use separate NoSQL database for secondary indexes CONFIDENTIAL | Example: Secondary Indexing CONFIDENTIAL 11
  • 12.
    Sample Clickstream Data • Sample Omniture clickstream files are available from Hortonworks − 420,000+ page views over 15 days − https://s3.amazonaws.com/hw-sandbox/tutorial8/RefineDemoData.zip • Example records combine web page and visitor information, including CONFIDENTIAL | geocoding: 1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-2.8- AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651- A9F4-4CB4-8639-7688FCD21D59} U en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 0 ABC 0 120 ABC 0 1331434006 2012-03-10 18:46:46 2850864012585216412 6917530841728651042 FAS-2.8- AS3 N 0 24.6.122.234 1 0 10 http://www.acme.com/SH55126545/VD55177927 {52B4FFFE- 606A-1C2B-77E7-F62057879CC8} U en-us 574 0 0 U U Y 0 0 304 comcast.net 10/2/2012 18:17:59 6 480 45 2 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1 71 0 37 2 0 los gatos usa 807 ca 0 0 0 0 0 KGO 0 120 KGO CONFIDENTIAL 12
  • 13.
    • Time isa very common dimension on which to organize data • Great for processing incoming data and for filtering any time-based queries… • …but can complicate other access patterns Hive partitions correspond to directories on HDFS /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=1/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=2/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=3/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=4/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=5/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=6/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=7/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=8/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=9/000000_0 /apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0 […] CONFIDENTIAL | Time-Partitioned Data CONFIDENTIAL 13
  • 14.
    CONFIDENTIAL | Top10 ≃ Bottom 2000 Distribution of geographic locations detected in clickstream data: > sum(subset(df, rank <= 10)$count) [1] 36986 > sum(subset(df, rank > max(df$rank) - 2000)$count) [1] 33971  In this sample clickstream data set, the top 10 cities account for more traffic than the bottom 2,000 combined  Optimizations are usually designed for the most common cases - “Biggest bang for the buck” due to size, frequency, etc. - What are the chances that the optimizations you pick to handle the most common cases work well for the long tail? - What if a new business opportunity depends on the long tail? Welcome to the Long Tail CONFIDENTIAL 14 > sum(subset(df, rank <= 10)$count) [1] 36986 > sum(subset(df, rank > max(df$rank) - 2000)$count) [1] 33971
  • 15.
    CONFIDENTIAL | SecondaryIndexing in Hive • Hive has built-in facilities to index data create index location on table omniture_daily(city, state, country) as 'COMPACT' with deferred rebuild; alter index location on omniture_daily rebuild; • Index stores pointers to locations of each found record (path, file, and byte offset) • However, resulting index is partitioned the same way as the underlying table CONFIDENTIAL 15
  • 16.
    Column parsing determined by Hive SerDe classes CONFIDENTIAL | Exporting Hive Data as JSON • Hive can easily read/write JSON data via a SerDe: − https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master add jar json-serde-1.1.9.2-Hive13-jar-with-dependencies.jar; create table json_export ( city string, country string, state string, bucketname string, offsets array<bigint>, year int, month int, day int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’ STORED AS TEXTFILE; insert into table json_export select * from default__omniture_daily_location__; CONFIDENTIAL 16 Hadoop’s InputFormat and OutputFormat
  • 17.
    Hive indices containphysical location of original data, including byte offsets: { "city": "taunton", "state": "ma", "country": "usa”, "bucketname": "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_daily/yea r=2012/month=3/day=10/000000_0”, "offsets": [ 4748045, 3522685 ], "year": 2012 "day": 10, "month": 3, } CONFIDENTIAL | Sample Index entry CONFIDENTIAL 17
  • 18.
    $ hadoop fs-text /apps/hive/warehouse/json_export/000000_0 | mongoimport --host localhost --db clickstream --collection locidx CONFIDENTIAL | Exporting Index Data to Mongo • Since our Hive index data is now stored on HDFS as JSON format, it’s very easy to load into Mongo directly. • Don’t do this in production, but that’s what makes simple examples so much fun: CONFIDENTIAL 18 connected to: localhost Sat Sep 27 10:30:22.325 100 16/second Sat Sep 27 10:30:24.448 check 9 12262 Sat Sep 27 10:30:24.449 imported 12262 objects
  • 19.
    Specific file onHDFS containing the records of interest CONFIDENTIAL | Querying the Index in Mongo $ mongo localhost MongoDB shell version: 2.4.6 connecting to: localhost > use clickstream; switched to db clickstream > db.locidx.find( {'state':'ma', 'city':'taunton'} ); { "_id" : ObjectId("5426f42e6a6b0b1939528f80"), "bucketname” : "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_d aily/year=2012/month=3/day=10/000000_0”, "offsets" : [ 4748045, 3522685 ], "month" : 3, "state" : "ma", "year" : 2012, "day" : 10, "country" : "usa", "city" : "taunton” } CONFIDENTIAL 19 Byte offsets within that file containing the records of interest
  • 20.
    $ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture _daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=3522685&length=615'; echo 1331431385 2012-03-10 18:03:05 2850813067829261564 4611687161967479390 FAS- 2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 20 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC $ curl -L 'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture _daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=4748045&length=615'; echo 1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS- 2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304 comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 ABC 0 120 ABC CONFIDENTIAL | Using the index data to retrieve the original data CONFIDENTIAL 20
  • 21.
    CONFIDENTIAL | Sowhat’s the right way to do it? Check out the MongoDB Connector for Hadoop • Available at https://github.com/mongodb/mongo-hadoop • Contains a “storage engine” to connect Hive directly to MongoDB for live querying • Provides a Hive SerDe for direct access to static BSON files (i.e., backup files) • Allows Hadoop Streaming jobs (python, perl, R, etc.) access to Mongo files • And more CONFIDENTIAL 21
  • 22.
    Work with the Leading Innovator in Big Data DATA SCIENTISTS DATA ARCHITECTS DATA SOLUTIONS Think Big Start Smart Scale Fast CONFIDENTIA2L2

Editor's Notes

  • #4 Think Big is a leading provider of big data solutions and analytic applications We achieve this by working in lock-step with business leaders to align their goals with big data strategy and planning services which become the roadmap for the data science and data engineering services we provide to implement big data projects.