Building a Business on Hadoop, HBase, and Open Source Distributed Computing

Building a Business on Open Source Distributed Computing company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear

Social Media and Scaling •Scalability Matters Now.

Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data

Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web

Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days

Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data

Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data •Big Data enabling new ﬁelds for companies

What Visible Does •BI and Brand Management on Social Media

What Visible Does •BI and Brand Management on Social Media •Listen, Monitor, Engage

Old Product: RDBMS •A few MSSQL servers on boxes

Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL

Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL •Several TB, inserts slow, deletes impossible, random fail

Why RDBMS Bad •Nonlinear scale cost

Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction

Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count

Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’

Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency

Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency •Swiss-army knife, unstable, transactions, advanced SQL, tuning

Why OSS? •Previously all MS •It exists!

Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No

Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source

Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source •It’s Enterprise Now!

Goals for New Platform •“Golden Timeline”

Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data

Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost

Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together

Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together •“Collect the Social Internet”

HOW TO SCALE •What makes you special?

HOW TO SCALE •What makes you special? •What are you willing to sacriﬁce?

HOW TO SCALE •What makes you special? •What are you willing to sacriﬁce? •How will you structure the data?

Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency

Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now

Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now •MapReduce vs. Sharding/Indexing

Ecosystem Compiled Pig Cascading Hive Processing Katta / Applications Raw Zookeeper MapReduce Processing Structured HBase Storage Unstructured Hadoop DFS Storage

Simple Workﬂow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search

Unstructured Processing Cluster Semantic Unstructured Structured Internet Collect Store Analysis Analysis HBase HTML XML Records

Hadoop + MR •Special: Crunch web-scale data fast

Hadoop + MR •Special: Crunch web-scale data fast •Sacriﬁce: Low-Latency, Transactions, Random Access, Updates

Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates •Structure: Chunked flat files

Structured Processing Cluster Enriched Data Structured Analysis Unstructured Store in Cluster HBase Store in Search Indexing Hadoop Cluster HBase Records Sharded Lucene Index Lucene Index

Document Structure ContentID: 00BAC189 Title: Iron Maiden Rules Body: I think Janick Gers is an amazing guitarist blah blah PostDT: 20090718 ParentID: 0FDEADBEEF Permalink: www.roadtofailure.com/post?=20

HBase •Special: Scalable random/sequential access almost as fast as RDBMS

HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacriﬁce: Joins, Secondary Indexes, Transactions (kind of)

HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacriﬁce: Joins, Secondary Indexes, Transactions (kind of) •Structure: BigTable - column oriented

Search Cluster Lucene Load/ Pull Indexes from Replicate Indexes HDFS Shards Search Lucene Lucene Indexes Indexes

Katta + Solr •Special: Sharded search

Katta + Solr •Special: Sharded search •Sacriﬁce: Consistency, high-throughput

Katta + Solr •Special: Sharded search •Sacriﬁce: Consistency, high-throughput •Structure: Reverse index

BI •Group, Sort, Filter, Count, Sum

BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard

BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs

BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs •Faceted Search

Challenges •Scaling Search •Understanding Latency

Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data?

Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data? •Monitoring

Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife

Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacriﬁces

Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacriﬁces •Know your specialness

Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacriﬁces •Know your specialness •Know your data structure

Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacriﬁces •Know your specialness •Know your data structure •Ponder Latency

What Next? •HBase Analytics?

What Next? •HBase Analytics? •“What would make a bank trust it”

What Next? •HBase Analytics? •“What would make a bank trust it” •Teach people to think about data

The End company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear bradfordstephens@gmail.com

Building a Business on Hadoop, HBase, and Open Source Distributed Computing

More Related Content

What's hot

Viewers also liked

Similar to Building a Business on Hadoop, HBase, and Open Source Distributed Computing

Recently uploaded

Building a Business on Hadoop, HBase, and Open Source Distributed Computing