Building a Business on Open Source Distributed Computing company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear
Social Media and Scaling
Social Media and Scaling •Scalability Matters Now.
Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data
Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web
Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days
Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data
Social Media and Scaling •Scalability Matters Now. •SM produces large, complex data •Anyone can collect the web •Make a Twitter in a few days •Easy to get TBs of data •Big Data enabling new fields for companies
What Visible Does
What Visible Does •BI and Brand Management on Social Media
What Visible Does •BI and Brand Management on Social Media •Listen, Monitor, Engage
Old Product: RDBMS
Old Product: RDBMS •A few MSSQL servers on boxes
Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL
Old Product: RDBMS •A few MSSQL servers on boxes •Lots of ETL •Several TB, inserts slow, deletes impossible, random fail
Why RDBMS Bad
Why RDBMS Bad •Nonlinear scale cost
Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction
Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count
Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’
Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency
Why RDBMS Bad •Nonlinear scale cost •Used as a storage abstraction •Mainly Select, Join, Group, Count •Specialized Scale-Out ones ‘meh’ •Impedance Mismatch - Try to be High- Throughput, Low-Latency •Swiss-army knife, unstable, transactions, advanced SQL, tuning
Why OSS?
Why OSS? •Previously all MS
Why OSS? •Previously all MS •It exists!
Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No
Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source
Why OSS? •Previously all MS •It exists! •Scaling + Licensing = No •Can’t build a platform without source •It’s Enterprise Now!
Goals for New Platform
Goals for New Platform •“Golden Timeline”
Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data
Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost
Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together
Goals for New Platform •“Golden Timeline” •Search/Analyze *any* data •Linear Cost •Not Hacked Together •“Collect the Social Internet”
HOW TO SCALE
HOW TO SCALE •What makes you special?
HOW TO SCALE •What makes you special? •What are you willing to sacrifice?
HOW TO SCALE •What makes you special? •What are you willing to sacrifice? •How will you structure the data?
Avoiding Impedance Mismatch
Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency
Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now
Avoiding Impedance Mismatch •Most problems can be divided into High or Low latency •Get a lot of data eventually, or a little now •MapReduce vs. Sharding/Indexing
Ecosystem Compiled Pig Cascading Hive Processing Katta / Applications Raw Zookeeper MapReduce Processing Structured HBase Storage Unstructured Hadoop DFS Storage
Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
Unstructured Processing Cluster Semantic Unstructured Structured Internet Collect Store Analysis Analysis HBase HTML XML Records
Hadoop + MR
Hadoop + MR •Special: Crunch web-scale data fast
Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates
Hadoop + MR •Special: Crunch web-scale data fast •Sacrifice: Low-Latency, Transactions, Random Access, Updates •Structure: Chunked flat files
Structured Processing Cluster Enriched Data Structured Analysis Unstructured Store in Cluster HBase Store in Search Indexing Hadoop Cluster HBase Records Sharded Lucene Index Lucene Index
Document Structure ContentID: 00BAC189 Title: Iron Maiden Rules Body: I think Janick Gers is an amazing guitarist blah blah PostDT: 20090718 ParentID: 0FDEADBEEF Permalink: www.roadtofailure.com/post?=20
HBase
HBase •Special: Scalable random/sequential access almost as fast as RDBMS
HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of)
HBase •Special: Scalable random/sequential access almost as fast as RDBMS •Sacrifice: Joins, Secondary Indexes, Transactions (kind of) •Structure: BigTable - column oriented
Search Cluster Lucene Load/ Pull Indexes from Replicate Indexes HDFS Shards Search Lucene Lucene Indexes Indexes
Search
Katta + Solr
Katta + Solr •Special: Sharded search
Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput
Katta + Solr •Special: Sharded search •Sacrifice: Consistency, high-throughput •Structure: Reverse index
BI
BI •Group, Sort, Filter, Count, Sum
BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard
BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs
BI •Group, Sort, Filter, Count, Sum •Semi-additive (Avg) rare but not hard •MapReduce Jobs •Faceted Search
Examples
Challenges
Challenges •Scaling Search
Challenges •Scaling Search •Understanding Latency
Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data?
Challenges •Scaling Search •Understanding Latency •What do we need ‘now’? Can customers wait for big data? •Monitoring
Recap: Rules for Scaling
Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife
Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices
Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness
Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure
Recap: Rules for Scaling •RDBMS is not a Swiss-Army Knife •Know your sacrifices •Know your specialness •Know your data structure •Ponder Latency
What Next?
What Next? •HBase Analytics?
What Next? •HBase Analytics? •“What would make a bank trust it”
What Next? •HBase Analytics? •“What would make a bank trust it” •Teach people to think about data
...
The End company: www.visibletechnologies.com blog: www.roadtofailure.com twitter: @lusciouspear bradfordstephens@gmail.com

Building a Business on Hadoop, HBase, and Open Source Distributed Computing