Not Your Father’s Database: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
Not Your Father’s Database: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
About Me 2005 Mobile Web & Voice Search 3
About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics
About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL
But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS
Just in Time Data Warehouse w/ Spark HDFS
Just in Time Data Warehouse w/ Spark HDFS
Just in Time Data Warehouse w/ Spark and more… HDFS
11 Know when to use other data stores 
 besides file systems Today’s Goal
Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12
Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing
Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing
Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15
Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have 
 to go through all your files to find your row. 16
Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index 
 on your key column. • Key-Value NOSQL stores retrieves the value 
 of a key efficiently out of the box. 17
Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18
Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19
Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20
Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21
Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +
Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS
Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS
Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB
Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26
Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27
Thank you

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

  • 1.
    Not Your Father’sDatabase: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  • 2.
    Not Your Father’sDatabase: How to Use Apache Spark Properly 
 in Your Big Data Architecture Spark Summit East 2016
  • 3.
    About Me 2005 MobileWeb & Voice Search 3
  • 4.
    About Me 2005 MobileWeb & Voice Search 4 2012 Reporting & Analytics
  • 5.
    About Me 2005 MobileWeb & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering
  • 6.
    This system talkslike a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL
  • 7.
    But the performanceis very different… Is this your Spark infrastructure? 7 SQL HDFS
  • 8.
    Just in TimeData Warehouse w/ Spark HDFS
  • 9.
    Just in TimeData Warehouse w/ Spark HDFS
  • 10.
    Just in TimeData Warehouse w/ Spark and more… HDFS
  • 11.
    11 Know when touse other data stores 
 besides file systems Today’s Goal
  • 12.
    Good: General PurposeProcessing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12
  • 13.
    Types of workloads •Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing
  • 14.
    Benefits: • Inexpensive Storage •Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing
  • 15.
    Bad: Random Access sqlContext.sql( “select* from my_large_table where id=2I34823”) Will this command run in Spark? 15
  • 16.
    Bad: Random Access sqlContext.sql( “select* from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have 
 to go through all your files to find your row. 16
  • 17.
    Bad: Random Access Solution:If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index 
 on your key column. • Key-Value NOSQL stores retrieves the value 
 of a key efficiently out of the box. 17
  • 18.
    Bad: Frequent Inserts sqlContext.sql(“insertinto TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18
  • 19.
    Bad: Frequent Inserts Solution: •Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19
  • 20.
    Good: Data Transformation/ETL UseSpark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20
  • 21.
    Bad: Frequent/Incremental Updates Updatestatements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21
  • 22.
    Use Case: Up-to-date,live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +
  • 23.
    Good: Connecting BITools Tip: Cache your tables for optimal performance. 23 HDFS
  • 24.
    Bad: External Reportingw/ load Too many concurrent requests will overload Spark. 24 HDFS
  • 25.
    Solution: Write outto a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB
  • 26.
    Good: Machine Learning& Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26
  • 27.
    Bad: Searching Contentw/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27
  • 28.