Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

About Me 2005 Mobile Web & Voice Search 3

About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics

About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering

This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL

But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS

Just in Time Data Warehouse w/ Spark HDFS

Just in Time Data Warehouse w/ Spark and more… HDFS

11 Know when to use other data stores   besides file systems Today’s Goal

Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12

Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing

Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing

Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15

Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have   to go through all your files to find your row. 16

Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index   on your key column. • Key-Value NOSQL stores retrieves the value   of a key efficiently out of the box. 17

Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18

Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19

Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20

Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21

Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +

Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS

Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS

Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB

Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26

Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

More Related Content

What's hot

Viewers also liked

Similar to Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture

More from Databricks

Recently uploaded

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data Architecture