Best Practices for Building Robust Data Platform with Apache Spark and Delta

Best Practices for building Robust Data Platform with Apache Spark & Delta Vini Jaiswal Spark+AI Summit - June 2020 https://www.linkedin.com/in/vinijaiswal/

▪ Data Strategy Optimizing the cost to drive Business value ▪ Performance and tuning with Delta Lake & Apache Spark ▪ Governance and security controls Bringing it all together - A reference architecture Agenda

Data Challenges Data Warehouse limits the potential of intelligence Data Volume is growing rapidly More Variety of data -> Different applications Need for faster processing and scalability Data silos limits innovation Promise of the Data Lake 1. Collect Everything 2. Store it all in the Data Lake 🔥 🔥🔥 3. Data Science & Machine Learning 🔥 🔥

Usual Data Lake Garbage In Garbage Out Garbage Stored

Ideal data lakes with No atomicity No quality enforcement No consistency / isolation ✗ Reliability - High Quality Data ● Schema Enforcement ● ACID Transactions ● Time Travel ● Open Standards, Open Source ● Powered by ● Unifies Streaming / Batch Usual Data Lake References: https://youtu.be/qtCxNSmTejk

Getting the Data Right Audience Segmentation CSV, JSON, TXT… Data Types Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold Table Categorization Align with Business Outcomes Is my data use case worthy? Is my data ready for Analytics / ML?

Optimizing the Cost to Drive Business Value

Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads

Workload Type AWS Type Azure Type Recommended Use Case Memory Optimized r5 Dsv2 Memory-intensive applications Use Case: ML workload with data caching Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data Science Applications Use Case: ETL with full file scans and no data reuse Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO Use Case: Analytics - Storage Optimized i3 class with Delta IO Cache Selection of Instance Types Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html

Selection of node size Rule of thumb 1. Fewer big instances > more small instances a. (larger heap = larger GC) b. Multiple executors per machine 2. Size based on the number of tasks initially, tweak later a. Run the job with a small cluster to get idea of # of tasks b. Observe Cluster metrics for CPU, memory and network utilization

Observe Spark UI & tweak the workloads Fully cached with room to spare? > decrease instances Almost completely cached? > Increase cluster size Not even close to cached? > Consider instance with SSD instead of EBS or use R class Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison

Observe Ganglia Metrics & tweak the workloads ○ Are we compute bound? ○ Are we network bound? ○ Are we spilling a ton?

Performance and Tuning with Delta Lake & Apache Spark

Performance Symptoms Look for these 4 symptoms Shuffle Spill Skew Small Files Can I make Spark application run faster?

Use broadcast join Review Join order I found Shuffle, now what? Query completion time 28 Minutes Sort Merge Join rows output: 2,509,189,31 3 Before 1.8 Minutes rows output: 1023 After Reference: https://spark.apache.org/docs/latest/sql- performance-tuning.html#broadcast-hint-for-sql-queries

● Increase Shuffle Partitions (for this example: 48) ● Reduce the number of cores spark.executor.cores < total cores per worker ● Larger cluster - faster disk SSDs Shuffle Partitions = 16 I found Spill, now what? set spark.sql.shuffle.partitions=48 More spill you can remove, larger the impact!

Symptom ● Ganglia CPU usage becomes low for long time after initial high usage ● Task duration -> Significant difference in max than 75% and 25% values ● Input Size/Records What to do? ● Use broadcast join ● Use Skew Join ● Filter out large keys/salt keys and set up multiple reduce steps ● Explicitly repartition the data on a different field I found Skew, now what? Reference: https://docs.databricks.com/delta/join-performance/skew-join.html

Adaptive Query Execution Reduced manual effort of tuning spark.sql.shuffle.partitions By default it is turned off, Set spark.sql.adaptive.enabled=true Dynamically change sort-merge join into broadcast-hash join ▪ Dynamically optimizing skew joins *Available in DBR 7.x/Spark 3.0

Upstream ● Fix the upstream application building tons of files ● Use a seperate tool to compact them before processing with Spark Changes in Spark Application ● Write your own compaction job ● Delta solves this problem! I found a lot of small files, now what?

Compaction ● Improves the Read Performance ● Solves Small Files problem Reference: https://docs.delta.io/latest/best-practices.html#compact-files

● Optimizes Apache Spark partition ● Maximizes the throughput of data being written ● Compacts files for partitions Auto Optimize Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction. Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize

Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering Z-order Sorting 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Z-Ordering A technique to colocate related information in the same set of files ● Safely skips more data ● Faster queries

Governance & Security Controls

Data Governance with Delta Lake Create retention policy to age out and erase raw data that may contain personal information High Level Aggregates (e.g. # of users that took an action) Historical Data Repository ● Easy to navigate ● Pseudonymization Data Lake Satisfy Compliance requests using UPDATE / DELETE commands Create tables that don't contain personal data Reference: https://www.youtube.com/watch?v=tCPslvUjG1w

Audit & Monitoring ▪ Use cluster tags for chargeback ▪ Audit logs ▪ Monitor Databricks DBU usage ▪ Delta Transactional Logs

Governance - The Who/What/Where Perform standard extraction, transformation and loading tasks (ETL) and apply best coding practices including source control, unit test, and automation drives product innovation with state-of-the-art Machine Learning models applied to big data Improves business process through providing standardized and ad-hoc business analysis. Acts as intermediary between Analytics and Business team Performs automated jobs based on Data Engineering configs. Data Scientist Data Engineer Data/Business Analyst Automated Jobs Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.

Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage

Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage Data Strategy Cost Optimization & Performance Tuning Business Value Security

Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Best Practices for Building Robust Data Platform with Apache Spark and Delta

In this document

More Related Content

What's hot

Similar to Best Practices for Building Robust Data Platform with Apache Spark and Delta

More from Databricks

Recently uploaded

Best Practices for Building Robust Data Platform with Apache Spark and Delta