Best Practices for Building Robust Data Platform with Apache Spark and Delta
The document outlines best practices for building a robust data platform using Apache Spark and Delta Lake, emphasizing effective data strategy, performance tuning, and governance. Key recommendations include optimizing cluster sizing based on workload types, improving data processing speeds through techniques like compaction and adaptive query execution, and ensuring compliance with governance policies. It also discusses the importance of managing data quality and security to drive business value.
Best Practices for Building Robust Data Platform with Apache Spark and Delta
1.
Best Practices forbuilding Robust Data Platform with Apache Spark & Delta Vini Jaiswal Spark+AI Summit - June 2020 https://www.linkedin.com/in/vinijaiswal/
2.
▪ Data Strategy Optimizingthe cost to drive Business value ▪ Performance and tuning with Delta Lake & Apache Spark ▪ Governance and security controls Bringing it all together - A reference architecture Agenda
Data Challenges Data Warehouselimits the potential of intelligence Data Volume is growing rapidly More Variety of data -> Different applications Need for faster processing and scalability Data silos limits innovation Promise of the Data Lake 1. Collect Everything 2. Store it all in the Data Lake 🔥 🔥🔥 3. Data Science & Machine Learning 🔥 🔥
Ideal data lakeswith No atomicity No quality enforcement No consistency / isolation ✗ Reliability - High Quality Data ● Schema Enforcement ● ACID Transactions ● Time Travel ● Open Standards, Open Source ● Powered by ● Unifies Streaming / Batch Usual Data Lake References: https://youtu.be/qtCxNSmTejk
8.
Getting the DataRight Audience Segmentation CSV, JSON, TXT… Data Types Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold Table Categorization Align with Business Outcomes Is my data use case worthy? Is my data ready for Analytics / ML?
Best Practices forCluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
11.
Best Practices forCluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
12.
Workload Type AWS Type Azure Type RecommendedUse Case Memory Optimized r5 Dsv2 Memory-intensive applications Use Case: ML workload with data caching Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data Science Applications Use Case: ETL with full file scans and no data reuse Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO Use Case: Analytics - Storage Optimized i3 class with Delta IO Cache Selection of Instance Types Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
13.
Best Practices forCluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
14.
Selection of nodesize Rule of thumb 1. Fewer big instances > more small instances a. (larger heap = larger GC) b. Multiple executors per machine 2. Size based on the number of tasks initially, tweak later a. Run the job with a small cluster to get idea of # of tasks b. Observe Cluster metrics for CPU, memory and network utilization
15.
Observe Spark UI& tweak the workloads Fully cached with room to spare? > decrease instances Almost completely cached? > Increase cluster size Not even close to cached? > Consider instance with SSD instead of EBS or use R class Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
16.
Observe Ganglia Metrics& tweak the workloads ○ Are we compute bound? ○ Are we network bound? ○ Are we spilling a ton?
Performance Symptoms Look forthese 4 symptoms Shuffle Spill Skew Small Files Can I make Spark application run faster?
19.
Use broadcast join ReviewJoin order I found Shuffle, now what? Query completion time 28 Minutes Sort Merge Join rows output: 2,509,189,31 3 Before 1.8 Minutes rows output: 1023 After Reference: https://spark.apache.org/docs/latest/sql- performance-tuning.html#broadcast-hint-for-sql-queries
20.
● Increase ShufflePartitions (for this example: 48) ● Reduce the number of cores spark.executor.cores < total cores per worker ● Larger cluster - faster disk SSDs Shuffle Partitions = 16 I found Spill, now what? set spark.sql.shuffle.partitions=48 More spill you can remove, larger the impact!
21.
Symptom ● Ganglia CPUusage becomes low for long time after initial high usage ● Task duration -> Significant difference in max than 75% and 25% values ● Input Size/Records What to do? ● Use broadcast join ● Use Skew Join ● Filter out large keys/salt keys and set up multiple reduce steps ● Explicitly repartition the data on a different field I found Skew, now what? Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
22.
Adaptive Query Execution Reducedmanual effort of tuning spark.sql.shuffle.partitions By default it is turned off, Set spark.sql.adaptive.enabled=true Dynamically change sort-merge join into broadcast-hash join ▪ Dynamically optimizing skew joins *Available in DBR 7.x/Spark 3.0
23.
Upstream ● Fix theupstream application building tons of files ● Use a seperate tool to compact them before processing with Spark Changes in Spark Application ● Write your own compaction job ● Delta solves this problem! I found a lot of small files, now what?
Compaction ● Improves theRead Performance ● Solves Small Files problem Reference: https://docs.delta.io/latest/best-practices.html#compact-files
26.
● Optimizes ApacheSpark partition ● Maximizes the throughput of data being written ● Compacts files for partitions Auto Optimize Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction. Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
Data Governance withDelta Lake Create retention policy to age out and erase raw data that may contain personal information High Level Aggregates (e.g. # of users that took an action) Historical Data Repository ● Easy to navigate ● Pseudonymization Data Lake Satisfy Compliance requests using UPDATE / DELETE commands Create tables that don't contain personal data Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
Governance - TheWho/What/Where Perform standard extraction, transformation and loading tasks (ETL) and apply best coding practices including source control, unit test, and automation drives product innovation with state-of-the-art Machine Learning models applied to big data Improves business process through providing standardized and ad-hoc business analysis. Acts as intermediary between Analytics and Business team Performs automated jobs based on Data Engineering configs. Data Scientist Data Engineer Data/Business Analyst Automated Jobs Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
32.
Business Unit Serving Operations & Security DataScience & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage
33.
Business Unit Serving Operations & Security DataScience & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage Data Strategy Cost Optimization & Performance Tuning Business Value Security