The document outlines a Spark SQL and DataFrames workshop led by Jules S. Damji, focusing on the features, benefits, and architecture of Apache Spark 2.3. It includes an agenda for the day covering topics such as DataFrames, Spark SQL labs, and developer certification, highlighting Spark's unified engine for diverse workloads and its performance advantages. Additionally, it discusses the evolution of Spark APIs and the importance of certification in validating skills for potential employers.
Get to knowDatabricks • Keepthis URL Open in a separate tab https://dbricks.co/spark-saturday-bayarea • Labs @Copyrightedby Databricks. Cannotbe repurposedfor Commercialuse! Use This
Big Data Systemsof Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines
Apache Spark: TheFirst Unified Analytics Engine Runtime Delta Spark Core Engine Big Data Processing ETL + SQL + Streaming Machine Learning MLlib + SparkR Uniquelycombines Data & AI technologies
18.
DATABRICKS WORKSPACE Databricks DeltaML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
Common Spark UseCases 2 ETL MACHINE LEARNING SQL ANALYTICS STREAMING
21.
The Benefits ofApache Spark? SPEED 100x faster than Hadoop for large scale data processing EASE OF USE Simple APIs for operating on large data sets UNIFIED ENGINE Packaged withhigher- level libraries (SQL, Streaming, ML, Graph)
Apache Spark atMassive Scale 23 60TB+ Compressed data 250,000+ # of tasks in a single job 4.5-6x CPU performance improvement over Hive https://databricks.com/blog/2016/08/31/apache- spark-scale-a-60-tb-production-use-case.html
Native Spark Appin K8S • New Spark scheduler backend • Driver runs in a Kubernetes pod created by the submission client and creates pods that run the executors in response to requests from the Spark scheduler. [K8S-34377] [SPARK-18278] • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization, and Logging. 29
30.
Spark on Kubernetes Supported: •Supports Kubernetes1.6 and up • Supports cluster mode only • Staticresource allocation only • Supports Java and Scala applications • Can use container-local and remote dependencies that are downloadable 30 In roadmap (2.4): • Client mode • Dynamic resource allocation + external shuffle service • Python and R support • Submission client local dependencies + Resource staging server (RSS) • Non-secured and KerberizedHDFS access (injection of Hadoop configuration)
Unified API Foundationfor the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming
48.
Major Themes inApache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
49.
SparkSession – AUnified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
Long Term • RDDas the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines
The not sosecret truth… SQL is not about SQL is about more thanSQL
57.
10 Is About Creatingand Running Spark Programs Faster: • Write less code • Read less data • Do less work • optimizerdoes the hard work Spark SQL: The wholestory
59 Using Catalyst inSpark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
60.
LOGICAL OPTIMIZATIONS PHYSICALOPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates
Columns: Predicate pushdown SELECTfirstName, LastName, SSN, COO, title FROM people where firstName = ‘jules’ and COO = ‘tz’; 63 You Write SparkWill Push it down To Postgres or Parquet SELECT <items, item, …items > FROM people WHERE <condition>
Background: What isin an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
68.
Structured APIs InSpark 68 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
DataFrame API code. // convert RDD -> DF with column names valdf = parsedRDD.toDF("project", "page", "numRequests") //filter, groupBy, sum, and then agg() df.filter($"project" === "en"). groupBy($"page"). agg(sum($"numRequests").as("count")). limit(100). show(100) project page numRequests en 23 45 en 24 200
71.
Take DataFrame àSQL Table à Query df. createOrReplaceTempView(("edits") val results = spark.sql("""SELECT page, sum(numRequests) AS count FROM edits WHERE project = 'en' GROUP BY page LIMIT 100""") results.show(100) project page numRequests en 23 45 en 24 200
72.
Easy to writecode... Believe it! from pyspark.sql.functions import avg dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)]) dataDF = dataRDD.toDF(["name", "age"]) # Using RDD code to compute aggregate average (dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1])) .map(lambda (x, (y, z)): (x, y / z))) # Using DataFrame dataDF.groupBy("name").agg(avg("age")) name age Jim 20 Ann 31 Jim 30
73.
Why structure APIs? data.map{ case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)} .map { case (dept, (age, c)) => dept -> age / c } select dept, avg(age) from data group by 1 SQL DataFrame RDD data.groupBy("dept").avg("age")
74.
Type-safe:operate on domain objects withcompiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
83 Why: Build YourSkills - Certification ● The industry standard for Apache Spark certification from original creators at Databricks ○ Validate your overall knowledge on Apache Spark ○ Assure clients that you are up-to-date with the fast moving Apache Spark project with features in new releases
84.
84 What: Build YourSkills - Certification ● Databricks Certification Exam ○ The test is approximately 3 hours and is proctored either online or at a test center ○ Series of randomly generated multiple choice questions ○ Test fee is $300 ○ Two editions: Scala & Python ○ Can take it twice
85.
85 How To Preparefor Certification • Knowledge of Apace Spark Basics • Structured Streaming, Spark Architecture, MLlib, Performance & Debugging, Spark SQL, GraphFrames, Programming Languages (offered Python or Scala) • Experience Developing Spark apps in production • Courses: • Databricks Apache Spark Programing 105 & 110 • Getting Started with Apache Spark SQL • 7 Steps for a Developer to Learn Apache Spark • Spark: The Definitive Guide
86.
86 Where To Signfor Certification REGISTER: Databricks Certified Developer: Apache Spark 2.X LOGISTICS: How to Take the Exam