Spark Saturday: Spark SQL & DataFrames Workshop w/ Apache Spark 2.3 Jules S. Damji Apache Spark Developer & Community Advocate Spark Saturday , Santa ClaraAugust 4th,2018
SSID: Password:
I have used Apache Spark Before…
I have used SQL or Spark SQL Before…
I know the difference between DataFrame and RDDs…
Spark Community &Developer Advocate @ Databricks Developer Advocate @ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme, jules@databricks.com
Morning Afternoon Agenda for the day • Introduction to DataFrames & Datasets • DataFrames Labs • Break • DeveloperCertification • Get to know Databricks • Overview of Spark Fundamentals & Architecture • Unified APIs:SparkSessions, SQL, DataFrames, Datasets… • Break • Spark SQL Labs • Lunch
Know Thy Neighbor! J
Get to know Databricks • Keepthis URL Open in a separate tab https://dbricks.co/spark-saturday-bayarea • Labs @Copyrightedby Databricks. Cannotbe repurposedfor Commercialuse! Use This
Why Apache Spark?
Big Data Systems of Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines
MapReduce Generalbatch processing Unified engine Big Data Systems Today ? Pregel Dremel Millwheel Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads
Faster, Easier to Use, Unified 13 First	Distributed Processing	Engine Specialized	Data Processing	Engines Unified	Data Processing	Engine
Apache Spark Philosophy Unified engine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph … DL Applications
An Analogy …. New applications
Unified engineacross diverse workloads & environments
Apache Spark: The First Unified Analytics Engine Runtime Delta Spark	Core	Engine Big Data Processing ETL + SQL + Streaming Machine Learning MLlib + SparkR Uniquelycombines Data & AI technologies
DATABRICKS WORKSPACE Databricks Delta ML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
Where Apache Spark is Used?
Common Spark Use Cases 2 ETL MACHINE LEARNING SQL ANALYTICS STREAMING
The Benefits of Apache Spark? SPEED 100x faster than Hadoop for large scale data processing EASE OF USE Simple APIs for operating on large data sets UNIFIED ENGINE Packaged withhigher- level libraries (SQL, Streaming, ML, Graph)
Spark in the Enterprise 22
Apache Spark at Massive Scale 23 60TB+ Compressed data 250,000+ # of tasks in a single job 4.5-6x CPU performance improvement over Hive https://databricks.com/blog/2016/08/31/apache- spark-scale-a-60-tb-production-use-case.html
Apache Spark Architecture
Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos
Driver + Executor Driver + Executor Container EC2 Machine Student-1 Notebook Student-2 Notebook Container JVM JVM Local Mode in Databricks
30	GB	Container 30	GB	Container 22	GB	JVM 22	GB	JVM S S S S S S S S Ex. Ex. 30	GB	Container 30	GB	Container 22	GB	JVM 22	GB	JVM S S S S Dr Ex. ... ... Standalone Mode
Spark Deployment Modes As spark 2.3 Kubernetes
Native Spark App in K8S • New Spark scheduler backend • Driver runs in a Kubernetes pod created by the submission client and creates pods that run the executors in response to requests from the Spark scheduler. [K8S-34377] [SPARK-18278] • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization, and Logging. 29
Spark on Kubernetes Supported: • Supports Kubernetes1.6 and up • Supports cluster mode only • Staticresource allocation only • Supports Java and Scala applications • Can use container-local and remote dependencies that are downloadable 30 In roadmap (2.4): • Client mode • Dynamic resource allocation + external shuffle service • Python and R support • Submission client local dependencies + Resource staging server (RSS) • Non-secured and KerberizedHDFS access (injection of Hadoop configuration)
Apache Spark Application Anatomy
Apache Spark Architecture An Anatomy ofan Application Spark	Application • Jobs • Stages • Tasks
S S Container S* * * * * * * * JVM T * * DF/RDD A Spark Executor
Resilient Distributed Dataset (RDD)
What are RDDs?
A Resilient Distributed Dataset (RDD) 1. Distributed Data Abstraction Logical Model Across Distributed Storage S3, Blob or HDFS
2. Resilient & Immutable RDD RDD RDDT RDD à Tà RDD -> RDD T T = Transformation
3. Compile-time Type-safe Integer RDD String or Text RDD Double or Binary RDD
4. Unstructured/Structured Data: Text (logs, tweets, articles, social)
5. Lazy RDD RDD RDDT RDD à Tà RDD à Tà RDD T T = Transformation A = Action RDD RDD RDDT A
2	kinds	of	Actions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
Unified API Foundation for the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming
Major Themes in Apache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
SparkSession – A Unified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
DataFrame & Dataset Structure
Long Term • RDD as the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines
Spark 1.6 vs Spark 2.x
Spark 1.6 vs Spark 2.x
DataFrames/Dataset, Spark SQL & Catalyst Optimizer
The not so secret truth… SQL is not about SQL is about more thanSQL
10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdoes the hard work Spark	SQL:	The	wholestory
Spark SQL Architecture Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL DataFrames Code Generator Datasets
59 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates
PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 61 scan (events) filter users.join(events, users("id")	===	events("uid")) . filter(events("date") >	"2015-01-01") DataFrame Optimization
Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 62 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
Columns: Predicate pushdown SELECT firstName, LastName, SSN, COO, title FROM people where firstName = ‘jules’ and COO = ‘tz’; 63 You Write SparkWill Push it down To Postgres or Parquet SELECT <items, item, …items > FROM people WHERE <condition>
43 Spark Core (RDD) Catalyst & Tungsten DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.x Components Spark SQL GraphFrames DL Pipelines TensorFrames
Spark SQL Lab (Pair Up J)
DataFrames & Datasets Spark 2.x APIs
Background: What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
Structured APIs In Spark 68 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
Unification of APIs in Spark 2.0
DataFrame API code. //	convert	RDD	->	DF	with	column	names val df	=	parsedRDD.toDF("project",	"page",	"numRequests") //filter,	groupBy,	sum,	and	then	agg() df.filter($"project"	===	"en"). groupBy($"page"). agg(sum($"numRequests").as("count")). limit(100). show(100) project page numRequests en 23 45 en 24 200
Take DataFrame à SQL Table à Query df. createOrReplaceTempView(("edits") val results	=	spark.sql("""SELECT	page,	sum(numRequests) AS	count	FROM	edits	WHERE	project	=	'en'	GROUP	BY	page LIMIT	100""") results.show(100) project page numRequests en 23 45 en 24 200
Easy to write code... Believe it! from	pyspark.sql.functions import	avg dataRDD	=	sc.parallelize([("Jim",	20),	("Anne", 31),	("Jim",	30)]) dataDF	=	dataRDD.toDF(["name",	"age"]) #	Using	RDD code	to	compute	aggregate	average (dataRDD.map(lambda	(x,y):	(x,	(y,1))) .reduceByKey(lambda	x,y:	(x[0]	+y[0],	x[1] +y[1]))	.map(lambda	(x,	(y,	z)):	(x,	y	/	z))) #	Using	DataFrame dataDF.groupBy("name").agg(avg("age")) name age Jim 20 Ann 31 Jim 30
Why structure APIs? data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)} .map { case (dept, (age, c)) => dept -> age / c } select dept, avg(age) from data group by 1 SQL DataFrame RDD data.groupBy("dept").avg("age")
Type-safe:operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
Datasets: Lightning-fast Serialization with Encoders
DataFrames are Faster than RDDs
Datasets < Memory RDDs
Why When DataFrames & Datasets • StructuredData schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • StrongType-safety • Ease-of-use & Readability • What-to-do
Source: michaelmalak
BLOG: http://dbricks.co/3-apis Spark Summit Talk: http://dbricks.co/summit-3aps
DataFrame Lab (Pair Up J)
Databricks Developer Certification for Apache Spark 2.x
83 Why: Build Your Skills - Certification ● The industry standard for Apache Spark certification from original creators at Databricks ○ Validate your overall knowledge on Apache Spark ○ Assure clients that you are up-to-date with the fast moving Apache Spark project with features in new releases
84 What: Build Your Skills - Certification ● Databricks Certification Exam ○ The test is approximately 3 hours and is proctored either online or at a test center ○ Series of randomly generated multiple choice questions ○ Test fee is $300 ○ Two editions: Scala & Python ○ Can take it twice
85 How To Prepare for Certification • Knowledge of Apace Spark Basics • Structured Streaming, Spark Architecture, MLlib, Performance & Debugging, Spark SQL, GraphFrames, Programming Languages (offered Python or Scala) • Experience Developing Spark apps in production • Courses: • Databricks Apache Spark Programing 105 & 110 • Getting Started with Apache Spark SQL • 7 Steps for a Developer to Learn Apache Spark • Spark: The Definitive Guide
86 Where To Sign for Certification REGISTER: Databricks Certified Developer: Apache Spark 2.X LOGISTICS: How to Take the Exam
https://dbricks.co/developer-cert
Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • spark-packages.org
http://dbricks.co/spark-guide
https://databricks.com/company/careers
Do you have any questions for my preparedanswers?

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

  • 1.
    Spark Saturday: SparkSQL & DataFrames Workshop w/ Apache Spark 2.3 Jules S. Damji Apache Spark Developer & Community Advocate Spark Saturday , Santa ClaraAugust 4th,2018
  • 2.
  • 3.
    I have usedApache Spark Before…
  • 4.
    I have usedSQL or Spark SQL Before…
  • 5.
    I know thedifference between DataFrame and RDDs…
  • 6.
    Spark Community &DeveloperAdvocate @ Databricks Developer Advocate @ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme, jules@databricks.com
  • 7.
    Morning Afternoon Agenda forthe day • Introduction to DataFrames & Datasets • DataFrames Labs • Break • DeveloperCertification • Get to know Databricks • Overview of Spark Fundamentals & Architecture • Unified APIs:SparkSessions, SQL, DataFrames, Datasets… • Break • Spark SQL Labs • Lunch
  • 8.
  • 9.
    Get to knowDatabricks • Keepthis URL Open in a separate tab https://dbricks.co/spark-saturday-bayarea • Labs @Copyrightedby Databricks. Cannotbe repurposedfor Commercialuse! Use This
  • 10.
  • 11.
    Big Data Systemsof Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines
  • 12.
    MapReduce Generalbatch processing Unified engine Big DataSystems Today ? Pregel Dremel Millwheel Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads
  • 13.
    Faster, Easier toUse, Unified 13 First Distributed Processing Engine Specialized Data Processing Engines Unified Data Processing Engine
  • 14.
    Apache Spark Philosophy Unifiedengine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph … DL Applications
  • 15.
    An Analogy …. Newapplications
  • 16.
    Unified engineacross diverseworkloads & environments
  • 17.
    Apache Spark: TheFirst Unified Analytics Engine Runtime Delta Spark Core Engine Big Data Processing ETL + SQL + Streaming Machine Learning MLlib + SparkR Uniquelycombines Data & AI technologies
  • 18.
    DATABRICKS WORKSPACE Databricks DeltaML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
  • 19.
  • 20.
    Common Spark UseCases 2 ETL MACHINE LEARNING SQL ANALYTICS STREAMING
  • 21.
    The Benefits ofApache Spark? SPEED 100x faster than Hadoop for large scale data processing EASE OF USE Simple APIs for operating on large data sets UNIFIED ENGINE Packaged withhigher- level libraries (SQL, Streaming, ML, Graph)
  • 22.
    Spark in theEnterprise 22
  • 23.
    Apache Spark atMassive Scale 23 60TB+ Compressed data 250,000+ # of tasks in a single job 4.5-6x CPU performance improvement over Hive https://databricks.com/blog/2016/08/31/apache- spark-scale-a-60-tb-production-use-case.html
  • 24.
  • 25.
    Apache Spark Architecture Deployments Modes •Local • Standalone • YARN • Mesos
  • 26.
  • 27.
  • 28.
    Spark Deployment Modes Asspark 2.3 Kubernetes
  • 29.
    Native Spark Appin K8S • New Spark scheduler backend • Driver runs in a Kubernetes pod created by the submission client and creates pods that run the executors in response to requests from the Spark scheduler. [K8S-34377] [SPARK-18278] • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization, and Logging. 29
  • 30.
    Spark on Kubernetes Supported: •Supports Kubernetes1.6 and up • Supports cluster mode only • Staticresource allocation only • Supports Java and Scala applications • Can use container-local and remote dependencies that are downloadable 30 In roadmap (2.4): • Client mode • Dynamic resource allocation + external shuffle service • Python and R support • Submission client local dependencies + Resource staging server (RSS) • Non-secured and KerberizedHDFS access (injection of Hadoop configuration)
  • 31.
  • 32.
    Apache Spark Architecture AnAnatomy ofan Application Spark Application • Jobs • Stages • Tasks
  • 33.
  • 34.
  • 35.
  • 36.
    A Resilient DistributedDataset (RDD) 1. Distributed Data Abstraction Logical Model Across Distributed Storage S3, Blob or HDFS
  • 37.
    2. Resilient &Immutable RDD RDD RDDT RDD à Tà RDD -> RDD T T = Transformation
  • 38.
    3. Compile-time Type-safe IntegerRDD String or Text RDD Double or Binary RDD
  • 39.
    4. Unstructured/Structured Data:Text (logs, tweets, articles, social)
  • 40.
    5. Lazy RDD RDDRDDT RDD à Tà RDD à Tà RDD T T = Transformation A = Action RDD RDD RDDT A
  • 43.
    2 kinds of Actions collect, count, reduce,take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
  • 47.
    Unified API Foundationfor the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming
  • 48.
    Major Themes inApache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  • 49.
    SparkSession – AUnified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
  • 50.
    SparkSession vs SparkContext SparkSessionsSubsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  • 51.
  • 52.
    Long Term • RDDas the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines
  • 53.
    Spark 1.6 vsSpark 2.x
  • 54.
    Spark 1.6 vsSpark 2.x
  • 55.
  • 56.
    The not sosecret truth… SQL is not about SQL is about more thanSQL
  • 57.
    10 Is About Creatingand Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdoes the hard work Spark SQL: The wholestory
  • 58.
  • 59.
    59 Using Catalyst inSpark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
  • 60.
    LOGICAL OPTIMIZATIONS PHYSICALOPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates
  • 61.
    PhysicalPlan with Predicate Pushdown andColumn Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 61 scan (events) filter users.join(events, users("id") === events("uid")) . filter(events("date") > "2015-01-01") DataFrame Optimization
  • 62.
    Columns: Predicate pushdown spark.read .format("jdbc") .option("url","jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 62 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  • 63.
    Columns: Predicate pushdown SELECTfirstName, LastName, SSN, COO, title FROM people where firstName = ‘jules’ and COO = ‘tz’; 63 You Write SparkWill Push it down To Postgres or Parquet SELECT <items, item, …items > FROM people WHERE <condition>
  • 64.
    43 Spark Core (RDD) Catalyst& Tungsten DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.x Components Spark SQL GraphFrames DL Pipelines TensorFrames
  • 65.
  • 66.
  • 67.
    Background: What isin an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
  • 68.
    Structured APIs InSpark 68 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  • 69.
    Unification of APIsin Spark 2.0
  • 70.
    DataFrame API code. // convert RDD -> DF with column names valdf = parsedRDD.toDF("project", "page", "numRequests") //filter, groupBy, sum, and then agg() df.filter($"project" === "en"). groupBy($"page"). agg(sum($"numRequests").as("count")). limit(100). show(100) project page numRequests en 23 45 en 24 200
  • 71.
    Take DataFrame àSQL Table à Query df. createOrReplaceTempView(("edits") val results = spark.sql("""SELECT page, sum(numRequests) AS count FROM edits WHERE project = 'en' GROUP BY page LIMIT 100""") results.show(100) project page numRequests en 23 45 en 24 200
  • 72.
    Easy to writecode... Believe it! from pyspark.sql.functions import avg dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)]) dataDF = dataRDD.toDF(["name", "age"]) # Using RDD code to compute aggregate average (dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1])) .map(lambda (x, (y, z)): (x, y / z))) # Using DataFrame dataDF.groupBy("name").agg(avg("age")) name age Jim 20 Ann 31 Jim 30
  • 73.
    Why structure APIs? data.map{ case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)} .map { case (dept, (age, c)) => dept -> age / c } select dept, avg(age) from data group by 1 SQL DataFrame RDD data.groupBy("dept").avg("age")
  • 74.
    Type-safe:operate on domain objects withcompiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
  • 75.
  • 76.
  • 77.
  • 78.
    Why When DataFrames &Datasets • StructuredData schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • StrongType-safety • Ease-of-use & Readability • What-to-do
  • 79.
  • 80.
    BLOG: http://dbricks.co/3-apis Spark SummitTalk: http://dbricks.co/summit-3aps
  • 81.
  • 82.
  • 83.
    83 Why: Build YourSkills - Certification ● The industry standard for Apache Spark certification from original creators at Databricks ○ Validate your overall knowledge on Apache Spark ○ Assure clients that you are up-to-date with the fast moving Apache Spark project with features in new releases
  • 84.
    84 What: Build YourSkills - Certification ● Databricks Certification Exam ○ The test is approximately 3 hours and is proctored either online or at a test center ○ Series of randomly generated multiple choice questions ○ Test fee is $300 ○ Two editions: Scala & Python ○ Can take it twice
  • 85.
    85 How To Preparefor Certification • Knowledge of Apace Spark Basics • Structured Streaming, Spark Architecture, MLlib, Performance & Debugging, Spark SQL, GraphFrames, Programming Languages (offered Python or Scala) • Experience Developing Spark apps in production • Courses: • Databricks Apache Spark Programing 105 & 110 • Getting Started with Apache Spark SQL • 7 Steps for a Developer to Learn Apache Spark • Spark: The Definitive Guide
  • 86.
    86 Where To Signfor Certification REGISTER: Databricks Certified Developer: Apache Spark 2.X LOGISTICS: How to Take the Exam
  • 87.
  • 88.
    Resources • Getting StartedGuide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • spark-packages.org
  • 89.
  • 90.
  • 91.
    Do you haveany questions for my preparedanswers?