Jumpstart on Apache Spark 2.2 on Databricks

Jump Start on Apache® Spark™ 2.x with Databricks Jules S. Damji Apache Spark Community Evangelist Spark Saturday Meetup Workshop

I have used Apache Spark Before…

I know the difference between DataFrame and RDDs…

Spark CommunityEvangelist& Developer Advocate @ Databricks DeveloperAdvocate@ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme

Morning Afternoon Agenda for the day • Introduction to DataFrames, Datasets and Spark SQL • Workshop Notebook 2 • Break • Introduction to StructuredStreaming Concepts • Workshop Notebook 3 • Go Home… • Get to know Databricks • Overview of Spark Fundamentals & Architecture • What’s New in Spark 2.x • Break • Unified APIs:SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook 1 • Lunch

Get to know Databricks 1. Get http://databricks.com/try-databricks 2. https://github.com/dmatrix/spark-saturday 3. [OR] ImportNotebook: http://dbricks.co/ss_wkshp0

Big Data Systems of Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines

An Analogy …. New applications

Apache Spark Philosophy Unified engine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph …

Unified engineacross diverse workloads & environments

TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeleyin 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple

Accelerate innovation by unifying data science, engineering and business. Unified Analytics Platform UNIFIED INFRASTRUCTURE UNIFIED EXPERIENCE ACROSS TEAMS UNIFIED ANALYTIC WORKFLOWS

The Unified Analytics Platform

Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos

Driver + Executor Driver + Executor Container EC2 Machine Student-1 Notebook Student-2 Notebook Container JVM JVM Local Mode in Databricks

30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S S S S S Ex. Ex. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S Dr Ex. ... ... Standalone Mode

Apache Spark Architecture An Anatomy ofan Application Spark Application • Jobs • Stages • Tasks

S S Container S* * * * * * * * JVM T * * DF/RDD A Spark Executor

Resilient Distributed Dataset (RDD)

What are RDDs? • … Distributed data abstraction • … Resilient & Immutable • … Lazy • … Compile Type-safe • … Semi-structuredor unstructured

A Resilient Distributed Dataset (RDD)

2 kinds of Actions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)

How did we get here…? Where we going...?

A Brief History 33 2012 Started @ UC Berkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten Catalyst Optimizer ML Pipelines 2016-17 Apache Spark 2.0,2.1,2.2 Structured Streaming Cost Based Optimizer Deep Learning Pipelines Easier Smarter Faster

Apache Spark 2.X • Steps to Bigger & Better Things…. Builds on all we learned in past 2 years

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 06/2016 12/2016 06/2017 Spark Version Usage in Databricks 2.1 2.0 1.6 1.5

Major Themes in Apache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier

Unified API Foundation for the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming

SparkSession – A Unified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management

SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf

SparkSession – A Unified entry point to Spark

Long Term • RDD as the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines

Towards SQL 2003 • Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser (with good error messages!) - Subqueries (correlated & uncorrelated) - Approximate aggregate stats - https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark- 2-0.html

0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)

Other notable API improvements • DataFrame-based ML pipeline API becoming the main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programminglanguages: Scala, Java, Python, R • Improved R support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K- Means • Structured Streaming Features & Production Readiness • https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html

Workshop: Notebook on SparkSession • Import Notebook into your Spark 2.2 Cluster – http://dbricks.co/ss_wkshp1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html #org.apache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR + <return> / Shift + Return • Try challenges • Break…

DataFrames/Dataset, Spark SQL & Catalyst Optimizer

The not so secret truth… SQL is not about SQL is about more thanSQL

10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdoes the hard work Spark SQL: The wholestory

Spark SQL Architecture Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL DataFrames Code Generator Datasets

54 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets

LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates

PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 56 scan (events) filter users.join(events, users("id") === events("uid")) . filter(events("date") > "2015-01-01") DataFrame Optimization

Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 57 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'

43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.x Components Spark SQL GraphFrames

http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

Background: What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data

Structured APIs In Spark 62 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts

Unification of APIs in Spark 2.0

Type-safe:operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)

Datasets: Lightning-fast Serialization with Encoders

DataFrames are Faster than RDDs

Why When DataFrames & Datasets • StructuredData schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • StrongType-safety • Ease-of-use & Readability • What-to-do

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Project Tungsten • Substantially speed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management

6 “bricks” Tungsten’sCompact RowFormat 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20

Encoders 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format

Workshop: Notebook on DataFrames/Datasets & Spark SQL • Import Notebookinto your Spark2.x Cluster – http://dbricks.co/sqlds_wkshp2 (optional) – http://dbricks.co/sqldf_wkshp2 (python) (optional) – http://dbricks.co/data_mounts (python) – http://dbricks.co/iotds_wkshp3 – https://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.Dataset • Workthrough each Notebookcell • Try challenges • Break..

Introduction to Structured Streaming Concepts

building robust stream processing apps is hard

Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning

Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems

you should not have to reason about streaming

Treat Streams as Unbounded Tables 82 data stream unbounded inputtable newdata in the data stream = newrows appended to a unboundedtable

you should write simple queries & Spark should continuously update the answer

DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Writeto Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata

Streaming word count Anatomy of a Streaming Query

Anatomy of a Streaming Query: Step 1 spark.readStream .format("kafka") .option("subscribe", "input") .load() . Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable.

Anatomy of a Streaming Query: Step 2 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) Transformation • Using DataFrames,Datasets and/or SQL. • Internal processingalways exactly- once.

Anatomy of a Streaming Query: Step 3 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files). • Use foreach to execute arbitrary code.

Anatomy of a Streaming Query: Output Modes spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rowsonly Trigger – When to output • Specifiedas a time, eventually supportsdata size • No trigger means as fast as possible

Anatomy of a Streaming Query: Checkpoint spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Checkpoint • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure.

Fault-tolerance with Checkpointing Checkpointing – tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log

Traditional ETL Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 93 file dump seconds hours table 10101010

Traditional ETL Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] file dump seconds hours table 10101010

Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 95 seconds table 10101010

Streaming ETL w/ Structured Streaming Example Json data being received in Kafka Parse nested json and flatten it Store in structured Parquet table Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable")

Reading from Kafka Specify options to configure How? kafka.boostrap.servers => broker1,broker2 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()

Reading from Kafka val rawDataDF = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721

Transforming Data Cast binary value to string Name it column json val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*")

Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"

Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested)

Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in APIs to performcomplex data transformations from_json, to_json, explode,... 100s offunctions (see our blogpost & tutorial)

Writing to Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable") //pathname

Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")

Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable")/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery

Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html complex, ad-hoc queries on latest data seconds!

More Kafka Support [Spark 2.2] Write out to Kafka DataFrame must have binary fields named key and value Direct, interactive and batch queries on Kafka Makes Kafka even more powerful as a storage platform! result.writeStream .format("kafka") .option("topic", "output") .start() val df = spark .read // not readStream .format("kafka") .option("subscribe", "topic") .load() df.createOrReplaceTempView("topicData") spark.sql("select value from topicData")

Amazon Kinesis [Databricks Runtime 3.0] Configure with options (similar to Kafka) How? region => us-west-2 / us-east-1 / ... awsAccessKey (optional) => AKIA... awsSecretKey (optional) => ... What? streamName => name-of-the-stream Where? initialPosition => latest(default) / earliest / trim_horizon spark.readStream .format("kinesis") .option("streamName”,"myStream") .option("region", "us-west-2") .option("awsAccessKey", ...) .option("awsSecretKey", ...) .load()

Event Time Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extractingevent time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff

Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins

Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state

Automatically handles Late Data 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increasesindefinitely if old windows are notdropped

Watermarking max eventtime event time watermark allowed lateness of 10 mins parsedDataDF .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowedto aggregate data too late, dropped Useful only in stateful operations (streaming aggs, dropDuplicates,mapGroupsWithState,...) Ignored in non-stateful streaming queries and batch queries

Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 116 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }

Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 117 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }

Other interestingoperations Streaming Deduplication Watermarks to limit state Stream-batch Joins Stream-stream Joins Can use mapGroupsWithState Direct support coming soon! val batchDataDF = spark.read .format("parquet") .load("/additional-data") //join with stream DataFrame parsedDataDF.join(batchData, "device") parsedDataDF.dropDuplicates("eventId")

More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html and more to come, stay tuned!!

Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • sparkhub.databricks.com • spark-packages.org

https://spark-summit.org/eu-2017/

Do you have any questions for my preparedanswers?

Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.2 Cluster • http://dbricks.co/iotss_wkshp4 • Done!

Title goes here. It can be one or two lines. Author goes here Dategoes here

Here is a basic slide Suspendisseullamcorpervel odio a varius • Pellentesque habitant morbi tristiqu • enectus et netuset malesuada fames ac turpis egestas • ut erat dapibus lobortis purus sed gravida augu • efficitur a risus placerat porta nullam molestie malesuada velit et auctor

HEADER CAN BE BOLD ALL CAPS LIKE THIS Here is a comparison slide • Quisque tortor quam, posuere sed sagittis et, iaculis a urna. In malesuada in orci ut lacinia • Sed bibendum sed mauris egestas pellentesque • Vestibulum bibendum sagittis odio quis tincidunt augue consequat e • Aliquam purus leo, interdum eu urna vitae • Etiam in arcu gravida, tincidunt magna ve faucibus • Donec laoreet vel quam eu condimentum

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Q1 Q2 Q3 Q4 Title Blue Orange Green Use this chart to start

Here are some icons to use - scalable DB Benefits DB Features General /Data Science Icons can be recoloredwithinPowerpoint — see: format picture/ picture color / recolor Orange, Green, and Black versions (no recolorationnecessary) can be found in go/icons

More icons Industries Security Spark Benefits Spark Features

Slide for Large Question or Section Headers

Thank You Parting words or contact information go here.

The Unified Analytics Platform Data Engineering Line of Business DATABRICKS ENTERPRISE SECURITY (DBES) DATABRICKS WORKSPACE DATABRICKS WORKFLOWS DATABRICKS RUNTIME DATABRICKS SERVERLESS DATABRICKS I/O (DBIO) PEOPLE Data Science Streaming Deep Learning / ML and manyothers… APPLICATIONS Cloud Storage Data Warehouses Hadoop Storage Data Warehousing

Jumpstart on Apache Spark 2.2 on Databricks

More Related Content

What's hot

Similar to Jumpstart on Apache Spark 2.2 on Databricks

More from Databricks

Recently uploaded

Jumpstart on Apache Spark 2.2 on Databricks