Jump Start on Apache® Spark™ 2.x with Databricks Jules S. Damji Apache Spark Community Evangelist Spark Saturday Meetup Workshop
I have used Apache Spark Before…
I know the difference between DataFrame and RDDs…
Spark CommunityEvangelist& Developer Advocate @ Databricks DeveloperAdvocate@ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
Morning Afternoon Agenda for the day • Introduction to DataFrames, Datasets and Spark SQL • Workshop Notebook 2 • Break • Introduction to StructuredStreaming Concepts • Workshop Notebook 3 • Go Home… • Get to know Databricks • Overview of Spark Fundamentals & Architecture • What’s New in Spark 2.x • Break • Unified APIs:SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook 1 • Lunch
Get to know Databricks 1. Get http://databricks.com/try-databricks 2. https://github.com/dmatrix/spark-saturday 3. [OR] ImportNotebook: http://dbricks.co/ss_wkshp0
Why Apache Spark?
Big Data Systems of Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines
An Analogy …. New applications
Apache Spark Philosophy Unified engine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph …
Unified engineacross diverse workloads & environments
TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeleyin 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
Accelerate innovation by unifying data science, engineering and business. Unified Analytics Platform UNIFIED INFRASTRUCTURE UNIFIED EXPERIENCE ACROSS TEAMS UNIFIED ANALYTIC WORKFLOWS
The Unified Analytics Platform
Apache Spark Architecture
Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos
Driver + Executor Driver + Executor Container EC2 Machine Student-1 Notebook Student-2 Notebook Container JVM JVM Local Mode in Databricks
30	GB	Container 30	GB	Container 22	GB	JVM 22	GB	JVM S S S S S S S S Ex. Ex. 30	GB	Container 30	GB	Container 22	GB	JVM 22	GB	JVM S S S S Dr Ex. ... ... Standalone Mode
Spark Deployment Modes
Apache Spark Architecture An Anatomy ofan Application Spark	Application • Jobs • Stages • Tasks
S S Container S* * * * * * * * JVM T * * DF/RDD A Spark Executor
Resilient Distributed Dataset (RDD)
What are RDDs? • … Distributed data abstraction • … Resilient & Immutable • … Lazy • … Compile Type-safe • … Semi-structuredor unstructured
A Resilient Distributed Dataset (RDD)
2	kinds	of	Actions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
How did we get here…? Where we going...?
A Brief History 33 2012 Started @ UC Berkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten Catalyst Optimizer ML Pipelines 2016-17 Apache Spark 2.0,2.1,2.2 Structured Streaming Cost Based Optimizer Deep Learning Pipelines Easier Smarter Faster
Apache Spark 2.X • Steps to Bigger & Better Things…. Builds on all we learned in past 2 years
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 06/2016 12/2016 06/2017 Spark Version Usage in Databricks 2.1 2.0 1.6 1.5
Major Themes in Apache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
Unified API Foundation for the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming
SparkSession – A Unified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
SparkSession – A Unified entry point to Spark
DataFrame & Dataset Structure
Long Term • RDD as the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines
Spark 1.6 vs Spark 2.x
Spark 1.6 vs Spark 2.x
Towards SQL 2003 • Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser (with good error messages!) - Subqueries (correlated & uncorrelated) - Approximate aggregate stats - https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark- 2-0.html
0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
Other notable API improvements • DataFrame-based ML pipeline API becoming the main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programminglanguages: Scala, Java, Python, R • Improved R support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K- Means • Structured Streaming Features & Production Readiness • https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html
Workshop: Notebook on SparkSession • Import Notebook into your Spark 2.2 Cluster – http://dbricks.co/ss_wkshp1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html #org.apache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR + <return> / Shift + Return • Try challenges • Break…
DataFrames/Dataset, Spark SQL & Catalyst Optimizer
The not so secret truth… SQL is not about SQL is about more thanSQL
10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdoes the hard work Spark	SQL:	The	wholestory
Spark SQL Architecture Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL DataFrames Code Generator Datasets
54 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
LOGICAL OPTIMIZATIONS PHYSICAL OPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates
PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 56 scan (events) filter users.join(events, users("id")	===	events("uid")) . filter(events("date") >	"2015-01-01") DataFrame Optimization
Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 57 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.x Components Spark SQL GraphFrames
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Datasets Spark 2.x APIs
Background: What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
Structured APIs In Spark 62 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
Unification of APIs in Spark 2.0
Type-safe:operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
Datasets: Lightning-fast Serialization with Encoders
DataFrames are Faster than RDDs
Datasets < Memory RDDs
Why When DataFrames & Datasets • StructuredData schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • StrongType-safety • Ease-of-use & Readability • What-to-do
Source: michaelmalak
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Project Tungsten II
Project Tungsten • Substantially speed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
6 “bricks” Tungsten’sCompact RowFormat 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20
Encoders 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format
Workshop: Notebook on DataFrames/Datasets & Spark SQL • Import Notebookinto your Spark2.x Cluster – http://dbricks.co/sqlds_wkshp2 (optional) – http://dbricks.co/sqldf_wkshp2 (python) (optional) – http://dbricks.co/data_mounts (python) – http://dbricks.co/iotds_wkshp3 – https://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.Dataset • Workthrough each Notebookcell • Try challenges • Break..
Introduction to Structured Streaming Concepts
building robust stream processing apps is hard
Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
you should not have to reason about streaming
Treat Streams as Unbounded Tables 82 data stream unbounded inputtable newdata in the data stream = newrows appended to a unboundedtable
you should write simple queries & Spark should continuously update the answer
DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Writeto Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
Streaming word count Anatomy of a Streaming Query
Anatomy of a Streaming Query: Step 1 spark.readStream .format("kafka") .option("subscribe", "input") .load() . Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable.
Anatomy of a Streaming Query: Step 2 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) Transformation • Using DataFrames,Datasets and/or SQL. • Internal processingalways exactly- once.
Anatomy of a Streaming Query: Step 3 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files). • Use foreach to execute arbitrary code.
Anatomy of a Streaming Query: Output Modes spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rowsonly Trigger – When to output • Specifiedas a time, eventually supportsdata size • No trigger means as fast as possible
Anatomy of a Streaming Query: Checkpoint spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Checkpoint • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure.
Fault-tolerance with Checkpointing Checkpointing – tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
Complex Streaming ETL
Traditional ETL Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 93 file dump seconds hours table 10101010
Traditional ETL Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] file dump seconds hours table 10101010
Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 95 seconds table 10101010
Streaming ETL w/ Structured Streaming Example Json data being received in Kafka Parse nested json and flatten it Store in structured Parquet table Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable")
Reading from Kafka Specify options to configure How? kafka.boostrap.servers => broker1,broker2 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
Reading from Kafka val rawDataDF = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
Transforming Data Cast binary value to string Name it column json val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*")
Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested)
Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in APIs to performcomplex data transformations from_json, to_json, explode,... 100s offunctions (see our blogpost & tutorial)
Writing to Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable") //pathname
Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable")/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html complex, ad-hoc queries on latest data seconds!
More Kafka Support [Spark 2.2] Write out to Kafka DataFrame must have binary fields named key and value Direct, interactive and batch queries on Kafka Makes Kafka even more powerful as a storage platform! result.writeStream .format("kafka") .option("topic", "output") .start() val df = spark .read // not readStream .format("kafka") .option("subscribe", "topic") .load() df.createOrReplaceTempView("topicData") spark.sql("select value from topicData")
Amazon Kinesis [Databricks Runtime 3.0] Configure with options (similar to Kafka) How? region => us-west-2 / us-east-1 / ... awsAccessKey (optional) => AKIA... awsSecretKey (optional) => ... What? streamName => name-of-the-stream Where? initialPosition => latest(default) / earliest / trim_horizon spark.readStream .format("kinesis") .option("streamName”,"myStream") .option("region", "us-west-2") .option("awsAccessKey", ...) .option("awsSecretKey", ...) .load()
Working With Time
Event Time Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extractingevent time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff
Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins
Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state
Automatically handles Late Data 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increasesindefinitely if old windows are notdropped
Watermarking max eventtime event time watermark allowed lateness of 10 mins parsedDataDF .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowedto aggregate data too late, dropped Useful only in stateful operations (streaming aggs, dropDuplicates,mapGroupsWithState,...) Ignored in non-stateful streaming queries and batch queries
What else…
Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 116 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }
Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 117 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }
Other interestingoperations Streaming Deduplication Watermarks to limit state Stream-batch Joins Stream-stream Joins Can use mapGroupsWithState Direct support coming soon! val batchDataDF = spark.read .format("parquet") .load("/additional-data") //join with stream DataFrame parsedDataDF.join(batchData, "device") parsedDataDF.dropDuplicates("eventId")
More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html and more to come, stay tuned!!
Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • sparkhub.databricks.com • spark-packages.org
https://spark-summit.org/eu-2017/
http://dbricks.co/2sK35XT
Do you have any questions for my preparedanswers?
Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.2 Cluster • http://dbricks.co/iotss_wkshp4 • Done!
Title goes here. It can be one or two lines. Author goes here Dategoes here
Here is a basic slide Suspendisseullamcorpervel odio a varius • Pellentesque habitant morbi tristiqu • enectus et netuset malesuada fames ac turpis egestas • ut erat dapibus lobortis purus sed gravida augu • efficitur a risus placerat porta nullam molestie malesuada velit et auctor
Here are some logos
HEADER CAN BE BOLD ALL CAPS LIKE THIS Here is a comparison slide • Quisque tortor quam, posuere sed sagittis et, iaculis a urna. In malesuada in orci ut lacinia • Sed bibendum sed mauris egestas pellentesque • Vestibulum bibendum sagittis odio quis tincidunt augue consequat e • Aliquam purus leo, interdum eu urna vitae • Etiam in arcu gravida, tincidunt magna ve faucibus • Donec laoreet vel quam eu condimentum
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Q1 Q2 Q3 Q4 Title Blue Orange Green Use this chart to start
Here are some icons to use - scalable DB Benefits DB Features General /Data Science Icons can be recoloredwithinPowerpoint — see: format picture/ picture color / recolor Orange, Green, and Black versions (no recolorationnecessary) can be found in go/icons
More icons Industries Security Spark Benefits Spark Features
Misc Even more Misc
Slide for Large Question or Section Headers
Thank You Parting words or contact information go here.
The Unified Analytics Platform Data Engineering Line of Business DATABRICKS ENTERPRISE SECURITY (DBES) DATABRICKS WORKSPACE DATABRICKS WORKFLOWS DATABRICKS RUNTIME DATABRICKS SERVERLESS DATABRICKS I/O (DBIO) PEOPLE Data Science Streaming Deep Learning / ML and manyothers… APPLICATIONS Cloud Storage Data Warehouses Hadoop Storage Data Warehousing

Jumpstart on Apache Spark 2.2 on Databricks

  • 1.
    Jump Start onApache® Spark™ 2.x with Databricks Jules S. Damji Apache Spark Community Evangelist Spark Saturday Meetup Workshop
  • 2.
    I have usedApache Spark Before…
  • 3.
    I know thedifference between DataFrame and RDDs…
  • 4.
    Spark CommunityEvangelist& Developer Advocate@ Databricks DeveloperAdvocate@ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
  • 5.
    Morning Afternoon Agenda forthe day • Introduction to DataFrames, Datasets and Spark SQL • Workshop Notebook 2 • Break • Introduction to StructuredStreaming Concepts • Workshop Notebook 3 • Go Home… • Get to know Databricks • Overview of Spark Fundamentals & Architecture • What’s New in Spark 2.x • Break • Unified APIs:SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook 1 • Lunch
  • 6.
    Get to knowDatabricks 1. Get http://databricks.com/try-databricks 2. https://github.com/dmatrix/spark-saturday 3. [OR] ImportNotebook: http://dbricks.co/ss_wkshp0
  • 7.
  • 8.
    Big Data Systemsof Yesterday… MapReduce/Hadoop Generalbatch processing Drill Storm Pregel Giraph Dremel Mahout Storm Impala Drill . . . Specialized systems for newworkloads Hard to combine in pipelines
  • 9.
    An Analogy …. Newapplications
  • 10.
    Apache Spark Philosophy Unifiedengine for complete data applications High-level user-friendly APIs SQLStreaming ML Graph …
  • 11.
    Unified engineacross diverseworkloads & environments
  • 12.
    TEAM About Databricks Started Sparkproject (now Apache Spark) at UC Berkeleyin 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 13.
    Accelerate innovation by unifyingdata science, engineering and business. Unified Analytics Platform UNIFIED INFRASTRUCTURE UNIFIED EXPERIENCE ACROSS TEAMS UNIFIED ANALYTIC WORKFLOWS
  • 14.
  • 15.
  • 16.
    Apache Spark Architecture Deployments Modes •Local • Standalone • YARN • Mesos
  • 17.
  • 18.
  • 19.
  • 20.
    Apache Spark Architecture AnAnatomy ofan Application Spark Application • Jobs • Stages • Tasks
  • 21.
  • 22.
  • 23.
    What are RDDs? •… Distributed data abstraction • … Resilient & Immutable • … Lazy • … Compile Type-safe • … Semi-structuredor unstructured
  • 24.
  • 27.
    2 kinds of Actions collect, count, reduce,take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
  • 32.
    How did weget here…? Where we going...?
  • 33.
    A Brief History 33 2012 Started @ UCBerkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten Catalyst Optimizer ML Pipelines 2016-17 Apache Spark 2.0,2.1,2.2 Structured Streaming Cost Based Optimizer Deep Learning Pipelines Easier Smarter Faster
  • 34.
    Apache Spark 2.X •Steps to Bigger & Better Things…. Builds on all we learned in past 2 years
  • 35.
  • 37.
    Major Themes inApache Spark 2.x TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  • 38.
    Unified API Foundationfor the Future: SparkSessions, DataFrame, Dataset, MLlib, Structured Streaming
  • 39.
    SparkSession – AUnified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
  • 40.
    SparkSession vs SparkContext SparkSessionsSubsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  • 41.
    SparkSession – AUnified entry point to Spark
  • 42.
  • 43.
    Long Term • RDDas the low-level API in Spark • For control and certain type-safety in Java/Scala • Datasets & DataFrames give richer semantics & optimizations • For semi-structured data and DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming, MLlib, GraphFrames, and Deep Learning Pipelines
  • 44.
    Spark 1.6 vsSpark 2.x
  • 45.
    Spark 1.6 vsSpark 2.x
  • 46.
    Towards SQL 2003 •Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser (with good error messages!) - Subqueries (correlated & uncorrelated) - Approximate aggregate stats - https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark- 2-0.html
  • 47.
    0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DSSpark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  • 48.
    Other notable APIimprovements • DataFrame-based ML pipeline API becoming the main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programminglanguages: Scala, Java, Python, R • Improved R support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K- Means • Structured Streaming Features & Production Readiness • https://databricks.com/blog/2017/07/11/introducing-apache-spark-2-2.html
  • 49.
    Workshop: Notebook onSparkSession • Import Notebook into your Spark 2.2 Cluster – http://dbricks.co/ss_wkshp1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html #org.apache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR + <return> / Shift + Return • Try challenges • Break…
  • 50.
  • 51.
    The not sosecret truth… SQL is not about SQL is about more thanSQL
  • 52.
    10 Is About Creatingand Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdoes the hard work Spark SQL: The wholestory
  • 53.
  • 54.
    54 Using Catalyst inSpark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
  • 55.
    LOGICAL OPTIMIZATIONS PHYSICALOPTIMIZATIONS Catalyst Optimizations • Catalyst compiles operations into physical plan for execution and generates JVM byte code • Intelligently choose between broadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual functions calls • Push filter predicate down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons into cheaper integer comparisons via dictionary coding • RDMS: reduce amount of data traffic by pushing down predicates
  • 56.
    PhysicalPlan with Predicate Pushdown andColumn Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 56 scan (events) filter users.join(events, users("id") === events("uid")) . filter(events("date") > "2015-01-01") DataFrame Optimization
  • 57.
    Columns: Predicate pushdown spark.read .format("jdbc") .option("url","jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 57 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  • 58.
    43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming {JSON } JDBC andmore… FoundationalSpark2.x Components Spark SQL GraphFrames
  • 59.
  • 60.
  • 61.
    Background: What isin an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
  • 62.
    Structured APIs InSpark 62 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  • 63.
    Unification of APIsin Spark 2.0
  • 64.
    Type-safe:operate on domain objects withcompiled lambda functions 8 Dataset API in Spark 2.x v a l d f = s p a r k .r e ad.j s on( "pe opl e.js on ") / / Convert data to domain o b j e c ts . case c l a s s Person(name: S tr i n g , age: I n t ) v a l d s : Dataset[Person] = d f.a s [P e r s on ] v a l fi l te r D S = d s . f i l t e r ( p = > p . a g e > 30)
  • 65.
  • 66.
  • 67.
  • 68.
    Why When DataFrames &Datasets • StructuredData schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • StrongType-safety • Ease-of-use & Readability • What-to-do
  • 69.
  • 70.
  • 71.
  • 72.
    Project Tungsten • Substantiallyspeed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
  • 73.
    6 “bricks” Tungsten’sCompact RowFormat 0x0123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20
  • 74.
    Encoders 6 “bricks”0x0 12332L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format
  • 76.
    Workshop: Notebook on DataFrames/Datasets& Spark SQL • Import Notebookinto your Spark2.x Cluster – http://dbricks.co/sqlds_wkshp2 (optional) – http://dbricks.co/sqldf_wkshp2 (python) (optional) – http://dbricks.co/data_mounts (python) – http://dbricks.co/iotds_wkshp3 – https://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.Dataset • Workthrough each Notebookcell • Try challenges • Break..
  • 77.
  • 78.
  • 79.
    Complexities in streamprocessing COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
  • 80.
    Structured Streaming stream processingon Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  • 81.
    you should not haveto reason about streaming
  • 82.
    Treat Streams asUnbounded Tables 82 data stream unbounded inputtable newdata in the data stream = newrows appended to a unboundedtable
  • 83.
    you should write simplequeries & Spark should continuously update the answer
  • 84.
    DataFrames, Datasets, SQL input =spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Writeto Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
  • 85.
    Streaming word count Anatomyof a Streaming Query
  • 86.
    Anatomy of aStreaming Query: Step 1 spark.readStream .format("kafka") .option("subscribe", "input") .load() . Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable.
  • 87.
    Anatomy of aStreaming Query: Step 2 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) Transformation • Using DataFrames,Datasets and/or SQL. • Internal processingalways exactly- once.
  • 88.
    Anatomy of aStreaming Query: Step 3 spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files). • Use foreach to execute arbitrary code.
  • 89.
    Anatomy of aStreaming Query: Output Modes spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rowsonly Trigger – When to output • Specifiedas a time, eventually supportsdata size • No trigger means as fast as possible
  • 90.
    Anatomy of aStreaming Query: Checkpoint spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Checkpoint • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure.
  • 91.
    Fault-tolerance with Checkpointing Checkpointing– tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
  • 92.
  • 93.
    Traditional ETL Raw, dirty,un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 93 file dump seconds hours table 10101010
  • 94.
    Traditional ETL Hours ofdelay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] file dump seconds hours table 10101010
  • 95.
    Streaming ETL w/Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 95 seconds table 10101010
  • 96.
    Streaming ETL w/Structured Streaming Example Json data being received in Kafka Parse nested json and flatten it Store in structured Parquet table Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable")
  • 97.
    Reading from Kafka Specifyoptions to configure How? kafka.boostrap.servers => broker1,broker2 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  • 98.
    Reading from Kafka valrawDataDF = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  • 99.
    Transforming Data Cast binaryvalue to string Name it column json val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*")
  • 100.
    Transforming Data Cast binaryvalue to string Name it column json Parse json string and expand into nested columns, name it data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  • 101.
    Transforming Data Cast binaryvalue to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedDataDF = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested)
  • 102.
    Transforming Data Cast binaryvalue to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in APIs to performcomplex data transformations from_json, to_json, explode,... 100s offunctions (see our blogpost & tutorial)
  • 103.
    Writing to Save parseddata as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable") //pathname
  • 104.
    Checkpointing Enable checkpointing by settingthe checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
  • 105.
    Streaming Query query isa handle to the continuously running StreamingQuery Used to monitor and manage the execution val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable")/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
  • 106.
    Data Consistency onAd-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html complex, ad-hoc queries on latest data seconds!
  • 107.
    More Kafka Support[Spark 2.2] Write out to Kafka DataFrame must have binary fields named key and value Direct, interactive and batch queries on Kafka Makes Kafka even more powerful as a storage platform! result.writeStream .format("kafka") .option("topic", "output") .start() val df = spark .read // not readStream .format("kafka") .option("subscribe", "topic") .load() df.createOrReplaceTempView("topicData") spark.sql("select value from topicData")
  • 108.
    Amazon Kinesis [DatabricksRuntime 3.0] Configure with options (similar to Kafka) How? region => us-west-2 / us-east-1 / ... awsAccessKey (optional) => AKIA... awsSecretKey (optional) => ... What? streamName => name-of-the-stream Where? initialPosition => latest(default) / earliest / trim_horizon spark.readStream .format("kinesis") .option("streamName”,"myStream") .option("region", "us-west-2") .option("awsAccessKey", ...) .option("awsSecretKey", ...) .load()
  • 109.
  • 110.
    Event Time Many usecases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extractingevent time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff
  • 111.
    Event time Aggregations Windowingis just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins
  • 112.
    Stateful Processing forAggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state
  • 113.
    Automatically handles LateData 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increasesindefinitely if old windows are notdropped
  • 114.
    Watermarking max eventtime event time watermark allowed lateness of10 mins parsedDataDF .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowedto aggregate data too late, dropped Useful only in stateful operations (streaming aggs, dropDuplicates,mapGroupsWithState,...) Ignored in non-stateful streaming queries and batch queries
  • 115.
  • 116.
    Arbitrary Stateful Operations[Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 116 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }
  • 117.
    Arbitrary Stateful Operations[Spark 2.2] mapGroupsWithState allows any user-defined stateful function to a user-defined state Direct support for per-key timeouts in event-time or processing-time Supports Scala and Java 117 ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K, values: Iterator[V], state: GroupState[S]): U = { // update or remove state // set timeouts // return mapped value }
  • 118.
    Other interestingoperations Streaming Deduplication Watermarksto limit state Stream-batch Joins Stream-stream Joins Can use mapGroupsWithState Direct support coming soon! val batchDataDF = spark.read .format("parquet") .load("/additional-data") //join with stream DataFrame parsedDataDF.join(batchData, "device") parsedDataDF.dropDuplicates("eventId")
  • 119.
    More Info Structured StreamingProgramming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html and more to come, stay tuned!!
  • 120.
    Resources • Getting StartedGuide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Databricks Engineering Blogs • sparkhub.databricks.com • spark-packages.org
  • 121.
  • 122.
  • 123.
    Do you haveany questions for my preparedanswers?
  • 124.
    Demo & Workshop:Structured Streaming • Import Notebook into your Spark 2.2 Cluster • http://dbricks.co/iotss_wkshp4 • Done!
  • 125.
    Title goes here. Itcan be one or two lines. Author goes here Dategoes here
  • 126.
    Here is abasic slide Suspendisseullamcorpervel odio a varius • Pellentesque habitant morbi tristiqu • enectus et netuset malesuada fames ac turpis egestas • ut erat dapibus lobortis purus sed gravida augu • efficitur a risus placerat porta nullam molestie malesuada velit et auctor
  • 127.
  • 128.
    HEADER CAN BEBOLD ALL CAPS LIKE THIS Here is a comparison slide • Quisque tortor quam, posuere sed sagittis et, iaculis a urna. In malesuada in orci ut lacinia • Sed bibendum sed mauris egestas pellentesque • Vestibulum bibendum sagittis odio quis tincidunt augue consequat e • Aliquam purus leo, interdum eu urna vitae • Etiam in arcu gravida, tincidunt magna ve faucibus • Donec laoreet vel quam eu condimentum
  • 129.
    0% 10% 20%30% 40% 50% 60% 70% 80% 90% 100% Q1 Q2 Q3 Q4 Title Blue Orange Green Use this chart to start
  • 130.
    Here are someicons to use - scalable DB Benefits DB Features General /Data Science Icons can be recoloredwithinPowerpoint — see: format picture/ picture color / recolor Orange, Green, and Black versions (no recolorationnecessary) can be found in go/icons
  • 131.
  • 132.
  • 133.
    Slide for LargeQuestion or Section Headers
  • 134.
    Thank You Parting wordsor contact information go here.
  • 135.
    The Unified AnalyticsPlatform Data Engineering Line of Business DATABRICKS ENTERPRISE SECURITY (DBES) DATABRICKS WORKSPACE DATABRICKS WORKFLOWS DATABRICKS RUNTIME DATABRICKS SERVERLESS DATABRICKS I/O (DBIO) PEOPLE Data Science Streaming Deep Learning / ML and manyothers… APPLICATIONS Cloud Storage Data Warehouses Hadoop Storage Data Warehousing