0% found this document useful (0 votes)

166 views91 pages

Large Scale Data Pipelines

The document provides an overview of modern data pipelines and various technologies used to build them, including Kafka, Akka Streams, Play Framework, Flink, and Cassandra. It discusses why streaming data pipelines are important for real-time processing and intelligence, and outlines requirements like scalability, availability, and distribution. Traditional ETL is compared to modern streaming approaches. Key concepts for each technology are briefly explained, with Kafka focusing on its distributed commit log architecture and Akka Streams on reactive streams.

Uploaded by

gopinathmaruthachala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views91 pages

Large Scale Data Pipelines

Uploaded by

gopinathmaruthachala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Modern Data Pipelines

Ryan Knight
James Ward
TODO
@
_JamesWard
@
• Distributed Systems guru
• Scala, Akka, Cassandra Expert & Trainer
• Skis with his 5 boys in Park City, UT
• First time to jFokus

Ryan Knight

Architect at Starbucks
• Back-end Developer
• Creator of WebJars
• Blog: www.jamesward.com
• Not a JavaScript Fan

James Ward • In love with FP

Developer at Salesforce
Agenda

• Modern Data Pipeline Overview

• Kafka Code

• Akka Streams github.com/jamesward/koober

• Play Framework
• Flink
• Cassandra
• Spark Streaming
Modern Data Pipelines
Real-Time, Distributed, Decoupled
Why Streaming Pipelines
Real Time Value
• Allow business to react to data in real-time instead of batch
Real Time Intelligence
• Provide real-time information so that the apps can use the information
and adapt their user interactions
Distributed data processing that is both scalable and resilient
Clickstream analysis
Real-time anomaly detection
Instant (< 10 s) feedback - ex. real time concurrent video viewers / page
views
Data Pipeline Requirements

• Ability to process massive amounts of data

• Handle data from a wider variety of sources
• Highly Available
• Resilient - not just fault tolerant
• Distributed for Scale of Data and Transactions
• Elastic
• Uniformity - all-JVM based for easy deployment and management
Traditional ETL
Data Integration Today
Data Pipelines today

http://ferd.ca/queues-don-t-fix-overload.html
Backpressure

http://ferd.ca/queues-don-t-fix-overload.html
Data Hub / Stream Processing
Pipeline Architecture

Spark
Notebook
Web Client
Flink
Spark
core, streaming,
graphx, mllib, ...

Play App
Spark
Kafka Cassandra
Streaming
Cold Data
Koober
github.com/jamesward/koober
Kafka
Distributed Commit Logs
What is Kafka?

Kafka is a distributed and partitioned commit log

Replacement for traditional message queues and publish subscribe
systems
Central Data Backbone or Hub
Designed to scale transparently with replication across the cluster
Core Principles

1. One pipeline to rule them all

2. Stream processing >> messaging
3. Clusters not servers
4. Pull Not Push
Kafka Characteristics

Scalability of a filesystem
• Hundreds of MB/sec/server throughput
• Many TB per server
Durable - Guarantees of a database
• Messages strictly ordered
• All data persistent
Distributed by default
• Replication
• Partitioning model
Kafka is about logs
The Event Log

Append-Only Logging
Database of Facts
Disks are Cheap
Why Delete Data any more?
Replay Events
Append Only Logging
Logs: pub/sub done right
Kafka Overview
• Producers write data to brokers.
• Consumers read data from brokers.
• Brokers - Each server running Kafka is called a
broker.
• All this is distributed.
• Data
– Data is stored in topics.
– Topics are split into partitions, which are
replicated.

• Built in Parallelism and Scale

http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Partitions
A topic consists of partitions.
Partition: ordered + immutable sequence of messages
that is continually appended to
Partition offsets
• Offset: messages in the partitions are each assigned a unique
(per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1
Example:
A Fault-tolerant CEO Hash Table
Operations
Final State
Kafka Log
Heroku Kafka

• Managed Kafka Cloud Service

• https://www.heroku.com/kafka
Code
Akka Streams
Reactive Streams Built on Akka
Reactive Streams
A JVM standard for asynchronous stream processing with non-blocking back pressure
Akka Streams

• Powered by Akka Actors

• Impl of Reactive Streams
• Actors can be used directly or just internally
• Stream processing functions: map, filter, fold, etc
Sink & Source

val source = Source.repeat("hello, world")

val sink = Sink.foreach(println)
val flow = source to sink
flow.run()
Code
Play Framework
Web Framework Built on Akka Streams
Play Framework
Scala & Java – Built on Akka Streams

Declarative Routing:
GET /foo controllers.Foo.do

Controllers Hold Stateless Functions:

class Foo {
def do() = Action {
Ok("hello, world")
}
}
Reactive Requests
Don't block in wait states!

def doLater = Action.async {

Promise.timeout(Ok("hello, world"), 5.seconds)
}

def reactiveRest = Action.async {

ws.url("http://api.foo.com/bar").get().map { response =>
Ok(response.json)
}
}
WebSockets
Built on Akka Streams

def ws = WebSocket.accept { request =>

val sink = ...
val source = ...
Flow.fromSinkAndSource(Sink.ignore, source)
}
Views
Serverside Templating with a Subset of Scala

app/views/blah.scala.html Action {
Ok(views.html.blah("bar"))
}
@(foo: String)
<html> <html>
<body> <body>
@foo bar
</body> </body>
</html> </html>
Demo & Code
Flink
Real-time Data Analytics
Flink
Real-time Data Analytics

• Bounded & Unbounded Data Sets

• Stream processing
• Distributed Core
• Fault Tolerant
• Clustered

• Flexible Windowing
Apache Flink
Continuous Processing for Unbounded Datasets

count() 5
Windowing
Bounding with Time, Count, Session, or Data

1s 1s count() 3
2
Batch Processing
Stream Processing on Finite Streams

count() 4
Data Processing
What can we do?

• Aggregate / Accumulate fold(), reduce(), sum(), min()

• Transform map(), flatMap()
• Filter λ filter(), distinct()
• Sort sortGroup(), sortPartition()
Apache Flink
Architecture
Partitioning
Network Distribution
Demo & Code
Cassandra
Distributed NoSQL Database
Challenges with Relational Databases
• How do you scale and maintain high-availability with a
monolithic database?
• Is it possible to have ACID compliant distributed transactions?
• How can I synchronize a distributed data store?
• How do I resolve differing views of data?
Goals of a Distributed Database
• Consistency is not practical - give it up!
• Manual sharding & rebalancing is hard - Automatic
Sharding!
• Every moving part makes systems more complex
• Master / slave creates a Single Point of Failure / Bottleneck
- Simplify Architecture!
• Scaling up is expensive - Reduce Cost
• Leverage cloud / commodity hardware
What is Cassandra?
Distributed Database

✓ Individual DBs (nodes)

✓ Working in a cluster C*
✓ Nothing is shared

Confidential
Cassandra Cluster

• Nodes in a peer-to-peer cluster

• No single point of failure

• Built in data replication

• Data is always available
• 100% Uptime

• Across data centers

• Failure avoidance

Confidential
Multi-Data Center Design
Why Cassandra?
It has a flexible data model
Tables, wide rows, partitioned and distributed
✓ Data
✓ Blobs (documents, files, images)
✓ Collections (Sets, Lists, Maps)
✓ UDTs
Access it with CQL ← familiar syntax to SQL

Confidential
Two knobs control Cassandra fault tolerance
Replication Factor (server side)
How many copies of the data should exist?
RF=3
Write A
A B
CD AD
Client
D C
BC AB
Two knobs control Cassandra fault tolerance
Consistency Level (client side)

How many replicas do we need to hear from before we acknowledge?

CL=ONE CL=QUORUM
A B A B
Write A Write A
CD AD CD AD
Client Client

D C D C
BC AB BC AB
Consistency Levels
Applies to both Reads and Writes (i.e. is set on each query)

ONE – one replica from any DC

LOCAL_ONE – one replica from local DC
QUORUM – 51% of replicas from any DC
LOCAL_QUORUM – 51% of replicas from local DC
ALL – all replicas
TWO
Consistency Level and Speed

How many replicas we need to hear from can affect

A B
how quickly we can read and write data in CD AD
Read A
Cassandra? 5 µs ack
(CL=QUORUM)
Client 300 µs ack

12 µs ack
D 12 µs ack
C
BC AB
Consistency Level and Availability
Consistency Level choice affects availability

For example, QUORUM can tolerate one replica being down

and still be available (in RF=3)
A B
CD AD
Read A
(CL=QUORUM) A=2
Client A=2

A=2
D C
BC AB
Reads in the cluster
Same as writes in the cluster, reads are coordinated
Any node can be the Coordinator Node

A B
CD AD

Client

Read A
(CL=QUORUM) D C
BC AB

Coordinator Node
Spark Cassandra Connector
Spark Cassandra Connector

Data locality-aware (speed)

Read from and Write to Cassandra

Cassandra Tables Exposed as RDD and DataFrames

Server-Side filters (where clauses)

Cross-table operations (JOIN, UNION, etc.)

Mapping of Java Types to Cassandra Types

●70
Code
Spark Streaming
Stream Processing Built on Spark
Hadoop?
Hadoop Limitations
• Master / Slave Architecture
• Every Processing Step requires Disk IO
• Difficult API and Programming Model
• Designed for batch-mode jobs
• No even-streaming / real-time
• Complex Ecosystem
What is Spark?
Fast and general compute engine for large-scale data processing

Fault Tolerant Distributed Datasets

Distributed Transformation on Datasets

Integrated Batch, Iterative and Streaming Analysis

In Memory Storage with Spill-over to Disk

Advantages of Spark
• Improves efficiency through:
• In-memory data sharing
• General computation graphs - Lazy Evaluates Data
• 10x faster on disk, 100x faster in memory than Hadoop MR
• Improves usability through:
• Rich APIs in Java, Scala, Py..??
• 2 to 5x less code
• Interactive shell
Spark Components Hosting
Spark Master UI
Spark Master
Hosting :7080
Application UI
:4040 A Process which Manages the
Application
(Spark Driver) Resources of the Spark Cluster

You application code

which creates the SparkContext

Worker
Worker
Worker
Worker
Worker

A process which shells out to create

a Executor JVM

These processes are all separate and require networking

to communicate
Resilient Distributed Datasets (RDD)
• The primary abstraction in Spark
• Collection of data stored in the Spark Cluster
• Fault-tolerant
• Enables parallel processing on data sets
• In-Memory or On-Disk
RDD Operations
Transformations - Similar to scala collections API
Produce new RDDs:
filter, flatmap, map, distinct, groupBy,
union, zip, reduceByKey, subtract

Actions - Require materialization of the records to generate a value

collect: Array[T], count, fold, reduce..
DataFrame
• Distributed collection of data

• Similar to a Table in a RDBMS

• Common API for reading/writing data

• API for selecting, filtering, aggregating

and plotting structured data
DataFrame Part 2
• Sources such as Cassandra, structured data files, tables in
Hive, external databases, or existing RDDs.

• Optimization and code generation through the Spark SQL

Catalyst optimizer

• Decorator around RDD - Previously SchemaRDD

Spark Versus Spark Streaming
Spark Streaming Data Sources
Spark Streaming General Architecture
DStream Micro Batches
Windowing
Windowing
Streaming Resiliency without Kafka

• Streaming uses aggressive checkpointing and in-memory data replication to improve

resiliency.

• Frequent checkpointing keeps RDD lineages down to a reasonable size.

• Checkpointing and replication mandatory since streams don’t have source data files to
reconstruct lost RDD partitions (except for the directory ingest case).
• Write Ahead Logging to prevent Data Loss
Direct Kafka Streaming w/ Kafka Direct API

• Use Kafka Direct Approach (No Receivers)

• Queries Kafka Directly
• Automatically Parallelizes based on Kafka Partitions
• (Mostly) Exactly Once Processing - Only Move Offset after
Processing
• Resiliency without copying data
Demo & Code

Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Databricks: Revolutionizing Data Science
No ratings yet
Databricks: Revolutionizing Data Science
5 pages
Integrating Apache Nifi and Apache Kafka
No ratings yet
Integrating Apache Nifi and Apache Kafka
5 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Data Vault & HQDM Insights
No ratings yet
Data Vault & HQDM Insights
8 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Apache Kafka Course Curriculum
No ratings yet
Apache Kafka Course Curriculum
5 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Bigdata Engineer Complete Syllabus: Presented by
No ratings yet
Bigdata Engineer Complete Syllabus: Presented by
21 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Azure Databricks An Introduction
100% (1)
Azure Databricks An Introduction
54 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Apache Spark Quick Reference
No ratings yet
Apache Spark Quick Reference
71 pages
Spark & Scala for Developers
No ratings yet
Spark & Scala for Developers
40 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
AWS Glue for ETL Developers
No ratings yet
AWS Glue for ETL Developers
5 pages
PySpark and Azure Data Engineer Free Notes
100% (1)
PySpark and Azure Data Engineer Free Notes
65 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
ETL Development Standards
No ratings yet
ETL Development Standards
6 pages
Apache Spark Interview Questions Book
100% (1)
Apache Spark Interview Questions Book
15 pages
Graph Databases for MDM Pros
100% (1)
Graph Databases for MDM Pros
14 pages
Tomcat Server 7: Architecture & Admin
100% (1)
Tomcat Server 7: Architecture & Admin
36 pages
Big Query Interview Q&A
100% (1)
Big Query Interview Q&A
8 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
TheInfoQeMag Service Mesh 1563784287455
No ratings yet
TheInfoQeMag Service Mesh 1563784287455
44 pages
Securing Snowflake
No ratings yet
Securing Snowflake
114 pages
WP - Databricks vs. ETL Data Lake - Updated
No ratings yet
WP - Databricks vs. ETL Data Lake - Updated
12 pages
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
No ratings yet
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
6 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
0% (1)
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
15 pages
Singapore DBT
No ratings yet
Singapore DBT
137 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
UNIX and Shell Scripting - Module 2
No ratings yet
UNIX and Shell Scripting - Module 2
44 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
Azure Data Engineer Interview Guide
No ratings yet
Azure Data Engineer Interview Guide
15 pages
Installation Procedure of JAVA & Eclipse
No ratings yet
Installation Procedure of JAVA & Eclipse
6 pages
AWS Data Lake Lab: Athena & QuickSight
No ratings yet
AWS Data Lake Lab: Athena & QuickSight
22 pages
Azure Databricks Best Practices 1664384402
No ratings yet
Azure Databricks Best Practices 1664384402
30 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Nosql Data Architecture Patterns
No ratings yet
Nosql Data Architecture Patterns
62 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
Data Engineering Nanodegree Program Syllabus
No ratings yet
Data Engineering Nanodegree Program Syllabus
16 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Rebuilding Reliable Data Pipelines Through Modern Tools PDF
100% (1)
Rebuilding Reliable Data Pipelines Through Modern Tools PDF
99 pages
Decomposing SMACK Stack
No ratings yet
Decomposing SMACK Stack
62 pages
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
No ratings yet
Scaladayslambda Architecture Spark Cassandra Akka Kafka 150609194508 Lva1 App6891 PDF
100 pages
Comments/Remarks:: Name: Tupas, Re Charles
No ratings yet
Comments/Remarks:: Name: Tupas, Re Charles
3 pages
A Deal For An Empress
No ratings yet
A Deal For An Empress
415 pages
21 Chump Street Questions
No ratings yet
21 Chump Street Questions
3 pages
Session-5 - Prbs. On Potentiometer Transducer - 16-9-2020 (Autosaved)
No ratings yet
Session-5 - Prbs. On Potentiometer Transducer - 16-9-2020 (Autosaved)
24 pages
Architecture Cover Letter Issuu
100% (1)
Architecture Cover Letter Issuu
4 pages
Ramesh Chopra - Electronics Projects Volume 20 (2013, EFY Enterprises Pvt. LTD.)
100% (8)
Ramesh Chopra - Electronics Projects Volume 20 (2013, EFY Enterprises Pvt. LTD.)
200 pages
Cape Physics Unit 2 Formula Sheet
No ratings yet
Cape Physics Unit 2 Formula Sheet
4 pages
Higher Novemeber 2009 Paper 3
No ratings yet
Higher Novemeber 2009 Paper 3
16 pages
Inspection Punch List
No ratings yet
Inspection Punch List
2 pages
Science & Tourism Month 2024 Events
No ratings yet
Science & Tourism Month 2024 Events
3 pages
Campground Fraud Class Action Suit
No ratings yet
Campground Fraud Class Action Suit
114 pages
Teto Abrir Mercedes Classe E w211
No ratings yet
Teto Abrir Mercedes Classe E w211
1 page
PhD Research Topic Selection Guide
0% (2)
PhD Research Topic Selection Guide
177 pages
CSPHCL JE Electrical 5 Jan 2022 English
No ratings yet
CSPHCL JE Electrical 5 Jan 2022 English
33 pages
(Nisar) Zakat Declaration
100% (1)
(Nisar) Zakat Declaration
2 pages
2 Ep Repet PP PR Gramm Quiz 10b
No ratings yet
2 Ep Repet PP PR Gramm Quiz 10b
2 pages
Respiratory System - History and Physical Examination
No ratings yet
Respiratory System - History and Physical Examination
22 pages
Rate Analysis - 2016
No ratings yet
Rate Analysis - 2016
23 pages
Test 2 Jan 2022
No ratings yet
Test 2 Jan 2022
3 pages
CR-1010 2ND Basement Plan
No ratings yet
CR-1010 2ND Basement Plan
1 page
Significance and Problems of Fisheries Co-Operatives in India PDF
0% (1)
Significance and Problems of Fisheries Co-Operatives in India PDF
27 pages
Intermediate Microeconomics: Market Demand
No ratings yet
Intermediate Microeconomics: Market Demand
4 pages
Unit11 Worksheet 14
No ratings yet
Unit11 Worksheet 14
4 pages
Applied Mechanics (Dynamics)
No ratings yet
Applied Mechanics (Dynamics)
2 pages
Tales from the Rabbi's Desk 2
No ratings yet
Tales from the Rabbi's Desk 2
11 pages
2017 Student Placement Summary
No ratings yet
2017 Student Placement Summary
3 pages
Study in Austria-General Information For Pakistani Students
No ratings yet
Study in Austria-General Information For Pakistani Students
15 pages
Cell: The Building Blocks of Life: Awaluddin, M.Kes
No ratings yet
Cell: The Building Blocks of Life: Awaluddin, M.Kes
39 pages
Get Ready For IELTS Speaking - Collins
No ratings yet
Get Ready For IELTS Speaking - Collins
9 pages
Slide of Chapter 1
No ratings yet
Slide of Chapter 1
29 pages

Large Scale Data Pipelines

Uploaded by

Large Scale Data Pipelines

Uploaded by

Modern Data Pipelines

James Ward • In love with FP

• Modern Data Pipeline Overview

• Akka Streams ​github.com/jamesward/koober

• Ability to process massive amounts of data

​ Kafka is a distributed and partitioned commit log

1. One pipeline to rule them all

• Built in Parallelism and Scale

• Managed Kafka Cloud Service

• Powered by Akka Actors

val source = Source.repeat("hello, world")

​Controllers Hold Stateless Functions:

​def doLater = Action.async {

def reactiveRest = Action.async {

​def ws = WebSocket.accept { request =>

• Bounded & Unbounded Data Sets

• Aggregate / Accumulate ​fold(), reduce(), sum(), min()

✓ Individual DBs (nodes)

• Nodes in a peer-to-peer cluster

• Built in data replication

• Across data centers

How many replicas do we need to hear from before we acknowledge?

​ONE – one replica from any DC

​How many replicas we need to hear from can affect

For example, QUORUM can tolerate one replica being down

​ Data locality-aware (speed)

​ Read from and Write to Cassandra

​ Cassandra Tables Exposed as RDD and DataFrames

​ Server-Side filters (where clauses)

​ Cross-table operations (JOIN, UNION, etc.)

​ Mapping of Java Types to Cassandra Types

​ Fault Tolerant Distributed Datasets

​ Distributed Transformation on Datasets

​ Integrated Batch, Iterative and Streaming Analysis

​ In Memory Storage with Spill-over to Disk

You application code

A process which shells out to create

These processes are all separate and require networking

Actions - Require materialization of the records to generate a value

• Similar to a Table in a RDBMS

• Common API for reading/writing data

• API for selecting, filtering, aggregating

• Optimization and code generation through the Spark SQL

• Decorator around RDD - Previously SchemaRDD

• Streaming uses aggressive checkpointing and in-memory data replication to improve

• Frequent checkpointing keeps RDD lineages down to a reasonable size.

• Use Kafka Direct Approach (No Receivers)

You might also like

• Akka Streams github.com/jamesward/koober

Kafka is a distributed and partitioned commit log

Controllers Hold Stateless Functions:

def doLater = Action.async {

def ws = WebSocket.accept { request =>

• Aggregate / Accumulate fold(), reduce(), sum(), min()

ONE – one replica from any DC

How many replicas we need to hear from can affect

Data locality-aware (speed)

Read from and Write to Cassandra

Cassandra Tables Exposed as RDD and DataFrames

Server-Side filters (where clauses)

Cross-table operations (JOIN, UNION, etc.)

Mapping of Java Types to Cassandra Types

Fault Tolerant Distributed Datasets

Distributed Transformation on Datasets

Integrated Batch, Iterative and Streaming Analysis

In Memory Storage with Spill-over to Disk