Graph Stream Processing spinning fast, large-scale, complex analytics Paris Carbone PhD Candidate @ KTH Committer @ Apache Flink
We want to analyse….
We want to analyse…. data
We want to analyse…. datacomplex
We want to analyse…. datacomplexlarge-scale
We want to analyse…. data fastcomplexlarge-scale
But why do we need large-scale, complex and fast data analysis? >
But why do we need large-scale, complex and fast data analysis? to answer big complex questions faster>
to	answer	big	complex	questions	faster>
>Hej Siri_ to	answer	big	complex	questions	faster>
Get me the best route to work right now >Hej Siri_ to	answer	big	complex	questions	faster>
Get me the best route to work right now >Hej Siri_ …with the fewest human drivers to	answer	big	complex	questions	faster>
Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… …with the fewest human drivers to	answer	big	complex	questions	faster>
Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday …with the fewest human drivers to	answer	big	complex	questions	faster>
Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to	answer	big	complex	questions	faster>
Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to	answer	big	complex	questions	faster>
no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to	answer	big	complex	questions	faster>
no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to	answer	big	complex	questions	faster>
to	answer	big	complex	questions	faster> no	matter	if	they	use	Spark	or	Flink	or	just	ipython best	route	to	work	right	now Lookup	a	pizza	recipe	all	of	my	friends	like	but	did	not	eat yesterday… re-unite	all	data	scientists	in	the	world? or	the	day	before	yesterday oh!	And	no	kebab	pizza! …with	the	fewest	human	drivers 3000 AD
to	answer	big	complex	questions	faster> no	matter	if	they	use	Spark	or	Flink	or	just	ipython best	route	to	work	right	now Lookup	a	pizza	recipe	all	of	my	friends	like	but	did	not	eat yesterday… re-unite	all	data	scientists	in	the	world? or	the	day	before	yesterday oh!	And	no	kebab	pizza! …with	the	fewest	human	drivers 3000 AD
to	answer	big	complex	questions	faster> no	matter	if	they	use	Spark	or	Flink	or	just	ipython best	route	to	work	right	now Lookup	a	pizza	recipe	all	of	my	friends	like	but	did	not	eat yesterday… re-unite	all	data	scientists	in	the	world? or	the	day	before	yesterday oh!	And	no	kebab	pizza! …with	the	fewest	human	drivers FIRST WORLD PROBLEM 3000 AD
to	answer	big	complex	questions	faster> use	Spark	or	Flink	or	just	ipython best	route	to	work	right	now re-unite	all	data	scientists	in	the	world? oh!	And	no	kebab	pizza! …with	the	fewest	human	drivers 30000 AD
to	answer	big	complex	questions	faster> FIRST EARTH WORLD PROBLEM use	Spark	or	Flink	or	just	ipython best	route	to	work	right	now re-unite	all	data	scientists	in	the	world? oh!	And	no	kebab	pizza! …with	the	fewest	human	drivers 30000 AD
Still,	fast	analytics	might	save	us	some	day… • We can access patient movements and fb, twitter and pretty much all social media interactions • Can we stop a pandemic? • Or can we predict fast where the virus can spread?
Now how do we analyse… data fastcomplexlarge-scale ?
Now how do we analyse… data graphdistributed streaming
Now how do we analyse… data graphdistributed streaming everything is a graph
Now how do we analyse… data graphdistributed streaming everything is many everything is a graph
Now how do we analyse… data graphdistributed streaming everything is many everything is a graph everything is a stream
it all started… as a first world problem question
but then things escalated quickly… …and machinery got cheaper and we suddenly realised that we have big data
Distributed Graph processing was born Thus,
Distributed Graph processing was born Thus, Map Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shuffle it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed file system
• We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example
• We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example 1 2 3
Distributed Graph processing example 1 43 2 5 6 7 8 ROUND 0
Distributed Graph processing example 1 43 2 5 ROUND 0 6 7 8 3 1 4 4 5 2 4 2 3 5 7 8 6 8 6 7
Distributed Graph processing example 1 21 2 2 ROUND 1 6 6 6
Distributed Graph processing example 1 2 2 2 2 1 2 6 6 6 6 1 21 2 2 ROUND 1 6 6 6 6 6
Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6
Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6 1 1 1
Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6
Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6 1 1 1 1
Distributed Graph processing example 1 11 1 1 ROUND 4 6 6 6 No messages, DONE!
• Examples of Load-Compute-Store systems: Pregel, Graphx (spark), Graphlab, PowerGraph • Same execution strategy - Same problems • It’s slow • Too much re-computation ($€) for nothing. • Real World Updates anyone? Distributed Graph processing systems
…and streaming came to mess everything make fast and simple
…and streaming came to mess everything make fast and simple real world
…and streaming came to mess everything make fast and simple real world event records • local state stays here • local computation too The Dataflow™
Streaming is so advanced that… • subsecond latency and high throughput finally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
Streaming is so advanced that… …but what about complex problems? • subsecond latency and high throughput finally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076
can we make it happen?
can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff
can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff ?? universe
can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff ?? universe >it was never about the graph silly, it was about answering complex questions, remember?
can we make it happen? • Problem: Can’t keep an infinite graph in- memory and do complex stuff universe ;) universe summary >it was never about the graph silly, it was about answering complex questions, remember? answers
Examples of Summaries • Spanners : distance estimation • Sparsifiers : cut estimation • Sketches : homomorphic properties graph summary algorithm algorithm~R1 R2
Distributed Graph streaming example 54 76 86 42 31 52Connected Components on a stream of edges (additions)
31 Distributed Graph streaming example 54 76 86 42 43 31 52 Connected Components on a stream of edges (additions) 1
52 Distributed Graph streaming example 54 76 86 42 43 87 52 Connected Components on a stream of edges (additions) 31 1 2
52 4 Distributed Graph streaming example 54 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2
52 4 Distributed Graph streaming example 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6
52 4 8 Distributed Graph streaming example 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6
8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6
8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6
8 52 4 76 Distributed Graph streaming example 43 87 41Connected Components on a stream of edges (additions) 31 1 6
But Is this Efficient? Sure, we can distribute the edges and summaries
But Is this Efficient? Sure, we can distribute the edges and summaries any systems in mind?
Gelly Stream Graph stream processing with Apache Flink
Gelly Stream Oveview DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly- ➤ Static Graphs ➤ Multi-Pass Algorithms ➤ Full Computations ➤ Dynamic Graphs ➤ Single-Pass Algorithms ➤ Approximate Computations DataStream
Gelly Stream Status ➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations ➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate
wait, so now we can detect connected components right	away?
wait, so now we can detect connected components right	away?
wait, so now we can detect connected components right	away? Solved! But how about our other issues now?
no	matter	if	they	use	Spark	or	Flink	or just	ipython >Hej	Siri_ Siri,	is	it	possible	to	re-unite	all	data scientists	in	the	world? >
no	matter	if	they	use	Spark	or	Flink	or just	ipython >Hej	Siri_ Siri,	is	it	possible	to	re-unite	all	data scientists	in	the	world? >
Gelly-Stream to the rescue graphStream.filterVertices(DataScientists()) .slice(Time.of(10, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) wendy checked_in glaze steve checked_in glaze tom checked_in joe’s_grill sandra checked_in glaze rafa checked_in joe’s_grill wendy steve sandra glaze tom rafa joe’s grill {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}
no	matter	if	they	use	Spark	or	Flink	or just	ipython >Hej	Siri_ Siri,	is	it	possible	to	re-unite	all	data scientists	in	the	world? >
no	matter	if	they	use	Spark	or	Flink	or just	ipython >Hej	Siri_ Siri,	is	it	possible	to	re-unite	all	data scientists	in	the	world? >	yes
The next step • Iterative model* on streams for deeper analytics • More Summaries • Better Our-Of-Core State Integration • AdHoc Graph Queries Large-scale, Complex, Fast, Deep Analytics * http://dl.acm.org/citation.cfm?id=2983551
Try out Gelly-Stream* because all questions matter @SenorCarbone *https://github.com/vasia/gelly-streaming

Graph Stream Processing : spinning fast, large scale, complex analytics