Graph Stream Processing : spinning fast, large scale, complex analytics

Graph Stream Processing spinning fast, large-scale, complex analytics Paris Carbone PhD Candidate @ KTH Committer @ Apache Flink

We want to analyse…. datacomplex

We want to analyse…. datacomplexlarge-scale

We want to analyse…. data fastcomplexlarge-scale

But why do we need large-scale, complex and fast data analysis? >

But why do we need large-scale, complex and fast data analysis? to answer big complex questions faster>

to answer big complex questions faster>

>Hej Siri_ to answer big complex questions faster>

Get me the best route to work right now >Hej Siri_ to answer big complex questions faster>

Get me the best route to work right now >Hej Siri_ …with the fewest human drivers to answer big complex questions faster>

Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… …with the fewest human drivers to answer big complex questions faster>

Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday …with the fewest human drivers to answer big complex questions faster>

Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>

Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>

no matter if they use Spark or Flink or just ipython Get me the best route to work right now >Hej Siri_ Lookup a pizza recipe all of my friends like but did not eat yesterday… Siri, is it possible to re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers to answer big complex questions faster>

to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers 3000 AD

to answer big complex questions faster> no matter if they use Spark or Flink or just ipython best route to work right now Lookup a pizza recipe all of my friends like but did not eat yesterday… re-unite all data scientists in the world? or the day before yesterday oh! And no kebab pizza! …with the fewest human drivers FIRST WORLD PROBLEM 3000 AD

to answer big complex questions faster> use Spark or Flink or just ipython best route to work right now re-unite all data scientists in the world? oh! And no kebab pizza! …with the fewest human drivers 30000 AD

to answer big complex questions faster> FIRST EARTH WORLD PROBLEM use Spark or Flink or just ipython best route to work right now re-unite all data scientists in the world? oh! And no kebab pizza! …with the fewest human drivers 30000 AD

Still, fast analytics might save us some day… • We can access patient movements and fb, twitter and pretty much all social media interactions • Can we stop a pandemic? • Or can we predict fast where the virus can spread?

Now how do we analyse… data fastcomplexlarge-scale ?

Now how do we analyse… data graphdistributed streaming

Now how do we analyse… data graphdistributed streaming everything is a graph

Now how do we analyse… data graphdistributed streaming everything is many everything is a graph

Now how do we analyse… data graphdistributed streaming everything is many everything is a graph everything is a stream

it all started… as a ﬁrst world problem question

but then things escalated quickly… …and machinery got cheaper and we suddenly realised that we have big data

Distributed Graph processing was born Thus,

Distributed Graph processing was born Thus, Map Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system

Distributed Graph processing was born Thus, 1. Store Updates to DFS 2. Load graph snapshot (mem) 3. Compute round~superstep 4. Store updates 5. …repeat Distributed Graph ProcessingMap Reduce 1. Store Partitioned Data 2. Sent Local computation (map) 3. now shufﬂe it on disks 4. merge the results (reduce) 5. Store the result back DFS : distributed ﬁle system

• We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example

• We want to compute the Connected Components of a distributed graph. • Basic computation element (map): vertex • Updates : messages to other vertices Distributed Graph processing example 1 2 3

Distributed Graph processing example 1 43 2 5 6 7 8 ROUND 0

Distributed Graph processing example 1 43 2 5 ROUND 0 6 7 8 3 1 4 4 5 2 4 2 3 5 7 8 6 8 6 7

Distributed Graph processing example 1 21 2 2 ROUND 1 6 6 6

Distributed Graph processing example 1 2 2 2 2 1 2 6 6 6 6 1 21 2 2 ROUND 1 6 6 6 6 6

Distributed Graph processing example 1 11 2 2 ROUND 2 6 6 6 1 1 1

Distributed Graph processing example 1 11 1 1 ROUND 3 6 6 6 1 1 1 1

Distributed Graph processing example 1 11 1 1 ROUND 4 6 6 6 No messages, DONE!

• Examples of Load-Compute-Store systems: Pregel, Graphx (spark), Graphlab, PowerGraph • Same execution strategy - Same problems • It’s slow • Too much re-computation ($€) for nothing. • Real World Updates anyone? Distributed Graph processing systems

…and streaming came to mess everything make fast and simple

…and streaming came to mess everything make fast and simple real world

…and streaming came to mess everything make fast and simple real world event records • local state stays here • local computation too The Dataﬂow™

Streaming is so advanced that… • subsecond latency and high throughput ﬁnally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076

Streaming is so advanced that… …but what about complex problems? • subsecond latency and high throughput ﬁnally coexist • it does fault tolerance without batch writes* • late data** is handled gracefully * https://arxiv.org/abs/1506.08603• ** http://dl.acm.org/citation.cfm?id=2824076

can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff

can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff ?? universe

can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff ?? universe >it was never about the graph silly, it was about answering complex questions, remember?

can we make it happen? • Problem: Can’t keep an inﬁnite graph in- memory and do complex stuff universe ;) universe summary >it was never about the graph silly, it was about answering complex questions, remember? answers

Examples of Summaries • Spanners : distance estimation • Sparsiﬁers : cut estimation • Sketches : homomorphic properties graph summary algorithm algorithm~R1 R2

Distributed Graph streaming example 54 76 86 42 31 52Connected Components on a stream of edges (additions)

31 Distributed Graph streaming example 54 76 86 42 43 31 52 Connected Components on a stream of edges (additions) 1

52 Distributed Graph streaming example 54 76 86 42 43 87 52 Connected Components on a stream of edges (additions) 31 1 2

52 4 Distributed Graph streaming example 54 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2

52 4 Distributed Graph streaming example 76 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6

52 4 8 Distributed Graph streaming example 86 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 76 2 6

8 52 4 76 Distributed Graph streaming example 42 43 87 41 Connected Components on a stream of edges (additions) 31 1 2 6

8 52 4 76 Distributed Graph streaming example 43 87 41Connected Components on a stream of edges (additions) 31 1 6

But Is this Efﬁcient? Sure, we can distribute the edges and summaries

But Is this Efﬁcient? Sure, we can distribute the edges and summaries any systems in mind?

Gelly Stream Graph stream processing with Apache Flink

Gelly Stream Oveview DataStreamDataSet Distributed Dataflow Deployment Gelly Gelly- ➤ Static Graphs ➤ Multi-Pass Algorithms ➤ Full Computations ➤ Dynamic Graphs ➤ Single-Pass Algorithms ➤ Approximate Computations DataStream

Gelly Stream Status ➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations ➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate

wait, so now we can detect connected components right away?

wait, so now we can detect connected components right away? Solved! But how about our other issues now?

no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? >

Gelly-Stream to the rescue graphStream.ﬁlterVertices(DataScientists()) .slice(Time.of(10, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs()) wendy checked_in glaze steve checked_in glaze tom checked_in joe’s_grill sandra checked_in glaze rafa checked_in joe’s_grill wendy steve sandra glaze tom rafa joe’s grill {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}

no matter if they use Spark or Flink or just ipython >Hej Siri_ Siri, is it possible to re-unite all data scientists in the world? > yes

The next step • Iterative model* on streams for deeper analytics • More Summaries • Better Our-Of-Core State Integration • AdHoc Graph Queries Large-scale, Complex, Fast, Deep Analytics * http://dl.acm.org/citation.cfm?id=2983551

Try out Gelly-Stream* because all questions matter @SenorCarbone *https://github.com/vasia/gelly-streaming

Graph Stream Processing : spinning fast, large scale, complex analytics

More Related Content

What's hot

Viewers also liked

Similar to Graph Stream Processing : spinning fast, large scale, complex analytics

More from Paris Carbone

Recently uploaded

Graph Stream Processing : spinning fast, large scale, complex analytics