Batch and Stream Graph Processing with Apache Flink
The document discusses batch and stream graph processing using Apache Flink, an open-source framework tailored for real-time data streaming and analysis. It highlights Flink's capabilities in handling both streaming and batch data through different APIs, introduces its graph processing library Gelly, and explains iterative graph processing and single-pass stream graph processing. Additionally, it covers key concepts like windowing, stateful operations, and challenges in maintaining graph structures during stream processing.
Why Stream Processing? •Most problems have streaming nature • Stream processing gives lower latency • Data volumes more easily tamed 4 Event stream
5.
Batch and Streaming Pipelinedand blocking operators Streaming Dataflow Runtime Batch Parameters DataSet DataStream Relational Optimizer Window Optimization Pipelined and windowed operators Schedule lazily Schedule eagerly Recompute whole operators Periodic checkpoints Streaming data movement Stateful operations DAG recovery Fully buffered streams DAG resource management Streaming Parameters
6.
Flink APIs 6 case classWord (word: String, frequency: Int) val lines: DataStream[String] = env.readFromKafka(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .keyBy("word”).timeWindow(Time.of(5,SECONDS)).sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”) .print() DataSet API (batch): DataStream API (streaming):
7.
Working with Windows 7 Whywindows? We are often interested in fresh data!15 38 65 88 110 120 #sec 40 80 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 1) Tumbling windows myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS)); #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 15 38 38 65 65 88 myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS)); 2) Sliding windows window buckets/panes
8.
Working with Windows 7 Whywindows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! 15 38 65 88 110 120 #sec 40 80 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 1) Tumbling windows myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS)); #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 15 38 38 65 65 88 myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS)); 2) Sliding windows window buckets/panes
Meet Gelly • Java& Scala Graph APIs on top of Flink • graph transformations and utilities • iterative graph processing • library of graph algorithms • Can be seamlessly mixed with the DataSet Flink API to easily implement applications that use both record-based and graph-based analysis 10
12.
Hello, Gelly! ExecutionEnvironment env= ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env); Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env); DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run( new ConnectedComponents(maxIterations)); val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env) val graph = Graph.fromDataSet(edges, env) val components = graph.run(new ConnectedComponents(maxIterations)) Java Scala 11
Iterative Graph Processing •Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations • Based on the BSP, vertex-centric model • scatter-gather • gather-sum-apply • vertex-centric (pregel)* • partition-centric* 14
Gather-Sum-Apply Iterations • Gather:compute one value per edge • Sum: combine the partial values of Gather to a single value • Apply: update the vertex value, based on the Sum and the current value 16 Gather ApplySum
18.
Library of Algorithms •PageRank* • Single Source Shortest Paths* • Label Propagation • Weakly Connected Components* • Community Detection • Triangle Count & Enumeration • Graph Summarization • val ranks = inputGraph.run(new PageRank(0.85, 20)) • *: both scatter-gather and GSA implementations 17
Batch vs. StreamGraph Processing 22 Batch Stream Input Graph static dynamic Analysis on a snapshot continuous Response after job completion immediately
35.
Graph Streaming Challenges •Maintain the graph structure • How to apply state updates efficiently? • Result updates • Re-run the analysis for each event? • Design an incremental algorithm? • Run separate instances on multiple snapshots? • Computation on most recent events only 23
36.
Single-Pass Graph Streaming •Each event is an edge addition • Maintain only a graph summary • Recent events are grouped in graph windows 24
Batch Connected Components •State: the graph and a component ID per vertex (initially equal to vertex ID) • Iterative Computation: For each vertex: • choose the min of neighbors’ component IDs and own component ID as new ID • if component ID changed since last iteration, notify neighbors 27
Stream Connected Components •State: a disjoint set data structure for the components • Computation: For each edge • if seen for the 1st time, create a component with ID the min of the vertex IDs • if in different components, merge them and update the component ID to the min of the component IDs • if only one of the endpoints belongs to a component, add the other one to the same component 32
Introducing Gelly-Stream 46 Gelly-Stream enrichesthe DataStream API with two new additional ADTs: • GraphStream: • A representation of a data stream of edges. • Edges can have state (e.g. weights). • Supports property streams, transformations and aggregations. • GraphWindow: • A “time-slice” of a graph stream. • It enables neighborhood aggregations
Graph Stream Aggregations 48 result aggregate propertystreamgraph stream (window) fold combine fold reduce local summaries global summary edges agg global aggregates can be persistent or transient graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
61.
Graph Stream Aggregations 49 result aggregate propertystream graph stream (window) fold combine transform fold reduce map local summaries global summary edges agg graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))