Graph processing

Why Graph Processing? Graphs are everywhere!

Why Distributed Graph Processing? They are getting bigger!

Road Scale >24 million vertices >58 million edges *Route Planning in Road Networks - 2008

Social Scale >1 billion vertices ~1 trillion edges *Facebook Engineering Blog ~41 million vertices >1.4 billion edges *Twitter Graph- 2010

Web Scale >50 billion vertices >1 trillion edges *NSA Big Graph Experiment- 2013

Brain Scale >100 billion vertices >100 trillion edges *NSA Big Graph Experiment- 2013

CHALLENGES IN PARALLEL GRAPH PROCESSING Lumsdaine, Andrew, et al. "Challenges in parallel graph processing." Parallel Processing Letters 17.01 -2007

Challenges 1 Structure driven computation Data Transfer Issues 2 Irregular Structure Partitioning Issues *Concept borrowed from Cristina Abad’s PhD defense slides

Overcoming the challenges 1 Extend Existing Paradigms 2 BUILD NEW FRAMEWORKS!

Build New Graph Frameworks! Key Requirements from Graph Processing Frameworks

1 Less pre-processing 2 Low and load-balanced computation 3 Low and load-balanced communication 4 Low memory footprint 5 Scalable wrt cluster size and graph size

PREGEL Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing.“ ACM SIGMOD -2010.

Life of a Vertex Program Placement Of Vertices Iteration 0 Iteration 1 Barrier Barrier Barrier Time Computation Communication *Concept borrowed from LFGraph Slides Computation Communication

B D C E A Sample Graph *Graph Borrowed from LFGraph Paper

B D C E A Shortest Path Example

B 0 D ∞ C ∞ E ∞ A ∞ Iteration 0

B 0 D ∞ C ∞ E ∞ A 1 Iteration 1

B 0 D 2 C ∞ E 2 A 1 Iteration 2

Can we do better? GOAL PREGEL Computation 1 Pass Communication ∝ #Edge cuts Pre-processing Cheap (Hash) Memory High (out edges + buffered messages)

LFGRAPH – YES, WE CAN! Hoque, Imranul, and Indranil Gupta. "LFGraph: Simple and Fast Distributed Graph Analytics”. TRIOS-2013

B D C E A Features Cheap Vertex Placement: Hash Based Low graph initialization time

B D C E A Features Publish Subscribe fetch once information flow Low communication overhead

B D C E A Subscribe Subscribing to vertex A

B D C E A Publish Publish List of Server 1: (Server2, A)

B D C E A LFGraph Model Value of A

B D C E A Features Only stores in-neighbor vertices Reduces memory footprint

B D C E A In-neighbor storage Local in-neighbor – simply read the value Remote in-neighbor – read locally available value

B 0 D ∞ C ∞ E ∞ A ∞ Iteration 1 A 1 Value change in duplicate store Value of A

B 0 D ∞ C ∞ E ∞ A 1 Iteration 2 Local read of A Local read of A D 2 E 2

B D C E A Features Single Pass Computation Low computation overhead

Life of a Vertex Program Placement Of Vertices Iteration 0 Iteration 1 Barrier Barrier Barrier Time Computation Communication Computation Communication *Concept Borrowed from LFGraph Slides

GRAPHLAB Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning”. Conference on Uncertainty in Artificial Intelligence (UAI) - 2010

B D C E A A D E GraphLab Model

Can we do better? GOAL GRAPHLAB Computation 2 passes Communication ∝ #Vertex Ghosts Pre-processing Cheap (Hash) Memory High (in & out edges + ghost values)

POWERGRAPH Gonzalez, Joseph E., et al. "Powergraph: Distributed graph-parallel computation on natural graphs." USENIX OSDI - 2012.

B D C E A1 A2 PowerGraph Model

Can we do better? GOAL POWERGRAPH Computation 2 passes Communication ∝ #Vertex Mirrors Pre-processing Expensive (Intelligent) Memory High (in & out edges + mirror values)

Communication Analysis External on edge cuts Ghost vertices - in and out neighbors Mirrors - in and out neighbors External in neighbors

Computation Balance Analysis • Power Law graphs have substantial load imbalance. • Power law graphs have degree d with probability proportional to d-α. •Lower α means a denser graph with more high degree vertices.

Communication Balance Analysis

PageRank – Runtime w/o partition

PageRank – Runtime with partition

PageRank – Network Communication

X-Stream: Edge-centric Graph Processing using Streaming Partitions * Some figures adopted from author’s presentation

Motivation • Can sequential access be used instead of random access?! • Can large graph processing be done on a single machine?!

Sequential Access: Key to Performance! Medium Read (MB/s) Write (MB/s) Random Sequential Speed up Random Sequential Speed up RAM (1 core) 567 2605 4.6 1057 2248 2.2 RAM (16 core) 14198 25658 1.9 10044 13384 1.4 SSD 22.5 667.69 29.7 48.6 576.5 11.9 Magnetic Disk 0.6 328 546.7 2 316.3 158.2 Speed up of sequential access over random access in different media Test bed: 64 GB RAM + 200 GB SSD + 3TB drive

How to Use Sequential Access? Sequential access … Edge-Centric Processing

Vertex-Centric Scatter for each vertex v if state has updated for each output edge e of v scatter update on e

Vertex-Centric Gather for each vertex v for each input edge e of v if e has an update apply update on state

1 6 3 5 8 7 4 2 BFS SOURCE DEST 1 3 1 5 2 7 2 4 3 2 3 8 4 3 4 7 4 8 5 6 6 1 8 5 8 6 V 1 2 3 4 5 6 7 8 Vertex-Centric

Edge-Centric Scatter for each edge e if e.source has updated scatter update on e A B C = Updated Vertex Update u1 Update un

Edge-Centric Gather for each update u on edge e apply update u to e.destination X Y Z = Updated Vertex Update u1 Update un Update u2 X Y Z

Sequential Access via Edge-Centric! In Fast Storage In Slow Storage In Slow Storage

1 6 3 5 8 7 4 2 SOURCE DEST 1 3 1 5 2 7 2 4 3 2 3 8 4 3 4 7 4 8 5 6 6 1 8 5 8 6 V 1 2 3 4 5 6 7 8 BFS Edge-Centric Lots of wasted reads! Most real world graphs have small diameter Large Diameter makes X-Stream slow and wasteful

SOURCE DEST 1 3 1 5 2 7 2 4 3 2 3 8 4 3 4 7 4 8 5 6 6 1 8 5 8 6 66 = SOURCE DEST 1 3 8 6 5 6 2 4 3 2 4 7 4 3 3 8 4 8 2 7 6 1 8 5 1 5 Order is not important No pre-processing (sorting and indexing) needed!

But, still … • Random access for vertices • Vertices may not fit into fast storage

Streaming Partitions V=Subset of vertices E=Outgoing edges of V U=Incoming updates to V Mutually disjoint Changing in each scatter phase Constant set

Vn En Un Scatter and Shuffle V1 E1 U1 Input buffere1 e2 e3 … Update bufferu1 u2 u3 … Output bufferu'1 u'2 u'3 … Vertex setv1 v2 v3 … Fast Memory Read source Add update Shuffle … Append to updates

Shuffle Stream Buffer with k partitions

Gather V1 E1 U1 Update bufferu1 u2 u3 … Vertex setv1 v2 v3 … Fast Memory Apply update No output!

Parallelism • State stored in vertices • Disjoint vertex set in partitions Compute partitions in parallel Parallel scatter and gather

X-Stream Speedup over Graphchi 0 1 2 3 4 5 6 Netflix/ALS Twitter/Pagerank Twitter/Belief Propagation RMAT27/WCC Mean Speedup = 2.3 Speedup without considering the pre-process time of Graphchi

0 1 2 3 4 5 6 Netflix/ALS Twitter/Pagerank Twitter/Belief Propagation RMAT27/WCC X-Stream Speedup over Graphchi Mean Speedup = 3.7 Speedup considering the pre-process time of Graphchi

0 500 1000 1500 2000 2500 3000 Time(sec) Graphchi Sharding X-Stream runtime X-Stream Runtime vs Graphchi Sharding

Disk Transfer Rates Metric X-Stream Graphchi Data moved 224 GB 322 GB Time taken 398 seconds 2613 seconds Transfer rate 578 MB/s 126 MB/s 77 SSD sustain reads = 667 MB/s, writes = 576 MB/s Data transfer rates on Page Rank algorithm on Twitter workload

Scalability on Input Data size 0:00:01 0:00:05 0:00:21 0:01:24 0:05:37 0:22:30 1:30:00 6:00:00 24:00:00 Time(HH:MM:SS) RAM SSD Disk 8 Million V, 128 Million E, 8 sec 256 Million V, 4 Billion E, 33 mins 4 Billion V, 64 Billion E, 26 hours

Discussion • Features like global values, aggregation functions, asynchronous computation missing from LFGraph. Will the overhead of adding these features slow it down? • LFGraph assumes that all edge values are same. If the edge values are not, either the receiving vertices or the server will have to incorporate that value. Overheads? • LFGraph has one pass computation but then it executes the vertex program at each vertex (active or inactive). Trade-off?

Discussion • Independent computation and communication rounds may not always be preferred. Use bandwidth when available. • Faul Tolerance is another feature missing from LFGraph. Overheads? • Three benchmarks for experiments. Enough evaluation? • Scalability comparison with Pregel with different experiment settings. Memory comparison with PowerGraph based on heap values from logs. Fair experiments?

Discussion • Could the system become asynchronous? • Could the scatter and gather phase be combined into one phase? • Does not support iterating over the edges/updates of a vertex. Can this be added? • How good do they determine number of partitions? • Can shuffle be optimized by counting the updates of each partition during scatter?

Thank you for listening! Questions?

Qualitative Comparison GOAL PREGEL GRAPHLAB POWERGRAPH LFGRAPH Computation 2 passes, Combiners 2 passes 2 passes 1 pass Communication ∝ #Edge cuts ∝ #Vertex Ghosts ∝ #Vertex Mirrors ∝ #External in- neighbors Pre-processing Cheap (Hash) Cheap (Hash) Expensive (Intelligent) Cheap (Hash) Memory High (out edges + buffered messages) High (in & out edges + ghost values) High (in & out edges + mirror values) Low (in edges + remote values)

Read Bandwidth - SSD 0 200 400 600 800 1000 Read(MB/s) 5 minute window X-Stream Graphchi

Write Bandwidth - SSD 0 100 200 300 400 500 600 700 800 Write(MB/s) 5 minute window X-Stream Graphchi

Scalability on Number of I/O Devices

Sharding-Computing Breakdown in Graphchi 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FractionofRuntime Benchmark Graphchi Runtime Breakdown Compute + I/O Re-sort shard

Large Diameter makes X-stream Slow!

In-Memory X-Stream Performance 0 50 100 1 2 4 8 16 Runtime(s)Loweris better Threads BFS (32M vertices/256M edges) BFS-1 [HPC 2010] BFS-2 [PACT 2011] X-Stream

Discussion • The current implementation is on a single machine, can it be extended to clusters? – Would it still perform good – How to provide fault tolerance and synchronization? • The waste rate is high (~65%). Could this be improved? • Can the partition be more intelligent? Dynamic partitioning? • Could all vertex-centric programs be converted to edge-centric? • When does streaming outperform random access?

Graph processing

More Related Content

What's hot

Viewers also liked

Similar to Graph processing

Recently uploaded

Graph processing

Editor's Notes