Modeling Orderings in Asynchronous Distributed-Memory Parallel Graph Algorithms Abstract Graph Machine Final Dissertation Defense Thejaka Amila Kanewala November 9, 2018
Agenda • Preliminaries • Motivation • Abstract Graph Machine (AGM) • Extended Abstract Graph Machine (EAGM) • Engineering an AGM • Engineering an EAGM • Results • Final Remarks 2Abstract Graph Machine
Graphs • Graph is an ordered pair, G = (V, E) where V is a set of vertices and E is a set of edges ( ). 3Abstract Graph Machine Our Focus.
Graph Applications (Graph Problems) • Single Source Shortest Paths • Breadth First Search • Connected Components • Maximal Independent Set • Minimum Spanning Tree • Graph Coloring • And many more … 4Abstract Graph Machine Multiple algorithms to solve a graph problem. E.g., - Dijkstra’s SSSP - Bellman-Ford - Etc.
Why Performance in Parallel Graph Algorithms is Challenging ? • Low compute/communication ratio (caused by irregularity) • Synchronization overhead • Higher amount of work • Many dependencies - Input graph - The algorithm logic - The underlying runtime - Etc. 5Abstract Graph Machine
Regular vs. Irregular Communication only take place at the data boundary Communication take place everywhere. Low communication/computation ratio. High communication/computation ratio. 6Abstract Graph Machine
Synchronization Overhead • Many existing parallel graph algorithms are designed focusing shared-memory parallel architectures. • When we ”extend” these algorithms for distributed-memory execution, they end up having many synchronization phases.  Synchronization overhead is significant when processing a large graph on many distributed nodes. 7Abstract Graph Machine
Shared-Memory Parallel Graph Algorithms • Shared-memory parallel algorithms can be describe abstractly using the Parallel Random Access Machine (PRAM) model. Independent processors with its own memory (e.g., registers) Shared-memory 8Abstract Graph Machine
Shared-Memory Parallel Graph Algorithms . . . . . Consists of multiple phases like these Synchronize Synchronize These algorithms can be naturally extended to distributed-memory using Bulk Synchronous Parallel (BSP). 9Abstract Graph Machine
Shared-Memory to Distributed-Memory . . . . . . . . . . Global Barrier Global Barrier Executes many global barriers. Synchronization overhead is significant, especially when executing the program on many nodes. 10Abstract Graph Machine
Asynchronous Graph Algorithms Level synchronous BFS, using “S” as the source. 11Abstract Graph Machine
Asynchronous Graph Algorithms Level synchronous BFS, using “S” as the source. 12Abstract Graph Machine Synchronize
Asynchronous Graph Algorithms Level synchronous BFS, using “S” as the source. Synchronize Once a label is set, no corrections  0 additional work. 13Abstract Graph Machine
Asynchronous Graph Algorithms Asynchronous BFS, using “S” as the source. 14Abstract Graph Machine
Asynchronous Graph Algorithms Asynchronous BFS, using “S” as the source. Label correction Need to correct these as well. Generates lot of redundant work. 15Abstract Graph Machine
Asynchronous Graph Algorithms • Asynchronous graph algorithms are better in the sense they avoid the overhead of synchronization. • BUT, they tend to generate lot of redundant work. - High runtime 16Abstract Graph Machine
How much Synchronization? Low diameter Have enough parallel work in each level Fewer barriers (diameter ~ barriers) E.g., Twitter, ER. Level synchronous execution is better. . . . High diameter Not enough parallel work in a level E.g., Road NW. Asynchronous execution is better. We need a way to control the amount of synchronization needed by a graph algorithm  Abstract Graph Machine (AGM) is a model that can control the level of synchronization of a graph algorithm using an ordering. 17Abstract Graph Machine
SSSP as an Example • In an undirected weighted graph (G=(V, E)), the Single Source Shortest Path problem is to find a path to every vertex so that the sum of weights in that path is minimized. 18Abstract Graph Machine
Algorithms void Chaotic(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } Relax(source, 0); } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d; For each edge e of v { Vertex u = target_vertex(e); Relax(u, d+weight[e]); } } } void Dijkstra(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } pq.insert(<source, 0>); While pq is not empty { <v, d> = vertex-distance pair with minimum distance in pq; Relax(v, d); } } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d; For each edge e of v { Vertex u = target_vertex(e); pq.insert(<u, d+weight[e]>); } } } void -Stepping(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } insert <source, 0> to appropriate bucket from buckets; While all buckets are not empty { bucket = smallest non-empty bucket in buckets; For each <v, d> in bucket { Relax(v, d); } } } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d For each edge e of v { Vertex u = target_vertex(e); insert <u, d+weight[e]> to appropriate bucket in buckets; } } } Relax Relax Relax Process random vertex Put vertex in priority queue Put vertex in ∆ priority queue 19Abstract Graph Machine
SSSP Algorithms Processing Work Chaotic ▲-Stepping KLA Dijkstras Vertex, distance pairs generated by the Relax function are fed to itself. Vertex distance pairs generated in by the Relax function are ordered based on ∆. Generated vertex-distance pair is inserted into appropriated, bucket and vertex, distance pairs in the first bucket is processed first. Vertex, distance pairs are ordered based on the level from the source vertex. Buckets are created based on level interval k. Vertex, distance pairs are ordered based on the distance. Smallest distant vertex, distance pairs are processed first. Relax State update Ordering of updates Same Same Different Different 20Abstract Graph Machine
Ordering of SSSP Algorithms Chaotic ▲-Stepping KLA Dijkstra Generated vertex, distance pairs are selected randomly and not ordered. Generated vertex, distance pairs are separated into several partitions based on a relation defined on the distance and ∆. Generated vertex, distance pairs are separated into partitions based on the level and k. Generated vertex, distance pairs are separated into partitions based on the distinct distance values. Single partition. Multiple partitions based on ∆. Multiple partitions based on the level. Multiple partitions based on the distance. (v0,d0) (v1,d1) (v2,d2) (v3,d3) (v4,d4)(v5,d5) (v6,d6) (v0,d0)(v1,d1) (v2,d2)(v3,d3) (v4,d4)(v5,d5) (v6,d6) (v0,d0) (v1,d1) (v2,d2) (v3,d3) (v4,d4) (v5,d5) (v6,d6) (v0,d0) (v1,d1) (v2,d2) … Unordered Unordered within ∆ Unordered within k Ordered Ordered between ∆ Ordered between k 21Abstract Graph Machine
Distributed Memory Implementations Chaotic ▲-Stepping KLA Dijkstras Barriers are executed at the beginning of the algorithm and at the end of the algorithm. A barrier is executed after processing a ∆ bucket. A barrier is executed after processing k levels. A barrier is executed after processing a vertex- distance pair with different distance. (v0,d0) (v2,d2) (v4,d4) (v6,d6) (v1,d1) (v3,d3) (v5,d5) R0 R1 (v1,d1) (v3,d3) (v5,d5) (v0,d0) (v2,d2) (v4,d4) (v6,d6) R0 R1 (v0,d0) (v2,d2) (v4,d4) (v6,d6) (v1,d1) (v3,d3) (v5,d5) R0 R1 (v0,d0) (v2,d2) … (v1,d1) … R0 R1 • Two ranks (R0 and R1) • Vertices are distributed (e.g., Even vertices in R0 and odd vertices in R1). • Barrier execution : 22Abstract Graph Machine
Abstract Graph Machine for SSSP Strict weak ordering relation defined on work units. The strict weak ordering relation creates equivalence classes and induces an ordering on generated equivalence classes. 23Abstract Graph Machine
Abstract Graph Machine • Abstract model that represents graph algorithms as a processing function and strict weak ordering • In AGM for SSSP, the (vertex, distance) pair is called a workitem • An AGM consists of:  A definition of a graph (vertices, edges, vertex attributes, edge attributes)  A set of states (of the computation, e.g., distances)  A definition of workitems  A processing function  A strict weak ordering relation  An initial workitem set (to start the algorithm) 24Abstract Graph Machine
The Abstract Graph Machine (AGM) Definition of graph Definition of workitems Set of states Processing function Strict weak ordering Initial set of workitems Including vertex and edge attributes SSSP: distance from source SSSP: relax SSSP: edge weight SSSP: starting vertex SSSP: depends on algorithm! 25Abstract Graph Machine
SSSP Algorithms in AGM • In general, a SSSP algorithm’s workitem consists of a vertex and a distance. Therefore, the set of workitems for AGM is defined as follows: • is a pair (e.g., w = <v, d>). The “[]” operator retrieves values associated with ”w”. E.g., w[0] returns the vertex associated with workitem and w[1] returns the distances. • The state of the algorithms (shortest distance calculated at a point in time) is stored in a state. We call this state “distance”. 26Abstract Graph Machine
The Processing Function for SSSP Input workitem. The workitem is a pair and w[0] returns the vertex associated to the workitem and w[1] returns the distance associated. 1. Check this condition. If input workitem’s distance is better than stored distance go to 2. 2. Updated distance and go to 3. 3. Generate new workitems. relax() in AGM notation 27Abstract Graph Machine
State Update & Work Generation 28Abstract Graph Machine State update logic. Work generation logic.
Strict weak ordering relations (SSSP) • Chaotic SSSP strict weak ordering relation ( ) • Dijkstra’s SSSP strict weak ordering relation ( ) • ∆-Stepping SSSP strict weak ordering relation ( ) • Similar ordering relation can be derived for KLA. 29Abstract Graph Machine
Algorithm Family • All algorithms share the same processing function but with different orderings. • The collection of algorithms that share the same processing function is called an algorithm family. 30Abstract Graph Machine
Distributed ∆–Stepping Algorithm Partitions created by global ordering, based on relation the . First bucket created by delta-stepping algorithm Work within a bucket is not ordered Partitions are ordered ∆–Stepping Algorithm on two ranks (R0, R1) Huge opportunity for acceleration 31Abstract Graph Machine
Extended AGM (EAGM): ∆–Stepping + thread level ordering For a specific algorithm, work within a bucket is not ordered AGM creates buckets with global ordering, based on relation . And we can do these in parallel Each rank can process its bucket with a better ordering! But why be unordered? 32Abstract Graph Machine E.g., .
Extended AGM (EAGM): Hybrid Hierarchical Ordering As we distribute/parallelize, we create different spatial domains for processing Extended AGM: Use different orderings in each domain 33 E.g., Abstract Graph Machine
Extended AGM in General Abstract Graph Machine 34 AGM takes a processing function and a strict weak ordering relation EAGM takes a processing function and a hierarchy of strict weak ordering relations attached to each spatial domain. Global Ordering Node Ordering NUMA Ordering Thread Ordering
Extended AGM (EAGM): E.g., ∆–Stepping with NUMA level ordering. • Work within each bucket is further ordered using NUMA local priority queues. 35Abstract Graph Machine ordering for B1 bucket and NUMA domain 1 in Rank 0
Extended AGM (EAGM) - ∆–Stepping with node level ordering. • Work within each bucket is further ordered using node local priority queues. 36Abstract Graph Machine ordering for B1 bucket in Rank 0
Engineering an AGM Abstract Graph Machine 37
template<typename buckets> void PF(const WorkItem& wi, int tid, buckets& outset) { Vertex v = std::get<0>(wi); Distance d = std::get<1>(wi); Distance old_dist = vdistance[v], last_old_dist; while (d < old_dist) { Distance din = CAS (d, &vdistance[v]); if (din == old_dist) { FORALL_OUTEDGES_T(v, e, g, Graph) { Vertex u = boost::target(e, g); Weight we = boost::get(vweight, e); WorkItem wigen(u, (d+we)); outset.push(wigen, tid); } break; } else { old_dist = din; } } } AGM Graph Processing Framework Abstract Graph Machine 38 //================== Dijkstra Ordering ===================== template<int index> struct dijkstra : public base_ordering { public: static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_dijkstra; template <typename T> bool operator()(T i, T j) { return (std::get<index>(i) < std::get<index>(j)); } eagm_ordering name() { return ORDERING_NAME; } }; The strict weak ordering, partitions WorkItems. In this case the ordering is based on the distance. The processing function for SSSP. For SSSP a ”WorkItem” is a vertex and distance. CAS = Atomic compare & swap After executing wi, processing function generates new work items and they are pushed for ordering.
Invoking the Framework Abstract Graph Machine 39 // SSSP (AGM) algorithm typedef agm<Graph, WorkItem, ProcessingFunction, StrictWeakOrdering, RuntimeModelGen> sssp_agm_t; sssp_agm_t ssspalgo(rtmodelgen, ordering, pf, initial); Creating the AGM for SSSP. The WorkItem type. The processing function (preorder) type. The strict weak ordering relation type. The underlying runtime. (AGM framework abstracts out the runtime). The underlying runtime. The specific ordering function (e.g., Delta Ordering). The processing function instance. The initial work item set (e.g., source vertex).
Mapping AGM to an Implementation Abstract Graph Machine 40 Q2. How should we arrange processing function and ordering ? Q1. How should we do ordering ? Abstract Actual ImplementationCommunication method
Ordering Implementation • Every compute node keeps a set of equivalence classes locally in a data structure. • Equivalence classes are ordered according to the strict weak ordering (swo) • Every equivalence class has:  A representative work-item  An append buffer that holds work items for the equivalence class. Abstract Graph Machine 41 Representation Shared-memory parallel thread Holds all the work items that are not comparable to the representative. Strict weak ordering is transitive. Will discuss later. An equivalence class.
Ordering Implementation: Inserts to an Equivalence class • Finds the equivalence class by comparing with representative work items. • If there is no equivalence class, create a one with the inserting work item as the representative work item.  - Deal with concurrency Abstract Graph Machine 42 Inserting element is not comparable with the representative element.
Ordering Implementation: Processing an Equivalence Class • Find the equivalence class with minimum representation workitem.  - Every rank sends the representative work item of the first equivalence class in the list. (Uses all reduce to communicate minimum work items)  - Selects the smallest equivalence class Abstract Graph Machine 43 Equivalence classes are not distributed uniformly If a particular node does not have an equivalence class relevant to the global minimum representative, then insert one.
Ordering Implementation: Termination of an Equivalence Class • Termination of a single equivalence class  When processing an equivalence class more work can be added to the same equivalence class.  Every node keeps track of number of work items pushed into the equivalence class and number of work items processed in the equivalence class.  The difference between those two numbers are globally reduced (sum) and when the sum is zero framework decides that equivalence class is processed. Abstract Graph Machine 44 Counts are reduced after making sure all messages are exchanged (i.e. after an epoch).
AGM Implementation: Data Structure • A data structure is needed to hold the equivalence classes  Later this data structure will be further extended to use at EAGM levels. • Requirements of the Data Structure  Should order nodes by representative work items  Faster lookups are important  Ability to concurrently insert/delete and lookup/search data Abstract Graph Machine 45 Sounds like a Dictionary ADT.
Data Structure • Binary Search Trees (BST), with Locks • Linked List, with Locks and Atomics • Concurrent SkipList • Partitioning Scheme Abstract Graph Machine 46 E.g, RBTrees. Often balances the tree. Therefore, need to lock the whole data structure for insert/delete Not much difference between locked version and lock free version. No need to lock the whole data structure (can lock only the modifying node), but linear lookup time. An alternative probabilistic data structure that gives the same complexity guarantees as BST, but avoids the need for rebalancing, but the contention is much higher. Explain in the next slide.
Date Structure: Partitioning Scheme Abstract Graph Machine 47 Every node maintains a set of vectors -- one per each thread Every thread maintains the min work item seen so far. Push back - O(1) (assuming no resize) Find the global min representative Partition by the global min with strict weak ordering relation and move partitioned work to a new vector. (O(n)) Thread buckets is a data structure that has a vector per each thread, represents the next processing equivalence class ordering for B1 bucket in Rank 0
Data Structure: Partitioning Scheme • Processing Thread Buckets • Avoid load imbalance issues. • But cannot insert work items that are coming for the same buckets (cannot afford vector resize -- if we need this we need to use locks) • Work items pushed for the same bucket are directly sent. Abstract Graph Machine 48 T0 T1 T2 T0 T1 T2 T0 …
Data Structure: Summary Abstract Graph Machine 49 Data Structure Execution Time (sec.) Linked List (Atomics) ~61 Binary Search Trees with Locks ~82 Concurrent SkipList ~89 Partitioning Scheme ~44 • Pre-order processing function • SSSP-Delta ordering • Two nodes, 16 shared-memory parallel threads, Graph500 Scale 24.
Mapping AGM to an Implementation Abstract Graph Machine 50 Q2. How should we arrange processing function and ordering ? Q1. How should we do ordering ? Abstract Actual ImplementationCommunication method
Communication Paradigms Abstract Graph Machine 51 R0 R1 R2 R3 Push Push Push Push Push Push R0 R1 R2 R3 Pull Pull Pull Pull Pull Pull R0 R1 R2 R3 Gather Gather Gather Scatter Scatter Scatter Apply Push Communication Pull Communication GAS Communication R1 pushes workitems to R0 R0 pulls workitems to R1
Placing Processing Function: Pre-Order & Post-Order Abstract Graph Machine 52 After receiving the workitem processing function is executed and generated workitems are inserted to the data structure After receiving the workitem insert to the data structure for ordering Pre-Order Post-Order
Placing Processing Function: Split-Order • The processing function contains logic for updating state/s and to generate new work.  - We split the processing function into two functions: a state update function ( and) Abstract Graph Machine 53 workitem Insert the workitem for ordering if state is updated Generate new work only if state is not changed Allows us to prune work further
Placing Processing Function: Summary Abstract Graph Machine 54 Run Time Total Inserts Additional WorkItems Processed Split-Order 14.54 17139284 21334 Pre-order 62.55 409388039 2469025 Post-order 44.09 398344581 2284918 Best timing. Total work items pushed into the data structure is much less compared other two implementations. Additional relaxes are much less compared to other two implementations, because of pruning. Post-Order is slightly better than pre-order because of the reduced contention.
Engineering an EAGM Abstract Graph Machine 56
EAGM Spatial Levels Abstract Graph Machine 57 T0 T1 T2 T3 T4 T5 N0 Global Spatial Level Node Spatial Level NUMA Spatial Level Thread Spatial Level numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1
EAGM Execution Abstract Graph Machine 58 // SSSP (AGM) algorithm typedef agm<Graph, WorkItem, ProcessingFunction, StrictWeakOrdering, RuntimeModelGen> sssp_agm_t; sssp_agm_t ssspalgo(rtmodelgen, ordering, pf, initial); // SSSP (EAGM) algorithm typedef eagm<Graph, WorkItem, ProcessingFunction, EAGMConfig, RuntimeModelGen> sssp_eagm_t; sssp_eagm_t ssspalgo(rtmodelgen, config, pf, initial); AGM takes a strict weak ordering relation. EAGM takes an hierarchy of orderings The hierarchy correspond to the memory hierarchy. Globally Delta (=5) ordering Node level chaotic Ordering NUMA level chaotic Ordering Thread level chaotic Ordering EAGMConfig
EAGMConfig Abstract Graph Machine 59 // strict weak orderings //================== Chaotic ===================== struct chaotic : public base_ordering { public: static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_chaotic; template <typename T> bool operator()(T i, T j) { return false; } eagm_ordering name() { return ORDERING_NAME; } }; Any two workitems are not comparable to each other. CHAOTIC_ORDERING_T ch; DELTA_ORDERING_T delta(agm_params.delta); auto config = boost::graph::agm::create_eagm_config(delta, ch, ch, ch); Globally delta ordering Node level chaotic ordering NUMA level chaotic orderingThread level chaotic ordering
EAGM Execution: Abstract Graph Machine 60 T0 T1 T2 T3 T4 T5 N0 numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1 Time An equivalence class Barrier synchronization Execution same as AGM Execution is globally synchronous Execution is asynchronous in process, NUMA and thread levels
EAGM Execution: Abstract Graph Machine 61 Execution is globally asynchronous  globally single equivalence class. Execution is node level synchronous, Equivalence class at process level. Execution is asynchronous in NUMA and thread levels Local thread barriers
EAGM Execution: Abstract Graph Machine 62 T0 T1 T2 T3 T4 T5 N0 numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1 Time Global Equivalence Class Global Barrier Node Equivalence Class NUMA Equivalence Class Thread Equivalence Class NUMA Barrier Thread Barrier Spatial Ordering TemporalOrdering • If we have “non- chaotic” orderings defined for every memory level. • E.g., Spatial ordering Temporal ordering
EAGM Implementation • The heart of the EAGM implementation is the nested data structure that holds equivalence classes. Abstract Graph Machine 63 Global level classes Node level classes NUMA level classes
. . . . . . . . . . . . . . . . . . Time EAGM Implementation: Data Structure Abstract Graph Machine 64 Order by Delta=10 Order by K=2 Order by Delta=5 E.g.,
EAGM Implementation: Static Optimizations Abstract Graph Machine 65 Chaotic ordering always create a single equivalence class. Therefore, we exclude levels for chaotic orderings. Optimized data structure. Optimizations are performed statically at compile time with the help of template meta- programming and template specialization. Remove nested levels for chaotic orderings.
EAGM Implementation: Static Optimizations Abstract Graph Machine 66 Remove nested levels for chaotic orderings.This is really an AGM.
EAGM Implementation: Static Optimizations Abstract Graph Machine 67
Results Abstract Graph Machine 68
Single Source Shortest Paths: Pre/Post/Split Orderings Abstract Graph Machine 69 Weak scaling results carried out on 2 Broad- well 16-core Intel Xeon processors. 230 vertices & 234 edges. Shared-memory execution. Split-order is ~5 times faster than pre-order & post- order.
Single Source Shortest Paths: Weak Scaling Abstract Graph Machine 70 Graph500 graph. Graph500 proposed SSSP graph. Erdos-Renyi graph. Framework algorithms show better scaling compared to PowerGraph. Thread level Dijkstra ordering shows better scaling behavior for Power-Law graphs. Globally synchronous delta- stepping has better scaling for ER graphs.
Single Source Shortest Paths: Strong Scaling (BRII+) Abstract Graph Machine 71 Relative Speedup, = Time for fastest sequential algorithm / Parallel execution time on “n” PEs Synchronization overhead become significant when executing on higher number of nodes Just ordering by level does not reduce the work, but if we apply Dijkstra ordering at thread level we see better performance.
SSSP More Results: Weak Scaling (BR II) Abstract Graph Machine 72 Both Power-Graph & PBGL-1 do not scale well at higher scale
Breadth First Search: Weak Scaling Abstract Graph Machine 73 K(2) global ordering and level global ordering reduce more redundant work compared thread level ordering.
BFS: In-node Performance (Weak Scaling) Abstract Graph Machine 74 Global level synchronous Node level synchronous Within a node both configurations execute in the same way. Therefore, the performance of both configurations are similar.
BFS: Strong Scaling Abstract Graph Machine 75 Globally level synchronous versions show better speed-up than thread level synchronous versions.
BFS: Road Networks Abstract Graph Machine 76 Strong scaling with Road networks. Road networks has a high diameter (~850-900). Therefore, global level synchronous version show poor scaling because of the synchronization overhead. Global level synchronous Globally asynchronous and thread level synchronous
Other Graph Applications • Connected Components • Maximal Independent Set • Triangle Counting Abstract Graph Machine 77
Conclusions • Using the AGM abstraction, we showed that, we can generate families of algorithms by keeping the processing function intact and by changing ordering. • With the EAGM we can achieve spatial and temporal orderings. • We discuss the challenges in mapping AGM framework to an implementations and described how we could avoid such challenges. • We came up with an efficient implementation of AGM and EAGM frameworks.  Weak and strong scaling results show that AGM/EAGM algorithms outperform graph algorithms in other frameworks. Abstract Graph Machine 78
Thank You ! Abstract Graph Machine 79

ABSTRACT GRAPH MACHINE: MODELING ORDERINGS IN ASYNCHRONOUS DISTRIBUTED-MEMORY PARALLEL GRAPH ALGORITHMS

  • 1.
    Modeling Orderings inAsynchronous Distributed-Memory Parallel Graph Algorithms Abstract Graph Machine Final Dissertation Defense Thejaka Amila Kanewala November 9, 2018
  • 2.
    Agenda • Preliminaries • Motivation •Abstract Graph Machine (AGM) • Extended Abstract Graph Machine (EAGM) • Engineering an AGM • Engineering an EAGM • Results • Final Remarks 2Abstract Graph Machine
  • 3.
    Graphs • Graph isan ordered pair, G = (V, E) where V is a set of vertices and E is a set of edges ( ). 3Abstract Graph Machine Our Focus.
  • 4.
    Graph Applications (GraphProblems) • Single Source Shortest Paths • Breadth First Search • Connected Components • Maximal Independent Set • Minimum Spanning Tree • Graph Coloring • And many more … 4Abstract Graph Machine Multiple algorithms to solve a graph problem. E.g., - Dijkstra’s SSSP - Bellman-Ford - Etc.
  • 5.
    Why Performance inParallel Graph Algorithms is Challenging ? • Low compute/communication ratio (caused by irregularity) • Synchronization overhead • Higher amount of work • Many dependencies - Input graph - The algorithm logic - The underlying runtime - Etc. 5Abstract Graph Machine
  • 6.
    Regular vs. Irregular Communicationonly take place at the data boundary Communication take place everywhere. Low communication/computation ratio. High communication/computation ratio. 6Abstract Graph Machine
  • 7.
    Synchronization Overhead • Manyexisting parallel graph algorithms are designed focusing shared-memory parallel architectures. • When we ”extend” these algorithms for distributed-memory execution, they end up having many synchronization phases.  Synchronization overhead is significant when processing a large graph on many distributed nodes. 7Abstract Graph Machine
  • 8.
    Shared-Memory Parallel GraphAlgorithms • Shared-memory parallel algorithms can be describe abstractly using the Parallel Random Access Machine (PRAM) model. Independent processors with its own memory (e.g., registers) Shared-memory 8Abstract Graph Machine
  • 9.
    Shared-Memory Parallel GraphAlgorithms . . . . . Consists of multiple phases like these Synchronize Synchronize These algorithms can be naturally extended to distributed-memory using Bulk Synchronous Parallel (BSP). 9Abstract Graph Machine
  • 10.
    Shared-Memory to Distributed-Memory . . . . . . . . . .Global Barrier Global Barrier Executes many global barriers. Synchronization overhead is significant, especially when executing the program on many nodes. 10Abstract Graph Machine
  • 11.
    Asynchronous Graph Algorithms Levelsynchronous BFS, using “S” as the source. 11Abstract Graph Machine
  • 12.
    Asynchronous Graph Algorithms Levelsynchronous BFS, using “S” as the source. 12Abstract Graph Machine Synchronize
  • 13.
    Asynchronous Graph Algorithms Levelsynchronous BFS, using “S” as the source. Synchronize Once a label is set, no corrections  0 additional work. 13Abstract Graph Machine
  • 14.
    Asynchronous Graph Algorithms Asynchronous BFS,using “S” as the source. 14Abstract Graph Machine
  • 15.
    Asynchronous Graph Algorithms Asynchronous BFS,using “S” as the source. Label correction Need to correct these as well. Generates lot of redundant work. 15Abstract Graph Machine
  • 16.
    Asynchronous Graph Algorithms •Asynchronous graph algorithms are better in the sense they avoid the overhead of synchronization. • BUT, they tend to generate lot of redundant work. - High runtime 16Abstract Graph Machine
  • 17.
    How much Synchronization? Low diameter Haveenough parallel work in each level Fewer barriers (diameter ~ barriers) E.g., Twitter, ER. Level synchronous execution is better. . . . High diameter Not enough parallel work in a level E.g., Road NW. Asynchronous execution is better. We need a way to control the amount of synchronization needed by a graph algorithm  Abstract Graph Machine (AGM) is a model that can control the level of synchronization of a graph algorithm using an ordering. 17Abstract Graph Machine
  • 18.
    SSSP as anExample • In an undirected weighted graph (G=(V, E)), the Single Source Shortest Path problem is to find a path to every vertex so that the sum of weights in that path is minimized. 18Abstract Graph Machine
  • 19.
    Algorithms void Chaotic(Graph g,Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } Relax(source, 0); } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d; For each edge e of v { Vertex u = target_vertex(e); Relax(u, d+weight[e]); } } } void Dijkstra(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } pq.insert(<source, 0>); While pq is not empty { <v, d> = vertex-distance pair with minimum distance in pq; Relax(v, d); } } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d; For each edge e of v { Vertex u = target_vertex(e); pq.insert(<u, d+weight[e]>); } } } void -Stepping(Graph g, Vertex source) { For each Vertex v in G { state[v] <- INFINITY; } insert <source, 0> to appropriate bucket from buckets; While all buckets are not empty { bucket = smallest non-empty bucket in buckets; For each <v, d> in bucket { Relax(v, d); } } } void Relax(Vertex v, Distance d) { If (d < state[v]) { state[v] <- d For each edge e of v { Vertex u = target_vertex(e); insert <u, d+weight[e]> to appropriate bucket in buckets; } } } Relax Relax Relax Process random vertex Put vertex in priority queue Put vertex in ∆ priority queue 19Abstract Graph Machine
  • 20.
    SSSP Algorithms ProcessingWork Chaotic ▲-Stepping KLA Dijkstras Vertex, distance pairs generated by the Relax function are fed to itself. Vertex distance pairs generated in by the Relax function are ordered based on ∆. Generated vertex-distance pair is inserted into appropriated, bucket and vertex, distance pairs in the first bucket is processed first. Vertex, distance pairs are ordered based on the level from the source vertex. Buckets are created based on level interval k. Vertex, distance pairs are ordered based on the distance. Smallest distant vertex, distance pairs are processed first. Relax State update Ordering of updates Same Same Different Different 20Abstract Graph Machine
  • 21.
    Ordering of SSSPAlgorithms Chaotic ▲-Stepping KLA Dijkstra Generated vertex, distance pairs are selected randomly and not ordered. Generated vertex, distance pairs are separated into several partitions based on a relation defined on the distance and ∆. Generated vertex, distance pairs are separated into partitions based on the level and k. Generated vertex, distance pairs are separated into partitions based on the distinct distance values. Single partition. Multiple partitions based on ∆. Multiple partitions based on the level. Multiple partitions based on the distance. (v0,d0) (v1,d1) (v2,d2) (v3,d3) (v4,d4)(v5,d5) (v6,d6) (v0,d0)(v1,d1) (v2,d2)(v3,d3) (v4,d4)(v5,d5) (v6,d6) (v0,d0) (v1,d1) (v2,d2) (v3,d3) (v4,d4) (v5,d5) (v6,d6) (v0,d0) (v1,d1) (v2,d2) … Unordered Unordered within ∆ Unordered within k Ordered Ordered between ∆ Ordered between k 21Abstract Graph Machine
  • 22.
    Distributed Memory Implementations Chaotic▲-Stepping KLA Dijkstras Barriers are executed at the beginning of the algorithm and at the end of the algorithm. A barrier is executed after processing a ∆ bucket. A barrier is executed after processing k levels. A barrier is executed after processing a vertex- distance pair with different distance. (v0,d0) (v2,d2) (v4,d4) (v6,d6) (v1,d1) (v3,d3) (v5,d5) R0 R1 (v1,d1) (v3,d3) (v5,d5) (v0,d0) (v2,d2) (v4,d4) (v6,d6) R0 R1 (v0,d0) (v2,d2) (v4,d4) (v6,d6) (v1,d1) (v3,d3) (v5,d5) R0 R1 (v0,d0) (v2,d2) … (v1,d1) … R0 R1 • Two ranks (R0 and R1) • Vertices are distributed (e.g., Even vertices in R0 and odd vertices in R1). • Barrier execution : 22Abstract Graph Machine
  • 23.
    Abstract Graph Machinefor SSSP Strict weak ordering relation defined on work units. The strict weak ordering relation creates equivalence classes and induces an ordering on generated equivalence classes. 23Abstract Graph Machine
  • 24.
    Abstract Graph Machine •Abstract model that represents graph algorithms as a processing function and strict weak ordering • In AGM for SSSP, the (vertex, distance) pair is called a workitem • An AGM consists of:  A definition of a graph (vertices, edges, vertex attributes, edge attributes)  A set of states (of the computation, e.g., distances)  A definition of workitems  A processing function  A strict weak ordering relation  An initial workitem set (to start the algorithm) 24Abstract Graph Machine
  • 25.
    The Abstract GraphMachine (AGM) Definition of graph Definition of workitems Set of states Processing function Strict weak ordering Initial set of workitems Including vertex and edge attributes SSSP: distance from source SSSP: relax SSSP: edge weight SSSP: starting vertex SSSP: depends on algorithm! 25Abstract Graph Machine
  • 26.
    SSSP Algorithms inAGM • In general, a SSSP algorithm’s workitem consists of a vertex and a distance. Therefore, the set of workitems for AGM is defined as follows: • is a pair (e.g., w = <v, d>). The “[]” operator retrieves values associated with ”w”. E.g., w[0] returns the vertex associated with workitem and w[1] returns the distances. • The state of the algorithms (shortest distance calculated at a point in time) is stored in a state. We call this state “distance”. 26Abstract Graph Machine
  • 27.
    The Processing Functionfor SSSP Input workitem. The workitem is a pair and w[0] returns the vertex associated to the workitem and w[1] returns the distance associated. 1. Check this condition. If input workitem’s distance is better than stored distance go to 2. 2. Updated distance and go to 3. 3. Generate new workitems. relax() in AGM notation 27Abstract Graph Machine
  • 28.
    State Update &Work Generation 28Abstract Graph Machine State update logic. Work generation logic.
  • 29.
    Strict weak orderingrelations (SSSP) • Chaotic SSSP strict weak ordering relation ( ) • Dijkstra’s SSSP strict weak ordering relation ( ) • ∆-Stepping SSSP strict weak ordering relation ( ) • Similar ordering relation can be derived for KLA. 29Abstract Graph Machine
  • 30.
    Algorithm Family • Allalgorithms share the same processing function but with different orderings. • The collection of algorithms that share the same processing function is called an algorithm family. 30Abstract Graph Machine
  • 31.
    Distributed ∆–Stepping Algorithm Partitionscreated by global ordering, based on relation the . First bucket created by delta-stepping algorithm Work within a bucket is not ordered Partitions are ordered ∆–Stepping Algorithm on two ranks (R0, R1) Huge opportunity for acceleration 31Abstract Graph Machine
  • 32.
    Extended AGM (EAGM):∆–Stepping + thread level ordering For a specific algorithm, work within a bucket is not ordered AGM creates buckets with global ordering, based on relation . And we can do these in parallel Each rank can process its bucket with a better ordering! But why be unordered? 32Abstract Graph Machine E.g., .
  • 33.
    Extended AGM (EAGM):Hybrid Hierarchical Ordering As we distribute/parallelize, we create different spatial domains for processing Extended AGM: Use different orderings in each domain 33 E.g., Abstract Graph Machine
  • 34.
    Extended AGM inGeneral Abstract Graph Machine 34 AGM takes a processing function and a strict weak ordering relation EAGM takes a processing function and a hierarchy of strict weak ordering relations attached to each spatial domain. Global Ordering Node Ordering NUMA Ordering Thread Ordering
  • 35.
    Extended AGM (EAGM):E.g., ∆–Stepping with NUMA level ordering. • Work within each bucket is further ordered using NUMA local priority queues. 35Abstract Graph Machine ordering for B1 bucket and NUMA domain 1 in Rank 0
  • 36.
    Extended AGM (EAGM)- ∆–Stepping with node level ordering. • Work within each bucket is further ordered using node local priority queues. 36Abstract Graph Machine ordering for B1 bucket in Rank 0
  • 37.
  • 38.
    template<typename buckets> void PF(constWorkItem& wi, int tid, buckets& outset) { Vertex v = std::get<0>(wi); Distance d = std::get<1>(wi); Distance old_dist = vdistance[v], last_old_dist; while (d < old_dist) { Distance din = CAS (d, &vdistance[v]); if (din == old_dist) { FORALL_OUTEDGES_T(v, e, g, Graph) { Vertex u = boost::target(e, g); Weight we = boost::get(vweight, e); WorkItem wigen(u, (d+we)); outset.push(wigen, tid); } break; } else { old_dist = din; } } } AGM Graph Processing Framework Abstract Graph Machine 38 //================== Dijkstra Ordering ===================== template<int index> struct dijkstra : public base_ordering { public: static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_dijkstra; template <typename T> bool operator()(T i, T j) { return (std::get<index>(i) < std::get<index>(j)); } eagm_ordering name() { return ORDERING_NAME; } }; The strict weak ordering, partitions WorkItems. In this case the ordering is based on the distance. The processing function for SSSP. For SSSP a ”WorkItem” is a vertex and distance. CAS = Atomic compare & swap After executing wi, processing function generates new work items and they are pushed for ordering.
  • 39.
    Invoking the Framework AbstractGraph Machine 39 // SSSP (AGM) algorithm typedef agm<Graph, WorkItem, ProcessingFunction, StrictWeakOrdering, RuntimeModelGen> sssp_agm_t; sssp_agm_t ssspalgo(rtmodelgen, ordering, pf, initial); Creating the AGM for SSSP. The WorkItem type. The processing function (preorder) type. The strict weak ordering relation type. The underlying runtime. (AGM framework abstracts out the runtime). The underlying runtime. The specific ordering function (e.g., Delta Ordering). The processing function instance. The initial work item set (e.g., source vertex).
  • 40.
    Mapping AGM toan Implementation Abstract Graph Machine 40 Q2. How should we arrange processing function and ordering ? Q1. How should we do ordering ? Abstract Actual ImplementationCommunication method
  • 41.
    Ordering Implementation • Everycompute node keeps a set of equivalence classes locally in a data structure. • Equivalence classes are ordered according to the strict weak ordering (swo) • Every equivalence class has:  A representative work-item  An append buffer that holds work items for the equivalence class. Abstract Graph Machine 41 Representation Shared-memory parallel thread Holds all the work items that are not comparable to the representative. Strict weak ordering is transitive. Will discuss later. An equivalence class.
  • 42.
    Ordering Implementation: Insertsto an Equivalence class • Finds the equivalence class by comparing with representative work items. • If there is no equivalence class, create a one with the inserting work item as the representative work item.  - Deal with concurrency Abstract Graph Machine 42 Inserting element is not comparable with the representative element.
  • 43.
    Ordering Implementation: Processingan Equivalence Class • Find the equivalence class with minimum representation workitem.  - Every rank sends the representative work item of the first equivalence class in the list. (Uses all reduce to communicate minimum work items)  - Selects the smallest equivalence class Abstract Graph Machine 43 Equivalence classes are not distributed uniformly If a particular node does not have an equivalence class relevant to the global minimum representative, then insert one.
  • 44.
    Ordering Implementation: Terminationof an Equivalence Class • Termination of a single equivalence class  When processing an equivalence class more work can be added to the same equivalence class.  Every node keeps track of number of work items pushed into the equivalence class and number of work items processed in the equivalence class.  The difference between those two numbers are globally reduced (sum) and when the sum is zero framework decides that equivalence class is processed. Abstract Graph Machine 44 Counts are reduced after making sure all messages are exchanged (i.e. after an epoch).
  • 45.
    AGM Implementation: DataStructure • A data structure is needed to hold the equivalence classes  Later this data structure will be further extended to use at EAGM levels. • Requirements of the Data Structure  Should order nodes by representative work items  Faster lookups are important  Ability to concurrently insert/delete and lookup/search data Abstract Graph Machine 45 Sounds like a Dictionary ADT.
  • 46.
    Data Structure • BinarySearch Trees (BST), with Locks • Linked List, with Locks and Atomics • Concurrent SkipList • Partitioning Scheme Abstract Graph Machine 46 E.g, RBTrees. Often balances the tree. Therefore, need to lock the whole data structure for insert/delete Not much difference between locked version and lock free version. No need to lock the whole data structure (can lock only the modifying node), but linear lookup time. An alternative probabilistic data structure that gives the same complexity guarantees as BST, but avoids the need for rebalancing, but the contention is much higher. Explain in the next slide.
  • 47.
    Date Structure: PartitioningScheme Abstract Graph Machine 47 Every node maintains a set of vectors -- one per each thread Every thread maintains the min work item seen so far. Push back - O(1) (assuming no resize) Find the global min representative Partition by the global min with strict weak ordering relation and move partitioned work to a new vector. (O(n)) Thread buckets is a data structure that has a vector per each thread, represents the next processing equivalence class ordering for B1 bucket in Rank 0
  • 48.
    Data Structure: PartitioningScheme • Processing Thread Buckets • Avoid load imbalance issues. • But cannot insert work items that are coming for the same buckets (cannot afford vector resize -- if we need this we need to use locks) • Work items pushed for the same bucket are directly sent. Abstract Graph Machine 48 T0 T1 T2 T0 T1 T2 T0 …
  • 49.
    Data Structure: Summary AbstractGraph Machine 49 Data Structure Execution Time (sec.) Linked List (Atomics) ~61 Binary Search Trees with Locks ~82 Concurrent SkipList ~89 Partitioning Scheme ~44 • Pre-order processing function • SSSP-Delta ordering • Two nodes, 16 shared-memory parallel threads, Graph500 Scale 24.
  • 50.
    Mapping AGM toan Implementation Abstract Graph Machine 50 Q2. How should we arrange processing function and ordering ? Q1. How should we do ordering ? Abstract Actual ImplementationCommunication method
  • 51.
    Communication Paradigms Abstract GraphMachine 51 R0 R1 R2 R3 Push Push Push Push Push Push R0 R1 R2 R3 Pull Pull Pull Pull Pull Pull R0 R1 R2 R3 Gather Gather Gather Scatter Scatter Scatter Apply Push Communication Pull Communication GAS Communication R1 pushes workitems to R0 R0 pulls workitems to R1
  • 52.
    Placing Processing Function:Pre-Order & Post-Order Abstract Graph Machine 52 After receiving the workitem processing function is executed and generated workitems are inserted to the data structure After receiving the workitem insert to the data structure for ordering Pre-Order Post-Order
  • 53.
    Placing Processing Function:Split-Order • The processing function contains logic for updating state/s and to generate new work.  - We split the processing function into two functions: a state update function ( and) Abstract Graph Machine 53 workitem Insert the workitem for ordering if state is updated Generate new work only if state is not changed Allows us to prune work further
  • 54.
    Placing Processing Function:Summary Abstract Graph Machine 54 Run Time Total Inserts Additional WorkItems Processed Split-Order 14.54 17139284 21334 Pre-order 62.55 409388039 2469025 Post-order 44.09 398344581 2284918 Best timing. Total work items pushed into the data structure is much less compared other two implementations. Additional relaxes are much less compared to other two implementations, because of pruning. Post-Order is slightly better than pre-order because of the reduced contention.
  • 55.
  • 56.
    EAGM Spatial Levels AbstractGraph Machine 57 T0 T1 T2 T3 T4 T5 N0 Global Spatial Level Node Spatial Level NUMA Spatial Level Thread Spatial Level numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1
  • 57.
    EAGM Execution Abstract GraphMachine 58 // SSSP (AGM) algorithm typedef agm<Graph, WorkItem, ProcessingFunction, StrictWeakOrdering, RuntimeModelGen> sssp_agm_t; sssp_agm_t ssspalgo(rtmodelgen, ordering, pf, initial); // SSSP (EAGM) algorithm typedef eagm<Graph, WorkItem, ProcessingFunction, EAGMConfig, RuntimeModelGen> sssp_eagm_t; sssp_eagm_t ssspalgo(rtmodelgen, config, pf, initial); AGM takes a strict weak ordering relation. EAGM takes an hierarchy of orderings The hierarchy correspond to the memory hierarchy. Globally Delta (=5) ordering Node level chaotic Ordering NUMA level chaotic Ordering Thread level chaotic Ordering EAGMConfig
  • 58.
    EAGMConfig Abstract Graph Machine59 // strict weak orderings //================== Chaotic ===================== struct chaotic : public base_ordering { public: static const eagm_ordering ORDERING_NAME = eagm_ordering::enum_chaotic; template <typename T> bool operator()(T i, T j) { return false; } eagm_ordering name() { return ORDERING_NAME; } }; Any two workitems are not comparable to each other. CHAOTIC_ORDERING_T ch; DELTA_ORDERING_T delta(agm_params.delta); auto config = boost::graph::agm::create_eagm_config(delta, ch, ch, ch); Globally delta ordering Node level chaotic ordering NUMA level chaotic orderingThread level chaotic ordering
  • 59.
    EAGM Execution: Abstract GraphMachine 60 T0 T1 T2 T3 T4 T5 N0 numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1 Time An equivalence class Barrier synchronization Execution same as AGM Execution is globally synchronous Execution is asynchronous in process, NUMA and thread levels
  • 60.
    EAGM Execution: Abstract GraphMachine 61 Execution is globally asynchronous  globally single equivalence class. Execution is node level synchronous, Equivalence class at process level. Execution is asynchronous in NUMA and thread levels Local thread barriers
  • 61.
    EAGM Execution: Abstract GraphMachine 62 T0 T1 T2 T3 T4 T5 N0 numa0 numa1 T0 T1 T2 T3 T4 T5 N1 numa0 numa1 T0 T1 T2 T3 T4 T5 N2 numa0 numa1 Time Global Equivalence Class Global Barrier Node Equivalence Class NUMA Equivalence Class Thread Equivalence Class NUMA Barrier Thread Barrier Spatial Ordering TemporalOrdering • If we have “non- chaotic” orderings defined for every memory level. • E.g., Spatial ordering Temporal ordering
  • 62.
    EAGM Implementation • Theheart of the EAGM implementation is the nested data structure that holds equivalence classes. Abstract Graph Machine 63 Global level classes Node level classes NUMA level classes
  • 63.
    . . . . . . . . . . . . . . . . . . Time EAGM Implementation: DataStructure Abstract Graph Machine 64 Order by Delta=10 Order by K=2 Order by Delta=5 E.g.,
  • 64.
    EAGM Implementation: StaticOptimizations Abstract Graph Machine 65 Chaotic ordering always create a single equivalence class. Therefore, we exclude levels for chaotic orderings. Optimized data structure. Optimizations are performed statically at compile time with the help of template meta- programming and template specialization. Remove nested levels for chaotic orderings.
  • 65.
    EAGM Implementation: StaticOptimizations Abstract Graph Machine 66 Remove nested levels for chaotic orderings.This is really an AGM.
  • 66.
    EAGM Implementation: StaticOptimizations Abstract Graph Machine 67
  • 67.
  • 68.
    Single Source ShortestPaths: Pre/Post/Split Orderings Abstract Graph Machine 69 Weak scaling results carried out on 2 Broad- well 16-core Intel Xeon processors. 230 vertices & 234 edges. Shared-memory execution. Split-order is ~5 times faster than pre-order & post- order.
  • 69.
    Single Source ShortestPaths: Weak Scaling Abstract Graph Machine 70 Graph500 graph. Graph500 proposed SSSP graph. Erdos-Renyi graph. Framework algorithms show better scaling compared to PowerGraph. Thread level Dijkstra ordering shows better scaling behavior for Power-Law graphs. Globally synchronous delta- stepping has better scaling for ER graphs.
  • 70.
    Single Source ShortestPaths: Strong Scaling (BRII+) Abstract Graph Machine 71 Relative Speedup, = Time for fastest sequential algorithm / Parallel execution time on “n” PEs Synchronization overhead become significant when executing on higher number of nodes Just ordering by level does not reduce the work, but if we apply Dijkstra ordering at thread level we see better performance.
  • 71.
    SSSP More Results:Weak Scaling (BR II) Abstract Graph Machine 72 Both Power-Graph & PBGL-1 do not scale well at higher scale
  • 72.
    Breadth First Search:Weak Scaling Abstract Graph Machine 73 K(2) global ordering and level global ordering reduce more redundant work compared thread level ordering.
  • 73.
    BFS: In-node Performance(Weak Scaling) Abstract Graph Machine 74 Global level synchronous Node level synchronous Within a node both configurations execute in the same way. Therefore, the performance of both configurations are similar.
  • 74.
    BFS: Strong Scaling AbstractGraph Machine 75 Globally level synchronous versions show better speed-up than thread level synchronous versions.
  • 75.
    BFS: Road Networks AbstractGraph Machine 76 Strong scaling with Road networks. Road networks has a high diameter (~850-900). Therefore, global level synchronous version show poor scaling because of the synchronization overhead. Global level synchronous Globally asynchronous and thread level synchronous
  • 76.
    Other Graph Applications •Connected Components • Maximal Independent Set • Triangle Counting Abstract Graph Machine 77
  • 77.
    Conclusions • Using theAGM abstraction, we showed that, we can generate families of algorithms by keeping the processing function intact and by changing ordering. • With the EAGM we can achieve spatial and temporal orderings. • We discuss the challenges in mapping AGM framework to an implementations and described how we could avoid such challenges. • We came up with an efficient implementation of AGM and EAGM frameworks.  Weak and strong scaling results show that AGM/EAGM algorithms outperform graph algorithms in other frameworks. Abstract Graph Machine 78
  • 78.
    Thank You ! AbstractGraph Machine 79

Editor's Notes

  • #33 Work within each bucket is further ordered using thread local priority queues (threadq).
  • #34 Interesting duality. Expressing the algorithm with AGM lets us pick an ordering which exposes parallelizability, but once parallelized, we can order in different ways. Usually what makes sense is to start as loosely as possible and then add more ordering Work within each bucket is further ordered using thread local priority queues (threadq).