Integrative Parallel Programming in HPC Victor Eijkhout 2014/09/22
Introduction Motivating example Type system Demonstration Other applications Tasks and processes Task execution Research Conclusion GA Tech | 2014/09/22| 2
Introduction GA Tech | 2014/09/22| 3
My aims for a new parallel programming system 1. There are many types of parallelism ) Uniform treatment of parallelism 2. Data movement is more important than computation ) While acknowledging the realities of hardware 3. CS theory seems to ignore HPC-type of parallelism ) Strongly theory based IMP: Integrative Model for Parallelism GA Tech | 2014/09/22| 4
Design of a programming system One needs to distinguish: Programming model How does it look in code Execution model How is it actually executed Data model How is data placed and moved about Three dierent vocabularies! GA Tech | 2014/09/22| 5
Programming model Sequential semantics [A]n HPF program may be understood (and debugged) using sequential semantics, a deterministic world that we are comfortable with. Once again, as in traditional programming, the programmer works with a single address space, treating an array as a single, monolithic object, regardless of how it may be distributed across the memories of a parallel machine. (Nikhil 1993) As opposed to [H]umans are quickly overwhelmed by concurrency and
nd it much more dicult to reason about concurrent than sequential code. Even careful people miss possible interleavings among even simple collections of partially ordered operations. (Sutter and Larus 2005) GA Tech | 2014/09/22| 6
Programming model Sequential semantics is close to the mathematics of the problem. Note: sequential semantics in the programming model does not mean BSP synchronization in the execution. Also note: sequential semantics is subtly dierent from SPMD (but at least SPMD puts you in the asynchronous mindset) GA Tech | 2014/09/22| 7
Execution model Virtual machine: data ow. Data ow expresses the essential dependencies in an algorithm. Data ow applies to multiple parallelism models. But it would be a mistake to program data ow explicitly. GA Tech | 2014/09/22| 8
Data model Distribution: mapping from processors to data. (note: traditionally the other way around) Needed (and missing from existing systems such as UPC, HPF): distributions need to be
rst-class objects: ) we want an algebra of distributions algorithms need to be expressed in distributions GA Tech | 2014/09/22| 9
Integrative Model for Parallelism (IMP) Theoretical model for describing parallelism Library (or maybe language) for describing operations on parallel data Minimal, yet sucient, speci
cation of parallel aspects Many aspects are formally derived (often as
rst-class objects), including messages and task dependencies. ) Specify what, not how ) Improve programmer productivity, code quality, eciency and robustness GA Tech | 2014/09/22| 10
Motivating example GA Tech | 2014/09/22| 11
1D example: 3-pt averaging Data parallel calculation: yi = f (xi1; xi ; xi+1) Each point has a dependency on three points, some on other processing elements GA Tech | 2014/09/22| 12
;
; distributions Distribution: processor-to-elements mapping distribution: data assignment on input distribution: data assignment on output
distribution: `local data' assignment )
is dynamically de
ned from the algorithm GA Tech | 2014/09/22| 13
Data ow We get a dependency structure: Interpretation: Tasks: local task graph Message passing: messages Note: this structure follows from the distributions of the algorithm, it is not programmed. GA Tech | 2014/09/22| 14
Algorithms in the Integrative Model Kernel: mapping between two distributed objects An algorithm consists of Kernels Each kernel consists of independent operations/tasks Traditional elements of parallel programming are derived from the kernel speci
cation. GA Tech | 2014/09/22| 15
Type system GA Tech | 2014/09/22| 16
Generalized data parallelism Functions f : Realk ! Real applied to arrays y = f (x): yi = f x(If (i)) This de
nes function If : N ! 2N for instance If = fi ; i 1; i + 1g. GA Tech | 2014/09/22| 17
Distributions Distribution is (non-disjoint, non-unique) mapping from processors to sets of indices: d : P ! 2N Distributed data: x(d) : p7! fxi : i 2 d(p)g Operations on distributions: g : N ! N ) g(d) : p7! fg(i) : i 2 d(p)g GA Tech | 2014/09/22| 18
Algorithms in terms of distributions If d is a distribution, and (funky notation) x y x + y; x y x y the motivating example becomes: y(d) = x(d) + x(d 1) + x(d 1) and the
distribution is
= d [ d 1 [ d 1 To reiterate: the
distribution comes from the structure of the algorithm GA Tech | 2014/09/22| 19
Transformations of distributions How do you go from the to
distribution of a distributed object? x(
) = T(;
)x() whereT(;
) = 1
De
ne 1
: P ! 2P by: q 2 1
(p) (q)
(p)6= ; `If q 2 1
(p), the task on q has data for the task on p' OpenMP: task wait MPI: message between q and p GA Tech | 2014/09/22| 20
Parallel computing with transformations Let y( ) distributed output, then (total needed input)
= If ( ) so y( ) = f x(
) ;
= If ( ) is local operation. However, x(), so y = f (Tx) 8 : y is distributed as y( ) x is distributed as x()
= If T = 1
GA Tech | 2014/09/22| 21
Data ow q 2 1
(p) Parts of a data ow graph can be realized with OMP tasks or MPI messages Total data ow graph from all kernels and all processes in kernels GA Tech | 2014/09/22| 22

Integrative Parallel Programming in HPC

  • 1.
    Integrative Parallel Programmingin HPC Victor Eijkhout 2014/09/22
  • 2.
    Introduction Motivatingexample Type system Demonstration Other applications Tasks and processes Task execution Research Conclusion GA Tech | 2014/09/22| 2
  • 3.
    Introduction GA Tech| 2014/09/22| 3
  • 4.
    My aims fora new parallel programming system 1. There are many types of parallelism ) Uniform treatment of parallelism 2. Data movement is more important than computation ) While acknowledging the realities of hardware 3. CS theory seems to ignore HPC-type of parallelism ) Strongly theory based IMP: Integrative Model for Parallelism GA Tech | 2014/09/22| 4
  • 5.
    Design of aprogramming system One needs to distinguish: Programming model How does it look in code Execution model How is it actually executed Data model How is data placed and moved about Three dierent vocabularies! GA Tech | 2014/09/22| 5
  • 6.
    Programming model Sequentialsemantics [A]n HPF program may be understood (and debugged) using sequential semantics, a deterministic world that we are comfortable with. Once again, as in traditional programming, the programmer works with a single address space, treating an array as a single, monolithic object, regardless of how it may be distributed across the memories of a parallel machine. (Nikhil 1993) As opposed to [H]umans are quickly overwhelmed by concurrency and
  • 7.
    nd it muchmore dicult to reason about concurrent than sequential code. Even careful people miss possible interleavings among even simple collections of partially ordered operations. (Sutter and Larus 2005) GA Tech | 2014/09/22| 6
  • 8.
    Programming model Sequentialsemantics is close to the mathematics of the problem. Note: sequential semantics in the programming model does not mean BSP synchronization in the execution. Also note: sequential semantics is subtly dierent from SPMD (but at least SPMD puts you in the asynchronous mindset) GA Tech | 2014/09/22| 7
  • 9.
    Execution model Virtualmachine: data ow. Data ow expresses the essential dependencies in an algorithm. Data ow applies to multiple parallelism models. But it would be a mistake to program data ow explicitly. GA Tech | 2014/09/22| 8
  • 10.
    Data model Distribution:mapping from processors to data. (note: traditionally the other way around) Needed (and missing from existing systems such as UPC, HPF): distributions need to be
  • 11.
    rst-class objects: )we want an algebra of distributions algorithms need to be expressed in distributions GA Tech | 2014/09/22| 9
  • 12.
    Integrative Model forParallelism (IMP) Theoretical model for describing parallelism Library (or maybe language) for describing operations on parallel data Minimal, yet sucient, speci
  • 13.
    cation of parallelaspects Many aspects are formally derived (often as
  • 14.
    rst-class objects), includingmessages and task dependencies. ) Specify what, not how ) Improve programmer productivity, code quality, eciency and robustness GA Tech | 2014/09/22| 10
  • 15.
    Motivating example GATech | 2014/09/22| 11
  • 16.
    1D example: 3-ptaveraging Data parallel calculation: yi = f (xi1; xi ; xi+1) Each point has a dependency on three points, some on other processing elements GA Tech | 2014/09/22| 12
  • 17.
  • 18.
    ; distributions Distribution: processor-to-elements mapping distribution: data assignment on input distribution: data assignment on output
  • 19.
  • 20.
  • 21.
    ned from thealgorithm GA Tech | 2014/09/22| 13
  • 22.
    Data ow We geta dependency structure: Interpretation: Tasks: local task graph Message passing: messages Note: this structure follows from the distributions of the algorithm, it is not programmed. GA Tech | 2014/09/22| 14
  • 23.
    Algorithms in theIntegrative Model Kernel: mapping between two distributed objects An algorithm consists of Kernels Each kernel consists of independent operations/tasks Traditional elements of parallel programming are derived from the kernel speci
  • 24.
    cation. GA Tech| 2014/09/22| 15
  • 25.
    Type system GATech | 2014/09/22| 16
  • 26.
    Generalized data parallelism Functions f : Realk ! Real applied to arrays y = f (x): yi = f x(If (i)) This de
  • 27.
    nes function If: N ! 2N for instance If = fi ; i 1; i + 1g. GA Tech | 2014/09/22| 17
  • 28.
    Distributions Distribution is(non-disjoint, non-unique) mapping from processors to sets of indices: d : P ! 2N Distributed data: x(d) : p7! fxi : i 2 d(p)g Operations on distributions: g : N ! N ) g(d) : p7! fg(i) : i 2 d(p)g GA Tech | 2014/09/22| 18
  • 29.
    Algorithms in termsof distributions If d is a distribution, and (funky notation) x y x + y; x y x y the motivating example becomes: y(d) = x(d) + x(d 1) + x(d 1) and the
  • 30.
  • 31.
    = d [d 1 [ d 1 To reiterate: the
  • 32.
    distribution comes fromthe structure of the algorithm GA Tech | 2014/09/22| 19
  • 33.
    Transformations of distributions How do you go from the to
  • 34.
    distribution of adistributed object? x(
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    : P !2P by: q 2 1
  • 41.
  • 42.
  • 43.
    (p), the taskon q has data for the task on p' OpenMP: task wait MPI: message between q and p GA Tech | 2014/09/22| 20
  • 44.
    Parallel computing withtransformations Let y( ) distributed output, then (total needed input)
  • 45.
    = If ( ) so y( ) = f x(
  • 46.
  • 47.
    = If ( ) is local operation. However, x(), so y = f (Tx) 8 : y is distributed as y( ) x is distributed as x()
  • 48.
    = If T = 1
  • 49.
    GA Tech |2014/09/22| 21
  • 50.
  • 51.
    (p) Parts ofa data ow graph can be realized with OMP tasks or MPI messages Total data ow graph from all kernels and all processes in kernels GA Tech | 2014/09/22| 22
  • 52.
    To summarize Distribution language is global with sequential semantics Leads to data ow formulation Can be interpreted in multiple parallelism modes Execution likely to be ecient GA Tech | 2014/09/22| 23
  • 53.
    Demonstration GA Tech| 2014/09/22| 24
  • 54.
    Can you codethis? As a library / internal DSL: express distributions in custom API, write local operation in ordinary C/F ) easy integration in existing codes As a programming language / external DSL: requires compiler technology: ) prospect for interactions between data movement and local code. GA Tech | 2014/09/22| 25
  • 55.
    Approach taken Program expresses the sequential semantics of kernels Base class to realize the IMP concepts One derived class that turns IMP into MPI One derived class that turns IMP into OpenMP+tasks Total: few thousand lines. GA Tech | 2014/09/22| 26
  • 56.
    GA Tech |2014/09/22| 27
  • 57.
    Code IMP_distribution *blocked= new IMP_distribution (disjoint-block,problem_environment,globalsize); for (int step=0; step=nsteps; ++step) { IMP_object *output_vector = new IMP_object( blocked ); all_objects[step] = output_vector; } GA Tech | 2014/09/22| 28
  • 58.
    for (int step=0;step=nsteps; ++step) { IMP_object *input_vector = all_objects[step-1], *input_vector = all_objects[step]; IMP_kernel *update_step = new IMP_kernel(input_vector,output_vector); update_step-localexecutefn = threepoint_execute; update_step-add_beta_oper( new ioperator(1) ); update_step-add_beta_oper( new ioperator(1) ); update_step-add_beta_oper( new ioperator(none) ); queue-add_kernel( step,update_step ); GA Tech | 2014/09/22| 29
  • 59.
    Inspector-executor queue-analyze_dependencies(); queue-execute(); Analysis done once (expensive) execution multiple times (very ecient) (In MPI context you can dispense with the queue and execute kernels directly) GA Tech | 2014/09/22| 30
  • 60.
    (Do I reallyhave to put up performance graphs?) GA Tech | 2014/09/22| 31
  • 61.
    (Do I reallyhave to put up performance graphs?) 2 4 6 8 10 12 14 16 100 10-1 Gflop under strong scaling of vector averaging OpenMP IMP 140 120 100 80 60 40 20 0 Gflop under weak scaling of vector averaging MPI IMP 0 200 400 600 800 1000 GA Tech | 2014/09/22| 32
  • 62.
    Summary: the motivatingexample in parallel language Write the three-point averaging as y(u) = x(u) + x(u 1) + x(u 1) =3 Global description, sequential semantics Execution is driven by data ow, no synchronization -distribution given by context
  • 63.
    -distribution is u+ u 1 + u 1 Messages and task dependencies are derived. GA Tech | 2014/09/22| 33
  • 64.
    Other applications GATech | 2014/09/22| 34
  • 65.
    N-body problems GATech | 2014/09/22| 35
  • 66.
    Distributions of theN-body problem Going up the levels: (k1) = (k)=2
  • 67.
    (k) = 2 (k) [ 2 (k) + j (k)j: Redundant computation is never explicitly mentioned. (This can be coded; code is essentially the same as the formulas) GA Tech | 2014/09/22| 36
  • 68.
    Tasks and processes GA Tech | 2014/09/22| 37
  • 69.
    Task graph Taskis local execution: Task Kernel P Task numbering: hi ; pi where i n; p 2 P Dependency edge: hi ; qi; hi + 1; pi i q 2 1
  • 70.
    (p): also written t0 = hi ; qi; t = hi + 1; pi; t0 t GA Tech | 2014/09/22| 38
  • 71.
    Processors and synchronization Processor Cp is a (non-disjoint) subset of tasks: Task = [pCp: For a task t 2 T we de
  • 72.
    ne a taskt0 as a synchronization point if t0 is an immediate predecessor on another processor: t 2 Cp ^ t0 t ^ t0 2 Cp0 ^ p6= p0: If L Task, base BL is BL = ft 2 L: pred(t)6 Lg: GA Tech | 2014/09/22| 39
  • 73.
    Local computations Two-parametercovering fLk;pgk;p of T is called local computations if 1. the p index corresponds to the division in processors: Cp = [kLk;p: 2. the k index corresponds to the partial ordering on tasks: the sets Lk = [pLk;p satisfy t 2 Lk ^ t0 t ) t0 2 [ `k L` 3. the synchronization points synchronize only with previous levels: pred(Bk;p) Cp [ `k L` For a given k, all Lk;p can be executed independently. GA Tech | 2014/09/22| 40
  • 74.
    (a): (b): (c): Are these local computations? Yes, No, Yes GA Tech | 2014/09/22| 41
  • 75.
  • 76.
    nitions can begiven purely in terms of the task graph. Programmer decides how `thick' to make the Lk;p covering, communication avoiding scheduling is formally derived. GA Tech | 2014/09/22| 42
  • 77.
    Co-processors Distributions candescribe data placement Our main worry is latency of data movement: in IMP, data can be sent early-as-possible; our communication avoiding compiler transforms algorithms to maximize granularity GA Tech | 2014/09/22| 43
  • 78.
    Task execution GATech | 2014/09/22| 44
  • 79.
    What is atask? A task is a Finite State Automaton with
  • 80.
    ve states, transitionsare triggered by receiving signals from other tasks: requesting Each task starts out by posting a request for incoming data to each of its predecessors. accepting The requested data is in the process of arriving or being made available. exec The data dependencies are satis
  • 81.
    ed and thetask can execute locally; in a re
  • 82.
    nement of thismodel there can be a separate exec state for each predecessor. avail Data that was produced and that serves as origin for some dependency is published to all successor tasks. used All published origin data has been absorbed by the endpoint of the data dependency, and any temporary buers can be released. GA Tech | 2014/09/22| 45
  • 83.
    p states controlmessages q; s states requesting # notifyReadyToSend ... exec 8q p requestToSend # q p accepting ! # sendData avail acknowledgeReceipt ! # 8q p used requesting notifyReadyToSend exec ! # 8s p # requestToSend s p accepting 9s p ... avail # sendData ! acknowledgeReceipt 8s p GA Tech | 2014/09/22| 46
  • 84.
    How does aprocessor manage tasks? Theorem: if you get a request-to-send, you can release the send buers of your predecessor tasks Corrolary: we have a functional model that doesn't need garbage collection GA Tech | 2014/09/22| 47
  • 85.
    Research GA Tech| 2014/09/22| 48
  • 86.
    Open questions Many! Software is barely in demonstration stage: needs much more functionality Theoretical questions: SSA, cost, scheduling, Practical questions: interaction with local code, heterogeneity, interaction with hardware Application: this works for tradition HPC, N-body, probably sorting and graph algorithms. Beyond? Software-hardware co-design: IMP model has semantics for data movement, hardware can be made more ecient using this. GA Tech | 2014/09/22| 49
  • 87.
    Conclusion GA Tech| 2014/09/22| 50
  • 88.
    The future's sobright, I gotta wear shades IMP has the right abstraction level: global expression, yet natural derivation of practical concepts. Concept notation looks humanly possible: basis for an expressive programming system Global description without talking about processes/processors: prospect for heterogeneous programming All concepts are explicit: middleware for scheduling, resilience, et cetera Applications to most conceivable scienti
  • 89.
    c operations GATech | 2014/09/22| 51