Lecture 2 More about Parallel Computing Vajira Thambawita
Parallel Computer Memory Architectures - Shared Memory • Multiple processors can work independently but share the same memory resources • Shared memory machines can be divided into two groups based upon memory access time: UMA : Uniform Memory Access NUMA : Non- Uniform Memory Access
Parallel Computer Memory Architectures - Shared Memory • Equal accesses and access times to memory • Most commonly represented today by Symmetric Multiprocessor (SMP) machines Uniform Memory Access (UMA)
Parallel Computer Memory Architectures - Shared Memory Non - Uniform Memory Access (NUMA) • Not all processors have equal memory access time
Parallel Computer Memory Architectures - Distributed Memory • Processors have own memory (There is no concept of global address space) • It operates independently • communications in message passing systems are performed via send and receive operations
Parallel Computer Memory Architectures – Hybrid Distributed-Shared Memory • Use in largest and Fasted computers in the world today
Parallel Programming Models Shared Memory Model (without threads) • In this programming model, processes/tasks share a common address space, which they read and write to asynchronously.
Parallel Programming Models Threads Model • This programming model is a type of shared memory programming. • In the threads model of parallel programming, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. • Ex: POSIX Threads, OpenMP, Microsoft threads, Java Python threads, CUDA threads for GPUs
Parallel Programming Models Distributed Memory / Message Passing Model • A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. • Tasks exchange data through communications by sending and receiving messages. • Ex: • Message Passing Interface (MPI)
Parallel Programming Models Data Parallel Model • May also be referred to as the Partitioned Global Address Space (PGAS) model. • Ex: Coarray Fortran, Unified Parallel C (UPC), X10
Parallel Programming Models Hybrid Model • A hybrid model combines more than one of the previously described programming models.
Parallel Programming Models SPMD and MPMD Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
Designing Parallel Programs Automatic vs. Manual Parallelization • Fully Automatic • The compiler analyzes the source code and identifies opportunities for parallelism. • The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the parallelism would actually improve performance. • Loops (do, for) are the most frequent target for automatic parallelization. • Programmer Directed • Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the compiler how to parallelize the code. • May be able to be used in conjunction with some degree of automatic parallelization also.
Designing Parallel Programs Understand the Problem and the Program • Easy to parallelize problem • Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. • A problem with little-to-no parallelism • Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: • F(n) = F(n-1) + F(n-2)
Designing Parallel Programs Partitioning • One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. Two ways: • Domain decomposition • Functional decomposition
Designing Parallel Programs Domain Decomposition The data associated with a problem is decomposed There are different ways to partition data:
Designing Parallel Programs Functional Decomposition The problem is decomposed according to the work that must be done
Designing Parallel Programs You DON'T need communications • Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. • Ex: Every pixel in a black and white image needs to have its color reversed You DO need communications • This require tasks to share data with each other • A 2-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data
Designing Parallel Programs Factors to Consider (designing your program's inter-task communications) • Communication overhead • Latency vs. Bandwidth • Visibility of communications • Synchronous vs. asynchronous communications • Scope of communications • Efficiency of communications
Designing Parallel Programs Granularity • In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. (Computation / Communication) • Periods of computation are typically separated from periods of communication by synchronization events. • Fine-grain Parallelism • Coarse-grain Parallelism
Designing Parallel Programs • Fine-grain Parallelism • Relatively small amounts of computational work are done between communication events • Low computation to communication ratio • Facilitates load balancing • Implies high communication overhead and less opportunity for performance enhancement • If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • Coarse-grain Parallelism • Relatively large amounts of computational work are done between communication/synchronization events • High computation to communication ratio • Implies more opportunity for performance increase • Harder to load balance efficiently
Designing Parallel Programs I/O • Rule #1: Reduce overall I/O as much as possible • If you have access to a parallel file system, use it. • Writing large chunks of data rather than small chunks is usually significantly more efficient. • Fewer, larger files performs better than many small files. • Confine I/O to specific serial portions of the job, and then use parallel communications to distribute data to parallel tasks. For example, Task 1 could read an input file and then communicate required data to other tasks. Likewise, Task 1 could perform write operation after receiving required data from all other tasks. • Aggregate I/O operations across tasks - rather than having many tasks perform I/O, have a subset of tasks perform it.
Designing Parallel Programs Debugging • TotalView from RogueWave Software • DDT from Allinea • Inspector from Intel Performance Analysis and Tuning • LC's web pages at https://hpc.llnl.gov/software/development-environment- software • TAU: http://www.cs.uoregon.edu/research/tau/docs.php • HPCToolkit: http://hpctoolkit.org/documentation.html • Open|Speedshop: http://www.openspeedshop.org/ • Vampir / Vampirtrace: http://vampir.eu/ • Valgrind: http://valgrind.org/ • PAPI: http://icl.cs.utk.edu/papi/ • mpitrace https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace • mpiP: http://mpip.sourceforge.net/ • memP: http://memp.sourceforge.net/
Summary

Lecture 2 more about parallel computing

  • 1.
    Lecture 2 More aboutParallel Computing Vajira Thambawita
  • 2.
    Parallel Computer MemoryArchitectures - Shared Memory • Multiple processors can work independently but share the same memory resources • Shared memory machines can be divided into two groups based upon memory access time: UMA : Uniform Memory Access NUMA : Non- Uniform Memory Access
  • 3.
    Parallel Computer MemoryArchitectures - Shared Memory • Equal accesses and access times to memory • Most commonly represented today by Symmetric Multiprocessor (SMP) machines Uniform Memory Access (UMA)
  • 4.
    Parallel Computer MemoryArchitectures - Shared Memory Non - Uniform Memory Access (NUMA) • Not all processors have equal memory access time
  • 5.
    Parallel Computer MemoryArchitectures - Distributed Memory • Processors have own memory (There is no concept of global address space) • It operates independently • communications in message passing systems are performed via send and receive operations
  • 6.
    Parallel Computer MemoryArchitectures – Hybrid Distributed-Shared Memory • Use in largest and Fasted computers in the world today
  • 7.
    Parallel Programming Models SharedMemory Model (without threads) • In this programming model, processes/tasks share a common address space, which they read and write to asynchronously.
  • 8.
    Parallel Programming Models ThreadsModel • This programming model is a type of shared memory programming. • In the threads model of parallel programming, a single "heavy weight" process can have multiple "light weight", concurrent execution paths. • Ex: POSIX Threads, OpenMP, Microsoft threads, Java Python threads, CUDA threads for GPUs
  • 9.
    Parallel Programming Models DistributedMemory / Message Passing Model • A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. • Tasks exchange data through communications by sending and receiving messages. • Ex: • Message Passing Interface (MPI)
  • 10.
    Parallel Programming Models DataParallel Model • May also be referred to as the Partitioned Global Address Space (PGAS) model. • Ex: Coarray Fortran, Unified Parallel C (UPC), X10
  • 11.
    Parallel Programming Models HybridModel • A hybrid model combines more than one of the previously described programming models.
  • 12.
    Parallel Programming Models SPMDand MPMD Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
  • 13.
    Designing Parallel Programs Automaticvs. Manual Parallelization • Fully Automatic • The compiler analyzes the source code and identifies opportunities for parallelism. • The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the parallelism would actually improve performance. • Loops (do, for) are the most frequent target for automatic parallelization. • Programmer Directed • Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the compiler how to parallelize the code. • May be able to be used in conjunction with some degree of automatic parallelization also.
  • 14.
    Designing Parallel Programs Understandthe Problem and the Program • Easy to parallelize problem • Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. • A problem with little-to-no parallelism • Calculation of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: • F(n) = F(n-1) + F(n-2)
  • 15.
    Designing Parallel Programs Partitioning •One of the first steps in designing a parallel program is to break the problem into discrete "chunks" of work that can be distributed to multiple tasks. This is known as decomposition or partitioning. Two ways: • Domain decomposition • Functional decomposition
  • 16.
    Designing Parallel Programs DomainDecomposition The data associated with a problem is decomposed There are different ways to partition data:
  • 17.
    Designing Parallel Programs FunctionalDecomposition The problem is decomposed according to the work that must be done
  • 18.
    Designing Parallel Programs YouDON'T need communications • Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. • Ex: Every pixel in a black and white image needs to have its color reversed You DO need communications • This require tasks to share data with each other • A 2-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data
  • 19.
    Designing Parallel Programs Factorsto Consider (designing your program's inter-task communications) • Communication overhead • Latency vs. Bandwidth • Visibility of communications • Synchronous vs. asynchronous communications • Scope of communications • Efficiency of communications
  • 20.
    Designing Parallel Programs Granularity •In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. (Computation / Communication) • Periods of computation are typically separated from periods of communication by synchronization events. • Fine-grain Parallelism • Coarse-grain Parallelism
  • 21.
    Designing Parallel Programs •Fine-grain Parallelism • Relatively small amounts of computational work are done between communication events • Low computation to communication ratio • Facilitates load balancing • Implies high communication overhead and less opportunity for performance enhancement • If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • Coarse-grain Parallelism • Relatively large amounts of computational work are done between communication/synchronization events • High computation to communication ratio • Implies more opportunity for performance increase • Harder to load balance efficiently
  • 22.
    Designing Parallel Programs I/O •Rule #1: Reduce overall I/O as much as possible • If you have access to a parallel file system, use it. • Writing large chunks of data rather than small chunks is usually significantly more efficient. • Fewer, larger files performs better than many small files. • Confine I/O to specific serial portions of the job, and then use parallel communications to distribute data to parallel tasks. For example, Task 1 could read an input file and then communicate required data to other tasks. Likewise, Task 1 could perform write operation after receiving required data from all other tasks. • Aggregate I/O operations across tasks - rather than having many tasks perform I/O, have a subset of tasks perform it.
  • 23.
    Designing Parallel Programs Debugging •TotalView from RogueWave Software • DDT from Allinea • Inspector from Intel Performance Analysis and Tuning • LC's web pages at https://hpc.llnl.gov/software/development-environment- software • TAU: http://www.cs.uoregon.edu/research/tau/docs.php • HPCToolkit: http://hpctoolkit.org/documentation.html • Open|Speedshop: http://www.openspeedshop.org/ • Vampir / Vampirtrace: http://vampir.eu/ • Valgrind: http://valgrind.org/ • PAPI: http://icl.cs.utk.edu/papi/ • mpitrace https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace • mpiP: http://mpip.sourceforge.net/ • memP: http://memp.sourceforge.net/
  • 24.