Introduction to Parallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
Dipl.-Inform. Jörn Dinkla  Java (J2SE, JEE)  Programming Languages  Scala, Groovy, Haskell  Parallel Computing  GPU Computing  Model driven  Eclipse-Plugins
Overview  Progress in computing  Traditional Hard- and Software  Theoretical Computer Science  Algorithms  Machines  Optimization  Parallelization  Parallel Hard- and Software
Progress in Computing 1. New applications  Not feasible before  Not needed before  Not possible before 2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
Progress in Computing  Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
How was progress achieved?  Hardware  CPU, memory, disks, networks  Faster and larger  Software  New and better algorithms  Programming methods and languages
Traditional Hardware  Von Neumann-Architecture CPU I/O Memory Bus  John Backus 1977  “von Neumann bottleneck“ Cache
Improvements  Increasing Clock Frequency  Memory Hierarchy / Cache  Parallelizing ALU  Pipelining  Very-long Instruction Words (VLIW)  Instruction-Level parallelism (ILP)  Superscalar processors  Vector data types  Multithreaded  Multicore / Manycore
Moore‘s law  Guaranteed until 2020
Clock frequency  No increase since 2005
Physical Limits  Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation  Limit to transistor size Faster processors impossible !?!
2005 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
Multicore  Transistor count  Doubles every 2-3 years  Calculation speed  No increase Multicore  Efficient?
How to use the cores?  Multi-Tasking OS  Different tasks  Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
Traditional Software  Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control  Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries  Keywords  Sequential, Serial, Deterministic
Traditional Software  Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile)  Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
Architecture  Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
More than one CPU?  How should they communicate ? I/O Memory I/O Memory CPU CPU
Message Passing  Distributed system  Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
Shared Memory  Shared Memory  Tight coupling I/O Memory I/O CPU CPU
Shared Memory  Global vs. Local  Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
Overview: Memory  Unshared Memory  Message Passing  Actors  Shared Memory  Threads  Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS)  Transactional Memory
Sequential Algorithms  Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
Sequential Algorithms int sum = 0 for i=0 to 4 sum += mem[i] Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
More than one CPU  How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
Two Processors PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Lockstep 1 2 7 5  Memory Access! 3 4 1 2 5 18
Flynn‘s Taxonomy  1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
Flynn‘s Taxonomy  SISD  RAM, Von Neumann  SIMD  Lockstep, vector processor, GPU  MISD  Fault tolerance  MIMD  Distributed system
Extension MIMD  How many programs?  SPMD  One program  Not in lockstep as in SIMD  MPMD  Many programs
Processes & Threads  Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads  Thread  Smallest unit of execution  Light weight
Overview: Algorithms  Sequential  Parallel  Concurrent Overlap  Distributed  Randomized  Quantum
Computer Science  Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
Gap: Theory and Practice  Galactic algorithms  Written for abstract machines  PRAM, special networks, etc.  Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
Sequential algorithms  Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
Parallel algorithms  Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
Parallelization  Transforming  a sequential  into a parallel algorithm  Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
Granularity  “Size” of the threads?  How much computation?  Coarse vs. fine grain  Right choice  Important for good performance  Algorithm design
Computational thinking  “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
Computational thinking  “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010  Expert level needed for parallelization!
Problems: Shared Memory  Destructive updates  i += 1  Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Which one first? 1 2 7 5 3 1 4 2 5 18
Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum  Synchronization needed
Problems: Shared Memory  The memory barrier  When is a value read or written?  Optimizing compilers change semantics  int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a  Solutions (Java)  volatile  java.util.concurrent.atomic
Problems: Shared Memory  Thread safety  Reentrant code class X { int x; void inc() { x+=1; } }
Problems: Threads  Deadlock  A wants B, B wants A, both waiting  Starvation  A wants B, but never gets it  Race condition  A writes to mem, B reads/writes mem
Shared Mem: Solutions  Shared mutable state  Synchronize properly  Isolated mutable state  Don‘t share state  Immutable or unshared  Don‘t mutate state!
Solutions  Transactional Memory  Every access within transaction  See databases  Actor models  Message passing  Immutable state / pure functional
Speedup and Efficiency  Running time  T(1) with one processor  T(n) with two processors  Speedup  How much faster?  S(n) = T(1) / T(n)
Speedup and Efficiency  Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
Amdahl‘s Law 
Amdahl‘s Law
Amdahl‘s Law  Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
P-Completeness  Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
P-Completeness 
Optimization  What can i achieve?  When do I stop?  How many threads should i use?
Optimization  I/O bound  Thread is waiting for memory, disk, etc.  Computation bound  Thread is calculating the whole time  Watch processor utilization!
Optimization  I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads  Computation bound  Number of threads = Number of cores
Processors  Multicore CPU  Graphical Processing Unit (GPU)  Field-Programmable Gate Array (FPGA)
GPU Computing  Finer granularity than CPU  Specialized processors  512 cores on a Fermi  High memory bandwidth 192 GB/sec
CPU vs. GPU  Source: SGI
FPGA  Configurable hardware circuits  Programmed in Verilog, VHDL  Now: OpenCL  Much higher level of abstraction  Under development, promising  No performance tests results (2011/12)
Networks / Cluster  Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA  Vast possibilities
Example  2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
Example  1 CPU with local cache  Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
Next Step: Hybrid  Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
Optimal combination?  Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
Example: K-Computer  SPARC64 VIIIfx 2.0GHz  705024 Cores  10.51 Petaflop/s  No GPUs  #1 2011
Example: Tianhe-1A  14336 Xeon X5670  7168 Tesla M2050  2048 NUDT FT1000  2.57 petaflop/s  #2 2011
Example: HPC at home  Workstations and blades  8 x 512 cores = 4096 cores
Frameworks: Shared Mem  C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads  Java  java.util.concurrent
Frameworks: Actors  C/C++  Theron  Java / JVM  Akka  Scala  GPars (Groovy)
GPU Computing  NVIDIA CUDA  NVIDIA  OpenCL  AMD  NVIDIA  Intel  Altera  Apple  WebCL  Nokia  Samsung
Advanced courses  Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory  See http://www.dinkla.com
Advanced courses  GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java  See http://www.dinkla.com
References: Practice  Mattson, Sanders, Massingill  Patterns for Parallel Programming  Breshears  The Art of Concurrency
References: Practice  Pacheco  An Introduction to Parallel Programming  Herlihy, Shavit  The Art of Multiprocessor Programming
References: Theory  Gibbons, Rytter  Efficient Parallel Algorithms  Lynch  Distributed Algorithms  Ben-Ari  Principles of Concurrent and Distributed Programming
References: GPU Computing  Scarpino  OpenCL in Action  Sanders, Kandrot  CUDA by Example
References: Background  Hennessy, Paterson  Computer Architecture: A Quantitative Approach

Introduction To Parallel Computing

  • 1.
    Introduction to Parallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  • 2.
    Dipl.-Inform. Jörn Dinkla Java (J2SE, JEE)  Programming Languages  Scala, Groovy, Haskell  Parallel Computing  GPU Computing  Model driven  Eclipse-Plugins
  • 3.
    Overview  Progress incomputing  Traditional Hard- and Software  Theoretical Computer Science  Algorithms  Machines  Optimization  Parallelization  Parallel Hard- and Software
  • 4.
    Progress in Computing 1.New applications  Not feasible before  Not needed before  Not possible before 2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  • 5.
    Progress in Computing Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  • 6.
    How was progressachieved?  Hardware  CPU, memory, disks, networks  Faster and larger  Software  New and better algorithms  Programming methods and languages
  • 7.
    Traditional Hardware  VonNeumann-Architecture CPU I/O Memory Bus  John Backus 1977  “von Neumann bottleneck“ Cache
  • 8.
    Improvements  Increasing Clock Frequency  Memory Hierarchy / Cache  Parallelizing ALU  Pipelining  Very-long Instruction Words (VLIW)  Instruction-Level parallelism (ILP)  Superscalar processors  Vector data types  Multithreaded  Multicore / Manycore
  • 9.
  • 10.
    Clock frequency  Noincrease since 2005
  • 11.
    Physical Limits  Increaseof clock frequency  >>> Energy-consumption  >>> Heat-dissipation  Limit to transistor size Faster processors impossible !?!
  • 12.
    2005 “The Free LunchIs Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  • 13.
    Multicore  Transistor count  Doubles every 2-3 years  Calculation speed  No increase Multicore  Efficient?
  • 14.
    How to usethe cores?  Multi-Tasking OS  Different tasks  Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  • 15.
    Traditional Software  Computationis expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control  Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries  Keywords  Sequential, Serial, Deterministic
  • 16.
    Traditional Software  Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile)  Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  • 17.
    Architecture  Simplification: Ignorethe bus CPU I/O Memory I/O Memory Bus CPU
  • 18.
    More than oneCPU?  How should they communicate ? I/O Memory I/O Memory CPU CPU
  • 19.
    Message Passing  Distributedsystem  Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  • 20.
    Shared Memory  SharedMemory  Tight coupling I/O Memory I/O CPU CPU
  • 21.
    Shared Memory  Globalvs. Local  Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  • 22.
    Overview: Memory  UnsharedMemory  Message Passing  Actors  Shared Memory  Threads  Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS)  Transactional Memory
  • 23.
    Sequential Algorithms  RandomAccess Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  • 24.
    Sequential Algorithms int sum= 0 for i=0 to 4 sum += mem[i] Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  • 25.
    More than oneCPU  How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  • 26.
    Two Processors PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Lockstep 1 2 7 5  Memory Access! 3 4 1 2 5 18
  • 27.
    Flynn‘s Taxonomy  1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  • 28.
    Flynn‘s Taxonomy  SISD  RAM, Von Neumann  SIMD  Lockstep, vector processor, GPU  MISD  Fault tolerance  MIMD  Distributed system
  • 29.
    Extension MIMD  Howmany programs?  SPMD  One program  Not in lockstep as in SIMD  MPMD  Many programs
  • 30.
    Processes & Threads Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads  Thread  Smallest unit of execution  Light weight
  • 31.
    Overview: Algorithms  Sequential  Parallel  Concurrent Overlap  Distributed  Randomized  Quantum
  • 32.
    Computer Science  TheoreticalComputer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  • 33.
    Gap: Theory andPractice  Galactic algorithms  Written for abstract machines  PRAM, special networks, etc.  Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  • 34.
    Sequential algorithms  Implementinga sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  • 35.
    Parallel algorithms  Implementinga parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  • 36.
    Parallelization  Transforming  a sequential  into a parallel algorithm  Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  • 37.
    Granularity  “Size” ofthe threads?  How much computation?  Coarse vs. fine grain  Right choice  Important for good performance  Algorithm design
  • 38.
    Computational thinking  “…is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  • 39.
    Computational thinking  “…is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010  Expert level needed for parallelization!
  • 40.
    Problems: Shared Memory Destructive updates  i += 1  Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  • 41.
    Problems: Shared Memory PC1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Which one first? 1 2 7 5 3 1 4 2 5 18
  • 42.
    Problems: Shared Memory PC1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum  Synchronization needed
  • 43.
    Problems: Shared Memory The memory barrier  When is a value read or written?  Optimizing compilers change semantics  int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a  Solutions (Java)  volatile  java.util.concurrent.atomic
  • 44.
    Problems: Shared Memory Thread safety  Reentrant code class X { int x; void inc() { x+=1; } }
  • 45.
    Problems: Threads  Deadlock  A wants B, B wants A, both waiting  Starvation  A wants B, but never gets it  Race condition  A writes to mem, B reads/writes mem
  • 46.
    Shared Mem: Solutions Shared mutable state  Synchronize properly  Isolated mutable state  Don‘t share state  Immutable or unshared  Don‘t mutate state!
  • 47.
    Solutions  Transactional Memory  Every access within transaction  See databases  Actor models  Message passing  Immutable state / pure functional
  • 48.
    Speedup and Efficiency Running time  T(1) with one processor  T(n) with two processors  Speedup  How much faster?  S(n) = T(1) / T(n)
  • 49.
    Speedup and Efficiency Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  • 50.
  • 51.
  • 52.
    Amdahl‘s Law  Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  • 53.
    P-Completeness  Is therean efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  • 54.
  • 55.
    Optimization  What cani achieve?  When do I stop?  How many threads should i use?
  • 56.
    Optimization  I/O bound  Thread is waiting for memory, disk, etc.  Computation bound  Thread is calculating the whole time  Watch processor utilization!
  • 57.
    Optimization  I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads  Computation bound  Number of threads = Number of cores
  • 58.
    Processors  Multicore CPU Graphical Processing Unit (GPU)  Field-Programmable Gate Array (FPGA)
  • 59.
    GPU Computing  Finergranularity than CPU  Specialized processors  512 cores on a Fermi  High memory bandwidth 192 GB/sec
  • 60.
    CPU vs. GPU Source: SGI
  • 61.
    FPGA  Configurable hardwarecircuits  Programmed in Verilog, VHDL  Now: OpenCL  Much higher level of abstraction  Under development, promising  No performance tests results (2011/12)
  • 62.
    Networks / Cluster Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA  Vast possibilities
  • 63.
    Example  2 xconnected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  • 64.
    Example  1 CPUwith local cache  Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  • 65.
    Next Step: Hybrid Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  • 66.
    Optimal combination?  Whichnetwork gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  • 67.
    Example: K-Computer  SPARC64 VIIIfx 2.0GHz  705024 Cores  10.51 Petaflop/s  No GPUs  #1 2011
  • 68.
    Example: Tianhe-1A  14336 Xeon X5670  7168 Tesla M2050  2048 NUDT FT1000  2.57 petaflop/s  #2 2011
  • 69.
    Example: HPC athome  Workstations and blades  8 x 512 cores = 4096 cores
  • 70.
    Frameworks: Shared Mem C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads  Java  java.util.concurrent
  • 71.
    Frameworks: Actors  C/C++  Theron  Java / JVM  Akka  Scala  GPars (Groovy)
  • 72.
    GPU Computing  NVIDIACUDA  NVIDIA  OpenCL  AMD  NVIDIA  Intel  Altera  Apple  WebCL  Nokia  Samsung
  • 73.
    Advanced courses  Bestpractices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory  See http://www.dinkla.com
  • 74.
    Advanced courses  GPUComputing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java  See http://www.dinkla.com
  • 75.
    References: Practice  Mattson,Sanders, Massingill  Patterns for Parallel Programming  Breshears  The Art of Concurrency
  • 76.
    References: Practice  Pacheco  An Introduction to Parallel Programming  Herlihy, Shavit  The Art of Multiprocessor Programming
  • 77.
    References: Theory  Gibbons,Rytter  Efficient Parallel Algorithms  Lynch  Distributed Algorithms  Ben-Ari  Principles of Concurrent and Distributed Programming
  • 78.
    References: GPU Computing Scarpino  OpenCL in Action  Sanders, Kandrot  CUDA by Example
  • 79.
    References: Background  Hennessy,Paterson  Computer Architecture: A Quantitative Approach