Introduction To Parallel Computing

Introduction to Parallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1

Dipl.-Inform. Jörn Dinkla  Java (J2SE, JEE)  Programming Languages  Scala, Groovy, Haskell  Parallel Computing  GPU Computing  Model driven  Eclipse-Plugins

Overview  Progress in computing  Traditional Hard- and Software  Theoretical Computer Science  Algorithms  Machines  Optimization  Parallelization  Parallel Hard- and Software

Progress in Computing 1. New applications  Not feasible before  Not needed before  Not possible before 2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness

Progress in Computing  Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods

How was progress achieved?  Hardware  CPU, memory, disks, networks  Faster and larger  Software  New and better algorithms  Programming methods and languages

Traditional Hardware  Von Neumann-Architecture CPU I/O Memory Bus  John Backus 1977  “von Neumann bottleneck“ Cache

Improvements  Increasing Clock Frequency  Memory Hierarchy / Cache  Parallelizing ALU  Pipelining  Very-long Instruction Words (VLIW)  Instruction-Level parallelism (ILP)  Superscalar processors  Vector data types  Multithreaded  Multicore / Manycore

Moore‘s law  Guaranteed until 2020

Clock frequency  No increase since 2005

Physical Limits  Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation  Limit to transistor size Faster processors impossible !?!

2005 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005

Multicore  Transistor count  Doubles every 2-3 years  Calculation speed  No increase Multicore  Efficient?

How to use the cores?  Multi-Tasking OS  Different tasks  Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?

Traditional Software  Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control  Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries  Keywords  Sequential, Serial, Deterministic

Traditional Software  Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile)  Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)

Architecture  Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU

More than one CPU?  How should they communicate ? I/O Memory I/O Memory CPU CPU

Message Passing  Distributed system  Loose coupling Messages Network I/O Memory I/O Memory CPU CPU

Shared Memory  Shared Memory  Tight coupling I/O Memory I/O CPU CPU

Shared Memory  Global vs. Local  Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory

Overview: Memory  Unshared Memory  Message Passing  Actors  Shared Memory  Threads  Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS)  Transactional Memory

Sequential Algorithms  Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum

Sequential Algorithms int sum = 0 for i=0 to 4 sum += mem[i] Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18

More than one CPU  How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system

Two Processors PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Lockstep 1 2 7 5  Memory Access! 3 4 1 2 5 18

Flynn‘s Taxonomy  1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD

Flynn‘s Taxonomy  SISD  RAM, Von Neumann  SIMD  Lockstep, vector processor, GPU  MISD  Fault tolerance  MIMD  Distributed system

Extension MIMD  How many programs?  SPMD  One program  Not in lockstep as in SIMD  MPMD  Many programs

Processes & Threads  Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads  Thread  Smallest unit of execution  Light weight

Overview: Algorithms  Sequential  Parallel  Concurrent Overlap  Distributed  Randomized  Quantum

Computer Science  Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch

Gap: Theory and Practice  Galactic algorithms  Written for abstract machines  PRAM, special networks, etc.  Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.

Sequential algorithms  Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard

Parallel algorithms  Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!

Parallelization  Transforming  a sequential  into a parallel algorithm  Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code

Granularity  “Size” of the threads?  How much computation?  Coarse vs. fine grain  Right choice  Important for good performance  Algorithm design

Computational thinking  “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010

Computational thinking  “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010  Expert level needed for parallelization!

Problems: Shared Memory  Destructive updates  i += 1  Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners

Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Which one first? 1 2 7 5 3 1 4 2 5 18

Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum  Synchronization needed

Problems: Shared Memory  The memory barrier  When is a value read or written?  Optimizing compilers change semantics  int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a  Solutions (Java)  volatile  java.util.concurrent.atomic

Problems: Shared Memory  Thread safety  Reentrant code class X { int x; void inc() { x+=1; } }

Problems: Threads  Deadlock  A wants B, B wants A, both waiting  Starvation  A wants B, but never gets it  Race condition  A writes to mem, B reads/writes mem

Shared Mem: Solutions  Shared mutable state  Synchronize properly  Isolated mutable state  Don‘t share state  Immutable or unshared  Don‘t mutate state!

Solutions  Transactional Memory  Every access within transaction  See databases  Actor models  Message passing  Immutable state / pure functional

Speedup and Efficiency  Running time  T(1) with one processor  T(n) with two processors  Speedup  How much faster?  S(n) = T(1) / T(n)

Speedup and Efficiency  Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))

Amdahl‘s Law  Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough

P-Completeness  Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)

Optimization  What can i achieve?  When do I stop?  How many threads should i use?

Optimization  I/O bound  Thread is waiting for memory, disk, etc.  Computation bound  Thread is calculating the whole time  Watch processor utilization!

Optimization  I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads  Computation bound  Number of threads = Number of cores

Processors  Multicore CPU  Graphical Processing Unit (GPU)  Field-Programmable Gate Array (FPGA)

GPU Computing  Finer granularity than CPU  Specialized processors  512 cores on a Fermi  High memory bandwidth 192 GB/sec

FPGA  Configurable hardware circuits  Programmed in Verilog, VHDL  Now: OpenCL  Much higher level of abstraction  Under development, promising  No performance tests results (2011/12)

Networks / Cluster  Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA  Vast possibilities

Example  2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory

Example  1 CPU with local cache  Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory

Next Step: Hybrid  Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA

Optimal combination?  Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling

Example: K-Computer  SPARC64 VIIIfx 2.0GHz  705024 Cores  10.51 Petaflop/s  No GPUs  #1 2011

Example: Tianhe-1A  14336 Xeon X5670  7168 Tesla M2050  2048 NUDT FT1000  2.57 petaflop/s  #2 2011

Example: HPC at home  Workstations and blades  8 x 512 cores = 4096 cores

Frameworks: Shared Mem  C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads  Java  java.util.concurrent

Frameworks: Actors  C/C++  Theron  Java / JVM  Akka  Scala  GPars (Groovy)

GPU Computing  NVIDIA CUDA  NVIDIA  OpenCL  AMD  NVIDIA  Intel  Altera  Apple  WebCL  Nokia  Samsung

Advanced courses  Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory  See http://www.dinkla.com

Advanced courses  GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java  See http://www.dinkla.com

References: Practice  Mattson, Sanders, Massingill  Patterns for Parallel Programming  Breshears  The Art of Concurrency

References: Practice  Pacheco  An Introduction to Parallel Programming  Herlihy, Shavit  The Art of Multiprocessor Programming

References: Theory  Gibbons, Rytter  Efficient Parallel Algorithms  Lynch  Distributed Algorithms  Ben-Ari  Principles of Concurrent and Distributed Programming

References: GPU Computing  Scarpino  OpenCL in Action  Sanders, Kandrot  CUDA by Example

References: Background  Hennessy, Paterson  Computer Architecture: A Quantitative Approach

Introduction To Parallel Computing

More Related Content

What's hot

Viewers also liked

Similar to Introduction To Parallel Computing

More from Jörn Dinkla

Recently uploaded

Introduction To Parallel Computing