Parallel Computing Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-UA.0480-003 Performance Analysis
Defining Performance • Which airplane has the best performance? 0 100 200 300 400 500 Douglas DC-8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Passenger Capacity 0 2000 4000 6000 8000 10000 Douglas DC- 8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Cruising Range (miles) 0 500 1000 1500 Douglas DC-8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Cruising Speed (mph) 0 100000 200000 300000 400000 Douglas DC- 8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Passengers x mph
• For some program running on machine X, PerformanceX = 1 / Execution timeX • "X is n times faster than Y" PerformanceX / PerformanceY = n Standard Definition of Performance  Example: time taken to run a program  10s on A, 15s on B  Execution TimeB / Execution TimeA = 15s / 10s = 1.5  So A is 1.5 times faster than B
Speedup • Number of cores = p • Serial run-time = Tserial • Parallel run-time = Tparallel Copyright © 2010, Elsevier Inc. All rights Reserved Tserial Tparallel S =
Example Processor 1 time 100 time 1 2 3 4 25 25 25 25 time 1 2 3 4 35 35 35 35 ,0.4 25 100 pS balancingloadperfect ,85.2 35 100 pS Perfect parallelization! Does it ever occur? synch cost
Example (cont.) time 1 2 3 4 30 20 40 10 time 1 2 3 4 50 50 50 50 imbalanceload ,5.2 40 100 pS costsyncand imbalanceload ,0.2 50 100 pS closest to real life parallel programs Processor 1 time 100
A Glimpse at the Top 500 (Nov 2019 List) Rmax: maximum achieved Rpeak: theoretical maximum
Sources of Parallel Overheads • Overhead of creating threads/processes • Synchronization • Load imbalance • Communication • Extra computation • Memory access (for both sequential and parallel!)
Efficiency of a parallel program Copyright © 2010, Elsevier Inc. All rights Reserved E = Tserial Tparallel Speedup P = p P = can be number of threads, processes, or cores
Be Careful about T • Both Tserial and Tpar are wall-clock times, and as such they are not objective. They can be influenced by : – The skill of the programmer who wrote the implementations – The choice of compiler (e.g. GNU C++ versus Intel C++) – The compiler switches (e.g. turning optimization on/off) – The operating system – The type of filesystem holding the input data (e.g. EXT4, NTFS, etc.) – The time of day... (different workloads, network traffic, etc.)
Speedup Copyright © 2010, Elsevier Inc. All rights Reserved Program: Matrix Vector Multiplication Graph shows how speedup is affected by the number of processes and problem size.
Efficiency Copyright © 2010, Elsevier Inc. All rights Reserved Program: Matrix Vector Multiplication
Scalability  Scalability is the ability of a (software or hardware) system to handle a growing amount of work efficiently. • If we keep the efficiency fixed by increasing the number of processes/threads and without increasing problem size, the problem is strongly scalable. • If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes/threads, the problem is weakly scalable. Copyright © 2010, Elsevier Inc. All rights Reserved
Let’s take a closer look at timing.
Taking Timings • What is time? • Start to finish? • A program segment of interest? • CPU time? • Wall clock time? Copyright © 2010, Elsevier Inc. All rights Reserved
• Elapsed Time (aka wall-clock time) – counts everything (disk and memory accesses, I/O , etc.) – a useful number, but often not good for comparison purposes • CPU time – doesn't count I/O or time spent running other programs – can be broken up into system time, and user time • Our focus: user CPU time – time spent executing the lines of code that are "in" our program Execution Time
Execution Time (Elapsed Time) I/O Time Disk and Memory timeCPU Time User CPU Time System CPU Time
Taking Timings In Linux: time prog Returns real Xs user Ys sys Zs Inside your C program: clock_t clock(void)returns the number of clock ticks elapsed since the program started #include <time.h> #include <stdio.h> int main() { clock_t start, end, total; int i; start = clock(); for(i=0; i< 10000000; i++) { } end = clock(); total= (double)(end – start) / CLOCKS_PER_SEC; printf("Total time taken by CPU: %fn", total); }
Let’s Look at Two Simple Metrics • Response time (aka Execution Time) – The time between the start and completion of a task • Throughput – Total amount of work done in a given time What is the relationship between execution time and throughput?
Timing for sequential programs IPC MIPS CPI Execution Time # insts
seconds program  cycles program  seconds cycle ET IC * CPI CT ET = IC X CPI X CT ET = Execution Time CPI = Cycles Per Instruction IC = Instruction Count Execution time for sequential program:
Example A program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?“
CPI Example • Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? [ 10-3 = milli, 10-6 = micro, 10-9 = nano, 10-12 = pico, 10-15 = femto ]
#Instructions Example • A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence?
MIPS Example • Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. • Which sequence will be faster according to MIPS? • Which sequence will be faster according to execution time?
Pitfalls in timing in Parallel Machines
For Multithreaded Programs • You need to decide: Shall we use execution time or throughput? or both? • IPC (Instructions per Cycle) is not accurate here – small timing variations may lead to different execution time – Order at which threads enter critical section may vary – Different interrupt timing may lead to different scheduling decisions The total number of instructions executed may be different across different runs!
For Multithreaded Programs The total number of instructions executed may be different across different runs! This effect increases with the number of cores System-level code account for a significant fraction of the total execution time
Your Program Does Not Run in A Vacuum • OS at least is there. • Multi-programming and/or mulithreading setting is very common in multicore settings • Independent programs affect each other performance (why?)
How to check the performance of a parallel machine?
• Performance best determined by running a real application – Use programs typical of expected workload – Or, typical of expected class of applications – e.g., compilers/editors, scientific applications, graphics, etc. • Small benchmarks – nice for architects and designers – easy to standardize • Parallel Benchmarks: PARSEC, Rodinia, SPLASH-2 • SPEC (System Performance Evaluation Cooperative) – companies have agreed on a set of real program and inputs – valuable indicator of performance (and compiler technology) Benchmarks
Role of Benchmarks • help designer explore architectural designs • identify bottlenecks • compare different systems • conduct performance prediction
Example: PARSEC • Princeton Application Repository for Shared-Memory Computers • Benchmark Suite for Chip-Multiprocessors • Freely available at: http://parsec.cs.princeton.edu/ • Objectives: – Multithreaded Applications: Future programs must run on multiprocessors – Emerging Workloads: Increasing CPU performance enables new applications – Diverse: Multiprocessors are being used for more and more tasks – State-of-Art Techniques: Algorithms and programming techniques evolve rapidly
Example: PARSEC Program Application Domain Parallelization Blackscholes Financial Analysis Data-parallel Bodytrack Computer Vision Data-parallel Canneal Engineering Unstructured Dedup Enterprise Storage Pipeline Facesim Animation Data-parallel Ferret Similarity Search Pipeline Fluidanimate Animation Data-parallel Freqmine Data Mining Data-parallel Streamcluster Data Mining Data-parallel Swaptions Financial Analysis Data-parallel Vips Media Processing Data-parallel X264 Media Processing Pipeline
Example: Rodinia • A Benchmark Suite for Heterogeneous Computing: multicore CPU and GPU • University of Virginia
Conclusions • Performance evaluation is very important to assess programming quality as well as the underlying architecture and how they interact. • The following capture some aspects of the system but do not represent overall performance: MIPS, #instructions, #cycles, frequency • Execution time is what matters: system time, CPU time, I/O and memory time – To know whether your execution time is good, you need to compare it with sequential code, another parallel code, etc. • Scalability and efficiency measure the quality of your code. Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)

Parallel Computing - Lec 6

  • 1.
    Parallel Computing Mohamed Zahran(aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-UA.0480-003 Performance Analysis
  • 2.
    Defining Performance • Whichairplane has the best performance? 0 100 200 300 400 500 Douglas DC-8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Passenger Capacity 0 2000 4000 6000 8000 10000 Douglas DC- 8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Cruising Range (miles) 0 500 1000 1500 Douglas DC-8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Cruising Speed (mph) 0 100000 200000 300000 400000 Douglas DC- 8-50 BAC/Sud Concorde Boeing 747 Boeing 777 Passengers x mph
  • 3.
    • For someprogram running on machine X, PerformanceX = 1 / Execution timeX • "X is n times faster than Y" PerformanceX / PerformanceY = n Standard Definition of Performance  Example: time taken to run a program  10s on A, 15s on B  Execution TimeB / Execution TimeA = 15s / 10s = 1.5  So A is 1.5 times faster than B
  • 4.
    Speedup • Number ofcores = p • Serial run-time = Tserial • Parallel run-time = Tparallel Copyright © 2010, Elsevier Inc. All rights Reserved Tserial Tparallel S =
  • 5.
    Example Processor 1 time 100 time 1 23 4 25 25 25 25 time 1 2 3 4 35 35 35 35 ,0.4 25 100 pS balancingloadperfect ,85.2 35 100 pS Perfect parallelization! Does it ever occur? synch cost
  • 6.
    Example (cont.) time 1 23 4 30 20 40 10 time 1 2 3 4 50 50 50 50 imbalanceload ,5.2 40 100 pS costsyncand imbalanceload ,0.2 50 100 pS closest to real life parallel programs Processor 1 time 100
  • 7.
    A Glimpse atthe Top 500 (Nov 2019 List) Rmax: maximum achieved Rpeak: theoretical maximum
  • 8.
    Sources of ParallelOverheads • Overhead of creating threads/processes • Synchronization • Load imbalance • Communication • Extra computation • Memory access (for both sequential and parallel!)
  • 9.
    Efficiency of aparallel program Copyright © 2010, Elsevier Inc. All rights Reserved E = Tserial Tparallel Speedup P = p P = can be number of threads, processes, or cores
  • 10.
    Be Careful aboutT • Both Tserial and Tpar are wall-clock times, and as such they are not objective. They can be influenced by : – The skill of the programmer who wrote the implementations – The choice of compiler (e.g. GNU C++ versus Intel C++) – The compiler switches (e.g. turning optimization on/off) – The operating system – The type of filesystem holding the input data (e.g. EXT4, NTFS, etc.) – The time of day... (different workloads, network traffic, etc.)
  • 11.
    Speedup Copyright © 2010,Elsevier Inc. All rights Reserved Program: Matrix Vector Multiplication Graph shows how speedup is affected by the number of processes and problem size.
  • 12.
    Efficiency Copyright © 2010,Elsevier Inc. All rights Reserved Program: Matrix Vector Multiplication
  • 13.
    Scalability  Scalability isthe ability of a (software or hardware) system to handle a growing amount of work efficiently. • If we keep the efficiency fixed by increasing the number of processes/threads and without increasing problem size, the problem is strongly scalable. • If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of processes/threads, the problem is weakly scalable. Copyright © 2010, Elsevier Inc. All rights Reserved
  • 14.
    Let’s take acloser look at timing.
  • 15.
    Taking Timings • Whatis time? • Start to finish? • A program segment of interest? • CPU time? • Wall clock time? Copyright © 2010, Elsevier Inc. All rights Reserved
  • 16.
    • Elapsed Time(aka wall-clock time) – counts everything (disk and memory accesses, I/O , etc.) – a useful number, but often not good for comparison purposes • CPU time – doesn't count I/O or time spent running other programs – can be broken up into system time, and user time • Our focus: user CPU time – time spent executing the lines of code that are "in" our program Execution Time
  • 17.
    Execution Time (ElapsedTime) I/O Time Disk and Memory timeCPU Time User CPU Time System CPU Time
  • 18.
    Taking Timings In Linux: timeprog Returns real Xs user Ys sys Zs Inside your C program: clock_t clock(void)returns the number of clock ticks elapsed since the program started #include <time.h> #include <stdio.h> int main() { clock_t start, end, total; int i; start = clock(); for(i=0; i< 10000000; i++) { } end = clock(); total= (double)(end – start) / CLOCKS_PER_SEC; printf("Total time taken by CPU: %fn", total); }
  • 19.
    Let’s Look atTwo Simple Metrics • Response time (aka Execution Time) – The time between the start and completion of a task • Throughput – Total amount of work done in a given time What is the relationship between execution time and throughput?
  • 20.
    Timing for sequentialprograms IPC MIPS CPI Execution Time # insts
  • 21.
    seconds program  cycles program  seconds cycle ET IC * CPI CT ET= IC X CPI X CT ET = Execution Time CPI = Cycles Per Instruction IC = Instruction Count Execution time for sequential program:
  • 22.
    Example A program runsin 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?“
  • 23.
    CPI Example • Supposewe have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? [ 10-3 = milli, 10-6 = micro, 10-9 = nano, 10-12 = pico, 10-15 = femto ]
  • 24.
    #Instructions Example • Acompiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence?
  • 25.
    MIPS Example • Twodifferent compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. • Which sequence will be faster according to MIPS? • Which sequence will be faster according to execution time?
  • 26.
    Pitfalls in timingin Parallel Machines
  • 27.
    For Multithreaded Programs •You need to decide: Shall we use execution time or throughput? or both? • IPC (Instructions per Cycle) is not accurate here – small timing variations may lead to different execution time – Order at which threads enter critical section may vary – Different interrupt timing may lead to different scheduling decisions The total number of instructions executed may be different across different runs!
  • 28.
    For Multithreaded Programs Thetotal number of instructions executed may be different across different runs! This effect increases with the number of cores System-level code account for a significant fraction of the total execution time
  • 29.
    Your Program DoesNot Run in A Vacuum • OS at least is there. • Multi-programming and/or mulithreading setting is very common in multicore settings • Independent programs affect each other performance (why?)
  • 30.
    How to checkthe performance of a parallel machine?
  • 31.
    • Performance bestdetermined by running a real application – Use programs typical of expected workload – Or, typical of expected class of applications – e.g., compilers/editors, scientific applications, graphics, etc. • Small benchmarks – nice for architects and designers – easy to standardize • Parallel Benchmarks: PARSEC, Rodinia, SPLASH-2 • SPEC (System Performance Evaluation Cooperative) – companies have agreed on a set of real program and inputs – valuable indicator of performance (and compiler technology) Benchmarks
  • 32.
    Role of Benchmarks •help designer explore architectural designs • identify bottlenecks • compare different systems • conduct performance prediction
  • 33.
    Example: PARSEC • PrincetonApplication Repository for Shared-Memory Computers • Benchmark Suite for Chip-Multiprocessors • Freely available at: http://parsec.cs.princeton.edu/ • Objectives: – Multithreaded Applications: Future programs must run on multiprocessors – Emerging Workloads: Increasing CPU performance enables new applications – Diverse: Multiprocessors are being used for more and more tasks – State-of-Art Techniques: Algorithms and programming techniques evolve rapidly
  • 34.
    Example: PARSEC Program ApplicationDomain Parallelization Blackscholes Financial Analysis Data-parallel Bodytrack Computer Vision Data-parallel Canneal Engineering Unstructured Dedup Enterprise Storage Pipeline Facesim Animation Data-parallel Ferret Similarity Search Pipeline Fluidanimate Animation Data-parallel Freqmine Data Mining Data-parallel Streamcluster Data Mining Data-parallel Swaptions Financial Analysis Data-parallel Vips Media Processing Data-parallel X264 Media Processing Pipeline
  • 35.
    Example: Rodinia • ABenchmark Suite for Heterogeneous Computing: multicore CPU and GPU • University of Virginia
  • 36.
    Conclusions • Performance evaluationis very important to assess programming quality as well as the underlying architecture and how they interact. • The following capture some aspects of the system but do not represent overall performance: MIPS, #instructions, #cycles, frequency • Execution time is what matters: system time, CPU time, I/O and memory time – To know whether your execution time is good, you need to compare it with sequential code, another parallel code, etc. • Scalability and efficiency measure the quality of your code. Powered by TCPDF (www.tcpdf.org)Powered by TCPDF (www.tcpdf.org)