08-02-2024
INTRODUCTION
TO PARALLEL
COMPUTING
Page 1
WHAT IS PARALLEL COMPUTING?
• Traditionally, software has been written for serial
computation:
• To be run on a single computer having a single Central Processing
Unit (CPU);
• A problem is broken into a discrete series of instructions.
• Instructions are executed one after another.
• Only one instruction may execute at any moment in time.
Page 2
1
08-02-2024
WHAT IS PARALLEL COMPUTING?
• In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem.
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
Page 3
PARALLEL COMPUTING:
RESOURCES
• The compute resources can include:
• A single computer with multiple processors;
• A single computer with (multiple) processor(s) and some specialized
computer resources (GPU, …)
• An arbitrary number of computers connected by a network;
Page 4
2
08-02-2024
PARALLEL COMPUTING: THE
COMPUTATIONAL PROBLEM
• The computational problem usually demonstrates
characteristics such as the ability to be:
• Broken apart into discrete pieces of work that can be solved
simultaneously;
• Execute multiple program instructions at any moment in time;
• Solved in less time with multiple compute resources than with a
single compute resource.
Page 5
PARALLEL COMPUTING: WHAT FOR?
• Parallel computing is an evolution of serial computing that
attempts to emulate many complex, interrelated events
happening at the same time, yet within a sequence.
Page 6
3
08-02-2024
THE NEED FOR HIGH PERFORMANCE
COMPUTERS
Large scale applications in science and engineering rely on fast processing
Applications requiring high availability rely on parallel and distributed platforms for
redundancy.
Many of todays’ applications such as weather prediction, aerodynamics and artificial
intelligence are very computationally intensive and require vast amounts of processing
power.
So, it appears that the only way forward is to use PARALLELISM. The idea here is that if
several operations can be performed simultaneously then the total computation time is
reduced.
Page 7
Today, commercial applications provide an equal or greater driving force in
the development of faster computers. These applications require the
processing of large amounts of data in sophisticated ways.
For example:
Databases, data mining
Web search engines, web-based business services
Medical imaging and diagnosis
Pharmaceutical design
Management of national and multi-national corporations
Financial and economic modeling
Advanced graphics and virtual reality, particularly in the entertainment industry
Networked video and multi-media technologies
Collaborative work environments
Page 8
4
08-02-2024
PARALLEL AND DISTRIBUTED COMPUTING
Parallel computing (processing):
the use of two or more processors (computers), usually within a single system,
working simultaneously to solve a single problem.
Distributed computing (processing):
any computing that involves multiple computers remote from each other that each
have a role in a computation problem or information processing.
Parallel programming:
the human process of developing programs that express what computations should be
executed in parallel.
Page 9
A parallel computer is a multiple-processor computer capable of parallel
processing.
A supercomputer is a general-purpose computer capable of solving
individual problems at extremely high computational speeds
Ideally, if a single microcomputer takes time t to solve a problem , n
computers working cooperatively in parallel should take time (t/n) to solve
the same problem……..advantage of parallel computing
Page 10
10
5
08-02-2024
ADVANTAGES:-
Time Reduction:-
With the help of parallel processing, a number of computations can be performed at
once, bringing down the time required to complete a project.
Complexity :-
Parallel processing is particularly useful in projects that require complex computations,
such as weather modeling and digital special effects.
Cheapest Possible Solution Strategy :-
many workstations cost much less than a large mainframe-style
Greater reliability :-
A parallel computer work even if a processor fails….fault tolerance
Page 12
12
CONCEPTS AND
TERMINOLOGY
Page 17
17
6
08-02-2024
VON NEUMANN ARCHITECTURE
• For over 40 years, virtually all computers have followed a
common machine model known as the von Neumann
computer. Named after the Hungarian mathematician John
von Neumann.
• A von Neumann computer uses the stored-program
concept. The CPU executes a stored program that specifies
a sequence of read and write operations on the memory.
Page 18
18
BASIC DESIGN
• Basic design
• Memory is used to store both program and data
instructions
• Program instructions are coded data which tell the
computer to do something
• Data is simply information to be used by the
program
• A central processing unit (CPU) gets
instructions and/or data from memory,
decodes the instructions and then
sequentially performs them.
Page 19
19
7
08-02-2024
FLYNN'S CLASSICAL TAXONOMY
• There are different ways to classify parallel computers. One
of the more widely used classifications, in use since 1966, is
called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer
architectures according to how they can be classified along
the two independent dimensions of Instruction and Data.
Each of these dimensions can have only one of two possible
states: Single or Multiple.
Page 20
20
FLYNN MATRIX
• The matrix below defines the 4 possible classifications
according to Flynn
Page 21
21
8
08-02-2024
SINGLE INSTRUCTION, SINGLE
DATA (SISD)
• A serial (non-parallel) computer
• Single instruction: only one instruction stream is being
acted on by the CPU during any one clock cycle
• Single data: only one data stream is being used as input
during any one clock cycle
• Deterministic execution
• This is the oldest and until recently, the most prevalent
form of computer
• Examples: most PCs, single CPU workstations and
mainframes
Page 22
22
SINGLE INSTRUCTION, MULTIPLE DATA (SIMD)
• A type of parallel computer
• Single instruction: All processing units execute the same instruction at any given clock cycle
• Multiple data: Each processing unit can operate on a different data element
• This type of machine typically has an instruction dispatcher, a very high-bandwidth internal network,
and a very large array of small-capacity instruction units.
• Best suited for specialized problems characterized by a high degree of regularity, such as image
processing.
• Synchronous (lockstep) and deterministic execution
• Examples:
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP,Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10
• Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD
instructions and execution units.
Page 23
23
9
08-02-2024
MULTIPLE INSTRUCTION, SINGLE DATA (MISD)
• A single data stream is fed into multiple processing units.
• Each processing unit operates on the data independently via independent
instruction streams.
• MISD architecture is also known as systolic arrays.
• Systolic arrays are hardware structures built for fast and efficient operation of
regular algorithms that perform the same task with different data at different
time instants.
Page 24
24
MULTIPLE INSTRUCTION, MULTIPLE DATA (MIMD)
• Currently, the most common type of parallel computer. Most modern
computers fall into this category.
• Multiple Instruction: every processor may be executing a different instruction
stream
• Multiple Data: every processor may be working with a different data stream
• Execution can be synchronous or asynchronous, deterministic or non-
deterministic
• Examples: most current supercomputers, networked parallel computer "grids"
and multi-processor SMP computers, multi-core PCs..
Page 25
25
10
08-02-2024
Page 26
26
GENERAL PARALLEL TERMINOLOGIES
• CPU
Contemporary CPUs consist of one or more cores - a distinct execution unit with its own
instruction stream. Cores with a CPU may be organized into one or more sockets - each socket
with its own distinct memory . When a CPU consists of two or more sockets, usually hardware
infrastructure supports memory sharing across sockets.
• Node
A standalone "computer in a box." Usually comprised of multiple CPUs/processors/cores, memory,
network interfaces, etc. Nodes are networked together to comprise a supercomputer.
• Task
A logically discrete section of computational work. A task is typically a program or program-like set
of instructions that is executed by a processor.
• Parallel Task
A task that can be executed by multiple processors safely (yields correct results).
• Serial Execution
Execution of a program sequentially, one statement at a time. In the simplest sense, this is what
happens on a one processor machine. However, virtually all parallel tasks will have sections of a
parallel program that must be executed serially.
Page 27
27
11
08-02-2024
• Parallel Execution
• Execution of a program by more than one task, with each task being able to
execute the same or different statement at the same moment in time.
• Shared Memory
• From a strictly hardware point of view, describes a computer architecture
where all processors have direct (usually bus based) access to common
physical memory. In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory and can directly address and
access the same logical memory locations regardless of where the physical
memory actually exists.
• Distributed Memory
• In hardware, refers to network based memory access for physical memory
that is not common. As a programming model, tasks can only logically "see"
local machine memory and must use communications to access memory on
other machines where other tasks are executing.
Page 28
28
• Communications
• Parallel tasks typically need to exchange data. There are several ways this can
be accomplished, such as through a shared memory bus or over a network,
however the actual event of data exchange is commonly referred to as
communications regardless of the method employed.
• Synchronization
• The coordination of parallel tasks in real time, very often associated with
communications. Often implemented by establishing a synchronization point
within an application where a task may not proceed further until another
task(s) reaches the same or logically equivalent point.
• Synchronization usually involves waiting by at least one task, and can
therefore cause a parallel application's wall clock execution time to increase.
Page 29
29
12
08-02-2024
• Granularity
• In parallel computing, granularity (or grain size) of a task is a measure of the
amount of work (or computation) which is performed by that task.
• granularity is a quantitative or qualitative measure of the ratio of computation
to communication time.
• In fine-grained parallelism, a program is broken down to a large number of
small tasks. These tasks are assigned individually to many processors.The
amount of work associated with a parallel task is low and the work is evenly
distributed among the processors. Hence, fine-grained parallelism facilitates load
balancing.
• In coarse-grained parallelism, a program is split into large tasks. Due to this,
a large amount of computation takes place in processors.This might result in
load imbalance; wherein certain tasks process the bulk of the data while others
might be idle.
• Granularity is closely tied to the levels of processing
1. Instruction level
2. Loop level
3. Sub-routine level and
4. Program-level
Page 30
30
• Observed Speedup
• Observed speedup of a code which has been parallelized,
defined as:
wall-clock time of serial execution
wall-clock time of parallel execution
• One of the simplest and most widely used indicators for a
parallel program's performance.
Page 31
31
13
08-02-2024
• Parallel Overhead
• The amount of time required to coordinate parallel tasks, as opposed to doing
useful work. Parallel overhead can include factors such as:
• Task start-up time
• Synchronizations
• Data communications
• Software overhead imposed by parallel compilers, libraries, tools, operating system, etc.
• Task termination time
Page 32
32
• Scalability
• Refers to a parallel system's (hardware and/or software) ability to demonstrate a
proportionate increase in parallel speedup with the addition of more processors. Factors that
contribute to scalability include:
• Hardware - particularly memory-cpu bandwidths and network communications
• Application algorithm
• Parallel overhead
• Characteristics of your specific application and coding
Page 33
33
14
08-02-2024
PARALLEL
COMPUTER
MEMORY
ARCHITECTURES
Page 34
34
MEMORY ARCHITECTURES
• Shared Memory
• Distributed Memory
• Hybrid Distributed-Shared Memory
Page 35
35
15
08-02-2024
SHARED MEMORY
• Shared memory parallel computers vary widely, but generally have in common the
ability for all processors to access all memory as global address space.
• Multiple processors can operate independently but share the same memory
resources.
• Changes in a memory location effected by one processor are visible to all other
processors.
• Shared memory machines can be divided into two main classes based upon memory
access times: UMA and NUMA.
Page 36
36
SHARED MEMORY : UMA VS. NUMA
• Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one
processor updates a location in shared memory, all the other processors know about
the update. Cache coherency is accomplished at the hardware level.
• Non-Uniform Memory Access (NUMA):
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache
Coherent NUMA
Page 37
37
16
08-02-2024
SHARED MEMORY: PROS AND CONS
• Advantages
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory
to CPUs
• Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs. Adding
more CPUs can geometrically increases traffic on the shared memory-CPU path, and
for cache coherent systems, geometrically increase traffic associated with
cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct" access
of global memory.
• Expense: it becomes increasingly difficult and expensive to design and produce shared
memory machines with ever increasing numbers of processors.
Page 38
38
DISTRIBUTED MEMORY
• Like shared memory systems, distributed memory systems vary widely but share a
common characteristics. Distributed memory systems require a communication network
to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address space across all
processors.
• Because each processor has its own local memory, it operates independently. Changes it
makes to its local memory have no effect on the memory of other processors. Hence,
the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated. Synchronization
between tasks is likewise the programmer's responsibility.
Page 39
39
17
08-02-2024
DISTRIBUTED MEMORY: PROS AND CONS
• Advantages
• Memory is scalable with number of processors. Increase the number of processors
and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
• Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to this
memory organization.
• Non-uniform memory access (NUMA) times
Page 40
40
HYBRID DISTRIBUTED-SHARED MEMORY
• The largest and fastest computers in the world today employ both shared and
distributed memory architectures.
• The shared memory component is usually a cache coherent SMP machine.
Processors on a given SMP can address that machine's memory as global.
• The distributed memory component is the networking of multiple SMPs. SMPs
know only about their own memory - not the memory on another SMP.
Therefore, network communications are required to move data from one SMP
to another.
• Current trends seem to indicate that this type of memory architecture will
continue to prevail and increase at the high end of computing for the
foreseeable future.
• Advantages and Disadvantages: whatever is common to both shared and
distributed memory architectures.
Page 42
42
18
08-02-2024
PARALLEL
PROGRAMMING
MODELS
Page 43
43
OVERVIEW
• There are several parallel programming models in common
use:
• Shared Memory
• Threads
• Message Passing
• Data Parallel
• Hybrid
• Parallel programming models exist as an abstraction above
hardware and memory architectures.
Page 44
44
19
08-02-2024
OVERVIEW
• Which model to use is often a combination of what is
available and personal choice. There is no "best" model,
although there certainly are better implementations of
some models over others.
• The following sections describe each of the models
mentioned above, and also discuss some of their actual
implementations.
Page 46
46
SHARED MEMORY MODEL
• In the shared-memory programming model, tasks share a common
address space, which they read and write asynchronously.
• Various mechanisms such as locks / semaphores may be used to
control access to the shared memory.
• An advantage of this model from the programmer's point of view is
that the notion of data "ownership" is lacking, so there is no need to
specify explicitly the communication of data between tasks.
• An important disadvantage in terms of performance is that it
becomes more difficult to understand and manage data locality.
Page 47
47
20
08-02-2024
SHARED MEMORY MODEL:
IMPLEMENTATIONS
• On shared memory platforms, the native compilers translate user
program variables into actual memory addresses, which are global.
• No common distributed memory platform implementations
currently exist. It provides a shared memory view of data.
Page 48
48
THREADS MODEL
• In the threads model of parallel programming, a single process can have
multiple, concurrent execution paths.
• Perhaps the most simple analogy that can be used to describe threads is
the concept of a single program that includes a number of subroutines:
• Threads are commonly associated with shared memory architectures and
operating systems.
Page 49
49
21
08-02-2024
THREADS MODEL: OPENMP
• OpenMP
• Jointly defined and endorsed by a group of major computer hardware and
software vendors.
• Portable / multi-platform, including Unix and Windows NT platforms
• Available in C/C++ implementations
• Can be very easy and simple to use - provides for "incremental parallelism"
Page 51
51
Page 52
52
22
08-02-2024
MESSAGE PASSING MODEL
IMPLEMENTATIONS: MPI
• From a programming perspective, message passing implementations commonly
comprise a library of subroutines that are embedded in source code. The
programmer is responsible for determining all parallelism.
• Historically, a variety of message passing libraries have been available since the
1980s. These implementations differed substantially from each other making it
difficult for programmers to develop portable applications.
• In 1992, the MPI Forum was formed with the primary goal of establishing a
standard interface for message passing implementations.
• Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2
(MPI-2) was released in 1996. Both MPI specifications are available on the web at
www.mcs.anl.gov/Projects/mpi/standard.html.
Page 54
54
MESSAGE PASSING MODEL
IMPLEMENTATIONS: MPI
• MPI is now the "de facto" industry standard for message passing, replacing
virtually all other message passing implementations used for production work.
Most, if not all of the popular parallel computing platforms offer at least one
implementation of MPI.A few offer a full implementation of MPI-2.
• For shared memory architectures, MPI implementations usually don't use a
network for task communications. Instead, they use shared memory (memory
copies) for performance reasons.
Page 55
55
23
08-02-2024
OTHER MODELS
• Other parallel programming models besides those
previously mentioned certainly exist, and will continue to
evolve along with the ever changing world of computer
hardware and software.
• Only three of the more common ones are mentioned here.
• Hybrid
• Single Program Multiple Data
• Multiple Program Multiple Data
Page 59
59
HYBRID
• In this model, any two or more parallel programming models are
combined.
• Currently, a common example of a hybrid model is the combination of
the message passing model (MPI) with the shared memory model
(OpenMP). This hybrid model lends itself well to the increasingly
common hardware environment of networked SMP machines.
• Another common example of a hybrid model is combining data
parallel with message passing. The data parallel implementations on
distributed memory architectures actually use message passing to
transmit data between tasks, transparently to the programmer.
Page 60
60
24
08-02-2024
SINGLE PROGRAM MULTIPLE DATA
(SPMD)
• Single Program Multiple Data (SPMD):
• SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
• A single program is executed by all tasks simultaneously.
• At any moment in time, tasks can be executing the same or different instructions
within the same program.
• SPMD programs usually have the necessary logic programmed into them to allow
different tasks to branch or conditionally execute only those parts of the
program they are designed to execute. That is, tasks do not necessarily have to
execute the entire program - perhaps only a portion of it.
• All tasks may use different data
Page 61
61
MULTIPLE PROGRAM MULTIPLE
DATA (MPMD)
• Multiple Program Multiple Data (MPMD):
• Like SPMD, MPMD is actually a "high level" programming model that
can be built upon any combination of the previously mentioned
parallel programming models.
• MPMD applications typically have multiple executable object files
(programs). While the application is being run in parallel, each task can
be executing the same or different program as other tasks.
• All tasks may use different data
Page 62
62
25
08-02-2024
MULTI CORE SYSTEMS
• A multi-core processor is a single computing component with two or more independent actual CPUs
(called "cores").
• These independent CPUs are capable of executing instructions at the same time hence overall increasing
the speed at which programs can be executed.
• Manufacturers typically integrate the cores onto a single integrated circuit die.
• Processors were originally developed with only one core.
• A dual-core processor has two cores (e.g. Intel Core Duo),
• A quad-core processor contains four cores (e.g. Intel's quad-core processors-i3,i5, and i7)
• A hexa-core processor contains six cores (e.g . Intel Core i7 Extreme Edition 980X)
• An octo-core processor or octa-core processor contains eight cores (e.g. Intel Xeon E7-2820)
• A deca-core processor contains ten cores (e.g Intel Xeon E7-2850
Page 64
64
PERFORMANCE
Two key goals to be achieved with the design of parallel applications are:
• Performance – the capacity to reduce the time needed to solve a problem as
the computing resources increase
• Scalability – the capacity to increase performance as the size of the problem
increases
The main factors limiting the performance and the scalability of an application can
be divided into:
• Architectural limitations
• Algorithmic limitations
Page 65
65
26
08-02-2024
PERFORMANCE METRICS FOR
PARALLEL APPLICATIONS
Some of the best known metrics are:
• Speedup
• Efficiency
• Redundancy
• Utilization
There also some laws/metrics that try to explain and assert the potential
performance of a parallel application. The best known are:
• Amdahl’s law
• Gustafson-Barsis’ law
Page 68
68
SPEEDUP
Speedup is a measure of performance. It measures the ratio between the sequential
execution time and the parallel execution time.
T(1) is the execution time with one processing unit
T(p) is the execution time with p processing units
Page 69
69
27
08-02-2024
EFFICIENCY
Efficiency is a measure of the usage of the computational capacity. It measures the
ratio between performance and the number of resources available to achieve that
performance.
Page 70
70
REDUNDANCY
Redundancy measures the increase in the required computation when using more
processing units. It measures the ratio between the number of operations performed
by the parallel execution and by the sequential execution.
Page 71
71
28
08-02-2024
UTILIZATION
Utilization is a measure of the good use of the computational capacity. It measures
the ratio between the computational capacity used during parallel execution and the
capacity that was available.
Page 72
72
PIPELINING
Page 73
73
29
08-02-2024
PIPELINING
• A technique of decomposing a sequential process into
suboperations, with each subprocess being executed in a
partial dedicated segment that operates concurrently with
all other segments.
• A pipeline can be visualized as a collection of processing
segments through which binary information flows.
• The name “pipeline” implies a flow of information analogous
to an industrial assembly line.
Page 74
74
PIPELINING IS NATURAL!
• Laundry Example
• Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
Page 75
75
30
08-02-2024
SEQUENTIAL LAUNDRY
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r D
• Sequential laundry takes 6 hours for 4 loads
Page 76
76
PIPELINED LAUNDRY: START WORK ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Pipelined laundry takes 3.5 hours for 4 loads
Page 77
77
31
08-02-2024
PIPELINING LESSONS
• Pipelining doesn’t help latency of
6 PM 7 8 9 single task, it helps throughput
of entire workload
Time
• Pipeline rate limited by slowest
30 40 40 40 40 20 pipeline stage
T
• Multiple tasks operating
a A
s simultaneously using different
k resources
B • Potential speedup = Number
O
r pipe stages
d C • Unbalanced lengths of pipe
e stages reduces speedup
r
D • Time to “fill” pipeline and time
to “drain” it reduces speedup
Page 78
78
IDEA OF PIPELINING IN COMPUTER
Page 79
79
32
08-02-2024
Page 80
80
Page 81
81
33
08-02-2024
Page 82
82
ROLE OF CACHE MEMORY
Page 83
83
34
08-02-2024
PIPELINE PERFORMANCE
Page 84
84
PIPELINE PERFORMANCE
Page 85
85
35
08-02-2024
Page 86
86
Page 87
87
36
08-02-2024
Page 88
88
Page 89
89
37
08-02-2024
PIPELINING AND PARALLEL PROCESSING
• Pipelining and parallel processing are both techniques used in computer architecture to improve the performance
of processors.
• Pipelining involves breaking down the execution of instructions into a series of stages. Each stage performs a
different part of the instruction, and multiple instructions can be processed simultaneously, each at a different
stage of execution. This allows for more efficient use of the processor's resources and can lead to faster overall
execution of instructions.
• Parallel processing, on the other hand, involves the simultaneous execution of multiple instructions or tasks. This
can be done using multiple processors or processor cores working together to handle different parts of a task at
the same time. Parallel processing can significantly speed up the execution of tasks that can be divided into
independent sub-tasks.
• In summary, pipelining improves the efficiency of processing individual instructions, while parallel processing
improves the overall throughput by executing multiple instructions or tasks simultaneously. Both techniques are
used to enhance the performance of modern computer systems.
Page 90
90
38