PARALLEL & DISTRIBUTED COMPUTING CS469
LECTURE # 21
Faizan ul Mustafa
Lecturer | Dept. of Computer Science
GIFT University Gujranwala, Pakistan
faizanulmustafa@gift.edu.pk
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 1
Table of Contents
FLOPS, Speed Up Calculation
Amdahl’s Law
Complexity, cost of complexity, portability, scalability
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 2
FLOPS
Floating point Instructions per second.
Measure of theoretical peak performance
Sockets are the chip slots of computer hardware on which cores are
installed. Normlly there is one socket of core in a traditional computer. In
cluster computing machines no of sockets are multiple.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 3
Sockets: This refers to the number of physical CPU sockets on the
system. Each socket typically holds one or two central processing units
(CPUs).
Cores: This is the number of CPU cores per socket. A core is an
independent processing unit that can execute instructions.
Cycles per second: This is the clock speed of the processor, measured in
Hertz (Hz). One Hertz is equal to one cycle per second.
FLOPs per cycle: This is the number of floating-point operations that a
single core can perform in one clock cycle. This value depends on the
architecture of the processor and the specific instruction being executed.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 4
FLOPS
Servers are the only computers that sometimes have more than one socket;
for most home computers (desktop or laptop), "sockets" will be 1
Cores per socket depends on your CPU. It could be 2 (dual-core), 3, 4 (quad-
core), 6 (hexa-core), or 8. There are some prototype CPUs with as many as
80 cores.
Clock cycles per second" refers to the speed of your CPU. Most modern
CPUs are rated in gigahertz. So 2 GHz would be 2,000,000,000 clock cycles
pr second.
The number of FLOPs per cycle also depends on the CPU. One of the fastest
(home computer) CPUs is the Intel Core i7-970, capable of 4 double-
precision or 8 single-precision floating-point operations per cycle.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 5
Test
Intel Core i7-970 has 6 cores. If it is running at 3.46 GHz and can
perform 8 floating point operations per second, calculate the
theoretical compute power of this machine.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 6
Solution
Intel Core i7-970 has 6 cores. If it is running at 3.46 GHz, the formula would
be:
1 (socket) * 6 (cores) * 3,460,000,000 (cycles per second) * S (single-
precision FLOPs per second) = 166,080,000,000 single-precision FLOPs
per second or 83,640,000,000 double-precision FLOPs per second.
109 FLOPS.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 7
Speed-Up Calculations
A machine is designed to execute different processes. Any machine that is
capable to execute more processes simultaneously is more efficient. Power
of a machine to run multiple processes in parallel in same time can be
calculated by Speed-Up calculation.
Speed-Up calculation tells how much theoretically we speedup a particular
process / task in execution.
Theoretical speed-up calculation is addressed by Amdahl’s Law
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 8
Example
If 30% of the execution time may be the subject of a speedup, p will
be 0.3; if the improvement makes the affected part twice as fast, s
will be 2. Amdahl's law states that the overall speedup of applying
the improvement will be?
?
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 9
Example2
Assume that we are given a serial task which is split into four
consecutive parts, whose percentages of execution time are p1 =
0.11, p2 = 0.18, p3 = 0.23, and p4 = 0.48 respectively. Then we are
told that the 1st part is not sped up, so s1 = 1, while the 2nd part is
sped up 5 times, so s2 = 5, the 3rd part is sped up 20 times, so s3 =
20, and the 4th part is sped up 1.6 times, so s4 = 1.6. By using
Amdahl's law, the overall speedup is?
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 10
To achieve better overall speedup, one is supposed to speed up the process
that is taking more memory / bigger in magnitude. This can be demonstrated
with the help of following diagram.
Assume that a task has two independent parts, A and B. Part B takes roughly 25% of the
time of the whole computation. By working very hard, one may be able to make this
part 5 times faster, but this reduces the time of the whole computation only slightly. In
contrast, one may need to perform less work to make part A perform twice as fast. This
will make the computation much faster than by optimizing part B, even though part B's
speedup is greater in terms of theul ratio,
Faizan Mustafa |(5 times versus 2 times).
faizanulmustafa@gift.edu.pk 11
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 12
Complexity
In general, parallel applications are much more complex than corresponding
serial applications.
Not only do you have multiple instruction streams executing at the same
time, but you also have data flowing between them.
The costs of complexity are measured in programmer time in virtually every
aspect of the software development cycle.
Design
Coding
Debugging
Maintenance
Adhering to "good" software development practices is essential when
working with parallel applications
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 13
Portability
Thanks to standardization in several APIs, such as MPI, POSIX threads, and
OpenMP, portability issues with parallel programs are not as serious as in
years past.
All of the usual portability issues associated with serial programs apply to
parallel programs. For example, if you use vendor "enhancements' to
Fortran, C or C++, portability will be a problem.
Even though standards exist for several APIs, implementations will differ in
a number of details, sometimes to the point of requiring code modifications
in order to effect portability.
Operating systems can play a key role in code portability issues.
Hardware architectures are characteristically highly variable and can affect
portability.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 14
Resource Requirement
The primary intent of parallel programming is to decrease execution wall
clock time, however in order to accomplish this, more CPU time is required.
For example, a parallel code that runs in 1 hour on 8 processors actually
uses 8 hours of CPU time.
The amount of memory required can be greater for parallel codes than
serial codes, due to the need to replicate data and for overheads associated
with parallel support libraries and subsystems.
For short running parallel programs, there can actually be a decrease in
performance compared to a similar serial implementation. The overhead
costs associated with setting up the parallel environment, task creation,
communications and task termination can comprise a significant portion of
the total execution time for short runs.
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 15
Thank You
Faizan ul Mustafa | faizanulmustafa@gift.edu.pk 16