Chapter 4: Parallel Programming Languages

CHAPTER-4 PARALLEL PROGRAMMING LANGUAGES By Dr. Heman Pathak Associate Professor KGC - Dehradun

Parallel Processes Multicomputer • A multicomputer is constructed out of multiple computers and an interconnection network. The processors on different computers interact by passing messages to each other. Multiprocessor • It is a computer system with two or more CPUs. It is highly integrated system in which all CPUs share access to a single global memory. 2

Parallel Processes Uniform Memory Access (UMA) • It is a shared memory architecture in which access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. 3

PROGRAMMING PARALLEL PROCESSES Every parallel language must address certain issues, either explicitly or implicitly  There must be a way to create parallel processes  there must be a way to coordinate the activities of these processes  when processes exchange results, they must communicate and synchronize with each other.  Communication and synchronization can be accomplished by sharing variables or by message passing.

PROGRAMMING PARALLEL PROCESSES Communication and synchronization can be accomplished by sharing variables or by message passing. There are two methods of synchronization: synchronization for precedence It guarantees that one event does not begin until another event has finished synchronization for mutual exclusion. It guarantees that only one process at a time enters a critical section of code where a data structure to be shared is manipulated

Barrier Synchronization Barrier P1 P2 P3 P4 Barrier P1 P2 P3 P4 P1 P2 P3 P4 time Barrier four processes approach the barrier all except P4 arrive Once all arrive, they continue

Mutual Exclusion Synchronization Process 1 Process 2

An Illustrative Example for UMA let's consider the problem of computing variance for a list of real numbers. The values given are r1, r2, r3 ,..., rn Variance = 𝒓𝒊 − 𝒎 𝟐/𝒏 where m = 𝒓𝒊/𝒏  Parallel Architecture used is UMA with 4 Processors  n real numbers are stored in shared memory  Four variables are also stored in shared memory  one will contain the grand total  another the mean  a third the global sum of squares  a fourth the variance

An Illustrative Example for UMA  Four processes are created one for each processors.  Each process has a local temporary variable.  Each process adds its share of the n values  Accumulating the sum in its local temporary variable.  When the process is through computing its subtotal, it adds its subtotal to the shared variable, accumulating the grand total.  Since multiple processes are accessing the same global variable, that portion of code is a critical section, and the processes must enforce mutual exclusion.

An Illustrative Example for UMA

An Illustrative Example for UMA  A barrier synchronization step, inserted in the algorithm after the critical section, ensures that no process continues until all processes have added their subtotals to the grand total.  In sequential code one process computes the average by dividing the grand total by n.  To compute the global sum of squares, the processes go through a process similar to that which computes the global sum.  Again, the processes must enforce mutual exclusion.  After another barrier synchronization, one process computes the variance by dividing the sun of squares by the number of values.

An Illustrative Example for Multicomputer Solve the same problem on a four-node multicomputer  There is no shared memory; the n values are distributed among the local memories of the nodes.  Each node process has four variables:  two to accumulate sums  one to store the mean  and another to store the variance  Each node process initializes the two accumulator to 0.  At this point every process has a subtotal; the four subtotals must be combined to find the grand total.

An Illustrative Example for Multicomputer 7 • 0 8 • 1 11 • 2 10 • 3 15 15 21 21 36 36 36 36

An Illustrative Example for Multicomputer The four subtotals must be combined to find the grand total.  After two exchange-and-add steps, every process has the grand total.  Every process can divide the grand total by n to determine the mean.

An Illustrative Example for Multicomputer A similar set of steps allows the processes to compute the variance of the list values. Every process has the result. One of the processes can pass the answer back to program running on the front end, which then de-allocates the hypercube, surrendering access to the nodes.

A Sample Application Integration: Find the area under the curve 4/( I +x2) between 0 and 1=   The interval [0, 1] is divided into n subintervals of width I/n.  For each these intervals the algorithm computes the area of a rectangle whose height is such that the curve intersects the top of the rectangle at its midpoint.  The sum of the areas of the n rectangles approximates the area under the curve.

A Sample Application  This algorithm is data-parallel.  Since the areas of all the rectangles can be  computed simultaneously.  Computing the area of each rectangle requires the same amount of work: hence load balancing is insignificant.  If the language requires to divide the work among the processors, it can be done easily.

FORTRAN 90 Parallel Programming Language

FORTRAN 90 In 1978 the ANSI-accredited technical committee, X3J3 began working on a new version of the FORTRAN language. In the early l990s the resulting language, Fortran 90, was adopted as an ISO and ANSI standard.

FORTRAN 90 Fortran 90 is a superset of FORTRAN 77. It includes all the features of FORTRAN 77, plus  Array operations  Improved facilities for numerical computations  Syntax to allow processors to support short integers, packed logicals, very large character sets, and high-precision real and complex numbers  User-defined data types, structures, and pointers  Dynamic storage allocation  Modules to support abstract data types  Internal procedures and recursive procedures  Improvements to input-output facilities  New control structures  New intrinsic procedures  Terminal-oriented source form

FORTRAN 90 The committee also marked many language features as obsolete, including  arithmetic IF  some DO construct variations,  assigned Go TO,  assigned formats  and the H edit descriptor The next revision of the FORTRAN standard may not contain these features.

FORTRAN 90 Programmer‘s Model The Fortran 90 programmer has a model of parallel computation similar to a PRAM. A CPU and a vector unit share a Single memory.  The CPU executes sequential instructions, accessing variables stored in the shard memory.  To execute parallel operations, the CPU controls the vector unit.  Which also stores and fetches data to and from the shared memory

FORTRAN 90 Language Features Fortran 90 gives the programmer the ability to specify the type of variables through type declaration statements such as • REAL A, B, C • INTEGER I Each type may have several kinds. For example, a real variable may be stored in 4 bytes or 8 bytes. The Fortran 90 programmer may specify explicitly the kind, as well as the type of the variable, as in the following example: • REAL ( KIND=LONG) PI Fortran 90 introduces the notion of an array constant. For example. • (/ 2, 3, 5, 7,11 /) denotes a one dimensional array with five elements. It is possible to construct an array of higher dimension by declaring an array constant, then changing its dimension with the RESHAPE function.

FORTRAN 90 Language Features An implied Do notation can simplify the specification of any constants. For example, the array constant • (/ 2, 4, 6, 8,10 /) may be specified as (/ (I, I = 2,10, 2) /) Fortran 90 also allows operations on arrays. When applied to an array, the unary intrinsic operators + and - return an array of the same dimensions, where the elements in the result array are found by applying the operator to the corresponding elements in the operand array. Numerical, relational, and logical binary intrinsic operators can manipulate arrays having the same dimensions. Each element in the result array is found by applying the operator to the corresponding elements in the operand arrays. A binary intrinsic operator can also manipulate an array and a scalar variable, resulting in an array of the same dimensions as the array operand.

FORTRAN 90 Language Features For example, given the array declarations • REAL, DIMENSION(100,50) : : X, Y • REAL, DIMENSION(100) : : Z the following are examples of legal array expressions: X + Y Array of shape(100,50), elements X(I,J) + Y(I,J) X + 1.0 Array of shape(100,50), elements X(I,J) + 1.0 X .EQ. Y Value .True. If X(I,J) .EQ. Y(I,J) and .FALSE. otherwise X(1:100,3) +Z Array of shape(100), elements X(I,3) + Z(I)

FORTRAN 90 Language Features Sometimes it is important to be able to perform an operation on a subset of the Array elements. The WHERE statement allows the programmer to specify which array elements are lo be active. For example, the statement WHERE (A > 0.0) A = SORT (A) replaces every positive element of A with its square root.

FORTRAN 90 Language Features The WHERE statement divides the array elements into two sets, first performing one or more array assignments on the elements for which the expression is true, then performing one or more array assignments on the elements for which the expression is false. The syntax of most general form of WHERE statement is- WHERE(logical-array-expression) array-assignment-statements ELSEWHERE array-assignment-statements END WHERE Finally new transformational functions allow the reduction of an array into a scalar value. For example, the function SUM returns the sum of the elements of the any passed to it as an argument.

SEQUENT C Parallel Programming Language

Sequent C  Sequent computers run the DYNIX operating system, a version of UNIX tailored for the multiprocessor environment.  In addition to the operating-system calls typically found in a UNIX system, DYNIX provides a set of routines to facilitate parallel processing.  The commercial parallel programming languages the Sequent hardware uses are simple extensions of sequential languages that allow programmers to declare shared variables that interact via mutual exclusion and barrier synchronization.  The resulting languages are primitive.

Sequent C - Shared Data  Parallel processes on the Sequent coordinate their activities by accessing shared data structures.  The keyword shared placed before a global variable declaration, indicates that all processes are to share a single instance of that variable.  For example, if a 10-element global array is declared int a [10], then every active process has its own copy of the array; if one process modifies a value in its copy of a, no other process's value will change.  On the other hand, if the array is declared shared int a [10] , then all active processes share a single instance of the array, and changes made by one process can be detected by the other processes.

Sequent C - Parallel Processing Function  A program begins execution as a single process. This process is responsible for executing those parts of the program that are inherently sequential.  the original process forks a number of other processes, each process performing its share of the work.  The total number of processes accessing shared data cannot exceed the number of physical processors less one. Because there are at least as many CPUs as active processes, each process may execute on its own CPU.  This allows a major reduction in the execution time, assuming that the computer is not executing any other jobs.

Sequent C - Parallel Processing Function  When control reaches an inherently sequential portion of the computation, only the original process executes the code; the remaining processes wait until control reaches another portion of the computation that can be divided into pieces and executed concurrently. The program cycles through these two modes until termination.  Parallel programs executing on the Sequent alternate between sequential and parallel execution.  The transition from parallel to sequential execution is always delimited by a barrier synchronization. In addition, data dependencies may require the insertion of barrier synchronizations within parallel code.

Sequent C - Parallel Processing Function

Sequent C - Parallel Processing Function Sequent's microtasking library has seven key functions: 1. m_set_procs(p): The parent process initializes to value p a shared variable that controls the number of processes created by a subsequent call to m_fork. The value of p can not exceed the number of physical processors in the system minus one. The function also initializes barriers and locks.

Sequent C - Parallel Processing Function 2. m_fork(name[,arg,...]): The parent process creates a number of child processes, The parent process and the child processes begin executing function name with the arguments (if any) also specified by the call to m_fork. After all the processes (the parent and all the children) have completed execution of function name, the parent process resumes execution with the code after m_fork, while the child processes busy wait until the next call to m_fork.

Sequent C - Parallel Processing Function 2. m_fork(name[,arg,...]): The parent process creates p number of child processes All processes begin executing function name The parent process resumes execution with the code after m_fork The child processes busy wait until the next call to m_fork. Therefore, the first call to m_fork is more expensive than subsequent calls, because only the first call entails process creation.

Sequent C - Parallel Processing Function 3. m_get_myid: A process calls function m_get_myid to get its unique process number. If the total number of active processes is p, then the process number of the parent is 0, while the process numbers of the child processes range from 1 to p-1. 4. m_get_numprocs: Function m_get_numprocs returns the number of processes executing in parallel. Given the total number of processes and its own process number, a process can determine which portion of a computation is its responsibility.

Sequent C - Parallel Processing Function 5. m_lock, m_unlock: Functions m_lock and m_unlock ensure mutually exclusive execution of the code that the two calls surround. Once a process has entered a block of code delimited by m_lock and m_unlock, no other process may enter until the first process has left. 7. m_kill_procs: Function m_kill_procs kills the child processes created by the first call to m_fork.

Sequent C - Monitor  Most parallel algorithms implemented on multiprocessors require a process to perform a series of operations on a shared data structure, as if it were an atomic operation.  For example, a process may need to fetch the value at the beginning of a linked list and advance the list pointer to the next list element.  When the hardware cannot perform the entire series of operations as an atomic operation, the process must have some way to enforce mutual exclusion, keeping all other processes from referencing the resource while it is being modified. The piece of code in which mutual exclusion must be enforced is called a critical section.

Sequent C - Monitor One way to structure accesses to shared resources is by using a monitor.  A monitor consists of variables representing the state of some resource, procedures that implement operations on it, and initialization code.  The values of the variables are initialized before any procedure in the monitor is called; these values are retained between procedure invocations and may be accessed only by procedures in the monitor.  Monitor procedures resemble ordinary procedures in the programming language with one significant exception. The execution of the procedures in the same monitor is guaranteed to be mutually exclusive. Hence monitors are a structured way of implementing mutual exclusion.

Sequent C - Monitor Programming languages that support monitors include Concurrent Pascal (Brinch Hansen 1975, 1977) and Modula (Wirth 1977a, 1977b, 1977c). Even if your parallel programming language does not support monitors, you can implement one yourself. For example, in the Sequent C language, you can implement a monitor by declaring a shared lock variable for each resource, putting an s_lock statement that accesses the variable at the start of each procedure, and putting an s_unlock statement at the end of each procedure. You also must have enough self-discipline to use only these procedures to access the shared resource.

Chapter 4: Parallel Programming Languages

More Related Content

Similar to Chapter 4: Parallel Programming Languages

More from Heman Pathak

Recently uploaded

Chapter 4: Parallel Programming Languages