Parallel computing(2)

Introduction to Parallel Computing Part IIb

What is MPI? Message Passing Interface (MPI) is a standardised interface. Using this interface, several implementations have been made. The MPI standard specifies three forms of subroutine interfaces: (1) Language independent notation; (2) Fortran notation; (3) C notation.

MPI Features MPI implementations provide: • Abstraction of hardware implementation • Synchronous communication • Asynchronous communication • File operations • Time measurement operations

Implementations MPICH Unix / Windows NT MPICH-T3E Cray T3E LAM Unix/SGI Irix/IBM AIX Chimp SunOS/AIX/Irix/HP-UX WinMPI Windows 3.1 (no network req.)

Programming with MPI What is the difference between programming using the traditional approach and the MPI approach: 1. Use of MPI library 2. Compiling 3. Running

Compiling (1) When a program is written, compiling it should be done a little bit different from the normal situation. Although details differ for various MPI implementations, there are two frequently used approaches.

Compiling (2) First approach $ gcc myprogram.c –o myexecutable -lmpi Second approach $ mpicc myprogram.c –o myexecutable

Running (1) In order to run an MPI-Enabled application we should generally use the command ‘mpirun’: $ mpirun –np x myexecutable <parameters> Where x is the number of processes to use, and <parameters> are the arguments to the Executable, if any.

Running (2) The ‘mpirun’ program will take care of the creation of processes on selected processors. By default, ‘mpirun’ will decide which processors to use, this is usually determined by a global configuration file. It is possible to specify processors, but they may only be used as a hint.

MPI Programming (1) Implementations of MPI support Fortran, C, or both. Here we only consider programming using the C Libraries. The first step in writing a program using MPI is to include the correct header: #include “mpi.h”

MPI Programming (2) #include “mpi.h” int main (int argc, char *argv[]) { … MPI_Init(&argc, &argv); … MPI_Finalize(); return …; }

MPI_Init int MPI_Init (int *argc, char ***argv) The MPI_Init procedure should be called before any other MPI procedure (except MPI_Initialized). It must be called exactly once, at program initialisation. If removes the arguments that are used by MPI from the argument array.

MPI_Finalize int MPI_Finalize (void) This routine cleans up all MPI states. It should be the last MPI routine to be called in a program; no other MPI routine may be called after MPI_Finalize. Pending communication should be finished before finalisation.

Using multiple processes When running an MPI enabled program using multiple processes, each process will run an identical copy of the program. So there must be a way to know which process we are. This situation is comparable to that of programming using the ‘fork’ statement. MPI defines two subroutines that can be used.

MPI_Comm_size int MPI_Comm_size (MPI_Comm comm, int *size) This call returns the number of processes involved in a communicator. To find out how many processes are used in total, call this function with the predefined global communicator MPI_COMM_WORLD.

MPI_Comm_rank int MPI_Comm_rank (MPI_Comm comm, int *rank) This procedure determines the rank (index) of the calling process in the communicator. Each process is assigned a unique number within a communicator.

MPI_COMM_WORLD MPI communicators are used to specify to what processes communication applies to. A communicator is shared by a group of processes. The predefined MPI_COMM_WORLD applies to all processes. Communicators can be duplicated, created and deleted. For most application, use of MPI_COMM_WORLD suffices.

Example ‘Hello World!’ #include <stdio.h> #include "mpi.h" int main (int argc, char *argv[]) { int size, rank; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); printf ("Hello world! from processor (%d/%d)n", rank+1, size); MPI_Finalize(); return 0; }

Running ‘Hello World!’ $ mpicc -o hello hello.c $ mpirun -np 3 hello Hello world! from processor (1/3) Hello world! from processor (2/3) Hello world! from processor (3/3) $ _

MPI_Send int MPI_Send (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) Synchronously sends a message to dest. Data is found in buf, that contains count elements of datatype. To identify the send, a tag has to be specified. The destination dest is the processor rank in communicator comm.

MPI_Recv int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Synchronously receives a message from source. Buffer must be able to hold count elements of datatype. The status field is filled with status information. MPI_Recv and MPI_Send calls should match; equal tag, count, datatype.

Datatypes MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double (http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html)

Example send / receive #include <stdio.h> #include "mpi.h" int main (int argc, char *argv[]) { MPI_Status s; int size, rank, i, j; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) // Master process { printf ("Receiving data . . .n"); for (i = 1; i < size; i++) { MPI_Recv ((void *)&j, 1, MPI_INT, i, 0xACE5, MPI_COMM_WORLD, &s); printf ("[%d] sent %dn", i, j); } } else { j = rank * rank; MPI_Send ((void *)&j, 1, MPI_INT, 0, 0xACE5, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }

Running send / receive $ mpicc -o sendrecv sendrecv.c $ mpirun -np 4 sendrecv Receiving data . . . [1] sent 1 [2] sent 4 [3] sent 9 $ _

MPI_Bcast int MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) Synchronously broadcasts a message from root, to all processors in communicator comm (including itself). Buffer is used as source in root processor, as destination in others.

MPI_Barrier int MPI_Barrier (MPI_Comm comm) Blocks until all processes defined in comm have reached this routine. Use this routine to synchronize processes.

Example broadcast / barrier int main (int argc, char *argv[]) { int rank, i; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); if (rank == 0) i = 27; MPI_Bcast ((void *)&i, 1, MPI_INT, 0, MPI_COMM_WORLD); printf ("[%d] i = %dn", rank, i); // Wait for every process to reach this code MPI_Barrier (MPI_COMM_WORLD); MPI_Finalize(); return 0; }

Running broadcast / barrier $ mpicc -o broadcast broadcast.c $ mpirun -np 3 broadcast [0] i = 27 [1] i = 27 [2] i = 27 $ _

MPI_Sendrecv int MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) int MPI_Sendrecv_replace( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status ) Send and receive (2nd, using only one buffer).

Other useful routines • MPI_Scatter • MPI_Gather • MPI_Type_vector • MPI_Type_commit • MPI_Reduce / MPI_Allreduce • MPI_Op_create

Example scatter / reduce int main (int argc, char *argv[]) { int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors int rank, i = -1, j = -1; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , 1, MPI_INT, 0, MPI_COMM_WORLD); printf ("[%d] Received i = %dn", rank, i); MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, 0, MPI_COMM_WORLD); printf ("[%d] j = %dn", rank, j); MPI_Finalize(); return 0; }

Running scatter / reduce $ mpicc -o scatterreduce scatterreduce.c $ mpirun -np 4 scatterreduce [0] Received i = 1 [0] j = 24 [1] Received i = 2 [1] j = -1 [2] Received i = 3 [2] j = -1 [3] Received i = 4 [3] j = -1 $ _

Some reduce operations MPI_MAX Maximum value MPI_MIN Minimum value MPI_SUM Sum of values MPI_PROD Product of values MPI_LAND Logical AND MPI_BAND Boolean AND MPI_LOR Logical OR MPI_BOR Boolean OR MPI_LXOR Logical Exclusive OR MPI_BXOR Boolean Exclusive OR

Measuring running time double MPI_Wtime (void); double timeStart, timeEnd; ... timeStart = MPI_Wtime(); // Code to measure time for goes here. timeEnd = MPI_Wtime() ... printf (“Running time = %f secondsn”, timeEnd – timeStart);

Parallel sorting (1) Sorting an sequence of numbers using the binary–sort method. This method divides a given sequence into two halves (until only one element remains) and sorts both halves recursively. The two halves are then merged together to form a sorted sequence.

Binary sort pseudo-code sorted-sequence BinarySort (sequence) { if (# elements in sequence > 1) { seqA = first half of sequence seqB = second half of sequence BinarySort (seqA); BinarySort (seqB); sorted-sequence = merge (seqA, seqB); } else sorted-sequence = sequence }

Merge two sorted sequences 1 2 5 7 3 4 6 8 1 2 3 4 5 6 7 8

Example binary – sort 1 2 7 3 5 4 2 5 8 6 4 7 6 8 3 1 2 7 5 7 2 3 8 4 6 8 3 1 7 2 5 5 2 4 8 8 4 3 6 6 3 1 7 5 2 8 4 6 3

Parallel sorting (2) This way of dividing work and gathering the results is a quite natural way to use for a parallel implementation. Divide work in two to two processors. Have each of these processors divide their work again, until either no data can be split again or no processors are available anymore.

Implementation problems • Number of processors may not be a power of two • Number of elements may not be a power of two • How to achieve an even workload? • Data size is less than number of processors

Parallel matrix multiplication We use the following partitioning of data (p=4) P1 P1 P2 P2 P3 P3 P4 P4

Implementation 1. Master (process 0) reads data 2. Master sends size of data to slaves 3. Slaves allocate memory 4. Master broadcasts second matrix to all other processes 5. Master sends respective parts of first matrix to all other processes 6. Every process performs its local multiplication 7. All slave processes send back their result.

Multiplication 1000 x 1000 1000 x 1000 Matrix multiplication 140 120 100 80 Time (s) 60 40 20 0 0 10 20 30 40 50 60 Processors Tp T1 / p

Multiplication 5000 x 5000 5000 x 5000 Matrix multiplication 90000 80000 70000 60000 Time (s) 50000 40000 30000 20000 10000 0 0 5 10 15 20 25 30 35 Processors Tp T1 / p

Gaussian elimination We use the following partitioning of data (p=4) P1 P1 P2 P2 P3 P3 P4 P4

Implementation (1) 1. Master reads both matrices 2. Master sends size of matrices to slaves 3. Slaves calculate their part and allocate memory 4. Master sends each slave its respective part 5. Set sweeping row to 0 in all processes 6. Sweep matrix (see next sheet) 7. Slave send back their result

Implementation (2) While sweeping row not past final row do A. Have every process decide whether they own the current sweeping row B. The owner sends a copy of the row to every other process C. All processes sweep their part of the matrix using the current row D. Sweeping row is incremented

Programming hints • Keep it simple! • Avoid deadlocks • Write robust code even at cost of speed • Design in advance, debugging is more difficult (printing output is different) • Error handing requires synchronisation, you can’t just exit the program.

References (1) MPI Forum Home Page http://www.mpi-forum.org/index.html Beginners guide to MPI (see also /MPI/) http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html MPICH http://www-unix.mcs.anl.gov/mpi/mpich/

References (2) Miscellaneous http://www.erc.msstate.edu/labs/hpcl/projects/mpi/ http://nexus.cs.usfca.edu/mpi/ http://www-unix.mcs.anl.gov/~gropp/ http://www.epm.ornl.gov/~walker/mpitutorial/ http://www.lam-mpi.org/ http://epcc.ed.ac.uk/chimp/ http://www-unix.mcs.anl.gov/mpi/www/www3/

Parallel computing(2)

More Related Content

What's hot

Viewers also liked

Similar to Parallel computing(2)

More from Md. Mahedi Mahfuj

Recently uploaded

Parallel computing(2)