The document discusses parallel program design and parallel programming techniques. It introduces parallel algorithm design based on four steps: partitioning, communication, agglomeration, and mapping. It also covers parallel programming tools including pthreads, OpenMP, and MPI. Common parallel constructs like private, shared, barrier, and reduction are explained. Examples of parallel programs using pthreads and OpenMP are provided.
Embedded and ParallelSystems Lab 3 Introduction ■ Why Use Parallel Computing? ● Save time ● Solve larger problems ● Provide concurrency ● Cost savings ● Multi-core CPU掘起 ◆ Intel® Core™2 duo ◆ Intel® Core™2 Quad ◆ AMD Opteron ◆ AMD Phenom ◆ Xbox360 ◆ PS3
4.
Embedded and ParallelSystems Lab 4 Introduction ■ Parallel computing ● It is the use of a parallel computer to reduce the time needed to solve a single computational problem. ■ Parallel programming ● It is a language that allows you to explicitly indicate how different portions of the computation may be executed concurrently by different processors. ■ 將一個程式分成n個不同的部份,使之能夠同時執 行降低執行時間,其最後結果與原本程式相同
5.
Embedded and ParallelSystems Lab 5 Introduction Serial Source : http://www.llnl.gov/computing/tutorials/parallel_comp
Embedded and ParallelSystems Lab 10 Introduction ■ Flynn's Classical Taxonomy M I M D Multiple Instruction, Multiple Data M I S D Multiple Instruction, Single Data S I M D Single Instruction, Multiple Data S I S D Single Instruction, Single Data
Embedded and ParallelSystems Lab 12 Introduction M I S D M I M D Source : http://www.llnl.gov/computing/tutorials/parallel_comp
13.
Embedded and ParallelSystems Lab 13 Introduction ■ Amdahl’s Law Best you could ever hope to do:
14.
Embedded and ParallelSystems Lab 14 Parallel Algorithm Design ■ Ian Foster ■ Four-step process for designing parallel algorithm 1. Partitioning 2. Communication 3. Agglomeration 4. Mapping ■ 平行化的大原則 ● Maximize processor utilization ● Minimize communication overhead ● Load balancing
15.
Embedded and ParallelSystems Lab 15 Parallel Algorithm Design ■ Partitioning ● Process of dividing the computation and the data into pieces. ● Domain decomposition ● Functional decomposition Problem
16.
Embedded and ParallelSystems Lab 16 Parallel Algorithm Design ■ Communication ● Local communication ● Global communication
17.
Embedded and ParallelSystems Lab 17 Parallel Algorithm Design ■ Agglomeration ● Increasing the locality (combining tasks that are connected by a channel eliminates) ● Combining sending and receiving task
18.
Embedded and ParallelSystems Lab 18 Parallel Algorithm Design ■ Mapping ● Process of assigning tasks to processor A C D B E G F H I A C D B E G F HI
19.
Embedded and ParallelSystems Lab 19 Foster’s parallel algorithm design Problem A C D B E G F H I A C D & FB E G H I A C B E G HI D & F Mapping Partitioning Communication Agglomeration
20.
Embedded and ParallelSystems Lab 20 Parallel Example : matrix A B C X = X =X X X = = = merge P1 P2 P3 P4
21.
Embedded and ParallelSystems Lab 21 Decision Tree Source : Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP”
Embedded and ParallelSystems Lab 24 pthread ■ What is a Thread? ● A thread is a logical flow that runs in the context of a process. ● Multiply threads can running concurrently in a single process. ● Each thread has its own thread context ◆ a unique integer thread ID (TID) ◆ stack ◆ stack pointer ◆ program counter ◆ general-purpose registers ◆ condition codes Source : William W.-Y. Liang , “Linux System Programming”
25.
Embedded and ParallelSystems Lab 25 Thread v.s. Processes ■ Process: ● When a process executes a fork call, a new copy of the process is created with its own variables and its own PID. ● This new process is scheduled independently, and (in general) executes almost independently of the process that created is. ■ Thread: ● When we create a new thread in a process, the new thread of execution gets its own stack (and hence local variables) but shares global variables, file descriptors, signal handlers, and its current directory state with the process that created it. Source : William W.-Y. Liang , “Linux System Programming”
26.
Embedded and ParallelSystems Lab 26 pthread Function If ok return 0 If error return error number (>=0) Return value tid:ID for the created thread att::thread attribute object, if NULL 為default attribute func:thread function arg:argument for the thread Parameters Create a new thread of execution功能 int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void * (*func)(void*), void *arg) Function Function int pthread_join(pthread_t tid, void **thread_return) 功能 Blocks the calling thread until the specified thread terninates Parameters tid:ID for the created thread thread_return:buffer for the returned value Return value If ok return 0 If error return error number (>=0)
27.
Embedded and ParallelSystems Lab 27 pthread Function noneReturn value retval :Thread return value. If not NULL,retval = thread_return (pthread_join) Parameters Terminates the calling thread功能 void pthread_exit(void * retval)Function Function pthread_t pthread_self(void); 功能 Return current thread ID Parameters none Return value Thread ID (unsigned long int)
28.
Embedded and ParallelSystems Lab 28 Example: thread.c #include <stdio.h> #include <pthread.h> char message[]="Example:create new thread"; void *thread_function(void *arg){ pthread_t tid = pthread_self(); printf("thread_function is runningn"); printf("new ID:%u Argument is %sn", tid, (char*)arg); pthread_exit("new thread endn"); } int main(void){ pthread_t new_thread; pthread_t master_thread = pthread_self(); void *thread_result; pthread_create(&new_thread, NULL, thread_function, (void*)message); pthread_join(new_thread, &thread_result); printf("nmaster ID:%u the new thread return valus is:%sn", master_thread,(char*)thread_result); return 0; }
29.
Embedded and ParallelSystems Lab 29 pthread Attribute If ok return 0 If error return error number (>=0) Return value attr: thread attribute objectParameters Initialize a thread attributes object.功能 int pthread_attr_init (pthread_attr_t *attr);Function Function int pthread_attr_destroy(pthread_attr_t *attr) 功能 Destory a thread attributes object. Parameters attr: thread attribute object Return value If ok return 0 If error return error number (>=0)
30.
Embedded and ParallelSystems Lab 30 pthread Attribute Thread’s stack sizestacksize Thread’s stack addressstackaddr (PAGESIZE bytes)Thread’s guard sizeguardsize PTHREAD_INHERIT_SCHED:thread attribute從建立者繼承 PTHREAD_EXPLICIT_SCHED :thread屬性由thread attribute (pthread_attr_t)來決定 Thread’s scheduling inhertienceinheritsched Argument (blue is default)FunctionAttribute Threads’ scheduling parametersschedparam SCHED_FIFO:first in first out SCHED_RR:round robin SCHED_OTHER:沒有優先權 Thread’s scheduling policyschedpolicy PTHREAD_CREATE_DETACHED:當thread結束時,會將所有資源 都釋放掉 PTHREAD_CREATE_JOINABLE:當thread結束時,它的thread ID 和結束狀態會保留,直到行程中的有 thread去對它呼叫pthread_join Threads’ detach state.detachstate PTHREAD_SCOPE_SYSTEM、PTHREAD_SCOPE_PROCESS, But linux only have PTHREAD_SCOPE_SYSTEM Thread’s scope.scope
31.
Embedded and ParallelSystems Lab 31 Get pthread Attribute ■ int pthread_attr_getdetachstate(const pthread_attr_t *attr, int *detachstate); ■ int pthread_attr_getguardsize(const pthread_attr_t *attr, size_t *guardsize); ■ int pthread_attr_getinheritsched(const pthread_attr_t *attr, int *inheritsched); ■ int pthread_attr_getschedparam(const pthread_attr_t *attr, struct sched_param *param); ■ int pthread_attr_getschedpolicy(const pthread_attr_t *attr, int *policy); ■ int pthread_attr_getscope(const pthread_attr_t *attr, int *scope); ■ int pthread_attr_getstackaddr(const pthread_attr_t *attr, void **stackaddr); ■ int pthread_attr_getstacksize(const pthread_attr_t *attr, size_t *stacksize);
32.
Embedded and ParallelSystems Lab 32 Set pthread Attribute ■ int pthread_attr_setdetachstate(pthread_attr_t *attr, int detachstate); ■ int pthread_attr_setguardsize(pthread_attr_t *attr, size_t guardsize); ■ int pthread_attr_setinheritsched(pthread_attr_t *attr, int inheritsched); ■ int pthread_attr_setschedparam(pthread_attr_t *attr, const struct sched_param *param); ■ int pthread_attr_setschedpolicy(pthread_attr_t *attr, int policy); ■ int pthread_attr_setscope(pthread_attr_t *attr, int scope); ■ int pthread_attr_setstackaddr(pthread_attr_t *attr, void *stackaddr); ■ int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);
33.
Embedded and ParallelSystems Lab 33 OpenMP Directive Table Specifies that a variable is private to a thread.threadprivate Lets you specify that a section of code should be executed on a single thread, not necessarily the master thread. single Identifies code sections to be divided among all threads.sections Defines a parallel region, which is code that will be executed by multiple threads in parallel.parallel Specifies that code under a parallelized for loop should be executed like a sequential loop.ordered Specifies that only the master threadshould execute a section of the program.master Causes the work done in a for loop inside a parallel region to be divided among threads.for Specifies that all threads have the same view of memory for all shared objects.flush Specifies that code is only executed on one thread at a time.critical Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the barrier. barrier Specifies that a memory location that will be updated atomically.atomic DescriptionDirective Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
34.
Embedded and ParallelSystems Lab 34 OpenMP Clause Table Specifies that one or more variables should be shared among all threads.shared Applies to the for directive. Have fourt method: static 、dynamic、guided、runtimeschedule Specifies that one or more variables that are private to each thread are the subject of a reduction operation at the end of the parallel region. reduction Specifies that each thread should have its own instance of a variable.private Required on a parallel for statement if an ordered directive is to be used in the loop.ordered Sets the number of threads in a thread team.num_threads Overrides the barrier implicit in a directive.nowait Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections). lastprivate Specifies whether a loop should be executed in parallel or in serial.if Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct. firstprivate Specifies the behavior of unscoped variables in a parallel region.default Specifies that one or more variables should be shared among all threads.copyprivate Allows threads to access the master thread's value, for a threadprivate variable.copyin DescriptionClause Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
35.
Embedded and ParallelSystems Lab 35 Reference ■ System Threads Reference http://www.unix.org/version2/whatsnew/threadsref. html ■ Semaphone http://www.mkssoftware.com/docs/man3/sem_init.3.asp ■ Richard Stones. Neil Matthew, “Beginning Linux Programming” ■ William W.-Y. Liang , “Linux System Programming”
Embedded and ParallelSystems Lab 40 Types of Work-Sharing Constructs ■ Loop:shares iterations of a loop across the team. Represents a type of "data parallelism". Source : http://www.llnl.gov/computing/tutorials/openMP/ ■ Sections:breaks work into separate, discrete sections. Each section is executed by a thread. Can be used to implement a type of "functional parallelism".
41.
Embedded and ParallelSystems Lab 41 Types of Work-Sharing Constructs ■ single:將程式於一個執行緒執行(於一個子執行緒執行,但不會在 master thread執行) Source : http://www.llnl.gov/computing/tutorials/openMP/
42.
Embedded and ParallelSystems Lab 42 Loop working sharing #pragma omp parallel for for( int i , i <10000, i++) for( int j , j <100 , j++) function(i); #pragma omp parallel {大括號必須斷行,不能接於parallel後 #pragma omp for for( int i , i <10000, i++) for( int j , j <100 , j++) function(i); } = parallel for只能使用迴圈的index 為 int 型態,且執行次數是可預知的 Thread 0 (Master) for( i = 0 , i <5000, i++) for( int j , j <100 , j++) function(i); Thread 1 for( i = 5000 , i <10000, i++) for( int j , j <100 , j++) function(i); 於雙執行緒的cpu執行時情形
43.
Embedded and ParallelSystems Lab 43 OpenMP example : log.cpp #include <omp.h> #pragma omp parallel for num_threads(2) //將for迴圈平均分給2個threads for (y=2;y<BufSizeY-2;y++) for (x=2;x<BufSizeX-2;x++) for (z=0;z<BufSizeBand;z++) { addr=(y*BufSizeX+x)*BufSizeBand+z; ans = (BYTE)(*(InBuf+addr))*16+ (BYTE)(*(InBuf+((y*BufSizeX+x+1)*BufSizeBand+z)))*(-2) + (BYTE)(*(InBuf+((y*BufSizeX+x-1)*BufSizeBand+z)))*(-2) + (BYTE)(*(InBuf+(((y+1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+ (BYTE)(*(InBuf+((y*BufSizeX+x+2)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+((y*BufSizeX+x-2)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y+2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y+1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1) + (BYTE)(*(InBuf+(((y+1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1); *(OutBuf+addr)=abs(ans)/8; }
44.
Embedded and ParallelSystems Lab 44 Source image Source image Out image Convert Log Image
Embedded and ParallelSystems Lab 46 OpenMP notice int Fe[10]; Fe[0] = 0; Fe[1] = 1; #pragma omp parallel for num_threads(2) for( i = 2; i < 10; ++ i ) Fe[i] = Fe[i-1] + Fe[i-2]; ■Data dependent #pragma omp parallel { #pragma omp for for( int i = 0; i < 1000000; ++ i ) sum += i; } ■Race conditions
47.
Embedded and ParallelSystems Lab 47 OpenMP notice ■ DeadLock #pragma omp parallel private(me) { int me; me = omp_get_thread_num (); if (me == 0) goto Master; #pragma omp barrier Master: #pragma omp single write(*,*) ”done” }
48.
Embedded and ParallelSystems Lab 48 OpenMP example:matrix(1) #include <omp.h> #include <stdio.h> #include <stdlib.h> #define RANDOM_SEED 2882 //random seed #define VECTOR_SIZE 4 //sequare matrix width the same to height #define MATRIX_SIZE (VECTOR_SIZE * VECTOR_SIZE) //total size of MATRIX int main(int argc, char *argv[]){ int i,j,k; int node_id; int *AA; //sequence use & check the d2mce right or fault int *BB; //sequence use int *CC; //sequence use int computing; int _vector_size = VECTOR_SIZE; int _matrix_size = MATRIX_SIZE; char c[10];
49.
Embedded and ParallelSystems Lab 49 OpenMP example:matrix(2) if(argc > 1){ for( i = 1 ; i < argc ;){ if(strcmp(argv[i],"-s") == 0){ _vector_size = atoi(argv[i+1]); _matrix_size =_vector_size * _vector_size; i+=2; } else{ printf("the argument only have:n"); printf("-s: the size of vector ex: -s 256n"); return 0; } } } AA =(int *)malloc(sizeof(int) * _matrix_size); BB =(int *)malloc(sizeof(int) * _matrix_size); CC =(int *)malloc(sizeof(int) * _matrix_size);
50.
Embedded and ParallelSystems Lab 50 OpenMP example:matrix(3) srand( RANDOM_SEED ); /* create matrix A and Matrix B */ for( i=0 ; i< _matrix_size ; i++){ AA[i] = rand()%10; BB[i] = rand()%10; } /* computing C = A * B */ #pragma omp parallel for private(computing, j , k) for( i=0 ; i < _vector_size ; i++){ for( j=0 ; j < _vector_size ; j++){ computing =0; for( k=0 ; k < _vector_size ; k++) computing += AA[ i*_vector_size + k ] * BB[ k*_vector_size + j ]; CC[ i*_vector_size + j ] = computing; } }
Embedded and ParallelSystems Lab 52 OpenMP Directive Table Specifies that a variable is private to a thread.threadprivate Lets you specify that a section of code should be executed on a single thread, not necessarily the master thread. single Identifies code sections to be divided among all threads.sections Defines a parallel region, which is code that will be executed by multiple threads in parallel.parallel Specifies that code under a parallelized for loop should be executed like a sequential loop.ordered Specifies that only the master threadshould execute a section of the program.master Causes the work done in a for loop inside a parallel region to be divided among threads.for Specifies that all threads have the same view of memory for all shared objects.flush Specifies that code is only executed on one thread at a time.critical Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the barrier. barrier Specifies that a memory location that will be updated atomically.atomic DescriptionDirective Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
53.
Embedded and ParallelSystems Lab 53 OpenMP Clause Table Specifies that one or more variables should be shared among all threads.shared Applies to the for directive. Have fourt method: static 、dynamic、guided、runtimeschedule Specifies that one or more variables that are private to each thread are the subject of a reduction operation at the end of the parallel region. reduction Specifies that each thread should have its own instance of a variable.private Required on a parallel for statement if an ordered directive is to be used in the loop.ordered Sets the number of threads in a thread team.num_threads Overrides the barrier implicit in a directive.nowait Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections). lastprivate Specifies whether a loop should be executed in parallel or in serial.if Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct. firstprivate Specifies the behavior of unscoped variables in a parallel region.default Specifies that one or more variables should be shared among all threads.copyprivate Allows threads to access the master thread's value, for a threadprivate variable.copyin DescriptionClause Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx
54.
Embedded and ParallelSystems Lab 54 Reference ■ Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” ■ Introduction to Parallel Computing http://www.llnl. gov/computing/tutorials/parallel_comp/ ■ OpenMP standard http://www.openmp.org/drupal/ ■ OpenMP MSDN tutorial http://msdn2.microsoft.com/en-us/library/tt15eb9t (VS.80).aspx ■ OpenMP tutorial http://www.llnl.gov/computing/tutorials/openMP/#DO ■ Kang Su Gatlin , Pete Isensee, “Reap the Benefits of Multithreading without All the Work” ,MSDN Magazine
Embedded and ParallelSystems Lab 56 MPI ■ MPI is a language-independent communications protocol used to program parallel computers ■ 分散式記憶體(Distributed-Memory) ■ SPMD(Single Program Multiple Data ) ■ Fortran , C / C++
57.
Embedded and ParallelSystems Lab 57 MPI需求及支援環境 ■ Cluster Environment ● Windows ◆ Microsoft AD (Active Directory) server ◆ Microsoft cluster server ● Linux ◆ NFS (Network FileSystem) ◆ NIS (Network Information Services)又稱 yellow pages ◆ SSH ◆ MPICH 2
58.
Embedded and ParallelSystems Lab 58 MPI 安裝 http://www-unix.mcs.anl.gov/mpi/mpich/ 下載mpich2-1.0.4p1.tar.gz [shell]# tar –zxvf mpich2-1.0.4p1.tar.gz [shell]# mkdir /home/yourhome/mpich2 [shell]# cd mpich2-1.0.4p1 [shell]# ./configure –prefix=/home/yourhome/mpich2 //建議自行建立目錄安 裝 [shell]# make [shell]# make install 再來是 [shell]# cd ~yourhome //到自己home目錄下 [shell]# vi .mpd.conf //建立文件 內容為 secretword=<secretword> (secretword可以依自己喜好打) Ex: secretword=abcd1234
59.
Embedded and ParallelSystems Lab 59 MPI 安裝 [shell]# chmod 600 mpd.conf [shell]# vi .bash_profiles 將PATH=$PATH:$HOME/bin 改成PATH=$HOME/mpich2/bin:$PATH:$HOME/bin 重登server [shell]# vi mpd.hosts //在自己home目錄下建立hosts list文件 ex: cluster1 cluster2 cluster3 cluster4
Embedded and ParallelSystems Lab 61 MPI程式基本架構 #include "mpi.h" MPI_Init(); Do some work or MPI function example: MPI_Send() / MPI_Recv() MPI_Finalize();
62.
Embedded and ParallelSystems Lab 62 MPI Ethernet Control and Data Flow Source : Douglas M. Pase, “Performance of Voltaire InfiniBand in IBM 64-Bit Commodity HPC Clusters,” IBM White Papers, 2005
Embedded and ParallelSystems Lab 64 MPI Function int:如果執行成功回傳MPI_SUCCESS,0return value int argc:參數數目 char* argv[]:參數內容 parameters 起始MPI執行環境,必須在所有MPI function前使用,並可以將main的指令參數 (argc, argv)傳送到所有process 功能 int MPI_Init( int *argc, char *argv[])function int:如果執行成功回傳MPI_SUCCESS,0return value parameters 結束MPI執行環境,在所有工作完成後必須呼叫功能 int MPI_Finzlize()function
65.
Embedded and ParallelSystems Lab 65 MPI Function int:如果執行成功回傳MPI_SUCCESS,0return value comm:IN,MPI_COMM_WORLD size:OUT,總計process數目 parameters 取得總共有多少process數在該communicator功能 int MPI_Comm_size( MPI_Comm comm, int *size)function int:如果執行成功回傳MPI_SUCCESS,0return value comm:IN,MPI_COMM_WORLD rank:OUT,目前process ID parameters 取得 process自己的process ID功能 int MPI_Comm_rank ( MPI_Comm comm, int *rank)function
66.
Embedded and ParallelSystems Lab 66 MPI Function int:如果執行成功回傳MPI_SUCCESS,0return value buf:IN要傳送的資料(變數) count:IN,傳送多少筆 datatype:IN,設定傳送的資料型態 dest:IN,目標Process ID tag:IN,設定頻道 comm:IN,MPI_COMM_WORLD parameters 傳資料到指定的Process,使用Standard模式功能 int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)function int:如果執行成功回傳MPI_SUCCESS,0return value buf:OUT,要接收的資料(變數) count:IN,接收多少筆 datatype:IN,設定接收的資料型態 source:IN,接收的Process ID tag:IN,設定頻道 comm:IN,MPI_COMM_WORLD status:OUT,取得MPI_Status parameters 接收來自指定的Process資料功能 int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) function
67.
Embedded and ParallelSystems Lab 67 MPI Function ■ Status:指出來源的process ID和傳送的tag,在C是使用MPI_Status的資料型態 typedef struct MPI_Status { int count; int cancelled; int MPI_SOURCE; //來源ID int MPI_TAG; //來源傳送的tag int MPI_ERROR; //錯誤控制碼 } MPI_Status; double:傳回時間return value parameters 傳回一個時間(秒數,浮點數)代表目前時間,通常用來看程式執行的時間功能 double MPI_Wtime()function
68.
Embedded and ParallelSystems Lab 68 MPI Function int:如果執行成功回傳MPI_SUCCESS,0return value datatype:INOUT,新的datatypeparameters 建立datatype功能 int MPI_Type_commit(MPI_Datatype *datatype);function int:如果執行成功回傳MPI_SUCCESS,0return value datatype:INOUT,需釋放的datatypeparameters 釋放datatype功能 MPI_Type_free(MPI_Datatype *datatype);function
69.
Embedded and ParallelSystems Lab 69 MPI Function int:如果執行成功回傳 MPI_SUCCESS,0return value count:IN,新型態的大小(指有幾個oldtype組成) oldtype:IN,舊有的資料型態(MPI_Datatype) newtype:OUT,新的資料型態 parameters 將現有資料型態(MPI_Datatype),簡單的重新定大小,形成新的資料型態,就是指將數個 相同型態的資料整合成一個 功能 int MPI_Type_contiguous (int count, MPI_Datatype oldtype, MPI_Datatype *newtype)function
Embedded and ParallelSystems Lab 72 MPI example : hello.c if (rank == 0) { dest = 1; source = 1; strcpy(outmsg,"Who are you?"); //傳送訊息到process 0 rc = MPI_Send(outmsg, 1, strtype, dest, tag, MPI_COMM_WORLD); printf("process %d has sended message: %sn",rank, outmsg); //接收來自process 1 的訊息 rc = MPI_Recv(inmsg, 1, strtype, source, tag, MPI_COMM_WORLD, &Stat); printf("process %d has received: %sn",rank, inmsg); } else if (rank == 1) { dest = 0; source = 0; strcpy(outmsg,"I am process 1"); rc = MPI_Recv(inmsg, 1, strtype, source, tag, MPI_COMM_WORLD, &Stat); printf("process %d has received: %sn",rank, inmsg); rc = MPI_Send(outmsg, 1 , strtype, dest, tag, MPI_COMM_WORLD); printf("process %d has sended message: %sn",rank, outmsg); }
73.
Embedded and ParallelSystems Lab 73 MPI example : hello.c endtime=MPI_Wtime(); // 取得結束時間 //使用MPI_CHAR來計算實際收到多少資料 rc = MPI_Get_count(&Stat, MPI_CHAR, &count); printf("Task %d: Received %d char(s) from task %d with tag %d and use time is %f n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG, endtime-starttime); MPI_Type_free(&strtype); //釋放string資料型態 MPI_Finalize(); //結束MPI } process 0 has sended message: Who are you? process 1 has received: Who are you? process 1 has sended message: I am process 1 Task 1: Received 20 char(s) from task 0 with tag 1 and use time is 0.001302 process 0 has received: I am process 1 Task 0: Received 20 char(s) from task 1 with tag 1 and use time is 0.002133
74.
Embedded and ParallelSystems Lab 74 openMP vs. MPI No No Yes Yes No Yes MPI Yes / NoYesreduction YesYesbarrier Yes / NoYesatomic YesYescritical YesYesshare data YesYesprivate data DSMopenMP
75.
Embedded and ParallelSystems Lab 75 int:如果執行成功回傳MPI_SUCCESS,0return value comm:IN,MPI_COMM_WORLDparameters 當程式執行到Barrier便會block,等待所有其他process也執行到Barrier,當所有 Group內的process均執行到Barrier便會取消block繼續往下執行 功能 int MPI_Barrier(MPI_Comm comm)function ■ Types of Collective Operations: ● Synchronization : processes wait until all members of the group have reached the synchronization point. ● Data Movement : broadcast, scatter/gather, all to all. ● Collective Computation (reductions) : one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.
76.
Embedded and ParallelSystems Lab 76 MPI_Bcast int:如果執行成功回傳 MPI_SUCCESS,0return value buffer:INOUT,傳送的訊息,也是接收訊息的 buff count:IN,傳送多少個訊息 datatype:IN,傳送的資料型能 source(標準root):IN,負責傳送訊息的process comm:IN,MPI_COMM_WORLD parameters 將訊息廣播出去,讓所有人接收到相同的訊息功能 int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int source(root), MPI_Comm comm) function
77.
Embedded and ParallelSystems Lab 77 MPI_Gather int:如果執行成功回傳MPI_SUCCESS,0return value sendbuf:IN,傳送的訊息 sendcount:IN,傳送多少個 sendtype:IN,傳送的型態 recvbuf:OUT,接收訊息的buf recvcount:IN,接收多少個 recvtype:IN,接收的型態 destine:IN,負責接收訊息的process comm:IN,MPI_COMM_WORLD parameters 將分散在各個process 所傳送的訊息,整合起來,然後傳送到指定的process接收功能 int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int destine, MPI_Comm comm) function
Embedded and ParallelSystems Lab 81 MPI_Reduce int:如果執行成功回傳MPI_SUCCESS,0return value sendbuf:IN,傳送的訊息 recvbuf:OUT,接收訊息的buf count:IN,傳送接收多少個 datatype:IN,傳送接收的資料型態 op:IN,想要做的動作 destine:IN,接收訊息的process ID comm:IN,MPI_COMM_WORLD parameters 在傳送時順便做一些Operation(ex:MPI_SUM做加總),然後將結果送到destine process 功能 int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int destine, MPI_Comm comm) function
82.
Embedded and ParallelSystems Lab 82 MPI_Reduce float, double and long doublemin value and locationMPI_MINLOC float, double and long doublemax value and locationMPI_MAXLOC integer, MPI_BYTEbit-wise XORMPI_BXOR integerlogical XORMPI_LXOR integer, MPI_BYTEbit-wise ORMPI_BOR integerlogical ORMPI_LOR integer, MPI_BYTEbit-wise ANDMPI_BAND integerlogical ANDMPI_LAND integer, floatproductMPI_PROD integer, floatsumMPI_SUM integer, floatminimumMPI_MIN integer, floatmaximumMPI_MAX C Data TypesMPI Reduction Operation
83.
Embedded and ParallelSystems Lab 83 MPI example : matrix.c(1) #include <mpi.h> #include <stdio.h> #include <stdlib.h> #define RANDOM_SEED 2882 //random seed #define MATRIX_SIZE 800 //sequare matrix width the same to height #define NODES 4//this is numbers of nodes. minimum is 1. don't use < 1 #define TOTAL_SIZE (MATRIX_SIZE * MATRIX_SIZE)//total size of MATRIX #define CHECK int main(int argc, char *argv[]){ int i,j,k; int node_id; int AA[MATRIX_SIZE][MATRIX_SIZE]; int BB[MATRIX_SIZE][MATRIX_SIZE]; int CC[MATRIX_SIZE][MATRIX_SIZE];
84.
Embedded and ParallelSystems Lab 84 MPI example : matrix.c(2) #ifdef CHECK int _CC[MATRIX_SIZE][MATRIX_SIZE]; //sequence user, use to check the parallel result CC #endif int check = 1; int print = 0; int computing = 0; double time,seqtime; int numtasks; int tag=1; int node_size; MPI_Status stat; MPI_Datatype rowtype; srand( RANDOM_SEED );
Embedded and ParallelSystems Lab 92 Reference ■ Top 500 http://www.top500.org/ ■ Maarten Van Steen, Andrew S. Tanenbaum, “Distributed Systems: Principles and Paradigms ” ■ System Threads Reference http://www.unix.org/version2/whatsnew/threadsref. html ■ Semaphone http://www.mkssoftware.com/docs/man3/sem_init.3.asp ■ Richard Stones. Neil Matthew, “Beginning Linux Programming” ■ W. Richard Stevens, “Networking APIs:Sockets and XTI“ ■ William W.-Y. Liang , “Linux System Programming” ■ Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” ■ Introduction to Parallel Computing http://www.llnl. gov/computing/tutorials/parallel_comp/
93.
Embedded and ParallelSystems Lab 93 Reference ■ Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” ■ Introduction to Parallel Computing http://www.llnl. gov/computing/tutorials/parallel_comp/ ■ MPI standard http://www-unix.mcs.anl.gov/mpi/ ■ MPI http://www.llnl.gov/computing/tutorials/mpi/
94.
Embedded and ParallelSystems Lab 94 Conclusion ■ 如何想出好的平行演算法是非常困難的。 ■ 開發工具及除錯工具普遍不足 ■ 新一代的語言 ● IBM的X10、Sun的Fortress、Cray的Chapel ◆ X10是以java1.4為基礎來擴充的語言 async(place.factory.place(1)){ for (int i=1 ; i<=10 ; i+=2 ) ans += i; }
95.
Embedded and ParallelSystems Lab 95 Reference ■ Top 500 http://www.top500.org/ ■ Maarten Van Steen, Andrew S. Tanenbaum, “Distributed Systems: Principles and Paradigms ” ■ System Threads Reference http://www.unix.org/version2/whatsnew/threadsref. html ■ Semaphone http://www.mkssoftware.com/docs/man3/sem_init.3.asp ■ Richard Stones. Neil Matthew, “Beginning Linux Programming” ■ W. Richard Stevens, “Networking APIs:Sockets and XTI“ ■ William W.-Y. Liang , “Linux System Programming” ■ Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” ■ Introduction to Parallel Computing http://www.llnl. gov/computing/tutorials/parallel_comp/
96.
Embedded and ParallelSystems Lab 96 Reference ■ MPI standard http://www-unix.mcs.anl.gov/mpi/ ■ MPI http://www.llnl.gov/computing/tutorials/mpi/ ■ OpenMP standard http://www.openmp.org/drupal/ ■ OpenMP MSDN tutorial http://msdn2.microsoft.com/en-us/library/tt15eb9t (VS.80).aspx ■ OpenMP tutorial http://www.llnl.gov/computing/tutorials/openMP/#DO ■ Kang Su Gatlin , Pete Isensee, “Reap the Benefits of Multithreading without All the Work” ,MSDN Magazine ■ Gary Anthes “Languages for Supercomputing Get 'Suped' Up”, Computerword March 12, 2007 ■ IBM X10 research http://domino.research.ibm. com/comm/research_projects.nsf/pages/x10.X10-presentations.html