Parallel Algorithm & Sorting in Parallel Programming Submitted By:- Richa kumari,14MT-CS12 Submitted To:- Dalpat songra
Contents: 1.1 Parallel algorithm 1.2 A Network for sorting 1.3 Sorting on a linear array 1.4 Sorting on the CRCW Model 1.5 Sorting on the CREW Model 1.6 Sorting on the EREW Model
1.1 Parallel Algorithm:-  A parallel algorithm or concurrent algorithm, as opposed to a traditional sequential algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.
Parallel Sorting:-  The fundamental operation of comparison- based sorting is compare-exchange.  The lower bound on any comparison-based sort of n numbers is Θ(nlog n) .  The sorted list is partitioned with the property that each partitioned list is sorted and each element in processor Pi's list is less than that in Pj's list if i < j
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps max{ai, aj}.
Quick Sort:-  Quicksort is one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity.  Quicksort selects one of the entries in the sequence to be the pivot and divides the sequence into two - one with all elements less than the pivot and other greater.  The process is recursively applied to each of the sublists.
Cont…  Average optimal sequential complexity: O(n log n)  Parallel efficiency limitations  Partitions are unbalanced  A single processor performs the initial partitioning
Example of quicksort  Let S = (6,5 ,9,2,4,3,5 , 1, 7,5,8 ). T he first call to procedure Q U I C K S O R T produces 5 as the median element of S, and hence S1 = {2,4,3,1,5,5} and S2 = {6,9,7,8,5}. Note that S1 = 6 and S2= 5. A recursive call to Q U I C K S O R T with S, as input produces the two subsequences {2,1,3} and {4,5,5}. The second call with S, as input produces {6,5,7}an d {9,8}. Further recursive calls complete the sorting of these sequences.
Quicksort algo….
COMPLEXITY OF QUICKSORT For some constant c, we can express the running time of procedure QUICKSORT as = O(n log n),
1.2 A NETWORK FOR SORTING  It is rather straightforward to use a collection of merging networks  to build a sorting network for the sequence S = {s1, s2, . . . , sn), where n is a power of 2. The idea is the following.  In a first stage, a rank of n/2 comparators is used to create n/2 sorted sequences each of length 2.  In a second stage, pairs of these are now merged into sorted sequences of length 4 using a rank of (2,2)- merging networks. Again, in a
Conti….  third stage, pairs of sequences of length 4 are merged using (4,4)-merging networks into sequences of length 8. The process continues until two sequences of length n/2 each are merged by an (n/2, n/2)-merging network to produce a single sorted sequence of length n. The resulting architecture is known as an odd-even sorting network and is  illustrated in Fig. for S = {8,4,7,2, 1,5,6,3). Note that, as in the case of merging, the odd-even sorting network is oblivious of its input.
FIG: ODD EVEN SORTING NETWORK FOR SEQUENCE OF EIGHT ELEMENTS
The odd-even sorting network is impractical for large input sequences : (i) The network is extremely fast. It can sort a sequence of length 2^20 within, on the order of, (20)2 time units. This is to be contrasted with the time required by procedure QUICKSORT, which would be in excess of 20 million time units.[(log n)^2] (ii) The number of comparators is too high. Again for n = 2^20, the network would need on the order of 400 million comparators.[n (log n)^2] (iii) The architecture is highly irregular and the wires linking the comparators have lengths that vary with n.
1.3 SORTING ON A LINEAR ARRAY: In this section we describe a parallel sorting algorithm for an SIMD computer where the processors are connected to form a linear array FIG: LINEAR ARRAY CONNECTION
Odd-Even Transposition Sort  Variation of bubble sort.  Operates in two alternating phases, even phase and odd phase.  Even phase Even-numbered processes exchange numbers with their right neighbour.  Odd phase Odd-numbered processes exchange numbers with their right neighbour.
 Odd-Even Transposition Sort - example Parallel time complexity: Tpar = O(n) (for P=n)
Algorithm
MERGE SPLIT:- • Now consider the second approach. If N processors, where N < n, • Assume that each of the N processors in the linear array holds a subsequence of S of length n/N. •The comparison-exchange operations of procedure ODD-EVEN TRANSPOSITION are now replaced with merge-split operations on subsequences. •Let Si denote the subsequence held by processor Pi. Initially, the Si are random subsequences of S.
Sorting sequence of twelve elements using procedure MERGE SPILIT:-
MERGE SPLIT ALGO:
Computational time complexity using n processors  Parallel quicksort - O(n) but unbalanced processor load, and communication can generate to O(nlogn)  parallel sorting in network-O(n log^4 n) Odd-even transposition sort- O(n^2)  Parallel mergesplit - O(nlogn) but unbalanced processor load and communication Parallel sorting Conclusions:
1.4 SORTING ON THE CRCW MODEL  By this algorithm write conflicts problem can be resolved.  we shall assume that write conflicts are created whenever several processors attempt to write potentially different integers into the same address. The conflict is resolved by storing the sum of these integers in that address.
Cont......  Assume that n^2 processors are available on such a CRCW computer to sort the sequence S = { s 1 , s2, . . . , sn).  If two elements si and sj are equal, then si is taken to be the larger of the two if i > j; otherwise sj is the larger.
Cont.... procedure CRCW SORT (S) Step 1: for i = 1 to n do in parallel for j = 1 to n do in parallel if (si > sj) or (si = sj and i > j ) then P(i, j) writes 1 in ci else P(i, j ) writes 0 in ci end if end for --- end for. Step 2: for i = 1 to n do in parallel P(i, 1 ) stores si in position 1 + ci of S end for
Example: Let S = (5,2,4, 5) n=4 so n2 =16 Processor 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0
 Update si array  i: 1+ci position  5: 1+2=3  2: 1+0=1  3:1+1=2  4:1+3=4 Cont...
Cont...... Analysis:- Each of steps 1 and 2 consists of an operation requiring constant time. Therefore Running Time t(n) = O(1).  Since p(n) = n2  The cost of procedure CRCW SORT is:- C(n)= O(n2) (which is not optimal)
1.5 SORTING ON THE CREW MODEL  Our purpose is to design an algorithm that is: 1. free of write conflicts. 2. uses a reasonable number of processors. 3. a running time that is small and adaptive. 4. a cost that is optimal.  Assume that a CREW SM SIMD computer with N processors PI, P2. . . , PN is to be used to sort the sequence S = {s1 s2 . . . , sn), where N < n.
procedure CREW SORT (S) Step 1: for i = 1 to N do in parallel Processor Pi (1.1) reads a distinct subsequence Si of S of size n/N (1.2) QUICKSORT (Si) (1.3) Si 1 <- Si (1.4) Pi 1 <- Pi end for. O((n/N)log(n/N)) Algorithm:-
Cont… Step 2 (2.1) u =1 (2.2) v = N (2.3) while v > 1 do (2.3.1) for m = 1 to |_v/2_| do in parallel (i) Pu+1 m <- Pu 2m-1 U pu 2m (ii) The processors in the set Pu+1 mperform CREW MERGE (su 2m-1, su 2m, su+1 m) end for (2.3.2) if v is odd then (1) pU+1 v/2 = pu v (ii) sU+1 v/2 = sU V end if (2.3.3) u = u + 1 (2.3.4) V = v/2 end while. O((n/N) + log n) time
Example  Let S = (2, 8, 5, 10, 15, 1, 12, 6, 14, 3, 11, 7, 9, 4, 13, 16) and N = 4. Here N<n Step1:- Subsequence Si created : n/N=>16/4= 4 And Quick sort apply for sorting elements S1 1 ={2,5,8,10} S2 1 = {1,6,12,15} S3 1= {3,7,11,14} S4 1 = {9,13,14,16} Step2:- u=1 & v=N=4 for (m=1 to v/2) P1 2=p1 1 U p2 1 =(p1,p2)=(1,2,5,6,8,10,12,15) P2 2= p3 1 U p4 1 =(p3,p4)=(3,4,7,9,11,13,14,16) 4/2=2 CREW MERGE ALGO USED
Cont.... The processors {P1, P2,P3, P4} cooperate to merge S1 2 and s2 2 into S1 3 = (1, 2,. . . , 16) by using CERW MERGE . Analysis:- the total running time of procedure CREW SOR'T is t(n) = O((n/N)log(n/N)) + O((n/N)log N + log n log N) = O((n/N)log n + log2n).  Since p(n) = N, the cost is given by:- c(n) = O(n log n + N log n^2).
1.6 SORTING ON THE EREW MODEL:-  Still, procedure CREW SORT tolerates multiple- read operations. Our purpose in this section is to deal with this third difficulty.  We assume throughout this section that N processors P1, P2 . . . , PN are available on an EREW SM SIMD computer to sort the sequence S = (s1, s2, . . . , sn)where N < n.
Cont….  since N < n, N=n1-x where 0<x<1.  Now mi =[ i(n/21/x)], for 1<=i<=21/x-1 .  The mi can be used to divide S into 21/x subsequence of size n/21/x .  These subsequences, denoted by S1,S2,..., Sj, Sj+1,........S2j, where j =2(1/x)-1  Every subdivision process can now be applied recursively to each of the subsequences Si until the entire sequence S is sorted in nondecreasing order.  K= 2(1/x)
Algorithm:- procedure EREW SORT (S) Step1 if |S| < k then QUICKSORT (S) else (1) for i = 1 to k - 1 do PARALLEL SELECT (S, |i |s|/k|) [obtain mi] end for (2) Si = (s E S: s<=mi ) (3) for i = 2 to k - 1 do Si ={s E S : mi-1<=s <=mi } end for
Cont.. (4) Sk<= { s E S : s >=mk-1) Step 2 for i = 1 to k/2 do in parallel EREW SORT (Si) end for Step 3 for i = (k/2) + 1 to k do in parallel EREW SORT (Si) end for end if.
Cont... Let S = {5,9, 12, 16, 18,2, 10, 13, 17,4,7, 18, 18, 11, 3, 17,20,19, 14, 8, 5, 17, 1, 11, 15, 10, 6) (i.e., n = 27)  Here N<n & N=n1-x => N=270.5 = 5 where 0<x<1 (x=0.5).  K=21/x => k= 21/0.5 = 22 = 4  During step 1 m1= 6 m2 = 11, and m3 = 17 are computed.  The four sub sequences S1 ,S2, S3 and S4 are created.  In step 5 the procedure is applied recursively and simultaneously to S1 and S2.  Compute m1 = 2, m2= 4, and m3= 5, and the four subsequence {1,2}, {3,4}, {5,5), and (6) are created each of which is already in sorted order.
Cont....
Cont....  Running Time t(n) = cnx + 2t(n/k) = O(nx log n).  Since p(n) = n1-x, the procedure's cost is given by c(n) = p(n) x t(n) = O(n log n), which is optimal.
Parallel sorting algorithm
Parallel sorting algorithm

Parallel sorting algorithm

  • 1.
    Parallel Algorithm &Sorting in Parallel Programming Submitted By:- Richa kumari,14MT-CS12 Submitted To:- Dalpat songra
  • 2.
    Contents: 1.1 Parallel algorithm 1.2A Network for sorting 1.3 Sorting on a linear array 1.4 Sorting on the CRCW Model 1.5 Sorting on the CREW Model 1.6 Sorting on the EREW Model
  • 3.
    1.1 Parallel Algorithm:- A parallel algorithm or concurrent algorithm, as opposed to a traditional sequential algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.
  • 4.
    Parallel Sorting:-  Thefundamental operation of comparison- based sorting is compare-exchange.  The lower bound on any comparison-based sort of n numbers is Θ(nlog n) .  The sorted list is partitioned with the property that each partitioned list is sorted and each element in processor Pi's list is less than that in Pj's list if i < j
  • 5.
    Sorting: Parallel CompareExchange Operation A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai,aj}, and Pj keeps max{ai, aj}.
  • 6.
    Quick Sort:-  Quicksortis one of the most common sorting algorithms for sequential computers because of its simplicity, low overhead, and optimal average complexity.  Quicksort selects one of the entries in the sequence to be the pivot and divides the sequence into two - one with all elements less than the pivot and other greater.  The process is recursively applied to each of the sublists.
  • 7.
    Cont…  Average optimalsequential complexity: O(n log n)  Parallel efficiency limitations  Partitions are unbalanced  A single processor performs the initial partitioning
  • 8.
    Example of quicksort Let S = (6,5 ,9,2,4,3,5 , 1, 7,5,8 ). T he first call to procedure Q U I C K S O R T produces 5 as the median element of S, and hence S1 = {2,4,3,1,5,5} and S2 = {6,9,7,8,5}. Note that S1 = 6 and S2= 5. A recursive call to Q U I C K S O R T with S, as input produces the two subsequences {2,1,3} and {4,5,5}. The second call with S, as input produces {6,5,7}an d {9,8}. Further recursive calls complete the sorting of these sequences.
  • 9.
  • 10.
    COMPLEXITY OF QUICKSORT Forsome constant c, we can express the running time of procedure QUICKSORT as = O(n log n),
  • 11.
    1.2 A NETWORKFOR SORTING  It is rather straightforward to use a collection of merging networks  to build a sorting network for the sequence S = {s1, s2, . . . , sn), where n is a power of 2. The idea is the following.  In a first stage, a rank of n/2 comparators is used to create n/2 sorted sequences each of length 2.  In a second stage, pairs of these are now merged into sorted sequences of length 4 using a rank of (2,2)- merging networks. Again, in a
  • 12.
    Conti….  third stage,pairs of sequences of length 4 are merged using (4,4)-merging networks into sequences of length 8. The process continues until two sequences of length n/2 each are merged by an (n/2, n/2)-merging network to produce a single sorted sequence of length n. The resulting architecture is known as an odd-even sorting network and is  illustrated in Fig. for S = {8,4,7,2, 1,5,6,3). Note that, as in the case of merging, the odd-even sorting network is oblivious of its input.
  • 13.
    FIG: ODD EVENSORTING NETWORK FOR SEQUENCE OF EIGHT ELEMENTS
  • 14.
    The odd-even sortingnetwork is impractical for large input sequences : (i) The network is extremely fast. It can sort a sequence of length 2^20 within, on the order of, (20)2 time units. This is to be contrasted with the time required by procedure QUICKSORT, which would be in excess of 20 million time units.[(log n)^2] (ii) The number of comparators is too high. Again for n = 2^20, the network would need on the order of 400 million comparators.[n (log n)^2] (iii) The architecture is highly irregular and the wires linking the comparators have lengths that vary with n.
  • 15.
    1.3 SORTING ONA LINEAR ARRAY: In this section we describe a parallel sorting algorithm for an SIMD computer where the processors are connected to form a linear array FIG: LINEAR ARRAY CONNECTION
  • 16.
    Odd-Even Transposition Sort Variation of bubble sort.  Operates in two alternating phases, even phase and odd phase.  Even phase Even-numbered processes exchange numbers with their right neighbour.  Odd phase Odd-numbered processes exchange numbers with their right neighbour.
  • 17.
     Odd-Even Transposition Sort- example Parallel time complexity: Tpar = O(n) (for P=n)
  • 18.
  • 19.
    MERGE SPLIT:- • Nowconsider the second approach. If N processors, where N < n, • Assume that each of the N processors in the linear array holds a subsequence of S of length n/N. •The comparison-exchange operations of procedure ODD-EVEN TRANSPOSITION are now replaced with merge-split operations on subsequences. •Let Si denote the subsequence held by processor Pi. Initially, the Si are random subsequences of S.
  • 20.
    Sorting sequence oftwelve elements using procedure MERGE SPILIT:-
  • 21.
  • 22.
    Computational time complexityusing n processors  Parallel quicksort - O(n) but unbalanced processor load, and communication can generate to O(nlogn)  parallel sorting in network-O(n log^4 n) Odd-even transposition sort- O(n^2)  Parallel mergesplit - O(nlogn) but unbalanced processor load and communication Parallel sorting Conclusions:
  • 23.
    1.4 SORTING ONTHE CRCW MODEL  By this algorithm write conflicts problem can be resolved.  we shall assume that write conflicts are created whenever several processors attempt to write potentially different integers into the same address. The conflict is resolved by storing the sum of these integers in that address.
  • 24.
    Cont......  Assume thatn^2 processors are available on such a CRCW computer to sort the sequence S = { s 1 , s2, . . . , sn).  If two elements si and sj are equal, then si is taken to be the larger of the two if i > j; otherwise sj is the larger.
  • 25.
    Cont.... procedure CRCW SORT(S) Step 1: for i = 1 to n do in parallel for j = 1 to n do in parallel if (si > sj) or (si = sj and i > j ) then P(i, j) writes 1 in ci else P(i, j ) writes 0 in ci end if end for --- end for. Step 2: for i = 1 to n do in parallel P(i, 1 ) stores si in position 1 + ci of S end for
  • 26.
    Example: Let S= (5,2,4, 5) n=4 so n2 =16 Processor 0 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0
  • 27.
     Update siarray  i: 1+ci position  5: 1+2=3  2: 1+0=1  3:1+1=2  4:1+3=4 Cont...
  • 28.
    Cont...... Analysis:- Each ofsteps 1 and 2 consists of an operation requiring constant time. Therefore Running Time t(n) = O(1).  Since p(n) = n2  The cost of procedure CRCW SORT is:- C(n)= O(n2) (which is not optimal)
  • 29.
    1.5 SORTING ONTHE CREW MODEL  Our purpose is to design an algorithm that is: 1. free of write conflicts. 2. uses a reasonable number of processors. 3. a running time that is small and adaptive. 4. a cost that is optimal.  Assume that a CREW SM SIMD computer with N processors PI, P2. . . , PN is to be used to sort the sequence S = {s1 s2 . . . , sn), where N < n.
  • 30.
    procedure CREW SORT(S) Step 1: for i = 1 to N do in parallel Processor Pi (1.1) reads a distinct subsequence Si of S of size n/N (1.2) QUICKSORT (Si) (1.3) Si 1 <- Si (1.4) Pi 1 <- Pi end for. O((n/N)log(n/N)) Algorithm:-
  • 31.
    Cont… Step 2 (2.1)u =1 (2.2) v = N (2.3) while v > 1 do (2.3.1) for m = 1 to |_v/2_| do in parallel (i) Pu+1 m <- Pu 2m-1 U pu 2m (ii) The processors in the set Pu+1 mperform CREW MERGE (su 2m-1, su 2m, su+1 m) end for (2.3.2) if v is odd then (1) pU+1 v/2 = pu v (ii) sU+1 v/2 = sU V end if (2.3.3) u = u + 1 (2.3.4) V = v/2 end while. O((n/N) + log n) time
  • 32.
    Example  Let S= (2, 8, 5, 10, 15, 1, 12, 6, 14, 3, 11, 7, 9, 4, 13, 16) and N = 4. Here N<n Step1:- Subsequence Si created : n/N=>16/4= 4 And Quick sort apply for sorting elements S1 1 ={2,5,8,10} S2 1 = {1,6,12,15} S3 1= {3,7,11,14} S4 1 = {9,13,14,16} Step2:- u=1 & v=N=4 for (m=1 to v/2) P1 2=p1 1 U p2 1 =(p1,p2)=(1,2,5,6,8,10,12,15) P2 2= p3 1 U p4 1 =(p3,p4)=(3,4,7,9,11,13,14,16) 4/2=2 CREW MERGE ALGO USED
  • 33.
    Cont.... The processors {P1,P2,P3, P4} cooperate to merge S1 2 and s2 2 into S1 3 = (1, 2,. . . , 16) by using CERW MERGE . Analysis:- the total running time of procedure CREW SOR'T is t(n) = O((n/N)log(n/N)) + O((n/N)log N + log n log N) = O((n/N)log n + log2n).  Since p(n) = N, the cost is given by:- c(n) = O(n log n + N log n^2).
  • 34.
    1.6 SORTING ONTHE EREW MODEL:-  Still, procedure CREW SORT tolerates multiple- read operations. Our purpose in this section is to deal with this third difficulty.  We assume throughout this section that N processors P1, P2 . . . , PN are available on an EREW SM SIMD computer to sort the sequence S = (s1, s2, . . . , sn)where N < n.
  • 35.
    Cont….  since N< n, N=n1-x where 0<x<1.  Now mi =[ i(n/21/x)], for 1<=i<=21/x-1 .  The mi can be used to divide S into 21/x subsequence of size n/21/x .  These subsequences, denoted by S1,S2,..., Sj, Sj+1,........S2j, where j =2(1/x)-1  Every subdivision process can now be applied recursively to each of the subsequences Si until the entire sequence S is sorted in nondecreasing order.  K= 2(1/x)
  • 36.
    Algorithm:- procedure EREW SORT(S) Step1 if |S| < k then QUICKSORT (S) else (1) for i = 1 to k - 1 do PARALLEL SELECT (S, |i |s|/k|) [obtain mi] end for (2) Si = (s E S: s<=mi ) (3) for i = 2 to k - 1 do Si ={s E S : mi-1<=s <=mi } end for
  • 37.
    Cont.. (4) Sk<= {s E S : s >=mk-1) Step 2 for i = 1 to k/2 do in parallel EREW SORT (Si) end for Step 3 for i = (k/2) + 1 to k do in parallel EREW SORT (Si) end for end if.
  • 38.
    Cont... Let S ={5,9, 12, 16, 18,2, 10, 13, 17,4,7, 18, 18, 11, 3, 17,20,19, 14, 8, 5, 17, 1, 11, 15, 10, 6) (i.e., n = 27)  Here N<n & N=n1-x => N=270.5 = 5 where 0<x<1 (x=0.5).  K=21/x => k= 21/0.5 = 22 = 4  During step 1 m1= 6 m2 = 11, and m3 = 17 are computed.  The four sub sequences S1 ,S2, S3 and S4 are created.  In step 5 the procedure is applied recursively and simultaneously to S1 and S2.  Compute m1 = 2, m2= 4, and m3= 5, and the four subsequence {1,2}, {3,4}, {5,5), and (6) are created each of which is already in sorted order.
  • 39.
  • 40.
    Cont....  Running Timet(n) = cnx + 2t(n/k) = O(nx log n).  Since p(n) = n1-x, the procedure's cost is given by c(n) = p(n) x t(n) = O(n log n), which is optimal.