Parallel Processing & Distributed Systems - Chapter 12: Parallel Algorithms - Thoai Nam

Introduction to Parallel
Algorithm Development
 Parallel algorithms mostly depend on destination
parallel platforms and architectures
 MIMD algorithm classification
– Pre-scheduled data-parallel algorithms
– Self-scheduled data-parallel algorithms
– Control-parallel algorithms
 According to M.J.Quinn (1994), there are 7 design
strategies for parallel algorithms
pdf 30 trang thamphan 26/12/2022 1960
Bạn đang xem 20 trang mẫu của tài liệu "Parallel Processing & Distributed Systems - Chapter 12: Parallel Algorithms - Thoai Nam", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

  • pdfparallel_processing_distributed_systems_chapter_12_parallel.pdf

Nội dung text: Parallel Processing & Distributed Systems - Chapter 12: Parallel Algorithms - Thoai Nam

  1. Parallel Algorithms Thoai Nam
  2. Introduction to Parallel Algorithm Development  Parallel algorithms mostly depend on destination parallel platforms and architectures  MIMD algorithm classification – Pre-scheduled data-parallel algorithms – Self-scheduled data-parallel algorithms – Control-parallel algorithms  According to M.J.Quinn (1994), there are 7 design strategies for parallel algorithms Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -3-
  3. Reduction Problem  Description: Given n values a0, a1, a2 an-1 associative operation , let’s use p processors to compute the sum: S = a0  a1  a2   an-1  Design strategy 1 – “If a cost optimal CREW PRAM algorithms exists and the way the PRAM processors interact through shared variables maps onto the target architecture, a PRAM algorithm is a reasonable starting point” Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -5-
  4. Cost Optimal PRAM Algorithm for the Reduction Problem(cont’d) Using p= n div 2 processors to add n numbers: Global a[0 n-1], n, i, j, p; Begin spawn(P0, P1, ,,Pp-1); for all Pi where 0 ≤ i ≤ p-1 do for j=0 to ceiling(logp)-1 do if i mod 2j =0 and 2i + 2j < n then a[2i] := a[2i]  a[2i + 2j]; endif; endfor j; endforall; End. Notes: the processors communicate in a biominal-tree pattern Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -7-
  5. Solving Reducing Problem on Hypercube SIMD Computer (cond’t) Using p processors to add n numbers ( p << n) Global j; Local local.set.size, local.value[1 n div p +1], sum, tmp; Begin spawn(P0, P1, ,,Pp-1); for all Pi where 0 ≤ i ≤ p-1 do Allocate if (i < n mod p) then local.set.size:= n div p + 1 workload for else local.set.size := n div p; each endif; processors sum[i]:=0; endforall; Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -9-
  6. Solving Reducing Problem on Hypercube SIMD Computer (cond’t) for j:=ceiling(logp)-1 downto 0 do for all Pi where 0 ≤ i ≤ p-1 do Calculate the total if i < 2j then sum by reducing tmp := [i + 2j]sum; for each dimension of the sum := sum  tmp; hypercube endif; endforall; endfor j; Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -11-
  7. Solving Reducing Problem on 2D-Mesh SIMD Computer(cont’d)  Example: compute the total sum on a 4*4 mesh Stage 1 Stage 1 Stage 1 Step i = 3 Step i = 2 Step i = 1 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -13-
  8. Solving Reducing Problem on 2D-Mesh SIMD Computer(cont’d) Summation (2D-mesh SIMD with l*l processors Global i; Local tmp, sum; Begin {Each processor finds sum of its local value code not shown} for i:=l-1 downto 1 do Stage 1: for all Pj,i where 1 ≤ i ≤ l do {Processing elements in colum i active} Pi,1 computes tmp := right(sum); the sum of all sum:= sum  tmp; processors in end forall; row i-th endfor; Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -15-
  9. Solving Reducing Problem on UMA Multiprocessor Model(MIMD)  Easily to access data like PRAM  Processors execute asynchronously, so we must ensure that no processor access an “unstable” variable  Variables used: Global a[0 n-1], {values to be added} p, {number of proeessor, a power of 2} flags[0 p-1], {Set to 1 when partial sum available} partial[0 p-1], {Contains partial sum} global_sum; {Result stored here} Local local_sum; Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -17-
  10. Solving Reducing Problem on UMA Multiprocessor Model(cont’d) Summation (UMA multiprocessor model) Begin for k:=0 to p-1 do flags[k]:=0; for all P where 0 ≤ i < p do Stage 1: i local_sum :=0; Each processor for j:=i to n-1 step p do computes the local_sum:=local_sum  a[j]; partial sum of n/p values Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -19-
  11. Solving Reducing Problem on UMA Multiprocessor Model(cont’d)  Algorithm complexity 0(n/p+p)  What is the advantage of this algorithm compared with another one using critical-section style to compute the total sum?  Design strategy 2: – Look for a data-parallel algorithm before considering a control-parallel algorithm On MIMD computer, we should exploit both data parallelism and control parallelism (try to develop SPMD program if possible) Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -21-
  12. Broadcast Algorithm on Hypercube SIMD  If the amount of data is small, the best algorithm takes logp communication steps on a p-node hypercube  Examples: broadcasting a number on a 8-node hypercube P P4 6 P0 P0 P2 P0 P2 P P5 7 P1 P1 P3 P1 P3 Step 1: Step 2: Step 3: Send the number via the Send the number via the Send the number via the 1st dimension of the 2nd dimension of the 3rd dimension of the hypercube hypercube hypercube Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -23-
  13. Broadcast Algorithm on Hypercube SIMD(cont’d)  The previous algorithm – Uses at most p/2 out of plogp links of the hypercube – Requires time Mlogp to broadcast a length M msg not efficient to broadcast long messages  Johhsson and Ho (1989) have designed an algorithm that executes logp times faster by: – Breaking the message into logp parts – Broadcasting each parts to all other nodes through a different biominal spanning tree Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -25-
  14. Johnsson and Ho’s Broadcast Algorithm on Hypercube SIMD(cont’d)  Design strategy 3 – As problem size grow, use the algorithm that makes best use of the available resources Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -27-
  15. Prefix SUMS Problem on Multicomputers  Finding the prefix sums of 16 values Processor 0 Processor 1 Processor 2 Processor 3 (a) 3 2 7 6 0 5 4 8 2 0 1 5 2 3 8 6 (b) 18 17 8 19 (c) 18 35 43 62 18 35 43 62 18 35 43 62 18 35 43 62 (d) 3 5 12 18 18 23 27 35 37 37 38 43 45 48 56 62 Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM -29-