Execution Time P0 P1 P2 P3 P4 P5 P6 P7 Essential/Excess Computation Interprocessor Communication Idling Figure 5.1 The execution profile of a hypothetical parallel program executing on eight process- ing elements. Profile indicates times spent performing computation (both essential and excess), communication, and idling.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (a) Initial data distribution and the first communication step Σ 0 1 Σ 3 Σ 5 Σ 7 Σ 9 Σ 11 Σ 13 Σ 15 2 4 6 8 10 12 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (b) Second communication step Σ 0 3 Σ 7 Σ 11 Σ 15 4 8 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (c) Third communication step Σ 0 7 Σ 15 8 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (d) Fourth communication step Σ 0 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (e) Accumulation of the sum at processing element 0 after the final communication Computing the globalsum of 16 partial sums using 16 processing elements. � j Figure 5.2 i de- notes the sum of numbers with consecutive labels from i to j .
Processing element 0 Processing element 1 S Figure 5.3 Searching an unstructured tree for a node with a given label, ‘S’, on two processing elements using depth-first traversal. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. The corresponding serial formulation expands the entire tree. It is clear that the serial algorithm does more work than the parallel algorithm.
−1 0 1 −2 0 2 −1 0 1 −1 −2 1 0 0 0 −1 2 1 0 1 2 3 (a) (b) (c) Figure 5.4 Example of edge detection: (a) an 8 × 8 image; (b) typical templates for detecting edges; and (c) partitioning of the image across four processors with shaded regions indicating image data that must be communicated from neighboring processors to processor 1.
12 13 14 15 12 13 14 15 8 9 10 11 8 9 10 11 4 5 6 7 4 5 6 7 Σ 0 1 Σ 3 0 1 2 3 2 0 1 2 3 0 1 2 3 Substep 1 Substep 2 12 13 14 15 12 13 14 15 Σ 9 Σ 11 8 9 10 11 8 10 Σ 5 Σ 7 Σ 5 Σ 7 4 6 4 6 Σ 0 1 Σ 3 Σ 0 1 Σ 3 2 2 0 1 2 3 0 1 2 3 Substep 3 Substep 4 (a) Four processors simulating the first communication step of 16 processors Σ 13 Σ 15 Σ 13 Σ 15 12 14 12 14 Σ 9 Σ 11 Σ 9 Σ 11 8 10 8 10 Σ 5 Σ 7 Σ 5 Σ 7 4 6 4 6 Σ 0 1 Σ 3 Σ 0 3 2 0 1 2 3 0 1 2 3 Substep 1 Substep 2 Σ 13 Σ 15 Σ 13 Σ 15 12 14 12 14 Σ 9 11 11 Σ Σ 8 8 10 Σ 4 7 Σ 4 7 3 3 Σ 0 Σ 0 0 1 2 3 0 1 2 3 Substep 3 Substep 4 (b) Four processors simulating the second communication step of 16 processors Figure 5.5 Four processing elements simulating 16 processing elements to compute the sum of 16 numbers (first two steps). � j i denotes the sum of numbers with consecutive labels from i to j .
15 15 Σ 12 Σ 12 Σ 8 11 Σ 8 11 7 Σ 4 Σ 0 3 Σ 0 7 0 1 2 3 0 1 2 3 Substep 1 Substep 2 (c) Simulation of the third step in two substeps Σ 15 8 Σ 0 7 Σ 0 15 0 1 2 3 0 1 2 3 (d) Simulation of the fourth step (e) Final result Figure 5.6 (continued) Four processing elements simulating 16 processing elements to compute the sum of 16 numbers (last three steps).
3 7 11 15 2 6 10 14 1 5 9 13 Σ 3 Σ 7 Σ 11 Σ 15 0 4 8 12 0 4 8 12 0 1 2 3 0 1 2 3 (a) (b) Σ 7 Σ 15 Σ 0 15 0 8 0 1 2 3 0 1 2 3 (c) (d) Figure 5.7 A cost-optimal way of computing the sum of 16 numbers using four processing ele- ments.
45 40 35 30 25 20 15 S Binary exchange 10 2-D transpose 3-D transpose 5 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 n Figure 5.8 A comparison of the speedups obtained by the binary-exchange, 2-D transpose and 3-D transpose algorithms on 64 processing elements with t c = 2 , t w = 4 , t s = 25 , and t h = 2 (see Chapter ?? for details).
35 Linear 30 25 20 n = 512 n = 320 15 n = 192 S 10 n = 64 5 0 0 5 10 15 20 25 30 35 40 p Figure 5.9 Speedup versus the number of processing elements for adding a list of numbers.
Fixed problem size (W) Fixed number of processors (p) E E p W (a) (b) Figure 5.10 Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems.
P P P 0 1 0 Solution Solution (a) DFS with one processing element (b) DFS with two processing elements Figure 5.11 Superlinear(?) speedup in parallel depth first search.
(a) (b) (c) (d) Figure 5.12 Dependency graphs for Problem ?? .
Recommend
More recommend