Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk for Loops 2 Marc Moreno Maza Scheduling Theory and Implementation 3 University of Western Ontario, London, Ontario (Canada) Measuring Parallelism in Practice 4 CS 4435 - CS 9624 Announcements 5 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 1 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 2 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Plan The fork-join parallelism model Example: Example: int fib (int n) { int fib (int n) { if (n<2) return (n); if (n<2) return (n); ( ( ) ) ( ); ( ); fib(4) fib(4) Parallelism Complexity Measures else { else { 1 int x,y; int x,y; x = cilk_spawn fib(n-1); x = cilk_spawn fib(n-1); 4 y = fib(n-2); y fib(n 2); y = fib(n-2); y fib(n 2); cilk for Loops 2 cilk_sync; cilk_sync; return (x+y); return (x+y); 3 2 } } } } } } Scheduling Theory and Implementation 3 2 2 1 1 1 1 0 0 “Processor “Processor Measuring Parallelism in Practice 4 oblivious” oblivious” 1 1 0 0 The computation dag computation dag Announcements 5 unfolds dynamically. We shall also call this model multithreaded parallelism . (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 3 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 4 / 62
Parallelism Complexity Measures Parallelism Complexity Measures Terminology Work and span initial strand initial strand final strand inal strand strand strand strand strand continu continue edg edge return edge return edge spawn edge spawn spawn spawn edge edge dge call edge call edge We define several performance measures. We assume an ideal situation: a strand is is a maximal sequence of instructions that ends with a no cache issues, no interprocessor costs: spawn , sync , or return (either explicit or implicit) statement. T p is the minimum running time on p processors At runtime, the spawn relation causes procedure instances to be T 1 is called the work , that is, the sum of the number of instructions at structured as a rooted tree, called spawn tree or parallel instruction each node. stream , where dependencies among strands form a dag. T ∞ is the minimum running time with infinitely many processors, called the span (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 5 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 6 / 62 Parallelism Complexity Measures Parallelism Complexity Measures The critical path length Work law Assuming all strands run in unit time, the longest path in the DAG is equal We have: T p ≥ T 1 / p . to T ∞ . For this reason, T ∞ is also referred to as the critical path length . Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 7 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 8 / 62
Parallelism Complexity Measures Parallelism Complexity Measures Span law Speedup on p processors T 1 / T p is called the speedup on p processors A parallel program execution can have: linear speedup : T 1 / T P = Θ( p ) superlinear speedup : T 1 / T P = ω ( p ) (not possible in this model, though it is possible in others) sublinear speedup : T 1 / T P = o ( p ) We have: T p ≥ T ∞ . Indeed, T p < T ∞ contradicts the definitions of T p and T ∞ . (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 9 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 10 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Parallelism The Fibonacci example (1/2) Because the Span Law dictates 1 8 that T ≥ T that T P ≥ T ∞ , the maximum the maximum possible speedup given T 1 and T ∞ is 2 2 7 7 T /T T 1 /T ∞ = para parall ll ll li lleli li lism = the average 3 4 6 amount of work amount of work per step along the span. 5 For Fib (4), we have T 1 = 17 and T ∞ = 8 and thus T 1 / T ∞ = 2 . 125. What about T 1 ( Fib ( n )) and T ∞ ( Fib ( n ))? (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 11 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 12 / 62
Parallelism Complexity Measures Parallelism Complexity Measures The Fibonacci example (2/2) Series composition We have T 1 ( n ) = T 1 ( n − 1) + T 1 ( n − 2) + Θ(1). Let’s solve it. One verify by induction that T ( n ) ≤ aF n − b for b > 0 large enough to dominate Θ(1) and a > 1. We can then choose a large enough to satisfy the initial condition, A B whatever that is. On the other hand we also have F n ≤ T ( n ). √ Therefore T 1 ( n ) = Θ( F n ) = Θ( ψ n ) with ψ = (1 + 5) / 2. We have T ∞ ( n ) = max( T ∞ ( n − 1) , T ∞ ( n − 2)) + Θ(1). Work? We easily check T ∞ ( n − 1) ≥ T ∞ ( n − 2). Span? This implies T ∞ ( n ) = T ∞ ( n − 1) + Θ(1). Therefore T ∞ ( n ) = Θ( n ). Consequently the parallelism is Θ( ψ n / n ). (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 13 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 14 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Series composition Parallel composition A A B B Work: T 1 ( A ∪ B ) = T 1 ( A ) + T 1 ( B ) Span: T ∞ ( A ∪ B ) = T ∞ ( A ) + T ∞ ( B ) Work? Span? (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 15 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 16 / 62
Parallelism Complexity Measures Parallelism Complexity Measures Parallel composition Some results in the fork-join parallelism model Al Algorithm g orithm Work Work Span p an A Merge sort Θ (n lg n) Θ (lg 3 n) Matrix multiplication Θ (n 3 ) Θ (lg n) Strassen Θ (n lg7 ) Θ (lg 2 n) B LU-decomposition Θ (n 3 ) Θ (n lg n) Tableau construction Θ (n 2 ) Ω (n lg3 ) FFT Θ (n lg n) Θ (lg 2 n) Breadth-first search B d h fi h Θ (E) Θ (E) Θ (d lg V) Θ (d l V) Work: T 1 ( A ∪ B ) = T 1 ( A ) + T 1 ( B ) Span: T ∞ ( A ∪ B ) = max( T ∞ ( A ) , T ∞ ( B )) We shall prove those results in the next lectures. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 17 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 18 / 62 cilk for Loops cilk for Loops Plan For loop parallelism in Cilk++ a 11 a 12 ⋯ a 1n a 11 a 21 ⋯ a n1 a 21 a 22 ⋯ a 2n a 12 a 22 ⋯ a n2 21 22 2n 12 22 n2 Parallelism Complexity Measures 1 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn a 1n a 2n ⋯ a nn n1 n2 nn 1n 2n nn A A T cilk for Loops 2 cilk_for (int i=1; i<n; ++i) { Scheduling Theory and Implementation 3 for (int j=0; j<i; ++j) { double temp = A[i][j]; Measuring Parallelism in Practice 4 A[i][j] = A[j][i]; A[j][i] = temp; } Announcements 5 } The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 19 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 20 / 62
cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of parallel for loops Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation : void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; 1 1 2 2 3 3 cilk_spawn recur(lo, mid); 4 4 5 6 7 8 recur(mid, hi); cilk_sync; Here we do not assume that each strand runs in unit time. } else for (int j=0; j<i; ++j) { Span of loop control : Θ(log( n )) double temp = A[i][j]; Max span of an iteration : Θ( n ) A[i][j] = A[j][i]; Span : Θ( n ) A[j][i] = temp; Work : Θ( n 2 ) } } Parallelism : Θ( n ) } (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 21 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 22 / 62 cilk for Loops Scheduling Theory and Implementation Parallelizing the inner loop Plan cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { Parallelism Complexity Measures 1 double temp = A[i][j]; A[i][j] = A[j][i]; cilk for Loops A[j][i] = temp; 2 } } Scheduling Theory and Implementation 3 Span of outer loop control : Θ(log( n )) Measuring Parallelism in Practice 4 Max span of an inner loop control : Θ(log( n )) Span of an iteration : Θ(1) Announcements 5 Span : Θ(log( n )) Work : Θ( n 2 ) Parallelism : Θ( n 2 / log( n )) But! More on this next week . . . (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 23 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 24 / 62
Recommend
More recommend