algorithm engineering
play

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 8 Yan n Gu What is Parallelism and Scheduling Many slides in this lecture are borrowed from the seventh lecture in 6.172 Performance Engineering of Software


  1. Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 – Lecture cture 8 Yan n Gu What is Parallelism and Scheduling Many slides in this lecture are borrowed from the seventh lecture in 6.172 Performance Engineering of Software Systems at MIT. The credit is to Prof. Charles E. Leiserson, and the instructor appreciates the permission to use them in this course.

  2. Fork-Join Parallelism CS260: Algorithm Greedy Scheduler Engineering Lecture 8 Work-Stealing Scheduler 2

  3. Recall: Basics of Cilk int fib(int n) The named child function { may execute in parallel with if (n < 2) return n; the parent caller. int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; Control cannot pass this } point until all spawned children have returned. • Cilk keywords grant permission for parallel execution. They do not command parallel execution. 3

  4. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

  5. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { int x, y; 4 x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; 3 2 return x + y; } } 2 1 1 0 “Processor oblivious” 1 0 The co computati mputation on dag ag unfolds dynamically.

  6. How Much Parallelism? Loop parallelism ( cilk_for ) is converted to spawns and syncs using recursive divide-and-conquer. Assuming that each node executes in unit time, what is the par aral allelism lelism of this computation?

  7. Performance Measures T = execution time on P processors W = work = 18

  8. Performance Measures T = execution time on P processors W = work D = span* = 18 = 9 *Also called critical-path length or computational depth.

  9. Performance Measures T = execution time on P processors W = work D = span* = 18 = 9 W ORK L AW AW ∙ T ≥ W/P S PAN PAN L AW AW ∙ T ≥ D *Also called critical-path length or computational depth.

  10. Series Composition A B Work: W(A ∪B) = Work: W(A ∪B) = W(A) + W(B) Sp Span: Sp Span: D(A ∪B) = : D(A ∪B) = D(A) + D(B)

  11. Parallel Composition A B Wo Work Wo Work rk: W(A ∪B) = rk: W(A ∪B) = W(A) + W(B) Sp Span: D(A ∪B) = max{D(A), D(B)} Sp Span: : D(A ∪B) =

  12. Speedup Definition. W/T = speedup on P processors. ● If W/T < P, we have sublinear speedup. ● If W/T = P, we have (perfect) linear speedup. ● If W/T > P, we have superlinear speedup, which is not possible in this simple performance model, because of the W ORK L AW T ≥ W/P.

  13. Parallelism Because the S PAN L AW dictates that T ≥ D , the maximum possible speedup given W and D is W/D = parallelism = the average amount of work per step along the span = 18/9 = 2

  14. Example: fib(4) 1 8 Assume for simplicity that each strand in fib(4) takes 2 7 unit time to execute. 3 4 6 5 Work: W Work: W = = 17 Span: D = 8 Sp Sp Span: D Parall Parall llel llel elism: elism: ism: W/D = 2.125 ism: W/D = Using many more than 2 processors can yield only marginal performance gains.

  15. Fork-Join Parallelism CS260: Algorithm Greedy Scheduler Engineering Lecture 8 Work-Stealing Scheduler 16

  16. Scheduling ● Fork-Jo Join in parallelis rallelism allows lows the e program ogrammer mer to ex expr press ess potential ential paralle rallelism lism in an application plication ● The schedu eduler ler map aps s strand rands s onto to proc ocess essors ors dynami namically ally at runtime untime ● Since nce the e theory eory of distr istributed buted Memory I/O schedulers is complicated, we’ll first rst explore plore the ideas eas wi with th a centr tralized alized sched heduler ler Network … $ $ $ P P P

  17. Greedy Scheduling I DEA : Do as much as possible on every step. Definition. A node is ready if all its predecessors have executed.

  18. Greedy Scheduling I DEA : Do as much as possible on every step. Definition. A node is ready if all its P = 3 predecessors have executed. Comple plete te step tep ● ≥ P strands ready. ● Run any P.

  19. Greedy Scheduling I DEA : Do as much as possible on every step. Definition. A node is ready if all its P = 3 predecessors have executed. Comple plete te step tep ● ≥ P strands ready. ● Run any P. Incomple complete te step tep ● < P strands ready. ● Run all of them.

  20. Analysis of Greedy rem [G68, B75, EZL89] . Any greedy scheduler achieves Theorem T ≤ W/P + D. Proof. ∙ # complete steps ≤ W/P, since each complete step performs P work. ∙ # incomplete steps ≤ D, since each incomplete step reduces the span of the unexecuted dag by 1. ■

  21. Fork-Join Parallelism CS260: Algorithm Greedy Scheduler Engineering Lecture 8 Work-Stealing Scheduler 22

  22. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { int x, y; 4 x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; 3 2 return x + y; } } 2 1 1 0 “Processor oblivious” 1 0 The co computati mputation on dag ag unfolds dynamically.

  23. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { P1 int x, y; 4 x = cilk_spawn fib(n-1); y = fib(n-2); P1 cilk_sync; Avai ailabl able e for 3 return x + y; execut cution ion } Avai ailabl able e for } P1 2 execut cution ion Avail ilabl able e for 1 execut cution ion

  24. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; Stea eal! fib(4) else { int x, y; P2 4 x = cilk_spawn fib(n-1); Steal! y = fib(n-2); cilk_sync; P3 3 2 return x + y; } } 2 1 1 P1 1

  25. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { int x, y; 4 x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; 3 2 return x + y; } } P3 P1 P2 2 1 1 1

  26. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { int x, y; Can’t 4 x = cilk_spawn fib(n-1); execut cute! e! y = fib(n-2); cilk_sync; P3 3 2 return x + y; } } P1 P2 2 1 1 1

  27. Execution Model int fib (int n) { Exampl mple: if (n < 2) return n; fib(4) else { int x, y; 4 x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; P3 3 2 return x + y; } } P1 P2 2 1 1 1

  28. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call spawn call spawn call call spawn call call Call! P P P P

  29. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call spawn call spawn call call spawn call spawn call Spawn! n! P P P P

  30. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call spawn call spawn call call spawn call spawn call spawn Spawn! n! Call! Spawn! n! spawn call P P P P

  31. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call spawn call spawn call call spawn call spawn call spawn Return! n! spawn call P P P P

  32. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call spawn call spawn call spawn call spawn call spawn Return! n! spawn call P P P P

  33. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call call spawn call spawn call spawn call spawn spawn call Steal! P P P P When a worker runs out of work, it steals from the top of a random victim’s deque.

  34. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call call spawn call spawn call spawn call spawn spawn call Steal! P P P P When a worker runs out of work, it steals from the top of a random victim’s deque.

  35. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call call spawn call spawn call spawn call spawn spawn call P P P P When a worker runs out of work, it steals from the top of a random victim’s deque.

  36. Cilk Runtime System Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98]. spawn call spawn call call spawn call spawn spawn call spawn call spawn spawn call Spawn! P P P P When a worker runs out of work, it steals from the top of a random victim’s deque.

Recommend


More recommend