Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing G63.2011.002/G22.2945.001 · November 16, 2010 D&C General
Outline Divide-and-Conquer General Data Dependencies D&C General
Outline Divide-and-Conquer General Data Dependencies D&C General
Divide and Conquer x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 y i = f i ( x 1 , . . . , x N ) x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 for i ∈ { 1 , dots , M } . y 0 y 1 y 2 y 3 y 4 y 5 y 6 y 7 Main purpose: A way of partitioning up fully u 0 u 1 u 2 u 3 u 4 u 5 u 6 u 7 dependent tasks. v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 w 0 w 1 w 2 w 3 w 4 w 5 w 6 w 7 D&C General
Divide and Conquer x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 y i = f i ( x 1 , . . . , x N ) x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 for i ∈ { 1 , dots , M } . y 0 y 1 y 2 y 3 y 4 y 5 y 6 y 7 Main purpose: A way of partitioning up fully u 0 u 1 u 2 u 3 u 4 u 5 u 6 u 7 dependent tasks. v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 w 0 w 1 w 2 w 3 w 4 w 5 w 6 w 7 Processor allocation? D&C General
Divide and Conquer: Examples • GEMM, TRMM, TRSM, GETRF (LU) • FFT • Sorting: Bucket sort, Merge sort • N -Body problems (Barnes-Hut, FMM) • Adaptive Integration More fun with work and span: D&C analysis lecture D&C General
Divide and Conquer: Issues • “No idea how to parallelize that” • → Try D&C • Non-optimal during partition, merge • But: Does not matter if deep levels do heavy enough processing • Subtle to map to fixed-width machines (e.g. GPUs) • Varying data size along tree • Bookkeeping nontrivial for non-2 n sizes • Side benefit: D&C is generally cache-friendly D&C General
Outline Divide-and-Conquer General Data Dependencies D&C General
General Dependency Graphs A f B B = f(A) C = g(B) g p E = f(C) q C P F = h(C) f h G = g(E,F) P = p(B) Q E F Q = q(B) g g r R = r(G,P,Q) r G r R D&C General
General Dependency Graphs A f B B = f(A) C = g(B) g p E = f(C) q C P F = h(C) f h G = g(E,F) P = p(B) Q E F Q = q(B) g g r R = r(G,P,Q) r G r Great: All patterns discussed so far can be reduced to this one. R D&C General
Cilk Features: cilk int fib ( int n) { • Adds keywords spawn , if (n < 2) return n; sync , ( inlet , abort ) else • Remove keywords → valid { (seq.) C int x, y; Timeline: x = spawn fib (n − 1); • Developed at MIT, starting y = spawn fib (n − 2); in ‘94 sync; • Commercialized in ‘06 • Bought by Intel in ‘09 return (x+y); • Available in the Intel } } Compilers D&C General
Cilk Features: cilk int fib ( int n) { • Adds keywords spawn , if (n < 2) return n; sync , ( inlet , abort ) else • Remove keywords → valid { (seq.) C int x, y; Timeline: x = spawn fib (n − 1); • Developed at MIT, starting y = spawn fib (n − 2); in ‘94 sync; • Commercialized in ‘06 • Bought by Intel in ‘09 return (x+y); • Available in the Intel } } Compilers Efficient implementation? D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P P P P P With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! Spawn! P P P P P P P P With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P P P P P With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Return! P P P P P P P P With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Steal! P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. With material by Charles E. Leiserson (MIT) D&C General
Work-Stealing Cilk’s Work-Stealing Scheduler Each processor maintains a work deque of ready threads, and it manipulates the bottom of the deque like a stack. Spawn! P P P P P P P P When a processor runs out of work, it steals a thread from the top of a random victim’s deque. Why is Work-Stealing better than a Task Queue? With material by Charles E. Leiserson (MIT) D&C General
General Graphs: Issues • Model can accommodate ‘speculative execution’ • Launch many different ‘approaches’ • Abort the others as soon as one satisfactory one emerges. • Discover dependencies, make up schedule at run-time • Usually less efficient than the case of known dependencies • Map-Reduce absorbs many cases that would otherwise be general • On-line scheduling: complicated • Not a good fit if a more specific pattern applies • Good if inputs/outputs/functions are (somewhat) heavy-weight D&C General
Questions? ? D&C General
Recommend
More recommend