Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on Charles Leiserson’s Slides on Cilk EE717
Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: ● virus shell assembly ● dense and sparse matrix computations ● graphics rendering ● friction-stir welding ● n -body simulation simulation ● heuristic search ● artificial evolution EE717
Shared-Memory Multiprocessor … P P P $ $ $ Network Memory I/O In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform! EE717
Cilk Is Simple • Cilk extends the C language with just a handful of keywords. • Every Cilk program has a serial semantics . • Not only is Cilk fast, it provides performance guarantees based on performance abstractions. • Cilk is processor-oblivious . • Cilk’s provably good runtime system auto- matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. • Cilk supports speculative parallelism. EE717
Fibonacci int fib (int n) { if (n<2) return (n); Cilk code else { int x,y; cilk int fib (int n) { x = fib(n-1); if (n<2) return (n); y = fib(n-2); else { return (x+y); int x,y; } x = spawn fib(n-1); } y = spawn fib(n-2); sync; C elision return (x+y); } } Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. EE717
Basic Cilk Keywords Identifies a function as a Cilk procedure , cilk int fib (int n) { if (n<2) return (n); capable of being else { spawned in parallel. int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; The named child return (x+y); } Cilk procedure } can execute in parallel with the Control cannot pass this parent caller. point until all spawned children have returned. EE717
Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); Example: fib(4) else { int x,y; x = spawn fib(n-1); 4 y = spawn fib(n-2); sync; return (x+y); } 3 2 } 2 1 1 0 “Processor oblivious” The computation dag 1 0 unfolds dynamically. EE717
Multithreaded Computation initial thread final thread continue edge return edge spawn edge • The dag G = ( V , E ) represents a parallel instruction stream. • Each vertex v 2 V represents a (Cilk) thread : a maximal sequence of instructions not containing parallel control ( spawn , sync , return ). • Every edge e 2 E is either a spawn edge, a return edge, or a continue edge. EE717
LECTURE 14 Performance Measures EE717
Algorithmic Complexity Measures TP = execution time on P processors EE717
Algorithmic Complexity Measures TP = execution time on P processors T 1 = work EE717
Algorithmic Complexity Measures TP = execution time on P processors T 1 = work T 1 = span* * Also called critical-path length or computational depth . EE717
Algorithmic Complexity Measures TP = execution time on P processors T 1 = work T 1 = span* L OWER B OUNDS • TP ¸ T 1/ P • TP ¸ T 1 * Also called critical-path length or computational depth . EE717
Speedup Definition: T 1 /TP = speedup on P processors. If T 1 /TP = Θ ( P ) · P , we have linear speedup ; = P , we have perfect linear speedup ; > P , we have superlinear speedup , which is not possible in our model, because of the lower bound TP ¸ T 1/ P . EE717
Parallelism Because we have the lower bound TP ¸ T 1 , the maximum possible speedup given T 1 and T 1 is T 1 /T 1 = parallelism = the average amount of work per step along the span. EE717
Example: fib(4) 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T 1 = ? Work: T 1 = 17 Span: T 1 = ? Span: T 1 = 8 EE717
Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T 1 = ? Work: T 1 = 17 Using many more Span: T 1 = ? Span: T 1 = 8 than 2 processors makes little sense. Parallelism: T 1 /T 1 = 2.125 EE717
Lesson Work and span can predict performance on large machines better than running times on small machines can. EE717
Suggested Reading Amdahl's Law in the Multicore Era , Mark D. Hill and Michael R. Marty, IEEE Computer, July 2008. EE717
Parallelizing Vector Addition void vadd (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } EE717
Parallelizing Vector Addition void vadd (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } } Parallelization strategy: 1. Convert loops to recursion. EE717
Parallelizing Vector Addition void vadd (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ cilk Cilk if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } } Parallelization strategy: Side benefit: 1. Convert loops to recursion. D&C is generally 2. Insert Cilk keywords. good for caches! EE717
Vector Addition cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } } EE717
Vector Addition Analysis To add two vectors of length n , where BASE = Θ (1): Work: T 1 = ? Θ ( n ) Span: T 1 = ? Θ (lg n ) Parallelism: T 1/ T 1 = ? Θ ( n /lg n ) BASE EE717
Another Parallelization void vadd1 (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } cilk void vadd1 (real *A, real *B, int n){ Cilk int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; } EE717
Analysis … … BASE To add two vectors of length n , where BASE = Θ (1): Work: T 1 = ? Θ ( n ) P U N Y ! Θ ( n ) ? Span: T 1 = ? Θ (1) Parallelism: T 1/ T 1 = EE717
Recommend
More recommend