Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The - PowerPoint PPT Presentation

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on Charles Leiserson’s Slides on Cilk EE717

Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: ● virus shell assembly ● dense and sparse matrix computations ● graphics rendering ● friction-stir welding ● n -body simulation simulation ● heuristic search ● artificial evolution EE717

Shared-Memory Multiprocessor … P P P $ $ $ Network Memory I/O In particular, over the next decade, chip multiprocessors (CMP’s) will be an increasingly important platform! EE717

Cilk Is Simple • Cilk extends the C language with just a handful of keywords. • Every Cilk program has a serial semantics . • Not only is Cilk fast, it provides performance guarantees based on performance abstractions. • Cilk is processor-oblivious . • Cilk’s provably good runtime system auto- matically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. • Cilk supports speculative parallelism. EE717

Fibonacci int fib (int n) { if (n<2) return (n); Cilk code else { int x,y; cilk int fib (int n) { x = fib(n-1); if (n<2) return (n); y = fib(n-2); else { return (x+y); int x,y; } x = spawn fib(n-1); } y = spawn fib(n-2); sync; C elision return (x+y); } } Cilk is a faithful extension of C. A Cilk program’s serial elision is always a legal implementation of Cilk semantics. Cilk provides no new data types. EE717

Basic Cilk Keywords Identifies a function as a Cilk procedure , cilk int fib (int n) { if (n<2) return (n); capable of being else { spawned in parallel. int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; The named child return (x+y); } Cilk procedure } can execute in parallel with the Control cannot pass this parent caller. point until all spawned children have returned. EE717

Dynamic Multithreading cilk int fib (int n) { if (n<2) return (n); Example: fib(4) else { int x,y; x = spawn fib(n-1); 4 y = spawn fib(n-2); sync; return (x+y); } 3 2 } 2 1 1 0 “Processor oblivious” The computation dag 1 0 unfolds dynamically. EE717

Multithreaded Computation initial thread final thread continue edge return edge spawn edge • The dag G = ( V , E ) represents a parallel instruction stream. • Each vertex v 2 V represents a (Cilk) thread : a maximal sequence of instructions not containing parallel control ( spawn , sync , return ). • Every edge e 2 E is either a spawn edge, a return edge, or a continue edge. EE717

LECTURE 14 Performance Measures EE717

Algorithmic Complexity Measures TP = execution time on P processors EE717

Algorithmic Complexity Measures TP = execution time on P processors T 1 = work EE717

Algorithmic Complexity Measures TP = execution time on P processors T 1 = work T 1 = span* * Also called critical-path length or computational depth . EE717

Algorithmic Complexity Measures TP = execution time on P processors T 1 = work T 1 = span* L OWER B OUNDS • TP ¸ T 1/ P • TP ¸ T 1 * Also called critical-path length or computational depth . EE717

Speedup Definition: T 1 /TP = speedup on P processors. If T 1 /TP = Θ ( P ) · P , we have linear speedup ; = P , we have perfect linear speedup ; > P , we have superlinear speedup , which is not possible in our model, because of the lower bound TP ¸ T 1/ P . EE717

Parallelism Because we have the lower bound TP ¸ T 1 , the maximum possible speedup given T 1 and T 1 is T 1 /T 1 = parallelism = the average amount of work per step along the span. EE717

Example: fib(4) 1 8 2 7 3 4 6 5 Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T 1 = ? Work: T 1 = 17 Span: T 1 = ? Span: T 1 = 8 EE717

Example: fib(4) Assume for simplicity that each Cilk thread in fib() takes unit time to execute. Work: T 1 = ? Work: T 1 = 17 Using many more Span: T 1 = ? Span: T 1 = 8 than 2 processors makes little sense. Parallelism: T 1 /T 1 = 2.125 EE717

Lesson Work and span can predict performance on large machines better than running times on small machines can. EE717

Suggested Reading Amdahl's Law in the Multicore Era , Mark D. Hill and Michael R. Marty, IEEE Computer, July 2008. EE717

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } EE717

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ C if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } } Parallelization strategy: 1. Convert loops to recursion. EE717

Parallelizing Vector Addition void vadd (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ cilk Cilk if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } } Parallelization strategy: Side benefit: 1. Convert loops to recursion. D&C is generally 2. Insert Cilk keywords. good for caches! EE717

Vector Addition cilk void vadd (real *A, real *B, int n){ if (n<=BASE) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { spawn vadd (A, B, n/2); spawn vadd (A+n/2, B+n/2, n-n/2); sync; } } EE717

Vector Addition Analysis To add two vectors of length n , where BASE = Θ (1): Work: T 1 = ? Θ ( n ) Span: T 1 = ? Θ (lg n ) Parallelism: T 1/ T 1 = ? Θ ( n /lg n ) BASE EE717

Another Parallelization void vadd1 (real *A, real *B, int n){ C int i; for (i=0; i<n; i++) A[i]+=B[i]; } void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { vadd1(A+j, B+j, min(BASE, n-j)); } } cilk void vadd1 (real *A, real *B, int n){ Cilk int i; for (i=0; i<n; i++) A[i]+=B[i]; } cilk void vadd (real *A, real *B, int n){ int j; for (j=0; j<n; j+=BASE) { spawn vadd1(A+j, B+j, min(BASE, n-j)); } sync; } EE717

Analysis … … BASE To add two vectors of length n , where BASE = Θ (1): Work: T 1 = ? Θ ( n ) P U N Y ! Θ ( n ) ? Span: T 1 = ? Θ (1) Parallelism: T 1/ T 1 = EE717

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The - PowerPoint PPT Presentation

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on Charles Leisersons Slides on Cilk EE717 Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors.

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

COMP 633 - Parallel Computing Lecture 8 September 8, 2020 SMM (3) Nested Parallelism

A Course-Based Usability Analysis of Cilk Plus and OpenMP Michael Coblenz, Robert Seacord,

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

CS 240A : Divide-and-Conquer with Cilk++ Divide & Conquer Paradigm Solving recurrences

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

OBSFUSCATION THE HIDING OF INTENDED MEANING, MAKING COMMUNICATION CONFUSING, WILFULLY AMBIGUOUS,

General Comments Good connection between the themes: evolvability, assessability, usability,

Resilience via Measurement Mike Lloyd, CTO Presentation at 9 th Workshop on Internet Economics

Elected Officials Must Themselves be Resilient: Social Media Survival Strategies UBCM 2019 Sukh

Origin of brane cosmological constant in warped geometry models Soumitra SenGupta IACS, Kolkata

P4 P4 Cur Curriculum riculum Briefing Briefing 2016 2016 Self-directed Learners,

AdL SUS/US Quad Controls Conceptual Design J. Heefner 7/12/2005 LIGO-G050318-00-C AdL SUS

Propagating ECN across IP tunnel Headers Separated by a Shim

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The - PowerPoint PPT Presentation

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on Charles Leisersons Slides on Cilk EE717 Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors.

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

COMP 633 - Parallel Computing Lecture 8 September 8, 2020 SMM (3) Nested Parallelism

A Course-Based Usability Analysis of Cilk Plus and OpenMP Michael Coblenz, Robert Seacord,

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication

CS 240A : Divide-and-Conquer with Cilk++ Divide &amp; Conquer Paradigm Solving recurrences

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

OBSFUSCATION THE HIDING OF INTENDED MEANING, MAKING COMMUNICATION CONFUSING, WILFULLY AMBIGUOUS,

General Comments Good connection between the themes: evolvability, assessability, usability,

Resilience via Measurement Mike Lloyd, CTO Presentation at 9 th Workshop on Internet Economics

Elected Officials Must Themselves be Resilient: Social Media Survival Strategies UBCM 2019 Sukh

Origin of brane cosmological constant in warped geometry models Soumitra SenGupta IACS, Kolkata

P4 P4 Cur Curriculum riculum Briefing Briefing 2016 2016 Self-directed Learners,

AdL SUS/US Quad Controls Conceptual Design J. Heefner 7/12/2005 LIGO-G050318-00-C AdL SUS

Propagating ECN across IP tunnel Headers Separated by a Shim

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

CS 240A : Divide-and-Conquer with Cilk++ Divide & Conquer Paradigm Solving recurrences