CS 240A: Shared Memory & Multicore Programming with Cilk++ - PowerPoint PPT Presentation

CS 240A: Shared Memory & Multicore Programming with Cilk++ • Multicore and NUMA architectures • Multithreaded Programming • Cilk++ as a concurrency platform • Work, Span, (potential) Parallelism Thanks to to Charles E. E. Leiserson for some of th these slides 1

Multicore Architecture Memory I/O Network … $ $ $ $ $ $ core core core core core core Chip Multi tiprocessor (CMP) 2

cc-NUMA Architectures AMD 8-way Opteron Server (neumann@cs.ucsb.edu) A processor (CMP) with 2/4 cores Memory bank local to a processor Point-to-point interconnect 3

cc-NUMA Architectures ∙ No Front Side Bus ∙ Integrated memory controller ∙ On-die interconnect among CMPs ∙ Main memory is physically distributed among CMPs (i.e. each piece of memory has an affinity to a CMP) ∙ NUMA: Non-uniform memory access. § For multi-socket servers only § Your desktop is safe (well, for now at least) § Triton nodes are not NUMA either 4

Desktop Multicores Today This is your AMD Barcelona or Intel Core i7 ! On-die interconnect Private cache: Cache coherence is required 5

Multithreaded Programming ∙ POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE ∙ “Assembly language” of shared memory programming ∙ Programmer has to manually: § Create and terminate threads § Wait for threads to complete § Manage interaction between threads using mutexes, condition variables, etc. 6

Concurrency Platforms • Programming directly on PThreads is painful and error-prone. • With PThreads, you either sacrifice memory usage or load-balance among processors • A concurrency platf tform provides linguistic support and handles load balancing. • Examples: • Threading Building Blocks (TBB) • OpenMP • Cilk++ 7

Cilk vs PThreads How will the following code execute in PThreads? In Cilk? for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); } sync-or-join; What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo? They have different liveness properties: ∙ Cilk threads are spawned lazily, “may” parallelism ∙ PThreads are spawned eagerly, “must” parallelism 8

Cilk vs OpenMP ∙ Cilk++ guarantees space bounds § On P processors, Cilk++ uses no more than P times the stack space of a serial execution. ∙ Cilk++ has a solution for global variables (called “reducers” / “hyperobjects”) ∙ Cilk++ has nested parallelism that works and provides guaranteed speed-up. § Indeed, cilk scheduler is provably optimal. ∙ Cilk++ has a race detector (cilkscreen) for debugging and software release. ∙ Keep in mind that platform comparisons are (always will be) subject to debate 9

Complexity Measures T P = execution time on P processors T 1 = wo work T ∞ = sp span an * * W ORK ORK L L AW AW ∙ T P ≥ T 1 /P S PAN PAN L L AW AW ∙ T P ≥ T ∞ * Also called criti tical-path th length th or computa tati tional depth th . 10

Scheduling A str trand is a sequence of instr tructi tions th that t doesn’t t conta tain any parallel constr tructs ts ∙ Cilk++ allows the programmer to express potential parallelism in an application. ∙ The Cilk++ sched scheduler uler maps strands onto processors dynamically at runtime. Memory I/O ∙ Since on on-lin line schedulers are complicated, we’ll Network explore the ideas with … $ $ $ P an of off-lin line scheduler. P P P 11

Greedy Scheduling I DEA DEA : Do as much as possible on every step. De Definiti tion: A strand is ready ready if all its predecessors have executed. 12

Greedy Scheduling I DEA DEA : Do as much as possible on every step. De Definiti tion: A strand is ready ready P = 3 if all its predecessors have executed. Complete te ste tep ∙ ≥ P strands ready. ∙ Run any P. 13

Greedy Scheduling I DEA DEA : Do as much as possible on every step. De Definiti tion: A strand is ready ready P = 3 if all its predecessors have executed. Complete te ste tep ∙ ≥ P strands ready. ∙ Run any P. Incomplete te ste tep ∙ < P strands ready. ∙ Run all of them. 14

Analysis of Greedy Th Theorem eorem : Any greedy scheduler achieves T P ≤ T 1 /P + T ∞ . P = 3 Proof . ∙ # complete steps ≤ T 1 /P, since each complete step performs P work. ∙ # incomplete steps ≤ T ∞ , since each incomplete step reduces the span of the unexecuted dag by 1. ■ 15

Optimality of Greedy Corollary Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof . Let T P * be the execution time produced by the optimal scheduler. Since T P * ≥ max{T 1 /P, T ∞ } by the Work and Span Laws, we have T P ≤ T 1 /P + T ∞ ≤ 2·max{T 1 /P, T ∞ } ≤ 2T P * . ■ 16

Linear Speedup Corollary Corollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T 1 /T ∞ . Proof. Since P ≪ T 1 /T ∞ is equivalent to T ∞ ≪ T 1 /P, the Greedy Scheduling Theorem gives us T P ≤ T 1 /P + T ∞ ≈ T 1 /P . Thus, the speedup is T 1 /T P ≈ P. ■ De Definiti tion. The quantity T 1 /PT ∞ is called the parallel s parallel slackn lacknes ess . 17

Parallelism Because the Span Law dictates that T P ≥ T ∞ , the maximum possible speedup given T 1 and T ∞ is T 1 /T ∞ = parallelis parallelism = the average   amount of work   per step along   the span. 18

Great, how do we program it? ∙ Cilk++ is a faithful extension of C++ ∙ Often use div divide- ide-an and- d-con conqu quer er ∙ Three (really two) hints to the compiler: § cilk_s _spawn: this function can run in parallel with the caller § cilk_s _sync: all spawned children must return before execution can continue § cilk_f _for: all iterations of this loop can run in parallel § Compiler translates cilk_for into cilk_spawn & cilk_sync under the covers 19

Nested Parallelism Example: Quick Ex Quicksort sort The named ch child ild function may execute template <typename T> void qsort(T begin, T end) { in parallel with the if (begin != end) { paren parent caller. T middle = partition( begin, end, bind2nd( less<typename iterator_traits<T>::value_type> (), *begin ) ); cilk_spawn qsort(begin, middle); qsort(max(begin + 1, middle), end); Control cannot pass this cilk_sync; } point until all spawned } children have returned. 20

Cilk++ Loops Example: Matr Ex trix tr transpose cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { B[i][j] = A[j][i]; } } ∙ A cilk_for loop’s iterations execute in parallel. ∙ The index must be declared in the loop initializer. ∙ The end condition is evaluated exactly once at the beginning of the loop. ∙ Loop increments should be a con const st value 21

Serial Correctness The serializati tion is the code with the Cilk++ int fib (int n) { if (n<2) return (n); Cilk++ else { keywords replaced by int x,y; Compiler x = cilk_spawn fib(n-1); null or C++ keywords. y = fib(n-2); Conventional cilk_sync; Compiler return (x+y); } Cilk++ source } Linker int fib (int n) { if (n<2) return (n); else { int x,y; x = fib(n-1); Binary y = fib(n-2); return (x+y); } Serialization } Serial correctness can Cilk++ Runtime be debugged and Conventional Library Regression Tests verified by running the multithreaded code on a Reliable Single- single processor. Threaded Code 22

Serialization How to seamlessly switch between serial c+ + and parallel cilk++ programs? Add to the #ifdef CILKPAR beginning of #include <cilk.h> your program #else #define cilk_for for #define cilk_main main #define cilk_spawn Compile ! #define cilk_sync #endif Ø cilk++ -DCILKPAR –O2 –o parallel.exe main.cpp Ø g++ –O2 –o serial.exe main.cpp 23

Parallel Correctness int fib (int n) { if (n<2) return (n); Cilk++ else { int x,y; Compiler x = cilk_spawn fib(n-1); y = fib(n-2); Conventional cilk_sync; Compiler return (x+y); } Cilk++ source } Linker Cilkscreen Binary Race Detector Parallel correctn tness can be debugged and verified with the Cilkscreen race Parallel Regression Tests detector, which guarantees to find inconsistencies with the serial code Reliable Multi- Threaded Code 24

Race Bugs Definition. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. Example int x = 0; int x = 0; cilk_for(int i=0, i<2, ++i) { x++; x++; x++; } assert(x == 2); assert(x == 2); Dependency Graph 25

Race Bugs Definition. A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions performs a write. 1 x = 0; int x = 0; 2 4 r1 = x; r2 = x; x++; x++; 3 5 r1++; r2++; 7 6 x = r1; x = r2; assert(x == 2); 8 assert(x == 2); 26

Types of Races Suppose that instruction A and instruction B both access a location x, and suppose that A ∥ B (A is parallel to B). A A B B Race Type Race Type read read none read write read race write read read race write write write race Two sections of code are independent if they have no determinacy races between them. 27

CS 240A: Shared Memory & Multicore Programming with Cilk++ - PowerPoint PPT Presentation

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++ as a concurrency platform Work, Span, (potential) Parallelism Thanks to to Charles E. E. Leiserson

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Lecture 20- ECE 240a Distributed Feedback Lasers 1 ECE 240a Lasers - Fall 2019 Lecture 20

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

Lecture 14- ECE 240a Transient Response Ver Chap. 9.3 Linearized Solution Sinusoidal

Lecture 12- ECE 240a Threshold Mirror Loss Ver Chap. 8-9 Threshold Conditions Homogeneous

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

CS 240A : Divide-and-Conquer with Cilk++ Divide & Conquer Paradigm Solving recurrences

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Self-similar groups: old and new results Said Najati Sidki Universidade de Brasilia In 1998

Gods Character W. Mark Lanier W. Mark Lanier Whats in a name? Commandment 3 Whats

( ) Intro. on Artificial Intelligence from the perspective of probability

Pattern Recognition 2019 Clustering, Mixture Models and EM Ad Feelders Universiteit Utrecht

Monitoring Built-up areas using DMSP-OLS nighttime lights data: A study from Indo Gangetic Plain

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial

Unit 3: Foundations for inference 3. Hypothesis tests GOVT 3990 - Spring 2020 Cornell University

Samples and Statistics The objective of statistical inference is to draw conclusions or make

Sambuz

Useful Links

Newsletter

Mail Us

CS 240A: Shared Memory & Multicore Programming with Cilk++ - PowerPoint PPT Presentation

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA architectures Multithreaded Programming Cilk++ as a concurrency platform Work, Span, (potential) Parallelism Thanks to to Charles E. E. Leiserson

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Lecture 20- ECE 240a Distributed Feedback Lasers 1 ECE 240a Lasers - Fall 2019 Lecture 20

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

Lecture 14- ECE 240a Transient Response Ver Chap. 9.3 Linearized Solution Sinusoidal

Lecture 12- ECE 240a Threshold Mirror Loss Ver Chap. 8-9 Threshold Conditions Homogeneous

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

CS 240A : Divide-and-Conquer with Cilk++ Divide &amp; Conquer Paradigm Solving recurrences

Shared Memory ... Programming Model Hardware Languages ( OpenMP , Cilk, pthreads, ...)

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Self-similar groups: old and new results Said Najati Sidki Universidade de Brasilia In 1998

Gods Character W. Mark Lanier W. Mark Lanier Whats in a name? Commandment 3 Whats

( ) Intro. on Artificial Intelligence from the perspective of probability

Pattern Recognition 2019 Clustering, Mixture Models and EM Ad Feelders Universiteit Utrecht

Monitoring Built-up areas using DMSP-OLS nighttime lights data: A study from Indo Gangetic Plain

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial

Unit 3: Foundations for inference 3. Hypothesis tests GOVT 3990 - Spring 2020 Cornell University

Samples and Statistics The objective of statistical inference is to draw conclusions or make

Sambuz

Useful Links

Newsletter

Mail Us

CS 240A : Divide-and-Conquer with Cilk++ Divide & Conquer Paradigm Solving recurrences