Programming a multicore architecture without coherency and atomic - PowerPoint PPT Presentation

Programming a multicore architecture without coherency and atomic operations Jochem Rutgers , Marco Bekooij, Gerard Smit 2014-02-15 1 / 13

Parallel render example One master thread: data = read_3d_model_from_file(); 1 go = 1; 2 while (done!=N) sleep(); 3 display_frame(frame); 4 N slave threads: while (!go) sleep(); 1 render_my_part_of_frame(data,frame); 2 done++; 3 2 / 13

Parallel render Pthread example One master thread: data = read_3d_model_from_file(); 1 pthread_barrier_wait(); 2 pthread_barrier_wait(); 3 display_frame(frame); 4 N slave threads: pthread_barrier_wait(); 1 render_my_part_of_frame(data,frame); 2 pthread_barrier_wait(); 3 3 / 13

programmers say. . . hardware architects say. . . I want C, so I need sequential execution Can’t do, use parallel execution instead Hmm, then I need shared memory to use threads and pointers I’ll give you distributed memory Then at least supply hardware cache coherency Ok, but only with a weak memory model I can’t reason about state then, give me atomic operations Ok, but that’s ex- tremely expensive , so don’t use them 4 / 13

Programming a multicore architecture without coherency and atomic operations 5 / 13

Programming a multicore architecture without coherency and atomic operations . . . by starting from a functional language 5 / 13

Dependency-only description Program definition: app main h = cylinder 2 h 1 cylinder r = * (* π (sqr r)) app main 2 sqr x = * x x 3 app 3 Evaluation sequence: 3 * main (* 3 3) 1 cylinder 2 (* 3 3) 2 * (* π (sqr 2)) (* 3 3) 3 * (* π (* 2 2)) (* 3 3) 4 * (* π (4)) (* 3 3) 5 * (12.57...) (* 3 3) 6 * (12.57...) 9 7 113.10... 8 6 / 13

Dependency-only description Program definition: app main h = cylinder 2 h 1 cylinder r = * (* π (sqr r)) app main 2 sqr x = * x x 3 app 3 Evaluation sequence: 3 * main (* 3 3) 1 cylinder 2 (* 3 3) 2 * (* π (sqr 2)) (* 3 3) 3 app * (* π (* 2 2)) (* 3 3) 4 * (* π (4)) (* 3 3) 5 cyl 2 * (12.57...) (* 3 3) 6 * (12.57...) 9 7 113.10... 8 6 / 13

Dependency-only description ◮ Terms are constant app ◮ Duplicates are identical ◮ No order in execution app main ◮ No memory/state app 3 ◮ No implicit behavior 3 * . . . therefore. . . ◮ Parallel description app ◮ Shortcuts in synchronization ◮ Lossy work distribution cyl 2 ◮ Only atomic pointer writes ◮ atomic free 6 / 13

A λ -term’s life new 1. Memory allocation 2. Memory initialization (construction) 3. Add to expression 4. Replace with result (indirect) 5. Die 6. Garbage collect, free 7 / 13

A λ -term’s life init 1. Memory allocation 2. Memory initialization (construction) 3. Add to expression 4. Replace with result (indirect) 5. Die 6. Garbage collect, free 7 / 13

A λ -term’s life 1. Memory allocation 2. Memory initialization (construction) 3. Add to expression 4. Replace with result (indirect) 5. Die 6. Garbage collect, free 7 / 13

A λ -term’s life 1. Memory allocation private 2. Memory initialization (construction) r/w access, private 3. Add to expression read-only, shared 4. Replace with result (indirect) pointer write, shared 5. Die private 6. Garbage collect, free private 7 / 13

From phases to rules 1. Memory allocation 2. Memory initialization (construction) Rule 1: construction must be completed ; flush / fence 3. Add to expression Rule 2: pointer write is atomic, in total order ; (flush) Rule 3: reads are in total order 4. Replace with result (indirect) (Rule 2 again) 5. Die Rule 4: all operations are completed ; flush / fence 6. Garbage collect, free 8 / 13

From phases to rules to requirements 1. Memory allocation 2. Memory initialization (construction) Rule 1: construction must be completed ; flush / fence 3. Add to expression Rule 2: pointer write is atomic, in total order ; (flush) Rule 3: reads are in total order 4. Replace with result (indirect) (Rule 2 again) 5. Die Rule 4: all operations are completed ; flush / fence 6. Garbage collect, free 8 / 13

λ -calculus in C++ ◮ λ -terms implemented as C++ templates/ classes ◮ gcc ; () -operator overloading gives FP-like syntax ◮ data type: (complex) doubles, large integers (GNU MP) ◮ one worker thread per core ◮ Haskell-like par and pseq ◮ local vs. global data and garbage collection ◮ mark–sweep GC (global GC is stop-the-world) ◮ ≈ 400 instructions in run-time per created λ -term ◮ ≈ 5500 LoC ◮ GPLv3 ◮ https://sites.google.com/site/jochemrutgers/lambdacpp 9 / 13

0 1 2 i 31 I$ D$ I$ D$ I$ D$ I$ D$ I$ D$ in-order NoC DDR 10 / 13

24 20 16 linear speedup Speedup ghc (x86) 12 ghc (x86, hyperthreaded) LambdaC++ (x86) LambdaC++ (x86, hyperthreaded) 8 LambdaC++ (MicroBlaze) LambdaC++ (x86), no mem bottleneck 4 0 0 4 8 12 16 20 24 28 32 Number of processors (b) parfib 11 / 13

Time spent (fraction of execution time) global GC 1 local GC 0 . 8 stalling on black hole idle 0 . 6 running β -reduction 0 . 4 0 . 2 0 s b k a s n i a s n i f t r e o r r p e c a a u p p q 12 / 13

Highlights programmers say. . . hardware architects say. . . I want C, so I need sequential execution programmers say. . . hardware architects say. . . Can’t do, use parallel execution instead Hmm, then I need shared memory to use threads and pointers I’ll give you distributed memory Then at least supply I want C, so I need hardware cache coherency Ok, but only with a sequential execution weak memory model I can’t reason about state then, give me atomic operations Ok, but that’s ex- tremely expensive , Can’t do, use parallel so don’t use them 4 / 13 execution instead Hmm, then I need Dependency-only description shared memory to use ◮ Terms are constant app ◮ Duplicates are identical threads and pointers ◮ No order in execution main app ◮ No memory/state I’ll give you app 3 ◮ No implicit behavior * 3 . . . therefore. . . distributed memory ◮ Parallel description ◮ Shortcuts in synchronization app Then at least supply ◮ Lossy work distribution ◮ Only atomic pointer writes cyl 2 ◮ atomic free hardware cache 6 / 13 coherency Ok, but only with a weak memory model From phases to rules to requirements 1. Memory allocation I can’t reason about 2. Memory initialization (construction) Rule 1: construction must be completed ; flush / fence state then, give me 3. Add to expression Rule 2: pointer write is atomic, in total order ; (flush) Rule 3: reads are in total order atomic operations 4. Replace with result (indirect) Ok, but that’s ex- (Rule 2 again) 5. Die Rule 4: all operations are completed ; flush / fence tremely expensive , 6. Garbage collect, free so don’t use them 8 / 13 4 / 13 0 1 2 i 31 I$ D$ I$ D$ I$ D$ I$ D$ I$ D$ in-order NoC ◮ Accept the hardware trends DDR 10 / 13 ◮ Another programming model might be more suitable Thanks! ◮ Extreme example: FP is hardware-friendly. . . Jochem Rutgers j.h.rutgers@utwente.nl Programming a multicore architecture without coherency and atomic operations ◮ . . . cache coherency and atomics are avoided 15 / 13 13 / 13

Programming a multicore architecture without coherency and atomic - PowerPoint PPT Presentation

Programming a multicore architecture without coherency and atomic operations Jochem Rutgers , Marco Bekooij, Gerard Smit 2014-02-15 1 / 13 Parallel render example One master thread: data = read_3d_model_from_file(); 1 go = 1; 2 while

Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses

Coherence Analysis Overview Definition Coherency Definition R xy (e j ) Properties G xy

Cache Coherency Cache coherent processors most current value for an address is the last

Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most

locks / cache coherency / spinlocks / other sync (intro) 1 Changelog 12 Feb 2020: add solution

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

T-106.5800 Seminar on Software Techniques Seminar on Multicore Programming Multicore Technology

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Concurrency and Memory Models Filip Sieczkowski Why concurrency? Moores law Every two

A classic locked-room mystery. Eve was in the false branch of a conditional the whole time, how

Nonlocal Cahn-Hilliard-Navier-Stokes systems with nonconstant mobility Maurizio Grasselli

A Denotational Study of Mobility Jo el-Alexis Bialkiewicz and Fr ed eric Peschanski

Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer

DISTRIBUTED SYSTEMS CS6421 ADVANCED RESOURCE MANAGEMENT Prof. Tim Wood and Prof. Roozbeh

Programming a multicore architecture without coherency and atomic - PowerPoint PPT Presentation

Programming a multicore architecture without coherency and atomic operations Jochem Rutgers , Marco Bekooij, Gerard Smit 2014-02-15 1 / 13 Parallel render example One master thread: data = read_3d_model_from_file(); 1 go = 1; 2 while

Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

Overview Synchronization hardware primitives Cache Coherency Issues Coherence misses

Coherence Analysis Overview Definition Coherency Definition R xy (e j ) Properties G xy

Cache Coherency Cache coherent processors most current value for an address is the last

Synchronization Coherency protocols guarantee that a reading processor (thread) sees the most

locks / cache coherency / spinlocks / other sync (intro) 1 Changelog 12 Feb 2020: add solution

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

Multicore DSP Architecture and Programming O. Dahl 1 1 Electrical Engineering, Linkping

T-106.5800 Seminar on Software Techniques Seminar on Multicore Programming Multicore Technology

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Concurrency and Memory Models Filip Sieczkowski Why concurrency? Moores law Every two

A classic locked-room mystery. Eve was in the false branch of a conditional the whole time, how

Nonlocal Cahn-Hilliard-Navier-Stokes systems with nonconstant mobility Maurizio Grasselli

A Denotational Study of Mobility Jo el-Alexis Bialkiewicz and Fr ed eric Peschanski

Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer

DISTRIBUTED SYSTEMS CS6421 ADVANCED RESOURCE MANAGEMENT Prof. Tim Wood and Prof. Roozbeh

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA