Cilk for High Cilk for High Productivity Computing Productivity - PowerPoint PPT Presentation

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul Supercomputing Technologies Research Group MIT CSAIL November 14, 2006 1 Cilk for High Productivity Computing, SC|06

Cilk A C language for dynamic multithreading A C language for dynamic multithreading with a provably good runtime system. with a provably good runtime system. Platforms Applications • AMD Opteron • virus shell assembly • Sun UltraSparc • graphics rendering • SGI Altix • n -body simulation • Intel Pentium • � Socrates and Cilkchess Cilk automatically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling. November 14, 2006 2 Cilk for High Productivity Computing, SC|06

Example: Vector Addition void vadd (real *A, real *B, int L, int H){ C C void vadd (real *A, real *B, int L, int H){ int i; for (i=L; i<H; i++) A[i]+=B[i]; int i; for (i=L; i<H; i++) A[i]+=B[i]; } } November 14, 2006 3 Cilk for High Productivity Computing, SC|06

Example: Vector Addition void vadd (real *A, real *B, int L, int H){ C C void vadd (real *A, real *B, int L, int H){ int i; for (i=L; i<H; i++) A[i]+=B[i]; int i; for (i=L; i<H; i++) A[i]+=B[i]; } } cilk void vadd (real *A, real *B, int L, int H){ Cilk Cilk cilk void vadd (real *A, real *B, int L, int H){ if (L+BASE>H) { if (L+BASE>H) { int i; for (i=L; i<H; i++) A[i]+=B[i]; int i; for (i=L; i<H; i++) A[i]+=B[i]; } else { } else { spawn vadd (A, B, L, (L+H)/2); spawn vadd (A, B, L, (L+H)/2); spawn vadd (A, B, (L+H)/2, H); spawn vadd (A, B, (L+H)/2, H); sync; sync; } } } } To expose parallelism, convert loops to recursion. Side benefit: Divide-and-conquer is good for caches! Side benefit: November 14, 2006 4 Cilk for High Productivity Computing, SC|06

Example: Vector Addition void vadd (real *A, real *B, int L, int H){ C C void vadd (real *A, real *B, int L, int H){ int i; for (i=L; i<H; i++) A[i]+=B[i]; int i; for (i=L; i<H; i++) A[i]+=B[i]; } } cilk void vadd (real *A, real *B, int L, int H){ Cilk Cilk cilk void vadd (real *A, real *B, int L, int H){ if (L+BASE>H) { if (L+BASE>H) { int i; for (i=L; i<H; i++) A[i]+=B[i]; int i; for (i=L; i<H; i++) A[i]+=B[i]; } else { } else { spawn vadd (A, B, L, (L+H)/2); spawn vadd (A, B, L, (L+H)/2); spawn vadd (A, B, (L+H)/2, H); spawn vadd (A, B, (L+H)/2, H); sync; sync; } } } } Cilk is a faithful faithful extension of C. A Cilk program’s serial elision is always a legal implementation of serial elision Cilk semantics. Cilk provides no no new data types. November 14, 2006 5 Cilk for High Productivity Computing, SC|06

Example: Vector Addition void vadd (real *A, real *B, int L, int H){ C C void vadd (real *A, real *B, int L, int H){ int i; for (i=L; i<H; i++) A[i]+=B[i]; int i; for (i=L; i<H; i++) A[i]+=B[i]; } } cilk void vadd (real *A, real *B, int L, int H){ Cilk Cilk cilk void vadd (real *A, real *B, int L, int H){ if (L+BASE>H) { if (L+BASE>H) { int i; for (i=L; i<H; i++) A[i]+=B[i]; serial serial int i; for (i=L; i<H; i++) A[i]+=B[i]; } else { } else { elision elision spawn vadd (A, B, L, (L+H)/2); spawn vadd (A, B, L, (L+H)/2); spawn vadd (A, B, (L+H)/2, H); spawn vadd (A, B, (L+H)/2, H); sync; sync; } } } } Cilk is a faithful faithful extension of C. A Cilk program’s serial elision is always a legal implementation of serial elision Cilk semantics. Cilk provides no no new data types. November 14, 2006 6 Cilk for High Productivity Computing, SC|06

Cilk Productivity I implemented all 6 HPC Challenge benchmarks. Distance to Desktop: # of Cilk keywords added to the serial program. SLOC * SLOC * Distance Benchmark (Cilk) (MPI) to Desktop 58 658 11 STREAM 81 2261 13 PTRANS 123 1883 18 RandomAccess 348 15608 41 HPL 97 ?? † 19 DGEMM 230 1747 35 FFTE * “Source lines of code” omits comments and blank lines, but includes .h files (official count does not). † MPI DGEMM uses the HPL parallel matrix multiplication. The framework is 184 SLOC. November 14, 2006 7 Cilk for High Productivity Computing, SC|06

Performance FFTE HPL DGEMM STREAM PTRANS Gflop/s η Gflop/s η GB/s η Gflop/s η P GB/s η 1 5.2 5.1 0.8 0.7 0.7 2 9.4 89 9.7 96 0.9 56 0.5 36 0.9 67 4 17.3 85 19.7 97 1.8 57 0.9 33 1.8 68 8 30.8 73 35.7 88 2.9 46 1.7 30 2.9 55 16 52.5 63 64.9 80 4.0 32 3.3 29 4.0 38 32 88.6 52 118.9 73 6.8 27 6.1 27 6.8 32 64 101.6 30 248.0 76 14.0 28 11.6 26 14.0 33 128 463.1 71 25.0 25 18.3 20 25.9 31 256 943.0 73 44.2 22 27.2 15 49.5 29 384 1195.9 61 54.1 11 What is limiting the speedup? The language or the hardware? November 14, 2006 8 Cilk for High Productivity Computing, SC|06

Performance vs. MPI Cilk beats the best reported Altix numbers for PTRANS and FFTE. PTRANS RandomAccess FFTE HPL Gflop/s η GB/s η GB/s η P GUPS Cilk32 88.6 52% 6.1 27% 0.15 6.8 32% MPI32 129.2 77% 2.6 11% 0.004 4.1 19% Cilk/MPI 0.68 2.35 37.5 1.65 Cilk128 18.3 20% 0.11 25.9 31% MPI128 638.9 95% 7.5 8% 0.11 14.1 17% Cilk/MPI ? 2.43 0.96 1.84 MPI performance taken from HPC web site for Altix 3700. November 14, 2006 9 Cilk for High Productivity Computing, SC|06

Conclusion • Cilk is simple simple , faithfully extending the legacy C language with only a handful of new keywords. ◦ Cilk contains no new data types. • Cilk encourages recursive recursive programming. ◦ Divide-and-conquer exploits data locality for caches. • Cilk scales down scales down to run on one processor with nearly the efficiency of C. ◦ Fast C code ⇔ fast Cilk code. • Cilk scales up scales up provably well, guaranteeing near- perfect linear speedup, assuming that ◦ sufficient parallelism exists in the application, and ◦ the platform has adequate communication bandwidth. November 14, 2006 10 Cilk for High Productivity Computing, SC|06

Cost of Programming • Commodity codes are amortized over 10 4 to 10 6 more users than custom codes. • Today’s custom scalable codes employ arcane programming models usable only by experts. • Our research is focused on reinventing scalable computing as a seamless extension of commodity serial computing. November 14, 2006 11 Cilk for High Productivity Computing, SC|06

Current Research • JCilk JCilk , a Java-based multithreaded language, fuses • dynamic and persistent multithreading. • Adaptive thread and job scheduling Adaptive thread and job scheduling guarantees fair • and efficient resource sharing. • Transactional memory Transactional memory simplifies thread synchroniz- • ation and improves performance compared with locking, especially for multicore processors. • Cilk Cilk- -DXM DXM integrates Cilk with distributed • transactional memory for clusters. • Parallel data Parallel data- -race detectors race detectors can guarantee to find • synchronization bugs efficiently. • Cache Cache- -oblivious algorithms oblivious algorithms offer high performance • for streaming file I/O through passive self-tuning. November 14, 2006 12 Cilk for High Productivity Computing, SC|06

World Wide Web Cilk source code, programming examples, documentation, technical papers, tutorials, and up-to-date information can be found at: http://supertech.csail.mit.edu/cilk supertech.csail.mit.edu/cilk http:// November 14, 2006 13 Cilk for High Productivity Computing, SC|06

HPC Challenge (Class 2) Most productivity: Most “elegant” Most productivity: implementation of two or more of seven parallel benchmarks: • STREAM: vector addition & scaling • PTRANS: matrix transpose • RandomAccess: eponymous • HPL: PLU decomposition • DGEMM: matrix multiplication • FFTE: fast Fourier transform • b_eff: bandwidth and efficiency November 14, 2006 14 Cilk for High Productivity Computing, SC|06

Acknowledgments Many thanks to MIT Department of Earth, Atmospheric, and Planetary Sciences and NASA for their donations of machine time to run these benchmarks. Keith Randall helped implement HPL in Cilk. November 14, 2006 15 Cilk for High Productivity Computing, SC|06

Cilk for High Cilk for High Productivity Computing Productivity - PowerPoint PPT Presentation

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul Supercomputing Technologies Research Group MIT CSAIL November 14, 2006 1 Cilk for High Productivity Computing, SC|06 Cilk A C language for

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

COMP 633 - Parallel Computing Lecture 8 September 8, 2020 SMM (3) Nested Parallelism

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

EVALUATION Richard Kneller School of Economics, University of Nottingham The productivity of

Decent work as a source of Decent work as a source of productivity in Europe productivity in

Automated Productivity Based Automated Productivity Based Schedule Animation (APBSA) Schedule

Productivity Development in Germany And the Financial Crisis by Georg Erber 22. November 2012

Structural change, labor productivity and globalization productivity and globalization Margaret

Training course for policy makers on productivity and working conditions in SMEs SESSION 4:

OUTLOOK, JULY 2 0 1 7 Peter Harris Productivity Com m ission Productivity Commission 1 2

Testing Kotlin at Scale: Spek Artem Zinnatullin @artem_zin - Productivity - Productivity -

OUTLOOK, JULY 2017 Peter Harris Productivity Commission Productivity Commission 1 2 Topic

Counting points on curves: the general case Jan Tuitman, KU Leuven October 14, 2015 Jan Tuitman,

Light meson masses using AdS/QCD modified soft wall model Miguel Angel Mart n Contreras

Counting Consecutive Pattern-Avoiding Permutations with Perron and Frobenius Richard Ehrenborg,

Hardware-Software Codesign 7. Design Space Exploration Lothar Thiele Computer Engineering Swiss

Operators of Kolmogorov type and parabolic operators associated with non-commuting vector fields:

Statistics of quantum resonances and fluctuations in chaotic scattering Dmitry Savin Department

Laurette TUCKERMAN laurette@pmmh.espci.fr Numerical Methods for Differential Equations in

T-Duality, fluxes and noncommutativity in closed string theory Athanasios Chatzistavrakidis

Cilk for High Cilk for High Productivity Computing Productivity - PowerPoint PPT Presentation

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul Supercomputing Technologies Research Group MIT CSAIL November 14, 2006 1 Cilk for High Productivity Computing, SC|06 Cilk A C language for

Lecture 14: Cilk Shankar Balachandran bshankar@ee.iitb.ac.in The lecture is partly based on

Gibbs Sampling Bayesian Networks: A First Attempt with Cilk++ Alexander Dubbs May 13, 2010

The Fork-Join Model and its Implementation in Cilk Marc Moreno Maza University of Western

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

COMP 633 - Parallel Computing Lecture 8 September 8, 2020 SMM (3) Nested Parallelism

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Parallel Simulation of Social Agents using Cilk and OpenCL DS-RT 2011 15th International

EVALUATION Richard Kneller School of Economics, University of Nottingham The productivity of

Decent work as a source of Decent work as a source of productivity in Europe productivity in

Automated Productivity Based Automated Productivity Based Schedule Animation (APBSA) Schedule

Productivity Development in Germany And the Financial Crisis by Georg Erber 22. November 2012

Structural change, labor productivity and globalization productivity and globalization Margaret

Training course for policy makers on productivity and working conditions in SMEs SESSION 4:

OUTLOOK, JULY 2 0 1 7 Peter Harris Productivity Com m ission Productivity Commission 1 2

Testing Kotlin at Scale: Spek Artem Zinnatullin @artem_zin - Productivity - Productivity -

OUTLOOK, JULY 2017 Peter Harris Productivity Commission Productivity Commission 1 2 Topic

Counting points on curves: the general case Jan Tuitman, KU Leuven October 14, 2015 Jan Tuitman,

Light meson masses using AdS/QCD modified soft wall model Miguel Angel Mart n Contreras

Counting Consecutive Pattern-Avoiding Permutations with Perron and Frobenius Richard Ehrenborg,

Hardware-Software Codesign 7. Design Space Exploration Lothar Thiele Computer Engineering Swiss

Operators of Kolmogorov type and parabolic operators associated with non-commuting vector fields:

Statistics of quantum resonances and fluctuations in chaotic scattering Dmitry Savin Department

Laurette TUCKERMAN laurette@pmmh.espci.fr Numerical Methods for Differential Equations in

T-Duality, fluxes and noncommutativity in closed string theory Athanasios Chatzistavrakidis

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA