CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: - PowerPoint PPT Presentation

CS 5220: Introduction David Bindel 2017-08-22 1

CS 5220: Applications of Parallel Computers http://www.cs.cornell.edu/courses/cs5220/2017fa/ Time: TR 8:40–9:55 Location: Gates G01 Instructor: David Bindel ( bindel@cs ) TA: Eric Hans Lee ( erichanslee@cs ) 2

Enrollment http://www.cs.cornell.edu/courseinfo/enrollment • Many CS classes (including 5220) limit pre-enrollment to ensure majors and MEng students can get in. • We almost surely will have enough space for all comers. • Enroll if you want access to class resources. • Enrolling as an auditor is OK. • If you will not take the class, please formally drop! 3

The Computational Science & Engineering Picture Application Analysis Computation 4

Applications Everywhere! These tools are used in more places than you might think: • Climate modeling • CAD tools (computers, buildings, airplanes, ...) • Control systems • Computational biology • Computational finance • Machine learning and statistical models • Game physics and movie special effects • Medical imaging • Information retrieval • ... Parallel computing shows up in all of these. 5

Why Parallel Computing? • Scientific computing went parallel long ago • Want an answer that is right enough, fast enough • Either of those might imply a lot of work! • ... and we like to ask for more as machines get bigger • ... and we have a lot of data, too • Today: Hard to get a non-parallel computer! • Totient nodes (2015): 12-core compute nodes • Totient accelerators (2015): 60-core Xeon Phi 5110P • My laptop (late 2013): Dual core i5 + built in graphics 6 • Cluster access ≈ internet connection + credit card

Lecture Plan Roughly three parts: 1. Basics: architecture, parallel concepts, locality and parallelism in scientific codes 2. Technology: OpenMP, MPI, CUDA/OpenCL, cloud systems, compilers and tools 3. Patterns: Monte Carlo, dense and sparse linear algebra and PDEs, graph partitioning and load balancing, fast multipole, fast transforms 7

Objectives • Reason about code performance • Many factors: HW, SW, algorithms • Want simple “good enough” models • Learn about high-performance computing (HPC) • Learn parallel concepts and vocabulary • Experience parallel platforms (HW and SW) • Read/judge HPC literature • Apply model numerical HPC patterns • Tune existing codes for modern HW • Apply good software practices 8

Prerequisites Basic logistical constraints: • Default class codes will be in C • Our focus is numerical codes Fine if you’re not a numerical C hacker! • I want a diverse class • Most students have some holes • Come see us if you have concerns 9

Coursework: Lecture (10%) • Lecture = theory + practical demos • 60 minutes lecture • 15 minutes mini-practicum • Bring questions for both! • Notes posted in advance • May be prep work for mini-practicum • Course evaluations are also required! 10

Coursework: Homework (15%) • Five individual assignments plus “HW0” • Intent: Get everyone up to speed • Assigned Tues, due one week later 11

Coursework: Small group assignments (45%) • Three projects done with partners (1–3) • Analyze, tune, and parallelize a baseline code • Scope is 2-3 weeks 12

Coursework: Final project (30%) • Groups are encouraged! • Bring your own topic or we will suggest • Flexible, but must involve performance • Main part of work in November–December 13

Homework 0 • Posted on the class web page. • Complete and submit by CMS by 8/29. 14

Questions? 15

How Fast Can We Go? Speed records for the Linpack benchmark: http://www.top500.org Speed measured in flop/s (floating point ops / second): • Giga (10 9 ) – a single core • Tera (10 12 ) – a big machine • Peta (10 15 ) – current top 10 machines (5 in US) • Exa (10 18 ) – favorite of funding agencies 16

Current Record: China’s Sunway TaihuLight • 93 petaflop/s (125 petaflop/s peak) • 15 MW (LAPACK) – relatively energy efficient • Does not include custom chilled-water cooling unit • Based on SW26010 manycore RISC processors • Management processing element (CPE) = 64-bit RISC core • Custom interconnect • Sunway Raise OS (Linux) • Custom compilers (Sunway OpenACC) 17 • Computer processing element (CPE) = 8 × 8 core mesh

Performance on TaihuLight (Dongarra, June 2016) • Theoretical peak: 125.4 petaflop/s • Linpack: 93 petaflop/s (74% peak) • Three SC16 Gordon Bell finalists • Explicit PDE solves: 30–40 petaflop/s (25–30%) • Implicit solver: 1.5 petaflop/s (1%) • Numbers taken from June 2016, may have improved • Even with improvements: peak is not indicative! 18

Second: Tianhe-2 (33.9 pflop/s Linpack) Commodity nodes, custom interconnect: • Nodes consist of Xeon E5-2692 + Xeon Phi accelerators • Intel compilers + Intel math kernel libraries • MPICH2 MPI with customized channel • Kylin Linux 19 • TH Express-2

Alternate Benchmark: Graph 500 Graph processing benchmark (data-intensive) • Metric: traversed edges per second (TEPS) • K computer (Japan) tops the list (38.6 teraTEPS) • Sunway TaihuLight is second (23.8 teraTEPS) • Tianhe-2 is at 8 (2.1 teraTEPS) 20

Punchline • Some high-end machines look like high-end clusters • Except custom networks. • Achievable performance is • Application-dependent • Hard to achieve peak on more modest platforms, too! 21 • ≪ peak performance

Parallel Performance in Practice So how fast can I make my computation? • Measuring performance of real applications is hard • Even figure of merit may be unclear (flops, TEPS, ...?) • Typically a few bottlenecks slow things down • And figuring out why they slow down can be tricky! • And we really care about time-to-solution • Sophisticated methods get answer in fewer flops • ... but may look bad in benchmarks (lower flop rates!) See also David Bailey’s comments: • Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers (1991) • Twelve Ways to Fool the Masses: Fast Forward to 2011 (2011) 22 • Peak > Linpack > Gordon Bell > Typical

Quantifying Parallel Performance • Starting point: good serial performance • Strong scaling: compare parallel to serial time on the same problem instance as a function of number of processors ( p ) Parallel time p • Barriers to perfect speedup • Serial work (Amdahl’s law) • Parallel overheads (communication, synchronization) 23 Speedup = Serial time Efficiency = Speedup • Ideally, speedup = p . Usually, speedup < p .

Amdahl’s Law Parallel scaling study where some serial code remains: Amdahl’s law: t p 1 s 24 p = number of processors s = fraction of work that is serial t s = serial time t p = parallel time ≥ st s + ( 1 − s ) t s / p Speedup = t s = s + ( 1 − s ) / p > 1 So 1 % serial work = ⇒ max speedup < 100 × , regardless of p .

A Little Experiment Let’s try a simple parallel attendance count: • Parallel computation: Rightmost person in each row counts number in row. • Synchronization: Raise your hand when you have a count • Communication: When all hands are raised, each row representative adds their count to a tally and says the sum (going front to back). (Somebody please time this.) 25

A Toy Analysis Parameters: How much could I possibly speed up? 26 n = number of students r = number of rows t c = time to count one student t t = time to say tally t s ≈ nt c t p ≈ nt c / r + rt t

Modeling Speedup 0 2 4 6 8 10 12 1 2 Rows Predicted speedup 27 1 . 5 (Parameters: n = 80, t c = 0 . 3, t t = 1.)

Modeling Speedup The bound 2 nt c t t is usually tight. Poor speed-up occurs because: • The problem size n is small • The communication cost is relatively large • The serial computation cost is relatively large Some of the usual suspects for parallel performance problems! Things would look better if I allowed both n and r to grow — that would be a weak scaling study. 28 √ speedup < 1

Summary: Thinking about Parallel Performance Today: • We’re approaching machines with peak exaflop rates • But codes rarely get peak performance • Better comparison: tuned serial performance • Common measures: speedup and efficiency • Strong scaling: study speedup with increasing p • Weak scaling: increase both p and n • Serial overheads and communication costs kill speedup • Simple analytical models help us understand scaling 29

And in case you arrived late http://www.cs.cornell.edu/courses/cs5220/2017fa/ ... and please enroll and submit HW0! 30

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: - PowerPoint PPT Presentation

CS 5220: Introduction David Bindel 2017-08-22 1 CS 5220: Applications of Parallel Computers http://www.cs.cornell.edu/courses/cs5220/2017fa/ Time: TR 8:409:55 Location: Gates G01 Instructor: David Bindel ( bindel@cs ) TA: Eric Hans Lee

Lecture 1: Introduction to CS 5220 David Bindel 24 Aug 2011 CS 5220: Applications of Parallel

Introduction to Compute Cloud Tao Zou CS 5220 Applications of Parallel Computers About me 3

CS 5220: Optimization basics David Bindel 2017-08-31 1 Reminder: Modern processors Modern

CS 5220: Heterogeneity and accelerators David Bindel 2017-10-03 1 Reminder: Totient cluster

CS 5220: VMs, containers, and clouds David Bindel 2017-10-12 1 Cloud vs HPC Is the cloud

CS 5220: Load Balancing David Bindel 2017-11-09 1 Inefficiencies in parallel code Poor single

Dave DeFazio Partner dave.defazio@strategycorps.com 615-498-5220 linkedin.com/in/davedefazio

CS 5220: Single core architecture David Bindel 2017-08-29 1 Just for fun

CS 5220: Impact of Floating Point David Bindel 2017-11-16 1 Why this lecture? Isnt this

CS 5220: Mixed languages, libraries, and frameworks David Bindel 2017-11-21 1 Nerdvana? x =

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

CS 5220: Performance basics David Bindel 2017-08-24 1 Starting on the Soap Box The goal is

CS 5220: Locality and parallelism in simulations II David Bindel 2017-09-14 1 Basic styles of

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

CS 5220: More Sparse LA David Bindel 2017-10-26 1 Reminder: Conjugate Gradients What if we only

CS 5220: Parallel machines and models David Bindel 2017-09-07 1 Why clusters? Clusters of

Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/20/13 1 TPP performance

http://xkcd.com/1270/ Review: SecretKeeper Language e ::= true | false | n | if e then e else e

Intel Labs Haskell Research Compiler Hai (Paul) Liu with Neal Glew, Leaf Peterson, Todd A.

Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale