time to start over software for exascale
play

Time to Start over? Software for Exascale William Gropp - PowerPoint PPT Presentation

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is Exascale Different? Extreme power constraints, leading to Clock Rates similar to todays systems A wide-diversity of simple computing


  1. Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp

  2. Why Is Exascale Different? • Extreme power constraints, leading to ♦ Clock Rates similar to today’s systems ♦ A wide-diversity of simple computing elements (simple for hardware but complex for algorithms and software) ♦ Memory per core and per FLOP will be much smaller ♦ Moving data anywhere will be expensive (time and power) • Faults that will need to be detected and managed ♦ Some detection may be the job of the programmer, as hardware detection takes power 2

  3. Why Is Exascale Different? • Extreme scalability and performance irregularity ♦ Performance will require enormous concurrency (10 8 – 10 9 ) ♦ Performance is likely to be variable • Simple, static decompositions will not scale • A need for latency tolerant algorithms and programming ♦ Memory, processors will be 100s to 10000s of cycles away. Waiting for operations to complete will cripple performance 3

  4. Why is Everyone Worried? • Exascale makes all problems extreme: ♦ Power, data motion costs, performance irregularities, faults, extreme degree of parallelism, specialized functional units • Added to each of these is ♦ Complexity resulting from all of the above • These issues are not new, but may be impossible to ignore ♦ The “free ride” from rapid improvement in hardware performance is ending/over 4

  5. That “Kink” in #500 is Real • Extrapolation of 1000 recent data gives 100 ~1PF HPL in 2018 10 1 on the #500 system 0 10 20 30 40 50 0.1 • Extrapolation of HPL Perf (TF) 0.01 Fit perf (TF) older data gives 0.001 0.0001 ~1PF in 2015, ~7PF in 2018 • #500 may be a better predictor of trends 5

  6. Current Petascale Systems Already Complex • Typical processor ♦ 8 floating point units, 16 integer units • What is a “core”? ♦ Full FP performance requires use of short vector instructions • Memory ♦ Performance depends on location, access pattern ♦ “Saturates” on multicore chip • Specialized processing elements ♦ E.g., NVIDIA GPU (K20X); 2688 “cores” (or 56…) • Network ♦ 3- or 5-D Torus, latency, bandwidth, contention important 6

  7. Blue Waters: NSF’s Most Powerful System • 4,224 XK7 nodes and 22,640 XE6 nodes ♦ ~ 1/7 GPU+CPU, 6/7 CPU+CPU ♦ Peak perf >13PF: ~ 1/3 GPU+CPU, 2/3 CPU+CPU • 1.5 PB Memory, >1TB/Sec I/O Bandwidth • System sustains > 1 PetaFLOPS on a wide range of applications ♦ From starting to read input from disk to results written to disk, not just computational kernels ♦ No Top500 run – does not represent application workload 7

  8. How Do We Program These Systems? • There are many claims about how we can and cannot program extreme scale systems ♦ Confusion is rampant ♦ Incorrect statements and conclusions common ♦ Often reflects “I don’t want to do it that way” instead of “there’s a good reason why it can’t be done that way” • General impression ♦ The programming model influences the solutions used by programmers and algorithm developers ♦ In Linguistics, this is the Sapir-Whorf or Whorfian hypothesis • We need to understand our terms first 8

  9. How Should We Think About Parallel Programming? • Need a more formal way to think about programming ♦ Must be based on the realities of real systems ♦ Not the system that we wish we could build (see PRAM) • Not talking about a programming model ♦ Rather, first need to think about what an extreme scale parallel system can do ♦ System – the hardware and the software together 9

  10. Separate the Programming Model from the Execution Model • What is an execution model? ♦ It’s how you think about how you can use a parallel computer to solve a problem • Why talk about this? ♦ The execution model can influence what solutions you consider (the Whorfian hypothesis ) ♦ After decades where many computer scientists only worked with one execution model, we are now seeing new models and their impact on programming and algorithms 10

  11. Examples of Execution Models • Von Neumann machine: ♦ Program counter ♦ Arithmetic Logic Unit ♦ Addressable Memory • Classic Vector machine: ♦ Add “vectors” – apply the same operation to a group of data with a single instruction • Arbitrary length (CDC Star 100), 64 words (Cray), 2 words (SSE) • GPUs with collections of threads (Warps) 11

  12. Programming Models and Systems • In past, often a tight connection between the execution model and the programming approach ♦ Fortran: FORmula TRANslation to von Neumann machine ♦ C: e.g., “register”, ++ operator match PDP-11 capabilities, needs • Over time, execution models and reality changed but programming models rarely reflected those changes ♦ Rely on compiler to “hide” those changes from the user – e.g., auto-vectorization for SSE(n) • Consequence: Mismatch between users’ expectation and system abilities. ♦ Can’t fully exploit system because user’s mental model of execution does not match real hardware ♦ Decades of compiler research have shown this problem is extremely hard – can’t expect system to do everything for you. 12

  13. Programming Models and Systems • Programming Model : an abstraction of a way to write a program ♦ Many levels • Procedural or imperative? • Single address space with threads? • Vectors as basic units of programming? ♦ Programming model often expressed with pseudo code • Programming System : (My terminology) ♦ An API that implements parts or all of one or more programming models, enabling the precise specification of a program 13

  14. Why the Distinction? • In parallel computing, ♦ Message passing is a programming model • Abstraction: A program consists of processes that communication by sending messages. See “Communicating Sequential Processes”, CACM 21#8, 1978, by C.A.R. Hoare. ♦ The Message Passing Interface (MPI) is a programming system • Implements message passing and other parallel programming models, including: • Bulk Synchronous Programming • One-sided communication • Shared-memory (between processes) ♦ CUDA/OpenACC/OpenCL are systems implementing a “GPU Programming Model ” • Execution model involves teams, threads, synchronization primitives, different types of memory and operations 14

  15. The Devil Is in the Details • There is no unique execution model ♦ What level of detail do you need to design and implement your program? • Don’t forget – you decided to use parallelism because you could not get the performance you need without it • Getting what you need already? ♦ Great! It ain’t broke • But if you need more performance of any type (scalability, total time to solution, user productivity) ♦ Rethink your model of computation and the programming models and systems that you use 15

  16. Rethinking Parallel Computing • Changing the execution model ♦ No assumption of performance regularity – but not unpredictable, just imprecise • Predictable within limits and most of the time ♦ Any synchronization cost amplifies irregularity – don’t include synchronizing communication as a desirable operation ♦ Memory operations are always costly, so moving operation to data may be more efficient • Some hardware designs provide direct support for this, not just software emulation ♦ Important to represent key hardware operations, which go beyond simple single Arithmetic Logic Unit (ALU) • Remote update (RDMA) • Remote atomic operation (compare and swap) • Execute short code sequence (active messages, parcels) 16

  17. How Does This Change The Way You Should Look At Parallel Programming? • More dynamic. Plan for performance irregularity ♦ But still exploit as much regularity as possible to minimize overhead of being dynamic) • Recognize communication takes time, which is not precisely predictable ♦ Communication between cache and memory or between two nodes in a parallel system ♦ Contention in system hard to avoid • Think about the execution model ♦ Your abstraction of how a parallel machine works ♦ Include the hardware-supported features that you need for performance • Finally, use a programming system that lets you express the elements you need from the execution model. 17

  18. Challenges for Programming Models • Parallel programming models need to provide ways to coordinate resource allocation ♦ Numbers of cores/threads/functional units ♦ Assignment (affinity) of cores/threads ♦ Intranode memory bandwidth ♦ Internode memory bandwidth • They must also provide clean ways to share data ♦ Consistent memory models ♦ Decide whether its best to make it easy and transparent for the programmer (but slow) or fast but hard (or impossible, which is often the current state) • Remember, parallel programming is about performance ♦ You will always get higher programmer productivity with a single threaded code 18

  19. Solutions • All new: Applications not well served by current systems ♦ e.g., not PDE simulations • MPI 4+: Especially ensembles of MPI simulations ♦ After all, MPI took us from giga to tera to peta, despite claims that was impossible • Addition and Composition ♦ MPI+X, including more interesting X ♦ Includes embedded abstract data-structure specific languages 19

Recommend


More recommend