Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp
Why Is Exascale Different? • Extreme power constraints, leading to ♦ Clock Rates similar to today’s systems ♦ A wide-diversity of simple computing elements (simple for hardware but complex for algorithms and software) ♦ Memory per core and per FLOP will be much smaller ♦ Moving data anywhere will be expensive (time and power) • Faults that will need to be detected and managed ♦ Some detection may be the job of the programmer, as hardware detection takes power 2
Why Is Exascale Different? • Extreme scalability and performance irregularity ♦ Performance will require enormous concurrency (10 8 – 10 9 ) ♦ Performance is likely to be variable • Simple, static decompositions will not scale • A need for latency tolerant algorithms and programming ♦ Memory, processors will be 100s to 10000s of cycles away. Waiting for operations to complete will cripple performance 3
Why is Everyone Worried? • Exascale makes all problems extreme: ♦ Power, data motion costs, performance irregularities, faults, extreme degree of parallelism, specialized functional units • Added to each of these is ♦ Complexity resulting from all of the above • These issues are not new, but may be impossible to ignore ♦ The “free ride” from rapid improvement in hardware performance is ending/over 4
That “Kink” in #500 is Real • Extrapolation of 1000 recent data gives 100 ~1PF HPL in 2018 10 1 on the #500 system 0 10 20 30 40 50 0.1 • Extrapolation of HPL Perf (TF) 0.01 Fit perf (TF) older data gives 0.001 0.0001 ~1PF in 2015, ~7PF in 2018 • #500 may be a better predictor of trends 5
Current Petascale Systems Already Complex • Typical processor ♦ 8 floating point units, 16 integer units • What is a “core”? ♦ Full FP performance requires use of short vector instructions • Memory ♦ Performance depends on location, access pattern ♦ “Saturates” on multicore chip • Specialized processing elements ♦ E.g., NVIDIA GPU (K20X); 2688 “cores” (or 56…) • Network ♦ 3- or 5-D Torus, latency, bandwidth, contention important 6
Blue Waters: NSF’s Most Powerful System • 4,224 XK7 nodes and 22,640 XE6 nodes ♦ ~ 1/7 GPU+CPU, 6/7 CPU+CPU ♦ Peak perf >13PF: ~ 1/3 GPU+CPU, 2/3 CPU+CPU • 1.5 PB Memory, >1TB/Sec I/O Bandwidth • System sustains > 1 PetaFLOPS on a wide range of applications ♦ From starting to read input from disk to results written to disk, not just computational kernels ♦ No Top500 run – does not represent application workload 7
How Do We Program These Systems? • There are many claims about how we can and cannot program extreme scale systems ♦ Confusion is rampant ♦ Incorrect statements and conclusions common ♦ Often reflects “I don’t want to do it that way” instead of “there’s a good reason why it can’t be done that way” • General impression ♦ The programming model influences the solutions used by programmers and algorithm developers ♦ In Linguistics, this is the Sapir-Whorf or Whorfian hypothesis • We need to understand our terms first 8
How Should We Think About Parallel Programming? • Need a more formal way to think about programming ♦ Must be based on the realities of real systems ♦ Not the system that we wish we could build (see PRAM) • Not talking about a programming model ♦ Rather, first need to think about what an extreme scale parallel system can do ♦ System – the hardware and the software together 9
Separate the Programming Model from the Execution Model • What is an execution model? ♦ It’s how you think about how you can use a parallel computer to solve a problem • Why talk about this? ♦ The execution model can influence what solutions you consider (the Whorfian hypothesis ) ♦ After decades where many computer scientists only worked with one execution model, we are now seeing new models and their impact on programming and algorithms 10
Examples of Execution Models • Von Neumann machine: ♦ Program counter ♦ Arithmetic Logic Unit ♦ Addressable Memory • Classic Vector machine: ♦ Add “vectors” – apply the same operation to a group of data with a single instruction • Arbitrary length (CDC Star 100), 64 words (Cray), 2 words (SSE) • GPUs with collections of threads (Warps) 11
Programming Models and Systems • In past, often a tight connection between the execution model and the programming approach ♦ Fortran: FORmula TRANslation to von Neumann machine ♦ C: e.g., “register”, ++ operator match PDP-11 capabilities, needs • Over time, execution models and reality changed but programming models rarely reflected those changes ♦ Rely on compiler to “hide” those changes from the user – e.g., auto-vectorization for SSE(n) • Consequence: Mismatch between users’ expectation and system abilities. ♦ Can’t fully exploit system because user’s mental model of execution does not match real hardware ♦ Decades of compiler research have shown this problem is extremely hard – can’t expect system to do everything for you. 12
Programming Models and Systems • Programming Model : an abstraction of a way to write a program ♦ Many levels • Procedural or imperative? • Single address space with threads? • Vectors as basic units of programming? ♦ Programming model often expressed with pseudo code • Programming System : (My terminology) ♦ An API that implements parts or all of one or more programming models, enabling the precise specification of a program 13
Why the Distinction? • In parallel computing, ♦ Message passing is a programming model • Abstraction: A program consists of processes that communication by sending messages. See “Communicating Sequential Processes”, CACM 21#8, 1978, by C.A.R. Hoare. ♦ The Message Passing Interface (MPI) is a programming system • Implements message passing and other parallel programming models, including: • Bulk Synchronous Programming • One-sided communication • Shared-memory (between processes) ♦ CUDA/OpenACC/OpenCL are systems implementing a “GPU Programming Model ” • Execution model involves teams, threads, synchronization primitives, different types of memory and operations 14
The Devil Is in the Details • There is no unique execution model ♦ What level of detail do you need to design and implement your program? • Don’t forget – you decided to use parallelism because you could not get the performance you need without it • Getting what you need already? ♦ Great! It ain’t broke • But if you need more performance of any type (scalability, total time to solution, user productivity) ♦ Rethink your model of computation and the programming models and systems that you use 15
Rethinking Parallel Computing • Changing the execution model ♦ No assumption of performance regularity – but not unpredictable, just imprecise • Predictable within limits and most of the time ♦ Any synchronization cost amplifies irregularity – don’t include synchronizing communication as a desirable operation ♦ Memory operations are always costly, so moving operation to data may be more efficient • Some hardware designs provide direct support for this, not just software emulation ♦ Important to represent key hardware operations, which go beyond simple single Arithmetic Logic Unit (ALU) • Remote update (RDMA) • Remote atomic operation (compare and swap) • Execute short code sequence (active messages, parcels) 16
How Does This Change The Way You Should Look At Parallel Programming? • More dynamic. Plan for performance irregularity ♦ But still exploit as much regularity as possible to minimize overhead of being dynamic) • Recognize communication takes time, which is not precisely predictable ♦ Communication between cache and memory or between two nodes in a parallel system ♦ Contention in system hard to avoid • Think about the execution model ♦ Your abstraction of how a parallel machine works ♦ Include the hardware-supported features that you need for performance • Finally, use a programming system that lets you express the elements you need from the execution model. 17
Challenges for Programming Models • Parallel programming models need to provide ways to coordinate resource allocation ♦ Numbers of cores/threads/functional units ♦ Assignment (affinity) of cores/threads ♦ Intranode memory bandwidth ♦ Internode memory bandwidth • They must also provide clean ways to share data ♦ Consistent memory models ♦ Decide whether its best to make it easy and transparent for the programmer (but slow) or fast but hard (or impossible, which is often the current state) • Remember, parallel programming is about performance ♦ You will always get higher programmer productivity with a single threaded code 18
Solutions • All new: Applications not well served by current systems ♦ e.g., not PDE simulations • MPI 4+: Especially ensembles of MPI simulations ♦ After all, MPI took us from giga to tera to peta, despite claims that was impossible • Addition and Composition ♦ MPI+X, including more interesting X ♦ Includes embedded abstract data-structure specific languages 19
Recommend
More recommend