tomorrow s exascale systems
play

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta - PowerPoint PPT Presentation

Tomorrows Exascale Systems: Not Just Bigger Versions of Todays Peta -Computers Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale


  1. Tomorrow’s Exascale Systems: Not Just Bigger Versions of Today’s Peta -Computers Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST) School of Informatics and Computing Indiana University November 20, 2013

  2. Tianhe-2: Half-way to Exascale • China, 2013 : the 30 PetaFLOPS dragon • Developed in cooperation between NUDT and Inspur for National Supercomputer Center in Guangzhou • Peak performance of 54.9 PFLOPS – 16,000 nodes contain 32,000 Xeon Ivy Bridge processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores – 162 cabinets in 720m 2 footprint – Total 1.404 PB memory (88GB per node) – Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock – Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches) – 12.4 PB parallel storage system – 17.6MW power consumption under load; 24MW including (water) cooling – 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system

  3. Exaflops by 2019 (maybe) 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 SUM 1 Pflop/s 1000000 100 Tflop/s 100000 N=1 10 Tflop/s 10000 1 Tflop/s 1000 N=500 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Courtesy of Erich Strohmaier LBNL

  4. Elements of an MFE Integrated Model  Complex Multi-scale, Multi-physics Processes Courtesy of Bill Tang, Princeton

  5. Progress in Turbulence Simulation Capability: Faster Computer  Achievement of Improved Fusion Energy Physics Insights GTC Computer PE# Speed Particle Time Physics Discovery simulation name used (TF) # steps ( Publication ) 10 2 10 -1 10 8 10 4 1998 Cray T3E Ion turbulence zonal flow NERSC ( Science, 1998 ) 10 3 10 0 10 9 10 4 IBM SP Ion transport scaling 2002 NERSC ( PRL, 2002 ) 10 4 10 2 10 10 10 5 2007 Cray XT3/4 Electron turbulence ORNL ( PRL, 2007 ); EP transport ( PRL, 2008 ) 10 5 10 3 10 11 10 5 Jaguar/Cray XT5 Electron transport scaling 2009 ORNL ( PRL, 2009 ); EP-driven MHD modes Cray XT5  Titan 10 5 10 4 10 12 10 5 Kinetic-MHD; 2012-13 ORNL Turbulence + EP + MHD (current) Tianhe-1A (China) 10 6 10 13 10 6 2018 To Extreme Scale Turbulence + EP + MHD + RF HPC Systems (future) * Example here of GTC code (Z. Lin, et al.) delivering production runs @ TF in 2002 and PF in 2009 Courtesy of Bill Tang, Princeton

  6. Practical Constraints for Exascale  Sustained Performance  Reliability  Exaflops  One factor of availability  100 Petabytes  Generality  125 Petabytes/sec.  How good is it across a range of problems  Cost  Strong scaling  Deployment – $200M  Productivity  Operational support  User programmability  Power  Performance portability  Energy required to run the computer  Size  Energy for cooling (remove heat from  Floor space – 4,000 sq. meters machine)  Access way for power and signal cabling  20 Megawatts 6

  7. Execution Model Phase Change Von Neumann Model • Guiding principles for system design and operation 1949 – Semantics, Mechanisms, Policies, Parameters, Metrics – Driven by technology opportunities and challenges SIF-MOE Model 1968 – Historically, catalyzed by paradigm shift • Decision chain across system layers – For reasoning towards optimization of design and operation Vector Model 1975 • Essential for co-design of all system layers – Architecture, runtime and OS, programming models – Reduces design complexity from O(N 2 ) to O(N) – Enables holistic reasoning about concepts and SIMD-array Model tradeoffs 1983 • Empowers discrimination, commonality, portability CSP Model – Establishes a phylum of HPC class systems 1991 ? ? Model 2020

  8. Courtesy of Peter Kogge, Total Power UND 10 9 8 7 Power (MW) 6 5 4 3 2 1 0 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Heavyweight Lightweight Heterogeneous

  9. Technology Demands new Response 9

  10. Total Concurrency 1.E+07 1.E+06 TC (Flops/Cycle) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Heavyweight Lightweight Heterogeneous Courtesy of Peter Kogge, UND

  11. Performance Factors - SLOWER • Starvation – Insufficiency of concurrency of work – P = e(L,O,W) * S(s) * a(r) * U (E) Impacts scalability and latency hiding – Effects programmability • Latency – Time measured distance for remote access and services – Impacts efficiency P – performance (ops) • Overhead e – efficiency (0 < e < 1) s – application’s average parallelism, – Critical time additional work to a – availability (0 < a < 1) manage tasks & resources – U – normalization factor/compute unit Impacts efficiency and granularity for E – watts per average compute unit scalability • r – reliability (0 < r < 1) Waiting for contention resolution – Delays due to simultaneous access requests to shared physical or logical resources

  12. The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N-body and SPH simluations) using 1M particles over four timesteps on 128 procs. Red indicates computation Blue indicates waiting for communication

  13. Goals of a New Execution Model for Exascale • Serve as a discipline to govern future scalable system architectures, programming methods, and runtime • Latency hiding at all system distances – Latency mitigating architectures • Exploit parallelism in diversity of forms and granularity • Provide a framework for efficient fine-grain synchronization and scheduling (dispatch) • Enable optimized runtime adaptive resource management and task scheduling for dynamic load balancing • Support full virtualization for fault tolerance and power management, and continuous optimization • Self-aware infrastructure for power management • Semantics of failure response for graceful degradation • Complexity of operation as an emergent behavior from simplicity of design, high replication, and local adaptation for global optima in time and space

  14. ParalleX Execution Model • Lightweight multi-threading – Divides work into smaller tasks – Increases concurrency • Message-driven computation – Move work to data – Keeps work local, stops blocking • Constraint-based synchronization – Declarative criteria for work – Event driven – Eliminates global barriers • Data-directed execution – Merger of flow control and data structure • Shared name space – Global address space – Simplifies random gathers

  15. ParalleX Addresses Critical Challenges (1) • Starvation – Lightweight threads for additional level of parallelism – Lightweight threads with rapid context switching for non-blocking – Low overhead for finer granularity and more parallelism – Parallelism discovery at runtime through data-directed execution – Overlap of successive phases of computation for more parallelism • Latency – Lightweight thread context switching for non-blocking – Overlap computation and communication to limit effects – Message-driven computation to reduce latency to put work near data – Reduce number and size of global messages

  16. ParalleX Addresses Critical Challenges (2) • Overhead – Eliminates (mostly) global barriers – However, ultimately will require hardware support in the limit – Uses synchronization objects exhibiting high semantic power – Reduces context switching time – Not all actions require thread instantiation • Waiting due to contention – Adaptive resource allocation with redundant resources • Like hardware for threads – Eliminates polling and reduces # of sources of synch contacts

  17. HPX Runtime Design • Current version of HPX provides the following infrastructure as defined by the ParalleX execution model – Complexes (ParalleX Threads) and ParalleX Thread Management – Parcel Transport and Parcel Management – Local Control Objects (LCOs) – Active Global Address Space (AGAS)

  18. Overlapping computational phases for hydrodynamics Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores. MPI HPX

  19. Dynamic load balancing via message-driven work- queue execution for Adaptive Mesh Refinement (AMR)

  20. Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations

  21. Conclusions • HPC is in a (6 th ) phase change • Ultra high scale computing of the next decade will require a new model of computation to effectively exploit new technologies and guide system co-design • ParalleX is an example of an experimental execution model that addresses key challenges to Exascale • Early experiments prove encouraging for enhancing scaling of graph-based numeric intensive and knowledge management applications

Recommend


More recommend