RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC - PowerPoint PPT Presentation

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick

NERSC Overview NERSC represents science needs • � Over 3000 users, 400 projects, 500 code instances • � Over 1,600 publications in 2009 • � Time is used by university researchers (65%), DOE Labs (25%) and others � 1 Petaflop Hopper system, late 2010 • � High application performance • � Nodes: 2 12-core AMD processors • � Low latency Gemini interconnect 2 �

Science at NERSC Energy storage: Catalysis for improved batteries and fuel Nano devices: New Fusion: Simulations cells Combustion: New single molecule of Fusion devices at algorithms (AMR) switching element ITER scale coupled to experiments Materials: For solar panels and other Capture & applications. Climate modeling: Work Sequestration: EFRCs with users on scalability of cloud-resolving models

Algorithm Diversity Dense Sparse Spectral Particle Structured Unstructured or Science areas linear linear Methods Methods Grids AMR Grids algebra algebra (FFT)s Accelerator Science Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Material Science NERSC Qualitative In-Depth Analysis of Methods by Science Area

Numerical Methods at NERSC • � Caveat: survey data from ERCAP requests based on PI input • � Allocation is based on hours allocated to a project that use the method 50% 45% 40% %Projects 35% % Allocation 30% 25% 20% 15% 10% 5% 0% 5

NERSC Interest in Exascale 10 7 NERSC-9 Exascale + ??? 1 EF Peak NERSC-8 10 6 100 PF Peak Peak Teraflop/s NERSC-7 10 5 GPU CUDA/OpenCL 10 PF Peak Or Manycore BG/Q, R Hopper (N6) 10 4 >1 PF Peak Top500 10 3 Franklin (N5) +QC COTS/MPP + MPI (+ OpenMP) Franklin (N5) 36 TF Sustained 10 2 19 TF Sustained 352 TF Peak 101 TF Peak COTS/MPP + MPI 10 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Danger: dragging users into a local optimum for programming 6

Exascale is really about Energy Efficient Computing At $1M per MW, energy costs are substantial • � 1 petaflop in 2010 will use 3 MW • � 1 exaflop in 2018 possible in 200 MW with “usual” scaling • � 1 exaflop in 2018 at 20 MW is DOE target usual scaling goal 2005 2010 2015 2020 7

The Challenge • � Power is the leading design constraint in HPC system design • � How to get build an exascale system without building a nuclear power plant next to my HPC center? • � How can you assure the systems will be balanced for a reasonable science workload? • � How do you make it “programmable?”

Architecture Paths to Exascale • � Leading Technology Paths (Swim Lanes) – � Multicore: Maintain complex cores, and replicate (x86 and Power7) – � Manycore/Embedded: Use many simpler, low power cores from embedded (BlueGene) – � GPU/Accelerator : Use highly specialized processors from gaming space (NVidia Fermi, Cell) • � Risks in Swim Lane selection – � Select too soon: users cannot follow – � Select too late : fall behind performance curve – � Select incorrectly: Subject users to multiple disruptive changes • � Users must be deeply engaged in this process – � Cannot leave this up to vendors alone 9

Green Flash: Overview John Shalf, PI • � We present an alternative approach to developing systems to serve the needs of scientific computing • � Choose our science target first to drive design decisions • � Leverage new technologies driven by consumer market • � Auto-tune software for performance, productivity, and portability • � Use hardware-accelerated architectural emulation to rapidly prototype designs ( auto-tune the hardware too! ) • � Requires a holistic approach: Must innovate algorithm/ software/hardware together (Co-design) Achieve 100x energy efficiency improvement over mainstream HPC approach

System Balance • � If you pay 5% more to double the FPUs and get 10% improvement, it’s a win (despite lowering your % of peak performance) • � If you pay 2x more on memory BW (power or cost) and get 35% more performance, then it’s a net loss (even though % peak looks better) • � Real example: we can give up ALL of the flops to improve memory bandwidth by 20% on the 2018 system • � We have a fixed budget – � Sustained to peak FLOP rate is wrong metric if FLOPs are cheap – � Balance involves balancing your checkbook & balancing your power budget – � Requires a application co-design make the right trade-offs

The Complexity of Tradeoffs in Exascale System Design 20 MW bytes/core power envelope envelope $200M cost Exascale envelope Performance envelope feasible system s

An Application Driver: Global Cloud Resolving Climate Model

Computational Requirements • � ~2 million horizontal subdomains • � 100 Terabytes of Memory – � 5MB memory per subdomain • � ~20 million total subdomains – � 20 PF sustained (200PF peak) – � Nearest-neighbor communication • � New discretization for climate model CSU Icosahedral Code Icosahedral � Must maintain 1000x faster than real time for practical climate simulation

An Application Driver: Seismic Exploration

Seismic Migration � • � Reconstruct the earth’s subsurface • � Focus on exploration which requires a 10 km survey depth • � Studies the velocity contrast between different materials below the surface

Seismic RTM Algorithm • � Explicit finite difference method used to approximate wave equation • � 8th order stencil code • � Typical survey size is 20 x 30 x 10 km – � Runtime on current clusters is ~ 1 month

Low Power Design Principles • � Small is beautiful – � Large array of simple, easy to verify cores – � Embrace the embedded market • � Slower clock frequencies for cubic power improvement • � Emphasis on performance per watt • � Reduce waste by not adding features not advantageous to science • � Parallel, manycore processors are the path to energy efficiency

Science Optimized Processor Design • � Make it programmable: – � Hardware Support for PGAS • � Local store mapped to global address space • � Direct DMA support between local store to bypass cache – � Logical network topology looks like full crossbar • � Optimized for small transfers • � On chip interconnect optimized for problem communication pattern • � Directly expose locality for optimized memory movement

Cost of Data Movement MPI

Cost of Data Movement MPI Cost of a FLOP

Not projected to improve much… Energy Efficiency will require careful management of data locality Important to know when you are on-chip and when data is off-chip!

Vertical Locality Management • � Movement of data up and down cache hierarchy – � Cache virtualizes notion of on-chip off-chip – � Software managed memory (local store) is hard to program (cell) • � Virtual Local store – � Use conventional cache for portability – � Only use SW managed memory only for performance critical code – � Repartition as needed

Horizontal Locality Management • � Movement of data between processors – � 10x lower latency and 10x higher bandwidth on-chip – � Need to minimize distance of horizontal data movement • � Encode Horizontal locality into memory address – � Hardware hierarchy where high-order bits encode cabinet and low-order bits encode chip-level distance

Application Analysis - Climate • � Analyze each loop within climate code – � Extract temporal reuse and bandwidth requirements • � Use traces to determine cache size and DRAM BW requirements • � Ensure memory hierarchy can support application

Application Optimization - Climate • � Original code: • � Tuned Code: – � 160KB Cache requirement – � 1KB Cache Requirement – � < 50% FP Instructions – � > 85% FP instructions Loop optimization resulted in 160x reduction in cache size and a 2x increase in execution speed

Co-design Advantage - Climate

Co-design Advantage - Seismic Co-design Perform ance Advantage for Seism ic RTM 4500.0 4000.0 3500.0 3000.0 MPoints / sec 2500.0 2000.0 1500.0 1000.0 500.0 0.0 8-core Nehalem Fermi Green Wave

Co-design Advantage - Seismic Co-design Pow er Advantage for Seism ic RTM 160 140 120 MPoints / W att 100 80 60 40 20 0 8-core Nehalem Fermi Tensilica GreenWave

Extending to General Stencil Codes • � Generalized co-tuning framework being developed • � Co-tuning framework applied to multiple architectures – � Manycore and GPU support • � Significant advantage over tuning only HW or SW – � ~3x Power and Area advantage gained Gradient Gradient

RAMP Infrastructure • � FPGA Emulation critical for HW / SW co-design – � Enable full application benchmarking – � Autotuning target – � Feedback path from HW architect from application developers • � RAMP Gateware and methodology needed to glue processors together

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC - PowerPoint PPT Presentation

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC Overview NERSC represents science needs Over 3000 users, 400 projects, 500 code instances Over 1,600 publications in 2009 Time is used by university

RAMP RAMP RAMP RAMP Research Administrators Management Program Use of Animal Subjects (IACUC)

PORT RICHMOND LIBRARY E XT E RIOR RAMP RE VIE W E XT E RIOR RAMP RE VIE W - 5/ 21/ 19

1 Target Model - Units Target Model Channel (1) Inside edge Channel semantics Ports

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

Ramp up plan 2 PRINCIPLES OF PEOPLE AND OPERATIONS RAMP-UP Ensure a safe & confident

Under-Ramp Park Under-Ramp Park Schematic Designs Schematic Designs June 12, 2014 1 Transbay

I-64/I-264 Ramp Improvements and I-264/Witchduck Road Interchange & Ramp Extension 1

Ramp Metering Jeremy Dilmore, P.E. FDOT District Five TSM&O Engineer Ramp Signaling in

Safety Concerns re: Complete Street Design Van Ramps Encroaching on Cycle Tracks Van Parked

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data 3D FFT Exascale-ability

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Morrison Boat Ramp Non-Motorized Development Morrison Cove & Boat Ramp Management Vision The

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

AVIATION PROGRAMS August 2019 August 2019 Routine Airport Maintenance Program (RAMP) RAMP

SR 874/Don Shula Expressway SR 874/Don Shula Expressway Ramp Connector Ramp Connector Ramp

Presentation by Snowy Hydro, AGL, and Hydro Tasmania to AEMC on Ramp Rates Draft Rule

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores

AMPHITHEATER/SITE IMPROVEMENTS NEW SITE PLAN NEW POWER PEDESTAL NEW SITE LIGHT RAMP UP

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC - PowerPoint PPT Presentation

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC Overview NERSC represents science needs Over 3000 users, 400 projects, 500 code instances Over 1,600 publications in 2009 Time is used by university

RAMP RAMP RAMP RAMP Research Administrators Management Program Use of Animal Subjects (IACUC)

PORT RICHMOND LIBRARY E XT E RIOR RAMP RE VIE W E XT E RIOR RAMP RE VIE W - 5/ 21/ 19

1 Target Model - Units Target Model Channel (1) Inside edge Channel semantics Ports

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

Ramp up plan 2 PRINCIPLES OF PEOPLE AND OPERATIONS RAMP-UP Ensure a safe &amp; confident

Under-Ramp Park Under-Ramp Park Schematic Designs Schematic Designs June 12, 2014 1 Transbay

I-64/I-264 Ramp Improvements and I-264/Witchduck Road Interchange &amp; Ramp Extension 1

Ramp Metering Jeremy Dilmore, P.E. FDOT District Five TSM&amp;O Engineer Ramp Signaling in

Safety Concerns re: Complete Street Design Van Ramps Encroaching on Cycle Tracks Van Parked

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data 3D FFT Exascale-ability

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Morrison Boat Ramp Non-Motorized Development Morrison Cove &amp; Boat Ramp Management Vision The

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

Exa-DM: Enabling Scientific Discovery in Exascale Simulations Jeremy Iverson 1 , 2 , Ya Ju Fan 1 ,

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

Time to Start over? Software for Exascale William Gropp www.cs.illinois.edu/~wgropp Why Is

AVIATION PROGRAMS August 2019 August 2019 Routine Airport Maintenance Program (RAMP) RAMP

SR 874/Don Shula Expressway SR 874/Don Shula Expressway Ramp Connector Ramp Connector Ramp

Presentation by Snowy Hydro, AGL, and Hydro Tasmania to AEMC on Ramp Rates Draft Rule

Exascale: Parallelism gone wild! Craig Stunkel, IBM Research IBM Research Outline Why are

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&amp;UIUC What are we talking about? 100M cores

AMPHITHEATER/SITE IMPROVEMENTS NEW SITE PLAN NEW POWER PEDESTAL NEW SITE LIGHT RAMP UP

The U.S. D.O.E. Exascale Computing Project Goals and Challenges Paul Messina, ECP Director

Ramp up plan 2 PRINCIPLES OF PEOPLE AND OPERATIONS RAMP-UP Ensure a safe & confident

I-64/I-264 Ramp Improvements and I-264/Witchduck Road Interchange & Ramp Extension 1

Ramp Metering Jeremy Dilmore, P.E. FDOT District Five TSM&O Engineer Ramp Signaling in

Morrison Boat Ramp Non-Motorized Development Morrison Cove & Boat Ramp Management Vision The

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores