RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick
NERSC Overview NERSC represents science needs • � Over 3000 users, 400 projects, 500 code instances • � Over 1,600 publications in 2009 • � Time is used by university researchers (65%), DOE Labs (25%) and others � 1 Petaflop Hopper system, late 2010 • � High application performance • � Nodes: 2 12-core AMD processors • � Low latency Gemini interconnect 2 �
Science at NERSC Energy storage: Catalysis for improved batteries and fuel Nano devices: New Fusion: Simulations cells Combustion: New single molecule of Fusion devices at algorithms (AMR) switching element ITER scale coupled to experiments Materials: For solar panels and other Capture & applications. Climate modeling: Work Sequestration: EFRCs with users on scalability of cloud-resolving models
Algorithm Diversity Dense Sparse Spectral Particle Structured Unstructured or Science areas linear linear Methods Methods Grids AMR Grids algebra algebra (FFT)s Accelerator Science Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Material Science NERSC Qualitative In-Depth Analysis of Methods by Science Area
Numerical Methods at NERSC • � Caveat: survey data from ERCAP requests based on PI input • � Allocation is based on hours allocated to a project that use the method 50% 45% 40% %Projects 35% % Allocation 30% 25% 20% 15% 10% 5% 0% 5
NERSC Interest in Exascale 10 7 NERSC-9 Exascale + ??? 1 EF Peak NERSC-8 10 6 100 PF Peak Peak Teraflop/s NERSC-7 10 5 GPU CUDA/OpenCL 10 PF Peak Or Manycore BG/Q, R Hopper (N6) 10 4 >1 PF Peak Top500 10 3 Franklin (N5) +QC COTS/MPP + MPI (+ OpenMP) Franklin (N5) 36 TF Sustained 10 2 19 TF Sustained 352 TF Peak 101 TF Peak COTS/MPP + MPI 10 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Danger: dragging users into a local optimum for programming 6
Exascale is really about Energy Efficient Computing At $1M per MW, energy costs are substantial • � 1 petaflop in 2010 will use 3 MW • � 1 exaflop in 2018 possible in 200 MW with “usual” scaling • � 1 exaflop in 2018 at 20 MW is DOE target usual scaling goal 2005 2010 2015 2020 7
The Challenge • � Power is the leading design constraint in HPC system design • � How to get build an exascale system without building a nuclear power plant next to my HPC center? • � How can you assure the systems will be balanced for a reasonable science workload? • � How do you make it “programmable?”
Architecture Paths to Exascale • � Leading Technology Paths (Swim Lanes) – � Multicore: Maintain complex cores, and replicate (x86 and Power7) – � Manycore/Embedded: Use many simpler, low power cores from embedded (BlueGene) – � GPU/Accelerator : Use highly specialized processors from gaming space (NVidia Fermi, Cell) • � Risks in Swim Lane selection – � Select too soon: users cannot follow – � Select too late : fall behind performance curve – � Select incorrectly: Subject users to multiple disruptive changes • � Users must be deeply engaged in this process – � Cannot leave this up to vendors alone 9
Green Flash: Overview John Shalf, PI • � We present an alternative approach to developing systems to serve the needs of scientific computing • � Choose our science target first to drive design decisions • � Leverage new technologies driven by consumer market • � Auto-tune software for performance, productivity, and portability • � Use hardware-accelerated architectural emulation to rapidly prototype designs ( auto-tune the hardware too! ) • � Requires a holistic approach: Must innovate algorithm/ software/hardware together (Co-design) Achieve 100x energy efficiency improvement over mainstream HPC approach
System Balance • � If you pay 5% more to double the FPUs and get 10% improvement, it’s a win (despite lowering your % of peak performance) • � If you pay 2x more on memory BW (power or cost) and get 35% more performance, then it’s a net loss (even though % peak looks better) • � Real example: we can give up ALL of the flops to improve memory bandwidth by 20% on the 2018 system • � We have a fixed budget – � Sustained to peak FLOP rate is wrong metric if FLOPs are cheap – � Balance involves balancing your checkbook & balancing your power budget – � Requires a application co-design make the right trade-offs
The Complexity of Tradeoffs in Exascale System Design 20 MW bytes/core power envelope envelope $200M cost Exascale envelope Performance envelope feasible system s
An Application Driver: Global Cloud Resolving Climate Model
Computational Requirements • � ~2 million horizontal subdomains • � 100 Terabytes of Memory – � 5MB memory per subdomain • � ~20 million total subdomains – � 20 PF sustained (200PF peak) – � Nearest-neighbor communication • � New discretization for climate model CSU Icosahedral Code Icosahedral � Must maintain 1000x faster than real time for practical climate simulation
An Application Driver: Seismic Exploration
Seismic Migration � • � Reconstruct the earth’s subsurface • � Focus on exploration which requires a 10 km survey depth • � Studies the velocity contrast between different materials below the surface
Seismic RTM Algorithm • � Explicit finite difference method used to approximate wave equation • � 8th order stencil code • � Typical survey size is 20 x 30 x 10 km – � Runtime on current clusters is ~ 1 month
Low Power Design Principles • � Small is beautiful – � Large array of simple, easy to verify cores – � Embrace the embedded market • � Slower clock frequencies for cubic power improvement • � Emphasis on performance per watt • � Reduce waste by not adding features not advantageous to science • � Parallel, manycore processors are the path to energy efficiency
Science Optimized Processor Design • � Make it programmable: – � Hardware Support for PGAS • � Local store mapped to global address space • � Direct DMA support between local store to bypass cache – � Logical network topology looks like full crossbar • � Optimized for small transfers • � On chip interconnect optimized for problem communication pattern • � Directly expose locality for optimized memory movement
Cost of Data Movement MPI
Cost of Data Movement MPI Cost of a FLOP
Not projected to improve much… Energy Efficiency will require careful management of data locality Important to know when you are on-chip and when data is off-chip!
Vertical Locality Management • � Movement of data up and down cache hierarchy – � Cache virtualizes notion of on-chip off-chip – � Software managed memory (local store) is hard to program (cell) • � Virtual Local store – � Use conventional cache for portability – � Only use SW managed memory only for performance critical code – � Repartition as needed
Horizontal Locality Management • � Movement of data between processors – � 10x lower latency and 10x higher bandwidth on-chip – � Need to minimize distance of horizontal data movement • � Encode Horizontal locality into memory address – � Hardware hierarchy where high-order bits encode cabinet and low-order bits encode chip-level distance
Application Analysis - Climate • � Analyze each loop within climate code – � Extract temporal reuse and bandwidth requirements • � Use traces to determine cache size and DRAM BW requirements • � Ensure memory hierarchy can support application
Application Optimization - Climate • � Original code: • � Tuned Code: – � 160KB Cache requirement – � 1KB Cache Requirement – � < 50% FP Instructions – � > 85% FP instructions Loop optimization resulted in 160x reduction in cache size and a 2x increase in execution speed
Co-design Advantage - Climate
Co-design Advantage - Climate
Co-design Advantage - Seismic Co-design Perform ance Advantage for Seism ic RTM 4500.0 4000.0 3500.0 3000.0 MPoints / sec 2500.0 2000.0 1500.0 1000.0 500.0 0.0 8-core Nehalem Fermi Green Wave
Co-design Advantage - Seismic Co-design Pow er Advantage for Seism ic RTM 160 140 120 MPoints / W att 100 80 60 40 20 0 8-core Nehalem Fermi Tensilica GreenWave
Extending to General Stencil Codes • � Generalized co-tuning framework being developed • � Co-tuning framework applied to multiple architectures – � Manycore and GPU support • � Significant advantage over tuning only HW or SW – � ~3x Power and Area advantage gained Gradient Gradient
RAMP Infrastructure • � FPGA Emulation critical for HW / SW co-design – � Enable full application benchmarking – � Autotuning target – � Feedback path from HW architect from application developers • � RAMP Gateware and methodology needed to glue processors together
Recommend
More recommend