Exploration for High Performance Computing using Fractional - PowerPoint PPT Presentation

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1

Motivation & background  Goal:  HPC-oriented core (characteristics suitable for HPC)  Why:  ARM’s main focus has been mobile – we have little knowledge of what an ARM HPC core should look like  Who:  ARM and partners can make more informed decisions if we/they are to create an HPC-oriented core  How (first step):  Use fractional-factorial experimental design to explore micro-architectural features*  HPC mini-applications & benchmarks  Single core, single thread experiments * Previously used by Dam Sunwoo et al in “A Structured Approach to the Simulation, Analysis and Characterization of Smartphon e Applications” 2

This study  This study is…  A design space exploration on ARMv8 in-order and out-of-order core configurations to determine the sensitivities of HPC applications with respect to micro-architectural changes.  A way to guide detailed micro-architectural investigations (it can point us in the right direction)  This study is not…  A way to produce an “ideal” core configuration that we can just use to create next -gen HPC cores 3

Infrastructure background  gem5  Event-based simulator used for computer systems architecture research.  Can run full-system simulations, with variable levels of detail.  Enables the exploration of various new and existing micro-architectural features, whilst running the same software stack as real hardware.  SimPoint  Provides a mechanism and methodology for extracting the most representative phases from a given workload.  Each SimPoint consists of a warm-up period and a region of interest. Their size is given in number of instructions.  Fractional Factorial  Relies on sparsity-of-effects principle (only the main and low-order interactions are investigated).  This allows for a significant reduction in the number of experiments (fraction of a full factorial). 4

Methodology  Select a representative collection of HPC proxy applications and benchmarks  Determine gem5-appropriate runtime parameters for those applications  Gather and validate SimPoints  Determine appropriate micro-architectural parameters and values  Run fractional factorial experiments  All our experiments are single core, single thread.  Figure-of-Merit: IPC 5

Applications  We chose problem sizes such that the total memory footprint is larger than the total maximum size of caches.  For all applications we only ran the core loops.  For most applications, we used 1B instruction SimPoints with 100M instruction warm-up phases. AArch64 openSUSE HPC image Parallel Serial Libraries & Tools CoMD HPCG miniFE OpenMPI-1.7.3 CoMD - MPI Pathfinder MCB HPCC GCC-4.9.0 HPCG-MPI Hand-crafted GCC-4.9.1 DGEMM 6

Address mapping, What we changed page policy, model, tWR #ALU units, Fetch2 to decode delay FP instruction latency #physical FP/Int regs Size, latency, MSHRs, prefetchers etc. Register File CPU L1D Branch L2 L3 Main Fetch Decode Issue Execute pred. Cache Cache Cache memory L1I Cache I-TLB D-TLB Size Issue limit to execute stage RAS, BTB, global predictor and local predictor size 7

OoO study – fractional factorial results (ARM Cortex-A57-like model-based) Core uArch L1IC L1DC L2C L3C Mem 8

OoO study – floating-point instruction latency 9

In-order study – fractional factorial results (ARM Cortex-A53-like model-based) Core uArch L1IC L1DC L2C L3C Mem 10

In-order study – front end study 11

Conclusions  High sensitivity to latency versus throughput  For out-of-order cores, there is an increased sensitivity to having more FP physical registers  For out-of-order cores, there is no sensitivity to an increased number of LD/ST/Int ALUs  In-order core shows sensitivity towards L1, L2, L3 prefetchers and memory model  Little or no sensitivity towards L1, L2, L3 data cache size variations  Negative sensitivity when changing the page policy 12

Summary  We investigated single-core configurations of both out-of-order and in-order processors  This provided us with a good “within core” perspective  Latency, and not throughput, matters most  Further work:  Investigation into data cache size sensitivity  In-order core prefetcher investigation (on-going)  Future studies:  Multi-core study using multi-threaded applications (on-going)  Deep-dive into the memory system (on-going)  SMT study 13

Future considerations  We had a methodology in-place for single-core studies, however, is this the best way forward? What about multi-core studies?  Methodology (speed/accuracy)  Source and magnitude of sensitivities  Scalability  Figure-of-merit – currently IPC  gem5  It’s easy to go outside of the expected design space. Great for bug hunting, good for pushing the envelope, but is it relevant? 14

Appendix 15

Out-of-order sensitivity study parameters 16

In-order sensitivity study parameters 17

Exploration for High Performance Computing using Fractional - PowerPoint PPT Presentation

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: HPC-oriented core (characteristics

New York University High Performance Computing High Performance Computing Information

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

High Performance Computing at High Performance Computing at the University of Utah: A User the

High-performance computing in Java: the data processing of Gaia X. Luri & J. Torra ICCUB/IEEC

An Overview of High An Overview of High Performance Computing and Performance Computing and

Mercury: RPC for High-Performance Computing Jerome Soumagne The HDF Group June 23, 2017 RPC and

How Python Slithered Into Astronomy Perry Greenfield Space Telescope Science Institute Outline

UKs Harwell Space Cluster Why you need to be there! 3 December 2015 Stephen Ringler

Space exploration, not only celestial bodies. A scientific approach to extraterrestrial life

Design Space Exploration of Memory Model for Heterogeneous

Three Face to of Huairou National Science Center JIANG Xiaoming Beijing Advanced

US Army Space and Missile Defense Command/Army Forces Strategic Command USASMDC / JFCC IMD

Dissimilarity Measures for Clustering Space Mission Architectures Cody Kinneer Institute for

Strategies that Engage Undergraduate Students to Learn about Space Weather M . C H A N T A L E

Exploration for High Performance Computing using Fractional - PowerPoint PPT Presentation

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: HPC-oriented core (characteristics

New York University High Performance Computing High Performance Computing Information

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Introduction to High Performance Computing Pierre Aubert High Performance Computing (HPC)

Trends in High Performance Trends in High Performance Computing and Using Numerical Computing

High Performance Computing, High Performance Computing, Computational Grid, and Numerical

Trends in High Performance Trends in High Performance Computing and the Grid Computing and the

High Performance Computing at High Performance Computing at the University of Utah: A User the

High-performance computing in Java: the data processing of Gaia X. Luri &amp; J. Torra ICCUB/IEEC

An Overview of High An Overview of High Performance Computing and Performance Computing and

Mercury: RPC for High-Performance Computing Jerome Soumagne The HDF Group June 23, 2017 RPC and

How Python Slithered Into Astronomy Perry Greenfield Space Telescope Science Institute Outline

UKs Harwell Space Cluster Why you need to be there! 3 December 2015 Stephen Ringler

Space exploration, not only celestial bodies. A scientific approach to extraterrestrial life

Design Space Exploration of Memory Model for Heterogeneous

Three Face to of Huairou National Science Center JIANG Xiaoming Beijing Advanced

US Army Space and Missile Defense Command/Army Forces Strategic Command USASMDC / JFCC IMD

Dissimilarity Measures for Clustering Space Mission Architectures Cody Kinneer Institute for

Strategies that Engage Undergraduate Students to Learn about Space Weather M . C H A N T A L E

High-performance computing in Java: the data processing of Gaia X. Luri & J. Torra ICCUB/IEEC