Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , - PowerPoint PPT Presentation

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering Northeastern University Boston, MA

WHY USE SIMULATORS • Designing and fabricating chips are expensive • A significant amount of the cost of delivering a new chip involves design verification/validation • May take many years to fully test a new microarchitecture • Challenging to predict the performance and power prior to silicon • Leverage software to evaluate models of proposed designs • Support design space exploration • Allows validation before hardware becomes available • Allows software developers to evaluate optimize performance

BACKGROUND • GPU has become pervasive in high performance and data center environments • Simulation is one of the key toolsets for computer architects to evaluate future designs • Given the rapid growth in GPU computing, the research community requires accurate GPU simulation tools

BACKGROUND Multi2Sim GPGPUSim AMD Evergreen/ NVIDIA Southern Fermi Island NVIDIA Kepler ?

INTRODUCTION MULTI2SIM SIMULATION FRAMEWORK • A simulator for CPU, GPU and Heterogeneous systems Support for CPU architectures: X86, ARM, and MIPS • Support for GPU architectures: AMD southern islands, NVIDIA Kepler • • Support for HSA Intermediate Language • Based on C++ 11 • Large user base and open source developer community • Maintained through Github (https://github.com/multi2sim) a on C++ 1

INTRODUCTION MULTI2SIM SIMULATION FRAMEWORK Disasm. Emulation Timing Visual tool Simulation ARM In progress – – ü MIPS In progress – – ü x86 ü ü ü ü AMD Southern Islands ü ü ü ü NVIDIA Kepler In progress ü ü ü HSA Intermediate Language In progress In progress ü ü • Available in Multi2Sim 5.0 • NVIDIA Kepler, Southern Islands, and x86 supported • Three other CPU/GPU architectures in progress

INTRODUCTION MULTI2SIM SIMULATION FRAMEWORK • Modular implementation Four clearly different software modules per architecture (x86, MIPS, • Kepler….) Each module provides a standard interface for stand-alone execution, • or interaction with other modules

Outline Introduction & Background • CUDA Execution • Kepler simulation • Evaluation • Conclusions •

CUDA EXECUTION SIMULATION LEVEL • SASS: NVIDIA Shader Assembly, the native GPU ISA • PTX: a higher-level intermediate language compared to SASS defined by NVIDIA • The SASS code changes for each different generation of NVIDIA GPU, while PTX code is architecture independent ü Multi2Sim Kepler is designed to support NVIDIA SASS

CUDA EXECUTION SIMULATION LEVEL L PTX execution is very different than SASS execution L

CUDA EXECUTION SIMULATION LEVEL • It is important to run SASS • The number of registers is limited in SASS, but is unlimited in PTX • Schedulers will have more restrictions when working at the SASS level • More ISA-specific issues can be considered when we run SASS • Running SASS simulation is much closer to the actual execution in recent GPUs (i.e., Kepler GPUs)

CUDA EXECUTION CUDA SUPPORT ON MULTI2SIM The figure shows the • modular organization of the CUDA execution framework, based on 4 software/hardware entities. In each case, we compare • native execution with simulated execution .

CUDA EXECUTION SIMULATION CHALLENGES • Driver & Runtime APIs • Implement our own CUDA Driver & Runtime APIs • ISA Level • Reverse Engineering of the whole Kepler ISA since there is no public information • Microarchitecture • Implement benchmarks to reverse engineer and test all hardware related specifications

Outline Introduction & Background • CUDA support on Multi2Sim • Kepler simulation • Evaluation • Conclusions •

KEPLER SIMULATION DISASSEMBLER & EMULATOR

KEPLER SIMULATION DISASSEMBLER & EMULATOR • Disassembler Reads from CUDA binary file and dumps a text-based output of all • fragments of GPU ISA code found in the file Outputs SASS (shader assembly) instructions one by one to emulator • • Emulator Reads instructions from disassembler, reproduce the original behavior • of a guest program Providing instructions information to timing simulator • Support CUDA SDK 6.5 benchmark suite (21 supported), other • benchmark suite will be supported in the future

KEPLER SIMULATION TIMING SIMULATOR

KEPLER SIMULATION TIMING SIMULATION

KEPLER SIMULATION TIMING SIMULATION • Support for detailed architectural models for GPU hardware components • SMs, Warp schedulers, execution units, memory and etc. • Support for instruction pipeline exploration • Pipelines for different kinds of instructions such as integer, floating point and control flow • Provides architecture-related statistics • Cache miss/hits, instructions retired, occupany, etc.

KEPLER SIMULATION EMULATOR • Produces CUDA kernel results • Emulates instructions and updates registers and memory • Produces execution statistics • Number of executed grids and blocks • Dynamic instruction mix of the kernel and etc. • Produces an ISA-level trace • Instruction emulation trace

KEPLER SIMULATION ARCHITECTURAL SIMULATION • Models SMs, memory hierarchy and other hardware details • Maps thread blocks onto SMs and warp pools • Emulates instructions and propagates state through the execution pipelines • Models resource usage and contention

KEPLER SIMULATION MULTI2SIM KEPLER ADVANTAGES • Support for CPU-GPU heterogeneous simulation • Support for NVIDIA Kepler native SASS execution • Support for detailed NVIDIA Kepler micorarchitectural exploration

EVALUATION • Emulator • Statistics: Number of instructions executed, instructions classification, percentage of each kind instruction

EVALUATION • Average execution time for different input sets on each benchmark • In general, there is good fidelity with the K20X • HM is on outlier, since it uses st.wt and ld.cv instructions, changing cache policy

EVALUATION • Input sizes: From 1K to 128K

EVALUATION • Input size: From 128x128, to 1024x1024

EVALUATION • Input sizes: From 32K to 1M

EVALUATION • Performance achieved by changing the number of lanes for each pSPU per SMX • MatrixTranspose shows greater speedup than VectorAdd, because it is less memory sensitive

CONCLUSIONS • Summary • Presented Multi2sim Kepler, a detailed performance simulator supporting NVIDIA Kepler SASS execution • Provided example architectural studies, exploring Kepler GPU microarchitecture • Showed the benefits of the infrastructure by evaluating application characteristics • Future work • Support more benchmarks • Implement new CUDA runtime and driver APIs • Improve the accuracy of our simulator, focusing on memory model

Thank you! Questions? * This work is supported in part by NSF Grant CNS-1525412, and through generous donations from NVIDIA, AMD and the Heterogeneous Systems Foundation.

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , - PowerPoint PPT Presentation

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering Northeastern University Boston, MA WHY

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Distributed classic DEVS simulator with TimeWarp Amr Al Mallah Outline Classical DEVS Simulator

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Kepler telescope and the Kepler Input Catalog (KIC) situation in a nutshell Kepler/K2:

Architectural Simulation What is an architectural simulator? SimpleScalar Primer A tool

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

TIQC Colorblind Simulator Application Philip Parzygnat (TIQC) Queens College - Computer Science

Hodgkin-Huxley model simulator Katrin Valdson Kristiina Pokk Marit Asula Tartu 2015

Optolith 2D Lithography Simulator Advanced 2D Optical Lithography Simulator 4/13/05 Introduction

THE SIMULATOR DEFINITIONS In the requirements I have stated that your simulator will be a:

The sphinx simulator project Nicolas CARRIER April, 6, 2016 The sphinx simulator project 1 / 32

Evaluating the performance of HPC- style SYCL applications Tom Deakin and Simon McIntosh-Smith

Side-Channel Attacks and Defenses for SGX and SEV Yinqian Zhang Associate Professor Computer

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, AMD, Wisconsin

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Vi Visual S Studio Cod o Code e Shipping One of the Largest Microso3 JavaScript Applica8ons

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , - PowerPoint PPT Presentation

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering Northeastern University Boston, MA WHY

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

OBAMA PRESIDENTIAL CENTER INTRODUCTION 2 INTRODUCTION 3 ARCHITECTURAL DESIGN 4 ARCHITECTURAL

Distributed classic DEVS simulator with TimeWarp Amr Al Mallah Outline Classical DEVS Simulator

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Religious Architectural Religious Architectural Religious Architectural Religious Architectural

Kepler telescope and the Kepler Input Catalog (KIC) situation in a nutshell Kepler/K2:

Architectural Simulation What is an architectural simulator? SimpleScalar Primer A tool

Architectural Resources Cambridge Architectural Resources Cambridge Architectural Resources

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Simulation of OpenCL and APUs on Multi2Sim 4.1 Rafael Ubal, David Kaeli Conference title 1

TIQC Colorblind Simulator Application Philip Parzygnat (TIQC) Queens College - Computer Science

Hodgkin-Huxley model simulator Katrin Valdson Kristiina Pokk Marit Asula Tartu 2015

Optolith 2D Lithography Simulator Advanced 2D Optical Lithography Simulator 4/13/05 Introduction

THE SIMULATOR DEFINITIONS In the requirements I have stated that your simulator will be a:

The sphinx simulator project Nicolas CARRIER April, 6, 2016 The sphinx simulator project 1 / 32

Evaluating the performance of HPC- style SYCL applications Tom Deakin and Simon McIntosh-Smith

Side-Channel Attacks and Defenses for SGX and SEV Yinqian Zhang Associate Professor Computer

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, AMD, Wisconsin

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Vi Visual S Studio Cod o Code e Shipping One of the Largest Microso3 JavaScript Applica8ons

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, AMD, Wisconsin