FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - PowerPoint PPT Presentation

First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu, Ziang Hu, Guang Gao Department of Electrical and Computer Engineering University of Delaware June 4, 2005

Outline ● Cyclops-64 architecture: ● C64 supercomputer. ● C64 node. ● C64 architecture details. ● FAST design and implementation: ● Pipeline. ● Segmented memory. ● Memory and interconnect contention. ● Experience: ● Architecture design verification. ● Early system software development. ● Application development.

Cyclops-64 Supercomputer

Cyclops-64 Node Logical View

Cyclops-64 Node Main Features Clock frequency 500MHz. ● 75 processors: ● 2 thread units (in-order issue, out-of-order ● completion). 2 32KB SRAM banks. ● FPU and integer multiply-accumulate unit. ● 32KB I-cache shared by 5 processors. ● A-switch ● 4 ports, 4GB/s per port. ● 1GB off-chip SDDRAM. ● Crossbar network: ● 96 ports, 4GB/s per port. ● Connects processors(80), I-cache(4), DRAM ● (4), A-switch(6), host-interface(2).

Cyclops-64 Architecture Relevant Features Integrates proc., memory and comm. ● 150 thread units, 4.7MB on-chip SRAM, ● crossbar switch and A-switch device. No resource virtualization ● Non-preemptive execution. ● OS will not interrupt user program. ● No HW virtual memory manager. ● Memory hierarchy visible by programmer. ● ISA provides: ● Support for thread level execution. ● In-memory atomic operations. ●

Simulation Requirements ● Architecture team: ● Multicore technology. ● Design verification. ● Space exploration. ● System software group: ● Toolchain development & testing. ● Runtime system design. ● Users: ● Application development. ● Performance estimation.

How does FAST meet the requirements? ● Multichip, multithreaded C64 system. ● Functionally accurate (not cycle accurate). ● Timing sensitive. ● Binary compatible. ● Instrumentability. ● Execution driven.

How does FAST meet the requirements? ● Instruction execution. ● Exception handling. ● Segmented memory space. ● Execution trace and statistics. ● Memory and interconnect contention. ● A-switch device. ● Debugger.

Instruction Pipeline Fetch and Decode Instn fetch: account for delay if PIB and I-cache miss, not for branch prediction

Instruction Pipeline Read Registers 1 cycle for all 1 and 2-operand instn, 2 cycles for 3-operand instn (FMAx). In-order-issue.

Instruction Pipeline Execution RISC-like instn execution model based on eXecution/Delay pairs Independent instructions executed in parallel

Instruction Pipeline Instruction Timing Instruction type x d Branch 2 0 Count pop. 1 1 Int. multiply 1 5 Int. divide, reminder 1 33 Float. add, multiply, convert 1 5 Float. multiply-add 1 10 Float. divide 1 33 Float square root 1 56 Memory op (local SPM) 1 2 Memory op (global SRAM) 1 20 Memory op (off-chip DRAM) 1 36 Others 1 0

Instruction Pipeline Commit Put results away Out-of-order completion Don't account for write in reg. file conflicts.

Commit (cont) ● Out-of-order execution: x/d model. ● But threads run synchronously; single clock signal. ● Special instn. imply inter-thread syn- chronization. ● FAST synchronizes thread execution at commit stage. ● If thread's ahead wait. Slowest thread updates clock. 1 2 3 T1 commit commit wait 1 3 T2 wait commit

Segmented Memory Space Non-uniform share address space AULD Off- chip/DRAM 0x80000000 Memory SPM 149 AULS SPM 1 Local/Scratc SPM 0 hpad 0x40000000 Memories AULI Global/Interleaved Memory 0x00000000

Memory and Interconnect Contention

Contention in the Crossbar LOAD X STORE X DATUM X

Memory Bandwidth Limita- tion STORE X FULL

Interconnect Contention LOAD X FULL

A-Switch Device ● Under testing. ● Reads from / writes to memory. ● Not connected to the crossbar/memory module. ● Overhead not accounted for. ● Estimation for multi-chip programs less accurate.

Debugger ● Embedded assembly-level debugger. ● GDB-based source level debugger: ● Handles single-threaded programs. ● Work in progress for multi-chip multi-threaded programs.

Toolchain Verification C64 toolchain: supports C64 ISA, segmented memory space, etc.

System Software Toolchain Verification Methodology: ● Source: C/Fortran/assembly ● Generate executable. ● Run binary in FAST simulator. ● If expected result then toolchain OK. 2 Phases: ● Thorough ISA testing: manually inspect trace file. ● Toolchain coverage: large # test cases. From compiler regression testsuite to sizable applications.

Architecture Design Verifi- cation ● FAST generates execution trace. ● Trace from VHDL simulator. ● Compare program execution, instruction by instruction. ● Help validate VHDL simulator, i.e., chip design. ● Continue testing once hardware plat- form (emulator) is available. ● C32 DIMES – CYSIM (SC03, SC04) ● C64 Mrs. Cyclops – FAST (SC05)

System Software Develop- ment ● Early evaluation of multithreaded runtime system library: ● Test functionality of the TNT* library. ● Estimation of thread creation, termi- nation, reuse times. ● Study of spin lock algorithms. ● CNET communication library development. (*) In Workshop of Massive Parallel Processing (IPDPS05)

Application Development ● FAST verification: Do the results match what the architecture is capable of? ● Microbenchmarks: ● TableToy ● Matrix-matrix-multiply

Experience ● TableToy: Measure Giga Updates Per Second (GUPS) 1 tmp1 = stable[j] (load) 2 tmp2 = table[i] (load) 3 val = tmp1 xor tmp2 (xor) 4 table[i] = val; (store)

GUPS with TableToy Little throughput Pseudo random # generation Bank conflicts

GUPS with NewToy (S- RAM) 75% maximum throughput

MUPS with NewToy (DRAM) DRAM maximum bandwidth 250 MUPS

MFLOPS Matrix-matrix Multiply Baseline: 17MLOPS Optimized: 215MFLOPS/thread approx. 500MFLOPS/proc ~ 50% maximum

ACKNOLEDGMENTS Monty Denneau (IBM) Alban Douillet ● ● Henry Warren (IBM) Brice Dobry ● ● José Castaños (IBM) Ge Gan ● ● Christos Georgiou (IBM) Geoff Gerfin ● ● John Tully ● ET International, Inc. Weirong Zhu ● ● Our sponsors Wesley Toland ● ● Ziang Hu ● Fei Chen ● Hirofumi Sakane ● Yuhei ● Vishal Karna ●

FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - PowerPoint PPT Presentation

First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu,

LAX LAX A toolset for network administration A toolset for network administration Thomas Gro

Open Film Tools Open Film Tools - - a Free a Free Toolset Toolset for for a a Spectral

The Fondue Toolset Thomas Baar 3 rd International KeY-Workshop Knigswinter near Bonn June 8,

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

Delta: a Toolset for the Structural Analysis of Biological Sequences on a 3D Triangular Lattice

Contract-based design with the CHESS toolset Silvia Mazzini, Stefano Puri Intecs Credits to

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

JIT-Assisted Fast-Forward Embedding and Instrumentation to Enable Fast, Accurate, and Agile

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Bio Detectors Accurate and precise Stable system Fast and visible response Versatile

Grade 3: Animal Lifecycles Presentation Table of Contents: Learning Objectives & Curriculum

OP OPTOGRAF RAF TM : Process RAMAN for Gas Phase Analysis A New Univariate Pipe Centric

Long-Term Changes of Zooplankton and Dynamics of Eutrophication in the Polluted System of the

WATER LICENCE AMENDMENT TECHNICAL SESSIONS 22.Jan.2015 The De Beers Group of Companies Overview

Cyclospora Outbreak Investigation in Lambton County: Utilization of EpiData Software Crystal

1/22/20 MEET OUR FOOD SAFETY EXPERTS MEET OUR FOOD SAFETY EXPERTS DR. CATHERINE STROHBEHN DR.

Fresh Express 2018 Cyclospora cayetanensis outbreak Summary and Response German Rios Technical

Data (or not) on US Foodborne Illness Due to Imports Dale Morse, MD, MS Senior Advisor

FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - PowerPoint PPT Presentation

First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu,

LAX LAX A toolset for network administration A toolset for network administration Thomas Gro

Open Film Tools Open Film Tools - - a Free a Free Toolset Toolset for for a a Spectral

The Fondue Toolset Thomas Baar 3 rd International KeY-Workshop Knigswinter near Bonn June 8,

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

Delta: a Toolset for the Structural Analysis of Biological Sequences on a 3D Triangular Lattice

Contract-based design with the CHESS toolset Silvia Mazzini, Stefano Puri Intecs Credits to

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

JIT-Assisted Fast-Forward Embedding and Instrumentation to Enable Fast, Accurate, and Agile

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Bio Detectors Accurate and precise Stable system Fast and visible response Versatile

Grade 3: Animal Lifecycles Presentation Table of Contents: Learning Objectives &amp; Curriculum

OP OPTOGRAF RAF TM : Process RAMAN for Gas Phase Analysis A New Univariate Pipe Centric

Long-Term Changes of Zooplankton and Dynamics of Eutrophication in the Polluted System of the

WATER LICENCE AMENDMENT TECHNICAL SESSIONS 22.Jan.2015 The De Beers Group of Companies Overview

Cyclospora Outbreak Investigation in Lambton County: Utilization of EpiData Software Crystal

1/22/20 MEET OUR FOOD SAFETY EXPERTS MEET OUR FOOD SAFETY EXPERTS DR. CATHERINE STROHBEHN DR.

Fresh Express 2018 Cyclospora cayetanensis outbreak Summary and Response German Rios Technical

Data (or not) on US Foodborne Illness Due to Imports Dale Morse, MD, MS Senior Advisor

Grade 3: Animal Lifecycles Presentation Table of Contents: Learning Objectives & Curriculum