First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu, Ziang Hu, Guang Gao Department of Electrical and Computer Engineering University of Delaware June 4, 2005
Outline ● Cyclops-64 architecture: ● C64 supercomputer. ● C64 node. ● C64 architecture details. ● FAST design and implementation: ● Pipeline. ● Segmented memory. ● Memory and interconnect contention. ● Experience: ● Architecture design verification. ● Early system software development. ● Application development.
Cyclops-64 Supercomputer
Cyclops-64 Node Logical View
Cyclops-64 Node Main Features Clock frequency 500MHz. ● 75 processors: ● 2 thread units (in-order issue, out-of-order ● completion). 2 32KB SRAM banks. ● FPU and integer multiply-accumulate unit. ● 32KB I-cache shared by 5 processors. ● A-switch ● 4 ports, 4GB/s per port. ● 1GB off-chip SDDRAM. ● Crossbar network: ● 96 ports, 4GB/s per port. ● Connects processors(80), I-cache(4), DRAM ● (4), A-switch(6), host-interface(2).
Cyclops-64 Architecture Relevant Features Integrates proc., memory and comm. ● 150 thread units, 4.7MB on-chip SRAM, ● crossbar switch and A-switch device. No resource virtualization ● Non-preemptive execution. ● OS will not interrupt user program. ● No HW virtual memory manager. ● Memory hierarchy visible by programmer. ● ISA provides: ● Support for thread level execution. ● In-memory atomic operations. ●
Simulation Requirements ● Architecture team: ● Multicore technology. ● Design verification. ● Space exploration. ● System software group: ● Toolchain development & testing. ● Runtime system design. ● Users: ● Application development. ● Performance estimation.
How does FAST meet the re- quirements? ● Multichip, multithreaded C64 system. ● Functionally accurate (not cycle accu- rate). ● Timing sensitive. ● Binary compatible. ● Instrumentability. ● Execution driven.
How does FAST meet the re- quirements? ● Instruction execution. ● Exception handling. ● Segmented memory space. ● Execution trace and statistics. ● Memory and interconnect contention. ● A-switch device. ● Debugger.
Instruction Pipeline Fetch and Decode Instn fetch: account for delay if PIB and I-cache miss, not for branch prediction
Instruction Pipeline Read Registers 1 cycle for all 1 and 2-operand instn, 2 cycles for 3-operand instn (FMAx). In-order-issue.
Instruction Pipeline Execution RISC-like instn execution model based on eXecution/Delay pairs Independent instructions executed in parallel
Instruction Pipeline Instruction Timing Instruction type x d Branch 2 0 Count pop. 1 1 Int. multiply 1 5 Int. divide, reminder 1 33 Float. add, multiply, convert 1 5 Float. multiply-add 1 10 Float. divide 1 33 Float square root 1 56 Memory op (local SPM) 1 2 Memory op (global SRAM) 1 20 Memory op (off-chip DRAM) 1 36 Others 1 0
Instruction Pipeline Commit Put results away Out-of-order completion Don't account for write in reg. file conflicts.
Commit (cont) ● Out-of-order execution: x/d model. ● But threads run synchronously; single clock signal. ● Special instn. imply inter-thread syn- chronization. ● FAST synchronizes thread execution at commit stage. ● If thread's ahead wait. Slowest thread updates clock. 1 2 3 T1 commit commit wait 1 3 T2 wait commit
Segmented Memory Space Non-uniform share address space AULD Off- chip/DRAM 0x80000000 Memory SPM 149 AULS SPM 1 Local/Scratc SPM 0 hpad 0x40000000 Memories AULI Global/Interleaved Memory 0x00000000
Memory and Interconnect Contention
Contention in the Crossbar LOAD X STORE X DATUM X
Memory Bandwidth Limita- tion STORE X FULL
Interconnect Contention LOAD X FULL
A-Switch Device ● Under testing. ● Reads from / writes to memory. ● Not connected to the crossbar/memory module. ● Overhead not accounted for. ● Estimation for multi-chip programs less accurate.
Debugger ● Embedded assembly-level debugger. ● GDB-based source level debugger: ● Handles single-threaded programs. ● Work in progress for multi-chip mul- ti-threaded programs.
Toolchain Verification C64 toolchain: supports C64 ISA, segmented memory space, etc.
System Software Toolchain Verification Methodology: ● Source: C/Fortran/assembly ● Generate executable. ● Run binary in FAST simulator. ● If expected result then toolchain OK. 2 Phases: ● Thorough ISA testing: manually inspect trace file. ● Toolchain coverage: large # test cases. From compiler regression testsuite to sizable applications.
Architecture Design Verifi- cation ● FAST generates execution trace. ● Trace from VHDL simulator. ● Compare program execution, instruc- tion by instruction. ● Help validate VHDL simulator, i.e., chip design. ● Continue testing once hardware plat- form (emulator) is available. ● C32 DIMES – CYSIM (SC03, SC04) ● C64 Mrs. Cyclops – FAST (SC05)
System Software Develop- ment ● Early evaluation of multithreaded run- time system library: ● Test functionality of the TNT* library. ● Estimation of thread creation, termi- nation, reuse times. ● Study of spin lock algorithms. ● CNET communication library devel- opment. (*) In Workshop of Massive Parallel Processing (IPDPS05)
Application Development ● FAST verification: Do the results match what the archi- tecture is capable of? ● Microbenchmarks: ● TableToy ● Matrix-matrix-multiply
Experience ● TableToy: Measure Giga Updates Per Second (GUPS) 1 tmp1 = stable[j] (load) 2 tmp2 = table[i] (load) 3 val = tmp1 xor tmp2 (xor) 4 table[i] = val; (store)
GUPS with TableToy Little throughput Pseudo random # generation Bank conflicts
GUPS with NewToy (S- RAM) 75% maximum throughput
MUPS with NewToy (DRAM) DRAM maximum bandwidth 250 MUPS
MFLOPS Matrix-matrix Multiply Baseline: 17MLOPS Optimized: 215MFLOPS/thread approx. 500MFLOPS/proc ~ 50% maximum
ACKNOLEDGMENTS Monty Denneau (IBM) Alban Douillet ● ● Henry Warren (IBM) Brice Dobry ● ● José Castaños (IBM) Ge Gan ● ● Christos Georgiou (IBM) Geoff Gerfin ● ● John Tully ● ET International, Inc. Weirong Zhu ● ● Our sponsors Wesley Toland ● ● Ziang Hu ● Fei Chen ● Hirofumi Sakane ● Yuhei ● Vishal Karna ●
Recommend
More recommend