fast a functionally accurate simulation toolset for the
play

FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 - PowerPoint PPT Presentation

First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu,


  1. First Workshop on Modeling, Benchmarking and Simulation (MoBS) at the 32 nd International Symposium on Computer Architecture FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture Juan del Cuvillo, Weirong Zhu, Ziang Hu, Guang Gao Department of Electrical and Computer Engineering University of Delaware June 4, 2005

  2. Outline ● Cyclops-64 architecture: ● C64 supercomputer. ● C64 node. ● C64 architecture details. ● FAST design and implementation: ● Pipeline. ● Segmented memory. ● Memory and interconnect contention. ● Experience: ● Architecture design verification. ● Early system software development. ● Application development.

  3. Cyclops-64 Supercomputer

  4. Cyclops-64 Node Logical View

  5. Cyclops-64 Node Main Features Clock frequency 500MHz. ● 75 processors: ● 2 thread units (in-order issue, out-of-order ● completion). 2 32KB SRAM banks. ● FPU and integer multiply-accumulate unit. ● 32KB I-cache shared by 5 processors. ● A-switch ● 4 ports, 4GB/s per port. ● 1GB off-chip SDDRAM. ● Crossbar network: ● 96 ports, 4GB/s per port. ● Connects processors(80), I-cache(4), DRAM ● (4), A-switch(6), host-interface(2).

  6. Cyclops-64 Architecture Relevant Features Integrates proc., memory and comm. ● 150 thread units, 4.7MB on-chip SRAM, ● crossbar switch and A-switch device. No resource virtualization ● Non-preemptive execution. ● OS will not interrupt user program. ● No HW virtual memory manager. ● Memory hierarchy visible by programmer. ● ISA provides: ● Support for thread level execution. ● In-memory atomic operations. ●

  7. Simulation Requirements ● Architecture team: ● Multicore technology. ● Design verification. ● Space exploration. ● System software group: ● Toolchain development & testing. ● Runtime system design. ● Users: ● Application development. ● Performance estimation.

  8. How does FAST meet the re- quirements? ● Multichip, multithreaded C64 system. ● Functionally accurate (not cycle accu- rate). ● Timing sensitive. ● Binary compatible. ● Instrumentability. ● Execution driven.

  9. How does FAST meet the re- quirements? ● Instruction execution. ● Exception handling. ● Segmented memory space. ● Execution trace and statistics. ● Memory and interconnect contention. ● A-switch device. ● Debugger.

  10. Instruction Pipeline Fetch and Decode Instn fetch: account for delay if PIB and I-cache miss, not for branch prediction

  11. Instruction Pipeline Read Registers 1 cycle for all 1 and 2-operand instn, 2 cycles for 3-operand instn (FMAx). In-order-issue.

  12. Instruction Pipeline Execution RISC-like instn execution model based on eXecution/Delay pairs Independent instructions executed in parallel

  13. Instruction Pipeline Instruction Timing Instruction type x d Branch 2 0 Count pop. 1 1 Int. multiply 1 5 Int. divide, reminder 1 33 Float. add, multiply, convert 1 5 Float. multiply-add 1 10 Float. divide 1 33 Float square root 1 56 Memory op (local SPM) 1 2 Memory op (global SRAM) 1 20 Memory op (off-chip DRAM) 1 36 Others 1 0

  14. Instruction Pipeline Commit Put results away Out-of-order completion Don't account for write in reg. file conflicts.

  15. Commit (cont) ● Out-of-order execution: x/d model. ● But threads run synchronously; single clock signal. ● Special instn. imply inter-thread syn- chronization. ● FAST synchronizes thread execution at commit stage. ● If thread's ahead wait. Slowest thread updates clock. 1 2 3 T1 commit commit wait 1 3 T2 wait commit

  16. Segmented Memory Space Non-uniform share address space AULD Off- chip/DRAM 0x80000000 Memory SPM 149 AULS SPM 1 Local/Scratc SPM 0 hpad 0x40000000 Memories AULI Global/Interleaved Memory 0x00000000

  17. Memory and Interconnect Contention

  18. Contention in the Crossbar LOAD X STORE X DATUM X

  19. Memory Bandwidth Limita- tion STORE X FULL

  20. Interconnect Contention LOAD X FULL

  21. A-Switch Device ● Under testing. ● Reads from / writes to memory. ● Not connected to the crossbar/memory module. ● Overhead not accounted for. ● Estimation for multi-chip programs less accurate.

  22. Debugger ● Embedded assembly-level debugger. ● GDB-based source level debugger: ● Handles single-threaded programs. ● Work in progress for multi-chip mul- ti-threaded programs.

  23. Toolchain Verification C64 toolchain: supports C64 ISA, segmented memory space, etc.

  24. System Software Toolchain Verification Methodology: ● Source: C/Fortran/assembly ● Generate executable. ● Run binary in FAST simulator. ● If expected result then toolchain OK. 2 Phases: ● Thorough ISA testing: manually inspect trace file. ● Toolchain coverage: large # test cases. From compiler regression testsuite to sizable applications.

  25. Architecture Design Verifi- cation ● FAST generates execution trace. ● Trace from VHDL simulator. ● Compare program execution, instruc- tion by instruction. ● Help validate VHDL simulator, i.e., chip design. ● Continue testing once hardware plat- form (emulator) is available. ● C32 DIMES – CYSIM (SC03, SC04) ● C64 Mrs. Cyclops – FAST (SC05)

  26. System Software Develop- ment ● Early evaluation of multithreaded run- time system library: ● Test functionality of the TNT* library. ● Estimation of thread creation, termi- nation, reuse times. ● Study of spin lock algorithms. ● CNET communication library devel- opment. (*) In Workshop of Massive Parallel Processing (IPDPS05)

  27. Application Development ● FAST verification: Do the results match what the archi- tecture is capable of? ● Microbenchmarks: ● TableToy ● Matrix-matrix-multiply

  28. Experience ● TableToy: Measure Giga Updates Per Second (GUPS) 1 tmp1 = stable[j] (load) 2 tmp2 = table[i] (load) 3 val = tmp1 xor tmp2 (xor) 4 table[i] = val; (store)

  29. GUPS with TableToy Little throughput Pseudo random # generation Bank conflicts

  30. GUPS with NewToy (S- RAM) 75% maximum throughput

  31. MUPS with NewToy (DRAM) DRAM maximum bandwidth 250 MUPS

  32. MFLOPS Matrix-matrix Multiply Baseline: 17MLOPS Optimized: 215MFLOPS/thread approx. 500MFLOPS/proc ~ 50% maximum

  33. ACKNOLEDGMENTS Monty Denneau (IBM) Alban Douillet ● ● Henry Warren (IBM) Brice Dobry ● ● José Castaños (IBM) Ge Gan ● ● Christos Georgiou (IBM) Geoff Gerfin ● ● John Tully ● ET International, Inc. Weirong Zhu ● ● Our sponsors Wesley Toland ● ● Ziang Hu ● Fei Chen ● Hirofumi Sakane ● Yuhei ● Vishal Karna ●

Recommend


More recommend