the open source
play

The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. - PowerPoint PPT Presentation

The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at RAMP Retreat, June 2009 The ProtoFlex Simulator History Project started (circa 2007) to build


  1. The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at RAMP Retreat, June 2009

  2. The ProtoFlex Simulator • History – Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs • Key Features – Functional simulator for N-way UltraSPARC III server (~50-90 MIPS) – Using hybrid simulation, runs real server apps + Solaris OS – Employs multithreading to virtualize # CPUs per FPGA core Hybrid Simulation Virtualization 2

  3. Open Sourcing ProtoFlex • Why open source? – Demonstration of FPGAs as viable architecture research vehicle – Facilitate adoption of hybrid simulation & host multithreading – Encourage building on top of our work • What are we releasing? – Bluespec source HDL, Verilog and pre-generated netlists for SPARCV9 CPU model + interfaces – XUPV5 Reference Design for EDK 10.1 – Virtutech Simics plug-ins for hybrid simulation – Top-level SW controller, user command-line interface – Documentation through online wiki 3

  4. Outline • Motivation • The ProtoFlex Simulator – High level components – UltraSPARC core model • Using ProtoFlex • XUPV5 Reference Design • Distribution Details 4

  5. The ProtoFlex Simulator • User perceives familiar SW-like UltraSPARC III simulator – Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring 5

  6. The ProtoFlex Simulator FPGA Linux PC User Ethernet FPGA PowerPC PFMON Interfac Core (or uBlaze) e SIMICS Main Memory (I/O) • User perceives familiar SW-like UltraSPARC III simulator – Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring 6

  7. Our UltraSPARC III Core Model Context Scheduler • ISA Specifications I-TLB Stage 1 – 64-bit SPARCV9 ISA + US III extensions I-TLB Stage 2 – 8 register windows, 4 global register files I-Fetch Address Generate Nonblocking I-cache – 512-entry D-TLB, 128 I-TLB (BRAM) I-Fetch Tag Check • Implementation Integer RF US III Decoder (BRAM) 64-bit ALU Stage 1 – 14-stage, multi-threaded pipeline, switch context on each cycle 64-bit ALU Stage 2 Arbiter to DDR Memory – On Virtex-5, XST~148MHz, D-TLB Stage 1 Placed & routed @ 100MHz D-TLB Stage 2 D-TLB Stage 3 – Parameterized non-blocking caches Nonblocking D-Cache Address Generate – FP + rare MMU instructions are D-cache (BRAM) D-Cache Tag Check SW-emulated by nearby uBlaze Multi-Cycle Writeback – 100% mirrors Virtutech Simics model 7 Instruction Unit

  8. Core Design Statistics • Runs 100MHz on V5 – Synthesizes up to 148MHz using standard tools (ISE XST) • Logic usage – 23.5 KLUTs (11.3% LX330T) • BRAM usage – 120 BRAMs for 16-context configuration (37% LX330T) • Future optimizations – Paging structures to SRAM or DRAM can reduce BRAM by significant amount – Will release in future updates 8

  9. Outline • Motivation • The ProtoFlex Simulator • Using ProtoFlex • XUPv5 Reference Design • Distribution 9

  10. Using ProtoFlex Context Scheduler • Add passive monitors Counters Counters Counters I-TLB Stage 1 – Counters, histograms I-TLB Stage 2 Histogram Histogram Histogram Tracker Tracker – Roll your own Trackers I-Fetch Address Generate Nonblocking I-cache (BRAM) I-Fetch Tag Check • Trace-based simulation Integer RF US III Decoder – Collect dynamic traces (BRAM) 64-bit ALU Stage 1 – Feed traces to functional-first Arbiter to Timing 64-bit ALU Stage 2 DDR timing model Model Memory D-TLB Stage 1 • Sampled Program Monitoring D-TLB Stage 2 D-TLB Stage 3 – Use micro-blaze (or PPC) to FPGA Nonblocking D-Cache Address Generate monitor core/memory state Hard/Soft Core D-cache (PowerPC or D-Cache Tag Check (BRAM) MicroBlaze) – Unintrusive profiling w/o Multi-Cycle Writeback changes to target SW Instruction Unit 10

  11. Applications of ProtoFlex • Examples – Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+ – Real-time stack trace profiling – CMP interconnect model (in progress) – Realistic CPU traffic generators (in progress) Piranha CMP Cache (First-Order Timing Model) • … running real 16 -CPU server workloads – Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K Statistics + Warmed Coherency & Tag States 11

  12. What does the RTL look like? • We use Bluespec System Verilog (high-level, synthesizable HDL) – 4-8 weeks learning curve for normal HDL users – Once learned, easier to read/modify than conventional RTL – Requires BSV compiler (free for academics) – Paper in MEMOCODE’09 describes BSV coding/validation of core • Sample code: rule split_ALU_pipeline (True); rule merged_ALU_pipeline (True); … … p1 = piperegs[ DECODE ]; p1 = piperegs[DECODE]; piperegs[ ALU1 ] <= doALUStage1 ( p1 , alu_ifc); p_tmp1 = doALUStage1 (p1, alu_ifc); p2 = piperegs[ ALU1 ]; p_tmp2 = doALUStage2 (p_tmp1, alu_ifc); piperegs[ ALU2 ] <= doALUStage2 ( p2 , alu_ifc); piperegs[ALU] <= p_tmp2; … … endrule endrule 2-stage ALU 1-stage ALU 12

  13. Other Simulator Features • Changing Core Parameters – Number of CPU contexts – Cache sizes – Merge/split pipeline stages – Enable/disable modules for profiling & debugging – Clock frequency (tested @ 10 MHz – 100 MHz) – Set optimal LUTRAM size (16 = V2P, 64 = V5) – Choose LUTRAMs or BRAMs for any CPU state • System Parameters – UDP or TCP/IP (for PFMON-to-FPGA communication) – XUPv5, BEE2 13

  14. Outline • Motivation • The ProtoFlex Simulator • Using ProtoFlex • XUPv5 Reference Design • Distribution 14

  15. Platform Release: XUPV5 • Why XUPv5? – Inexpensive (~$750), easily accessible – Standard tool flows (EDK, ISE) – Reference design portable to other platforms – just drop in our ‘ pcores ’ • Supporting other platforms – Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) – Plan to release with future updates 15

  16. Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 16

  17. Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 17

  18. XUPv5 Overview • Virtex-5 LX110T • DDR2 Memory – up to 2GB • 1ΜΒ SRAM • 1Gbps Ethernet • 3Gbps SATA • Serial Port 18

  19. Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 19

  20. BlueSPARC BRAM • EDK IP core BRAM Ethernet Serial Port BRAM – connects to PLB & NPI PLB • Runs @ 100MHz BlueSPARC MicroBlaze • 4 CPU contexts M ulti- P ort SRAM • 64KB I&D L1 caches M emory C ontroller Controller LX110T SRAM DRAM XUPv5 20

  21. Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 21

  22. BlueSPARC BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze • 81% utilization M ulti- P ort SRAM – Core 51% (76 out of 148) M emory C ontroller Controller LX110T – Rest 30% (45 out of 148) SRAM DRAM XUPv5 22

  23. Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 23

  24. Ethernet BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze • 4 MB/sec bandwidth • 350 usec RTT latency M ulti- P ort SRAM M emory C ontroller Controller • Socket Abstraction LX110T – using LWIP RAW interface SRAM DRAM XUPv5 24

  25. Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 25

  26. DDR2 Memory Controller • 1.5GB/s peak BW BRAM BRAM Ethernet Serial Port BRAM • 115ns latency PLB • Multiple ports/interfaces BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 26

  27. Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 27

  28. Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 28

  29. Linux PC • Software requirements – SuSE Linux 10.1 – CAD tools + licenses (Bluespec compiler, Xilinx ISE/EDK) – Simics 3.0.22 – Hybrid simulation plug-in modules – ProtoFlex MONitor tool (PFMON) 29

  30. Linux PC • Runs PFMON (ProtoFlex MONitor) – Orchestrates communication between Simics & BlueSPARC – Provides CLI interface to simulator (like Simics Console) • Runs Simics – Handles I/O, FPGA Core and memory initialization Linux PC Simics PFMON BlueSPARC 30

  31. From RTL to Running System • Bluespec  Verilog 1 Bluespec code – Bluespec compiler 1 – ~30 minutes Verilog code • Verilog  Bitstream 2 – Xilinx EDK 2 – ~ 3 hours Bitstream • Bitstream  Working System 3 3 – Stream mem. image over ethernet Working – ~ 5 minutes (for 512MB image) System 31

Recommend


More recommend