The Open Source ProtoFlex Simulator Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at RAMP Retreat, June 2009
The ProtoFlex Simulator • History – Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs • Key Features – Functional simulator for N-way UltraSPARC III server (~50-90 MIPS) – Using hybrid simulation, runs real server apps + Solaris OS – Employs multithreading to virtualize # CPUs per FPGA core Hybrid Simulation Virtualization 2
Open Sourcing ProtoFlex • Why open source? – Demonstration of FPGAs as viable architecture research vehicle – Facilitate adoption of hybrid simulation & host multithreading – Encourage building on top of our work • What are we releasing? – Bluespec source HDL, Verilog and pre-generated netlists for SPARCV9 CPU model + interfaces – XUPV5 Reference Design for EDK 10.1 – Virtutech Simics plug-ins for hybrid simulation – Top-level SW controller, user command-line interface – Documentation through online wiki 3
Outline • Motivation • The ProtoFlex Simulator – High level components – UltraSPARC core model • Using ProtoFlex • XUPV5 Reference Design • Distribution Details 4
The ProtoFlex Simulator • User perceives familiar SW-like UltraSPARC III simulator – Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring 5
The ProtoFlex Simulator FPGA Linux PC User Ethernet FPGA PowerPC PFMON Interfac Core (or uBlaze) e SIMICS Main Memory (I/O) • User perceives familiar SW-like UltraSPARC III simulator – Software user interface similar to Simics – Applications load directly from Simics checkpoints – Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring 6
Our UltraSPARC III Core Model Context Scheduler • ISA Specifications I-TLB Stage 1 – 64-bit SPARCV9 ISA + US III extensions I-TLB Stage 2 – 8 register windows, 4 global register files I-Fetch Address Generate Nonblocking I-cache – 512-entry D-TLB, 128 I-TLB (BRAM) I-Fetch Tag Check • Implementation Integer RF US III Decoder (BRAM) 64-bit ALU Stage 1 – 14-stage, multi-threaded pipeline, switch context on each cycle 64-bit ALU Stage 2 Arbiter to DDR Memory – On Virtex-5, XST~148MHz, D-TLB Stage 1 Placed & routed @ 100MHz D-TLB Stage 2 D-TLB Stage 3 – Parameterized non-blocking caches Nonblocking D-Cache Address Generate – FP + rare MMU instructions are D-cache (BRAM) D-Cache Tag Check SW-emulated by nearby uBlaze Multi-Cycle Writeback – 100% mirrors Virtutech Simics model 7 Instruction Unit
Core Design Statistics • Runs 100MHz on V5 – Synthesizes up to 148MHz using standard tools (ISE XST) • Logic usage – 23.5 KLUTs (11.3% LX330T) • BRAM usage – 120 BRAMs for 16-context configuration (37% LX330T) • Future optimizations – Paging structures to SRAM or DRAM can reduce BRAM by significant amount – Will release in future updates 8
Outline • Motivation • The ProtoFlex Simulator • Using ProtoFlex • XUPv5 Reference Design • Distribution 9
Using ProtoFlex Context Scheduler • Add passive monitors Counters Counters Counters I-TLB Stage 1 – Counters, histograms I-TLB Stage 2 Histogram Histogram Histogram Tracker Tracker – Roll your own Trackers I-Fetch Address Generate Nonblocking I-cache (BRAM) I-Fetch Tag Check • Trace-based simulation Integer RF US III Decoder – Collect dynamic traces (BRAM) 64-bit ALU Stage 1 – Feed traces to functional-first Arbiter to Timing 64-bit ALU Stage 2 DDR timing model Model Memory D-TLB Stage 1 • Sampled Program Monitoring D-TLB Stage 2 D-TLB Stage 3 – Use micro-blaze (or PPC) to FPGA Nonblocking D-Cache Address Generate monitor core/memory state Hard/Soft Core D-cache (PowerPC or D-Cache Tag Check (BRAM) MicroBlaze) – Unintrusive profiling w/o Multi-Cycle Writeback changes to target SW Instruction Unit 10
Applications of ProtoFlex • Examples – Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+ – Real-time stack trace profiling – CMP interconnect model (in progress) – Realistic CPU traffic generators (in progress) Piranha CMP Cache (First-Order Timing Model) • … running real 16 -CPU server workloads – Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K Statistics + Warmed Coherency & Tag States 11
What does the RTL look like? • We use Bluespec System Verilog (high-level, synthesizable HDL) – 4-8 weeks learning curve for normal HDL users – Once learned, easier to read/modify than conventional RTL – Requires BSV compiler (free for academics) – Paper in MEMOCODE’09 describes BSV coding/validation of core • Sample code: rule split_ALU_pipeline (True); rule merged_ALU_pipeline (True); … … p1 = piperegs[ DECODE ]; p1 = piperegs[DECODE]; piperegs[ ALU1 ] <= doALUStage1 ( p1 , alu_ifc); p_tmp1 = doALUStage1 (p1, alu_ifc); p2 = piperegs[ ALU1 ]; p_tmp2 = doALUStage2 (p_tmp1, alu_ifc); piperegs[ ALU2 ] <= doALUStage2 ( p2 , alu_ifc); piperegs[ALU] <= p_tmp2; … … endrule endrule 2-stage ALU 1-stage ALU 12
Other Simulator Features • Changing Core Parameters – Number of CPU contexts – Cache sizes – Merge/split pipeline stages – Enable/disable modules for profiling & debugging – Clock frequency (tested @ 10 MHz – 100 MHz) – Set optimal LUTRAM size (16 = V2P, 64 = V5) – Choose LUTRAMs or BRAMs for any CPU state • System Parameters – UDP or TCP/IP (for PFMON-to-FPGA communication) – XUPv5, BEE2 13
Outline • Motivation • The ProtoFlex Simulator • Using ProtoFlex • XUPv5 Reference Design • Distribution 14
Platform Release: XUPV5 • Why XUPv5? – Inexpensive (~$750), easily accessible – Standard tool flows (EDK, ISE) – Reference design portable to other platforms – just drop in our ‘ pcores ’ • Supporting other platforms – Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) – Plan to release with future updates 15
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 16
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 17
XUPv5 Overview • Virtex-5 LX110T • DDR2 Memory – up to 2GB • 1ΜΒ SRAM • 1Gbps Ethernet • 3Gbps SATA • Serial Port 18
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 19
BlueSPARC BRAM • EDK IP core BRAM Ethernet Serial Port BRAM – connects to PLB & NPI PLB • Runs @ 100MHz BlueSPARC MicroBlaze • 4 CPU contexts M ulti- P ort SRAM • 64KB I&D L1 caches M emory C ontroller Controller LX110T SRAM DRAM XUPv5 20
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 21
BlueSPARC BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze • 81% utilization M ulti- P ort SRAM – Core 51% (76 out of 148) M emory C ontroller Controller LX110T – Rest 30% (45 out of 148) SRAM DRAM XUPv5 22
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 23
Ethernet BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze • 4 MB/sec bandwidth • 350 usec RTT latency M ulti- P ort SRAM M emory C ontroller Controller • Socket Abstraction LX110T – using LWIP RAW interface SRAM DRAM XUPv5 24
Reference Design Block Diagram BRAM BRAM Ethernet Serial Port BRAM PLB BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 25
DDR2 Memory Controller • 1.5GB/s peak BW BRAM BRAM Ethernet Serial Port BRAM • 115ns latency PLB • Multiple ports/interfaces BlueSPARC MicroBlaze M ulti- P ort SRAM M emory C ontroller Controller LX110T SRAM DRAM XUPv5 26
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 27
Required Equipment Linux PC FPGA Board Ethernet PFMON BlueSPARC + Simics 28
Linux PC • Software requirements – SuSE Linux 10.1 – CAD tools + licenses (Bluespec compiler, Xilinx ISE/EDK) – Simics 3.0.22 – Hybrid simulation plug-in modules – ProtoFlex MONitor tool (PFMON) 29
Linux PC • Runs PFMON (ProtoFlex MONitor) – Orchestrates communication between Simics & BlueSPARC – Provides CLI interface to simulator (like Simics Console) • Runs Simics – Handles I/O, FPGA Core and memory initialization Linux PC Simics PFMON BlueSPARC 30
From RTL to Running System • Bluespec Verilog 1 Bluespec code – Bluespec compiler 1 – ~30 minutes Verilog code • Verilog Bitstream 2 – Xilinx EDK 2 – ~ 3 hours Bitstream • Bitstream Working System 3 3 – Stream mem. image over ethernet Working – ~ 5 minutes (for 512MB image) System 31
Recommend
More recommend