FPGA-Accelerated Cycle-Exact FireSim Scale-Out System Simulation in the Public Cloud https://fires.im @firesimproject sagark@eecs.berkeley.edu Sagar Karandikar , Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović
The new datacenter hardware environment The end of Faster Moore’s Law networks Custom Silicon e.g. Silicon in the Cloud Photonics Deeper New datacenter memory/storage architectures hierarchies e.g. disaggregation e.g. 3DXPoint, HBM [1] 2
Disaggregated Datacenters 3 Diagram from Gao et al., OSDI’16
…and custom HW is changing faster than ever FPGAs: Agile HW Design for ASICs: [2] 4
What does our simulator need to do? • Model hardware at scale: • CPUs down to microarchitecture • Fast networks, switches • Novel accelerators • Run real software: • Real OS, networking stack (Linux) • Real frameworks/applications (not microbenchmarks) • Be productive/usable: • Run on a commodity platform • Want to encourage collaboration between systems, architecture: real HW/SW co-design 5
Comparing existing HW “simulation” systems (1 (1) ) Build the hardware (2 (2) ) Build a soft ftware simulator (3 (3) ) Build a hardware-ac accel eler erated ed simul ulator 6
A HW-accelerated DC simulator: DIABLO • DIABLO, ASPLOS’15 [4]: • Simulated 3072 servers, 96 ToRs at ~2.7 MHz • Booted Linux, ran apps like Memcached • Part of RAMP collaboration [8] • Need to hand-write abstract RTL models • Harder than writing “tapeout-ready” RTL • Need to validate against real HW • Tied to an expensive custom host- platform • $100k+ host platform, custom built DIABLO Prototype 7
Comparing existing HW “simulation” systems • Taping-out excels at: • Modeling reality: “single source of truth” • Scalability • Hardware-accelerated simulators excel at: • Simulation rate • Ability to run real workloads (as fn. of sim rate) • Software-based simulators excel at: • Ease-of-use • Ease-of-rebuild (time-to-first-cycle) • Commodity host platform • Cost • Introspection 8
Useful trends throughout the architect’s stack Open ISA Open, Silicon-Proven SoC Implementations FPGAs in the Cloud High-Productivity Hardware Design Language w/IR 9
FireSim at a high-level Server Simulations f1.16xlarge • Inherent parallelism – lots of gates CPU • We have tapeout-proven RTL: F P Host Ethernet (EC2 Network) G A automatically FAME-1 transform s ( x 8 Server ) • Put RTL-derived sims on the FPGAs Server Simulations Switch Model Server Simulation(s) Network simulation Server Simulation(s) Server Simulation(s) • Little parallelism in switch models Server Simulation(s) Server (e.g. a thread per port) Simulation(s) Server Simulation(s) • Need to coordinate all of our Simulation(s) Host distributed server simulations PCIe • So use CPUs + host network 10
Now, let’s build a datacenter-scale FireSim simulation! 11
Step 1: Server SoC in RTL Rocket L1I Core L1D L1I Server Rocket Core L1D L2 Rocket L1I Blade Core L1D Rocket L1I Sim. Core L1D Other Peripherals NIC - N/A Sim Ra Si - < ¼ of an FPGA Resource Util. Re NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Rate 12
DRAM Step 1: Server SoC in RTL Rocket L1I Core L1D L1I Server Rocket Core L1D L2 Rocket L1I Blade Core L1D Rocket L1I Sim. Core L1D Other Peripherals NIC NIC Sim Other Periph. Endpoint Sim Endpoints - N/A Sim Ra Si - < ¼ of an FPGA Resource Util. Re NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Rate PCIe to Host 13
Step 2: FPGA Simulation of one server blade DRAM Rocket L1I Core L1D DRAM Model Server L1I Rocket Core L1D L2 Rocket L1I Blade Core L1D Rocket L1I Sim. Core L1D Other Peripherals NIC NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host ms) - ~40 MHz (netw) - ~150 MHz Sim Ra Si - ¼ Mem Chans - < ¼ of an FPGA Resource Util. Re - 16 GB DDR3 NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Rate 14
Step 2: FPGA Simulation of one server blade DRAM DR Rocket L1I Core L1D DRAM Model L1I Server Rocket Se Core L1D L2 Rocket L1I Blade Bl Core L1D Rocket L1I Sim. Sim Core L1D Other Peripherals NIC NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host ms) - ~40 MHz (netw) - ~150 MHz Sim Ra Si - ¼ Mem Chans - < ¼ of an FPGA Resource Util. Re - 16 GB DDR3 NIC - 200 Gb/s Eth. - 256K Shared L2$ - 16K I/D L1$ Cores @ 3.2 GHz - 4x RISC-V Rocket Modeled System Mo Server Se Rate Blade Bl 15 Simulation Sim
Step 3: FPGA Simulation of 4 server blades DRAM DRAM Modeled System Mo Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 4 Server Blades Cost: Core L1D L2 Rocket L1I Blade Blade - 16 Cores Core L1D $0.49 per hour Rocket L1I Sim. SimulaIon Core L1D - 64 GB DDR3 (spot) Other Peripherals NIC NIC Sim Other Periph. Re Resource Util. Fabric FPGA Endpoint Sim Endpoints PCIe to Host ms) - < 1 FPGA $1.65 per hour Server Server - 4/4 Mem Chans (on-demand) Blade Blade Si Sim Ra Rate Simulation Simulation - ~14.3 MHz (netw) DRAM DRAM 16
Step 3: FPGA Simulation of 4 server blades DRAM DRAM Modeled System Mo Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 4 Server Blades Core L1D L2 Rocket L1I Blade Blade FPGA FPGA - 16 Cores Core L1D Rocket L1I Sim. SimulaIon Core L1D - 64 GB DDR3 Other Peripherals NIC NIC Sim Other Periph. Re Resource Util. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) - < 1 FPGA Server Server - 4/4 Mem Chans Blade Blade Sim Ra Si Rate Simulation Simulation - ~14.3 MHz (netw) DRAM DRAM 17
Step 4: Simulating a 32 node rack Mo Modeled System - 32 Server Blades DRAM DRAM Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 128 Cores Core L1D L2 L1I Rocket Blade Blade FPGA FPGA FPGA Core L1D L1I Rocket Sim. SimulaIon Core L1D Other Peripherals Cost: NIC - 512 GB DDR3 NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) (4 Sims) Server Server $2.60 per - 32 Port ToR Blade Blade Simulation Simulation Switch hour (spot) DRAM DRAM - 200 Gb/s, 2us Host Instance CPU: ToR Switch Model links $13.20 per Re Resource Util. hour (on- demand) - 8 FPGAs = FPGA FPGA FPGA FPGA - 1x f1.16xlarge (4 Sims) (4 Sims) (4 Sims) (4 Sims) Si Sim Ra Rate - ~10.7 MHz (netw) 18
Cycle-accurate Network Modeling Switch Port • For global cycle-accuracy, send a token on each link for each cycle, in each direction • Each direction of a link has link latency in cycles tokens in-flight • e.g. 6400 tokens in flight on link for 2us link Link Model latency @ 3.2 GHz 6400 tokens • Each token is desired bandwidth / clock frequency bits wide ß à • e.g. 200 Gbps / 3.2 GHz ≈ 64 bit wide token sent per cycle • Target transport agnostic (we provide Ethernet switch models) • Host transport agnostic (shared mem, sockets, PCIe) 64b 64b • Can “downgrade” to a zero-perf-impact functional network model (150+ MHz) NIC Top-Level I/O on FPGA 19
Step 4: Simulating a 32 node rack Mo Modeled System - 32 Server Blades DRAM DRAM Rocket L1I Core L1D DRAM Model Server L1I Rocket Server - 128 Cores Core L1D L2 L1I Rocket Blade Blade FPGA FPGA FPGA Core L1D L1I Rocket Sim. SimulaIon Core L1D Other Peripherals Cost: NIC - 512 GB DDR3 NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) (4 Sims) Server Server $2.60 per - 32 Port ToR Blade Blade Simulation Simulation Switch hour (spot) DRAM DRAM - 200 Gb/s, 2us Host Instance CPU: ToR Switch Model links $13.20 per Re Resource Util. hour (on- demand) - 8 FPGAs = FPGA FPGA FPGA FPGA - 1x f1.16xlarge (4 Sims) (4 Sims) (4 Sims) (4 Sims) Si Sim Ra Rate - ~10.7 MHz (netw) 20
Ag Step 4: Simulating a 32 node rack Modeled System Mo - 32 Server Blades DRAM DRAM L1I Rocket Core L1D DRAM Model Server L1I Rocket Server - 128 Cores Core L1D L2 L1I Rocket Blade Blade FPGA FPGA FPGA Core L1D L1I Rocket Sim. SimulaIon Core L1D Other Peripherals NIC - 512 GB DDR3 NIC Sim Other Periph. Fabric FPGA Endpoint Sim Endpoints PCIe to Host (4 Sims) (4 Sims) (4 Sims) Server Server - 32 Port ToR Blade Blade Simulation Simulation ck Switch DRAM DRAM - 200 Gb/s, 2us Host Instance CPU: ToR Switch Model links Re Resource Util. - 8 FPGAs = FPGA FPGA FPGA FPGA - 1x f1.16xlarge (4 Sims) (4 Sims) (4 Sims) (4 Sims) Si Sim Ra Rate - ~10.7 MHz (netw) 21
Recommend
More recommend