sonicboom the third generation berkeley out of order
play

SonicBOOM The Third Generation Berkeley Out-of-Order Machine Jerry - PowerPoint PPT Presentation

SonicBOOM The Third Generation Berkeley Out-of-Order Machine Jerry Zhao, Ben Korpan, Abe Gonzalez, Krste Asanovic UC Berkeley jzh@berkeley.edu Goal of the BOOM project 2x 7- wide OOO Vortex 72x 8- wide OOO Skylake 4x 10- wide


  1. SonicBOOM – The Third Generation Berkeley Out-of-Order Machine Jerry Zhao, Ben Korpan, Abe Gonzalez, Krste Asanovic UC Berkeley jzh@berkeley.edu

  2. Goal of the BOOM project 2x 7- wide OOO “Vortex” 72x 8- wide OOO “Skylake” 4x 10- wide OOO “Sunny Lake” 2x 3- wide OOO “T empest” 2x 9- wide OOO “Typhoon” 4x 3- wide OOO “T empest” General-purpose performance is important across the entire computing ecosystem. BOOM Goals : Build a high-performance open-source RISC-V out-of-order core Support research in various aspects of high-performance SoC design (microarch, security, accelerators, etc.) 2

  3. BOOMv1 7-cycle branch- iss mispredict penalty fetch fetch dec exec wb queues queues rrd iss BTB GShare tlb D$ D$ wb rrd 10-cycle branch- BOOMv2 mispredict penalty queue fetch fetch fetch dec dis iss rrd exec wb queues 4-cycle load-use BTB queue iss rrd tlb D$ D$ wb GShare

  4. Open-source Performance Gap 9 8 7 6 5 4 3 2 1 0 Ivy Bridge XuanTie SiFive U74 WD BOOMv1 BOOMv2 Rocket 910 SWERV 12+stage 12-stage 8-stage 9-stage 8-stage 10-stage 5-stage Architecture 4-w OOO 3-w OOO 2-w in-order 2-w in-order 4-w OOO 4-w OOO 1-w in-order CoreMark/ 8.5 7.1 5.1 4.9 4.9 3.2 2.3 MHz 4

  5. BOOMv1 7-cycle branch- iss mispredict penalty fetch fetch dec exec wb queues queues rrd iss BTB GShare tlb D$ D$ wb rrd 10-cycle branch- BOOMv2 mispredict penalty queue fetch fetch fetch dec dis iss rrd exec wb queues 4-cycle load-use BTB queue iss rrd tlb D$ D$ wb GShare 12-cycle branch- mispredict penalty (SonicBOOM) queue BOOMv3 fetch fetch fetch fetch dec dis issue rrd exec wb br queues 4-cycle load-use SFB uBTB queue Recoder issue rrd tlb D$ D$ wb BTB RAS queue issue rrd Custom RoCC Accelerator wb TAGE 5

  6. SonicBOOM Frontend: • New TAGE-L branch predictor • New decoders for RISC-V compressed Execute: • Short-forwards-branch recoding • Superscalar branch resolution • Improved address-generation pipeline • Custom RoCC accelerators Memory: • Superscalar address generation • Superscalar load-store unit • Optimized load/store scheduling • L1 next-line-prefetcher w. line-fill-buffers 6

  7. State-of-the-art Branch Prediction Challenges: • Superscalar fetch/predict Instruction Dec • Speculative updates ICache ode Buffer • Repair after misspeculation Control/Redirect Logic • Predictor pipelining Branch Generated Predictor Pipeline Metadata SonicBOOM Instruction Fetch: • Variable-width (RVC) decode Global Update + + Local Repair • L0/L1 BTBs Histories • Pipelined TAGE + Loop predictor • Repaired return-address-stack 7

  8. Improving Branch Performance Dynamic Predication Superscalar Branch Resolution • Recode short-forwards-branches • BOOMv2: 1 branch/jump unit into “predicated” micro -ops • BOOMv3: Every ALU is a branch • "POWER8"-style unit • Correct prediction is cheap, • 5.1 CM/MHz -> 6.2 CM/MHz misprediction is expensive • Single JMP unit to handle fetch fetch fetch fetch AUIPC/JAL instructions SFB • +1 branch latency to find oldest uBTB Recoder mispredicted branch BTB RAS queue issue rrd exec wb br TAGE queue issue rrd exec wb br 8

  9. Advanced Load/Store Unit Memory Issue Queue Superscalar memory access: • Addr-gen/translate/execute 2 loads Register-read per cycle DataGen DataGen DataGen • Banked DCache data arrays AddrGen AddrGen Improved L1 Data Cache: TLB • Fully non-blocking (refill in parallel Load Store Queue Queue with writeback) DCache DCache Bank0 Bank1 • Line-fill-buffers with next-line- prefetcher MSHRs Probe + Next-line • Improved memory scheduler writeback prefetch Line Fill Buffers 9

  10. FPGA-accelerated Co-simulation Dromajo : simulator developed by Test Esperanto, checks correctness of Application RISC-V trace Linux Kernel Fromajo: couple Dromajo to FireSim Image FPGA simulation of core • Committed instruction stream pulled Dromajo Cosimulator FireSim Simulation from core (1 MHz) (100 MHz) • Committed instructions checked RISC-V RISC-V core against Dromajo at 1 MHz simulation model • Cycle-exact, reproducible divergences • Works with other RISC-V cores (Ex: Ariane) 10

  11. Finding a RISC-V Linux Bug Background: PTW Insn reads+writes • PTWs are unordered w.r.t. loads/stores • SFENCE.VMA orders page-table updates with accesses Store-buffer Found Linux hang with SonicBOOM • Kernel load launches a PTW to recently written PTE Memory • No SFENCE between PTE write and PTW • Only materializes on a deeply speculating core • Patch in-progress 11

  12. CoreMark IPC 9 8 7 6 5 4 3 2 1 0 Ivy XuanTie BOOMv3 SiFive WD BOOMv1 BOOMv2 Rocket Bridge 910 U74 SWERV 12+stage 12-stage 12-stage 8-stage 9-stage 8-stage 10-stage 5-stage Architecture 4-w OOO 3-w OOO 4-w OOO 2-w in- 2-w in- 4-w OOO 4-w OOO 1-w in- order order order CoreMark/ 8.5 7.1 6.2 5.1 4.9 4.9 3.2 2.3 MHz 12

  13. SPEC17 Comparison • Evaluate SPEC17 intspeed, single-core performance • Target comparable branch-prediction accuracy and IPC Intel Xeon AWS Graviton SonicBOOM Microarchitecture Skylake Server Cortex A72 BOOMv3 Undisclosed Undisclosed TAGE-L Branch Predictor 64/64 KB 48/32 KB 32/32 KB L1 Cache Sizes (I/D) 1 MB 2 MB 512 KB L2 Cache Size L3 Cache Size 24 MB 0 MB 4 MB Compiler gcc gcc gcc Ubuntu 18.04 Server Ubuntu 18.04 Buildroot Linux OS AWS EC2 bare-metal AWS EC2 bare-metal FireSim simulation Platform 13

  14. SPEC17 Branch Prediction Accuracy Equivalent to A72 14

  15. SPEC17 IPC 15

  16. Next steps Physical Implementation: • > 1 GHz possible according to preliminary results • Critical path in issue-units (issue-select/compaction) • Current SRAMs limit us to 1.4 GHz Improving performance: • Larger prefetchers between L2/LLC to hide L2 miss penalty • Instruction prefetcher • V-Extension support 16

Recommend


More recommend