hw sw co designed processors challenges design choices
play

HW/SW Co-designed Processors : Challenges, Design Choices and a - PowerPoint PPT Presentation

HW/SW Co-designed Processors : Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar 1 , Jos Cano 1 , Aleksandar Brankovic 2 , Demos Pavlou 3 , Kyriakos Stavrou 3 , Enric Gibert 4 , Alejandro Martnez 5 ,


  1. HW/SW Co-designed Processors : Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar 1 , José Cano 1 , Aleksandar Brankovic 2 , Demos Pavlou 3 , Kyriakos Stavrou 3 , Enric Gibert 4 , Alejandro Martínez 5 , Antonio González 6 1 University of Edinburgh, UK 2 Intel 3 11pets 4 Pharmacelera 5 ARM 6 Universitat Politècnica de Catalunya, Spain IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Santa Rosa, California, USA - April 24-25, 2017

  2. Outline • HW/SW co-designed processors • Building a simulation infrastructure • DARCO • Evaluation • Conclusions 2

  3. HW/SW co-designed processors Application Programs Libraries Application Programs Operating Libraries System Guest ISA Operating Translation Optimization System Layer (Software) Host ISA ISA Execution Hardware Execution Hardware Conventional processor HW/SW co-designed processor • Simple Host ISA Energy Efficiency – In-order cores; move complexity to software layer • Dynamic Binary Optimizations in software (TOL) Performance – Aggressive and speculative – Exploit application behavior at runtime 3

  4. HW/SW co-designed processors: History • IBM DAISY (1997) – Targets binary compatibility from PowerPC to VLIW architectures • IBM BOA (1999) – Targets high frequency PowerPC through simple hardware design • Transmeta Crusoe (2000) and Efficeon (2003) – Execute x86 binaries on proprietary VLIW with low power consumption – Better energy efficiency than Intel Pentium III • Nvidia Denver (2014) – Executes ARMv8 binaries on proprietary in-order core – Applying dynamic optimizations matches Out-of-order Intel Haswell DAISY BOA ??? 1997 1999 2000 2003 2014 4

  5. HW/SW co-designed processors: History • IBM DAISY (1997) – Targets binary compatibility from PowerPC to VLIW architectures • IBM BOA (1999) – Targets high frequency PowerPC through simple hardware design Anything missing? • Transmeta Crusoe (2000) and Efficeon (2003) No major academic project! – Execute x86 binaries on proprietary VLIW with low power consumption Can lack of simulation infrastructure be the reason? – Better energy efficiency than Intel Pentium III • Nvidia Denver (2014) – Executes ARMv8 binaries on proprietary in-order core – Applying dynamic optimizations matches Out-of-order Intel Haswell DAISY BOA ??? 1997 1999 2000 2003 2014 5

  6. Outline • Introduction • HW/SW co-designed processors • Building a simulation infrastructure • DARCO • Evaluation • Conclusions 6

  7. What will a simulation infrastructure enable? • Where to implement (HW or SW?) microarchitectural features like – Instruction decoding/reordering, memory disambiguation, register renaming, … Application Programs Libraries Operating System Guest ISA Translation Optimization Layer (Software) Host ISA Execution Hardware 7

  8. What will a simulation infrastructure enable? • Where to implement (HW or SW?) microarchitectural features like – Instruction decoding/reordering, memory disambiguation, register renaming, … • How to reduce “ startup delay ” – One of the major problems of Transmeta products • When and where to translate/optimize the guest binaries – As soon as code becomes “hot”?, wait for a core to become idle?,… • How to address speculative execution (memory, control) – Checkpointing granularity?, find susceptible speculation failure points? • When and how to profile the execution – Overhead vs opportunity for improvement 8

  9. What will a simulation infrastructure enable? • Where to implement (HW or SW?) microarchitectural features like – Instruction decoding/reordering, memory disambiguation, register renaming, … • How to reduce “ startup delay ” – One of the major problems of Transmeta products A simulation infrastructure can help • When and where to translate/optimize the guest binaries evaluate trade-offs and design choices – As soon as code becomes “hot”?, wait for a core to become idle?,… • How to address speculative execution (memory, control) – Checkpointing granularity?, find susceptible speculation failure points? • When and how to profile the execution – Overhead vs opportunity for improvement 9

  10. Simulation infrastructure: Complexity • Compilation framework – Code analysis/translation – Optimizations – Code generation • Runtime system – Profiling and instrumentation – Profile-guided optimizations • Microarchitectural simulator – Model components like pipeline, caches, … – Allow sampling 10

  11. Simulation infrastructure: Complexity • Compilation framework – Code analysis/translation – Optimizations – Code generation Simulation infrastructure complexity = • Runtime system – Profiling and instrumentation Compilation framework + Runtime system + – Profile-guided optimizations Microarchitectural simulator • Microarchitectural simulator – Model components like pipeline, caches, … – Allow sampling 11

  12. Simulation infrastructure: Requirements • Correctness – It should not change program behavior • Minimum software layer (TOL) overhead – TOL execution time must be small • Minimum emulation cost – Host to guest instruction ratio must be low x86 ARM Power • Support for multiple guest ISAs (front-ends) – Enables wider applicability TOL • Plug and play support Hardware – Easy to include/evaluate new features • Debugging – Strong debug toolchain 12

  13. Outline • Introduction • HW/SW co-designed processors • Building a simulation infrastructure • DARCO: Infrastructure for Research on HW/SW Co-designed Processors • Evaluation • Conclusions 13

  14. DARCO: The big picture x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Models a processor that executes x86 code on a RISC host architecture • Four main components : Co-designed, x86, Timing Simulator, Controller 14

  15. DARCO: Co-designed Component x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Models the functionality of a HW/SW co-designed processor • Composed of TOL and host ISA functional emulator (user code) • Maintains emulated x86 architectural and memory states 15

  16. DARCO: x86 Component x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Provides a full-system functional emulator for the guest x86 ISA • Maintains authoritative x86 architectural and memory states • Filters instruction stream and passes user code to co-designed component 16

  17. DARCO: Timing Simulator x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker • Models a parameterized in-order core • Can distinguish application and TOL code • Includes power and energy modelling (McPAT) 17

  18. DARCO: Controller x86 Component Co-designed Component x86 Binary Translation Optimization x86 OS Layer (TOL) Emulated x86 Register State Authoritative x86 x86 RISC Register State Emulated x86 Functional Functional Authoritative x86 Memory State Emulator Memory State Emulator Data and Data and Instruction Path Instruction Path Process Tracker Timing Controller Simulator Commands Path Commands Path State Checker User • Provides full control over the app execution and debugging utilities • Compares authoritative and emulated x86 states to ensure correctness 18

  19. DARCO: Starting execution x86 Component Co-designed Component Tracker Controller XYZ Exec Space TOL Code Data Reg File Code Reg File XYZ Data User Tracks the emulated application . Passes user-level code to co- designed component Commands User : Execute application XYZ Controller : Send proper command to x86 component x86 OS : Starts application XYZ Tracker : Identifies application XYZ : Request 1 st code page from x86 component and send to TOL with initial state Controller TOL : Load the code page and starts emulating 19

Recommend


More recommend