software architecture and vlsi co design for efficient
play

Software, Architecture, and VLSI Co-Design for Efficient Task-Based - PowerPoint PPT Presentation

Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes Christopher Torng Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Motivation Task-Based Parallelism


  1. Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes Christopher Torng Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University

  2. • Motivation • Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research Emerging New Contexts Demand Better Hardware Pushing Intelligence to the Edge I Better local security I Faster response times I Lower data-movement energy I Many more... Source: Lanner Peak Energy Performance Efficiency Inference 2.5 sec 10+ years Image 28 x 28 CR2032 coin MNIST dataset Standby mode TI MSP430 Source: Gobieski ASPLOS'19 Cornell University Christopher Torng 2 / 56

  3. • Motivation • Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research Emerging New Contexts Demand Better Hardware I Cybersecurity Machine Graph Analytics I Smart Healthcare Learning I Smart Home I Augmented Reality I Virtual Reality I Autonomous Driving How can we drastically improve performance and energy efficiency for these new emerging contexts? Cornell University Christopher Torng 3 / 56

  4. • Motivation • Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research Motivating Trends in Computer Architecture 7 Transistors Intel 48-Core Prototype 10 (Thousands) AMD 4-Core Opteron 6 10 Intel r P4 a e y / % 5 1 5 ~ SPECint 10 DEC Performance Alpha 21264 4 10 ~9%/year Frequency MIPS (MHz) 3 R2K 10 Typical 2 10 Power (W) Number 1 of Cores 10 0 10 1975 1980 1985 1990 1995 2000 2005 2010 2015 Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten Cornell University Christopher Torng 4 / 56

  5. • Motivation • Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research Excitement After Moore’s Law Smart Home Autonomous AR / VR Driving Graph Analytics Application Smart AI Algorithm Healthcare Cybersecurity Programming Language Operating System Compiler Instruction Set Architecture Computer Architecture Microarchitecture Register-Transfer Level Carbon Quantum Biodegradable Energy Gate Level Computing Computing Nanotubes Harvesting Circuits Devices Technology Phase-Change Molecular Memory Computing Cornell University Christopher Torng 5 / 56

  6. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Building Future Computing Systems that Bridge Software, Architecture, and VLSI Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16 , MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14 , IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18 , DAC’18, Hotchips’17 Future Research Cornell University Christopher Torng 6 / 56

  7. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Building Future Computing Systems that Bridge Software, Architecture, and VLSI Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16 , MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14 , IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18 , DAC’18, Hotchips’17 Future Research Cornell University Christopher Torng 7 / 56

  8. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Cross-Stack Co-Design for Task-Based Parallelism Work-Stealing Runtimes Static Dynamic Asymmetry Asymmetry Single-ISA Dynamic Voltage Heterogeneous and Frequency Architectures Scaling How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime? Cornell University Christopher Torng 8 / 56

  9. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Work-Stealing Runtimes Task Queues Steal Task E Steal Task F Work in Task E Task D Task C Task F Progress Core 0 Core 1 Core 2 Core 3 I Work stealing has good performance, space requirements, and communication overheads in both theory and practice I Supported in many popular concurrency platforms including: Intel’s Cilk Plus, Intel’s C++ TBB, Microsoft’s .NET Task Parallel Library, Java’s Fork/Join Framework, and OpenMP Cornell University Christopher Torng 9 / 56

  10. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Static Asymmetry vs. Dynamic Asymmetry Fmax @ Vmax Energy Fnom @ Vnom Fmin @ Vmin Performance Integrated Voltage Regulation Samsung Exynos Octa Mobile Processor 150 ns 1.4 1.3 Little Big 1.2 Voltage (V) ARM Cores ARM Cores 1.1 1.0 A7 A7 A15 A15 0.9 0.8 A15 120 ns A7 A7 A15 0.7 L2$ L2$ 100 150 200 250 300 350 400 Time (ns) Cornell University Christopher Torng 10 / 56

  11. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime? Bender et al. Ribic et al. "Online Scheduling "Energy-Efficient of Parallel Programs on Work-Stealing Work-Stealing Heterogeneous Sys ..." Language Runtimes" Runtimes Theory of Computing ASPLOS 2014 Systems 2002 Static Dynamic Asymmetry Asymmetry Single-ISA Dynamic Voltage Heterogeneous and Frequency Architectures Scaling Azizi et al. "Energy-performance Tradeoffs in Processor Architecture and Circuit Design: A Marginal Cost Analysis" ISCA 2010 Cornell University Christopher Torng 11 / 56

  12. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Work-Stealing Runtimes Cross-Stack Co-Design for Task-Based Parallelism Static Dynamic Asymmetry Asymmetry Let’s start with some first-order modeling to build intuition Work-Pacing L L B B Work-Mugging Work-Sprinting Cornell University Christopher Torng 12 / 56

  13. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Building Intuition by Exploring a 1 Big 1 Little System System with 1 big 1 little 8 Four-Issue B L Big Core 7 10% Energy Efficiency Increase Normalized Power 6 (2.0, 6.0) 10% Performance Increase 5 3.0 7.0 4 L L L Same L L Power 3 L Little Core 2 B B B B B B (1.0, 1.0) 1 0.5 1.0 1.5 2.0 2.5 3.0 r S e w P I o P Normalized Instructions Per Second (IPS) Cornell University Christopher Torng 13 / 56

  14. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research The Law of Equi-Marginal Utility British Economist 8 Alfred Marshall (1824 - 1924) "Other things being equal, a consumer 7 gets maximum satisfaction when he allocates his limited income to the Normalized Power 6 purchase of different goods in such a way that the Marginal Utility derived 5 Slope from the last unit of money spent on 0.9 V each item of expenditure Slope 4 tend to be equal ." 3 Balance the ratio of 1.3 V utility (IPS) to cost (power) 2 Arbitrage 1 "Buy Low, Sell High" 0.5 1.0 1.5 2.0 2.5 3.0 Normalized Instructions Per Second (IPS) Cornell University Christopher Torng 14 / 56

  15. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Systematic Approach for Balancing Marginal Utility Pareto-Optimal Frontier 1 Big 1 Little System 1.4 isopower Normalized Energy Efficiency at Nominal voltage 1.3 Individual (V B , V L ) pair 1.2 1.1 Assumptions 1.0 Perfectly parallel application Ideal load balancing 0.9 0.8 Marginal Utility-Based Optimization Problem 0.7 Constraint: isopower line 0.6 0.8 1.0 1.2 1.4 Objective: maximize performance (Solved numerically) Normalized IPS Cornell University Christopher Torng 15 / 56

  16. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Work-Stealing Runtimes Cross-Stack Co-Design for Task-Based Parallelism Static Dynamic Asymmetry Asymmetry Let’s explore three specific techniques to balance marginal utility in a work-stealing runtime Work-Pacing L L B B Work-Mugging Work-Sprinting Cornell University Christopher Torng 16 / 56

  17. Motivation • Task-Based Parallelism • Voltage Regulation Rapid ASIC Design Future Research Work-Pacing: Building Intuition Balance performance/power L L across cores in the B high-parallel (HP) region B 2 Big, 2 Little Busy Steal Loop Aggregate System IPS 7 1.2 Normalized Power 6 Marginal IPS/W 1.0 Aggregate 5 0.8 Throughput 4 0.6 3 B 0.4 2 L B L 0.2 1 0 0.0 VL VL 0.70 1.00 1.30 1.60 1.90 0.70 1.00 1.30 1.60 1.90 0.0 0.5 1.0 1.5 2.0 2.5 3.0 VB 1.04 1.00 0.92 0.76 0.24 VB 1.04 1.00 0.92 0.76 0.24 Normalized IPS System with both big cores active and both little cores active Cornell University Christopher Torng 17 / 56

Recommend


More recommend