Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 2
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 3
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures ? Sources: "Cramming more components onto integrated circuits” GE Moore, Computer 1965 “Next - Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016 4
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 5
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 6
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Transistor scaling stops. Chip specialization runs out of steam. What’s Next? Sources: “Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018 “Cloud TPU”, Google, https://cloud.google.com/tpu/ “FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017 “Microsoft unveils Project Brainwave for real - time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/ “NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/ 7
Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation I: The Density of Emerging Memories are Projected to Increase ITRS Logic Roadmap 8
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011 9
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec 0% recurrence 38% recurrence 61% recurrence Source: ”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011 10
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles Source: Google 11
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 12
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 13
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Power laws suggest high recurrent processing of popular content Source: Twitter 14
Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations ▪ Power laws suggest high recurrent processing of popular content Source: Twitter 15
Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric COREx: Compute-Reuse Architecture For Accelerators 16
Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input lookup Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric fetched core hit result result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 17
Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. DMA Engine Input lookup Lookup output input Host Processors input Accelerator Scratchpad Shared LLC / NoC core result Core Memory Acceleration Fabric fetched core hit result result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 18
Architectural Guidelines Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC 19
Architectural Guidelines ▪ Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Output Compute Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC Input 20
Architectural Guidelines ▪ Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow ▪ But Not Straightforward! Output o High lookup costs Compute Accelerator Core o Unnecessary accesses o High access costs DMA General-Purpose Scratchpad Engine CMP ▪ COREx Key Ideas: Specialized o Hashing (reduce lookup costs) Compute Lanes o Lookup filtering(fewer accesses) Shared LLC o Banking (reduce access costs) Input 21
Architectural Guidelines ▪ Accelerators Memoization is Natural Goal: Extend Specialization with o Little or no additional programming effort Workload-Specific Memoization o Built-in input-compute-output flow ▪ But Not Straightforward! Output o High lookup costs Compute Accelerator Core o Unnecessary accesses o High access costs DMA General-Purpose Scratchpad Engine CMP ▪ COREx Key Ideas: Specialized o Hashing (reduce lookup costs) Compute Lanes o Lookup filtering(fewer accesses) Shared LLC o Banking (reduce access costs) Input 22
Top Level Architecture Mem. Chip Func. Block Control Datapath Accelerator Core DMA General-Purpose Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 23
Top Level Architecture ▪ New Modules: Mem. Chip Func. Block Control o Input Hashing Unit (IHU) Datapath COREx Interconnect Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 24
Top Level Architecture ▪ New Modules: Mem. Chip ILU Func. Block Associative Control o Input Hashing Unit (IHU) Cache Datapath Cache Ctrl. COREx Interconnect Hashes o Input Lookup Unit (ILU) Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP Specialized Compute Lanes Shared LLC SoC Interconnect 25
Top Level Architecture ▪ New Modules: CHT Mem. Chip ILU Func. Block RAM-Array Associative Control o Input Hashing Unit (IHU) Cache Table Datapath Cache Ctrl. Fetch RAM-Array Ctrl. COREx Interconnect o Input Lookup Unit (ILU) Accelerator Core DMA General-Purpose IHU Scratchpad Engine CMP o ComputationHistoryTable(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect 26
Recommend
More recommend