high performance and energy efficient
play

High Performance and Energy Efficient Machine Learning Accelerators - PowerPoint PPT Presentation

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage


  1. High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage Circuits Research Group Circuit Research, Intel Labs Hillsboro, Oregon, U.S.A. 1 of 16

  2. Era of Tera-scale Computing Teraflops of performance operating on Terabytes of data Entertainment, learning Model-based Apps and virtual travel Recognition Mining Financial Analytics TIPS Synthesis Performance Models Personal Media GIPS Creation and Management 3D & Video Terascale Mult- MIPS Media Multi-core Text KIPS Single-core Health Kilobytes Megabytes Gigabytes Terabytes Dataset Size 2

  3. Motivation: ML in IoT Platforms 3 3 of 16

  4. Internet of Everything (IoE) Need end-to-end energy efficiency & security 4 4 of 16

  5. Tera-scale Microprocessors and SoCs Special Special Special Special Graphics Graphics Graphics Graphics Purpose Purpose Purpose Purpose Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Video Video Video Video Engines Engines Engines Engines Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Integrated Integrated Integrated Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Off Die Off Die Off Die Memory Memory Memory Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache interconnect interconnect interconnect Controllers Controllers Controllers Maximize Workload-based Independent Dynamic Scenario-based performance core activation V/F control V/F control power allocation & efficiency & shutdown regions Deliver best user experience under constraints 5 5

  6. Motivation: IoT Technology Scaling Trends 6 6 of 16

  7. Moore ’ s Law scaling 14nm 10 9 45nm Trigate 2014 2007 10 7 More, better transistors 10 5 + More cores Continued benefits 10 3 from Moore ’ s Law Source: Intel 7

  8. Performance/Energy Scaling Trends Source: Intel 8

  9. Ke Key En Energ rgy Ch Challenge ge: : On Ongoi going Sc g Scalin ing g Re Requ quir ires Ar Archit itectu cture re Innov ovati tion on Multicore Scalability Gap Single Core Plateau System Integration Cores available by Process Scaling 448 Single Thread Performance 384 320 1996 – 2004: Increased 28x 256 # Cores 192 2004 – 2012: Dark Increased 4.6x 128 Silicon 64 IPC gains now at ~3%/gen 0 45 32 22 14 10 7 Process Node # Cores to achieve 90% max performance (Amdahl ’ s Law) 9

  10. Towards Energy Efficient Neuromorphic Computing Standard Computing Brain Inspired Computing Neuromorphic Computing MEM CPU MEM Biological form if X then … 01100 else 11010 … 00100 “ Intelligent ” Applications

  11. Energy Efficiency Challenge: Neuromorphic Accelerators for Cognitive Computing and Machine Learning FPGA Good for efficiency, but problematic for SW and System complexity. 11

  12. Biological Inspiration iq.intel.com • Brains exhibit energy efficient intelligence at 20W • One-shot, unsupervised learning & inference, creativity • High parallelism : 100 Billion neurons • Rich connectivity: 100 Trillion synapses • Super computer implementation of brain: ~100 server racks • ~1500x slower, and ~500 Million times more power 12

  13. Neuromorphic Landscape is Growing THEORY HARDWARE / SOFTWARE / SIMULATION APPLICATIONS / SOLUTIONS Neurithmic … and more …

  14. NTV Variable Precision FPU H. Kaul, R. Krishnamurthy et al, ISSCC 2012 14

  15. K-Nearest Neighbor ML Accelerator 1024b Reference Vector Storage Partial Distance Compute Accumulator Local Reference Vector Control psum 0 3 Reference Vector 0 Minimum Sort Network valid 0 3 Reference Vector 1 3 Reference Vector 2 psum 127 3 Reference Vector 127 valid 127 1024 Global Query Object Control Vector (Q) {minaddr, minprecise, minvalid,minpsum} • On-die integrated special-purpose hardware accelerator for visual recognition vectors matching • 128x128x8b vector search for the top “ k nearest neighbors ” (kNN) • Data-dependent accuracy refinement to increase energy efficiency • Reconfigurable for k and distance metric (Euclidean/Manhattan) 15 15 of 16

  16. K-Nearest Neighbor ML Accelerator Distance Sort Vector Distances ×n ×n-1 Reference Vector(r) LSB Query Vector (q) MSB 0000 a 128x8b < < (q-r) 2 b 1000 Nearest 1101 < Neighbor a b Narrow Bit-Width ● k-Nearest-Neighbor (kNN): power/performance limiter for computer vision and classification workloads ● Only closely matched vectors require higher precision → Adapt precision per vector to guarantee accuracy ● Majority of vectors eliminated with low precision → Increased performance, reduced area and energy 16 of 16

  17. Iterative Search Space Reduction 128 Search Space (Valid Vectors) Example kNN Operation (Euclidean) 96 64 k th NN Found 32 3 1 10 0 1 6 11 16 21 26 Sort Iteration ● Up to 5.2X higher throughput from early elimination ● Up to 127X reduction for next nearest search space 17 of 16

  18. kNN Accelerator Die Micrograph I/O and Control1 Distance Clock Shared Vector0 Vector1 I/O Memory 128 × 128-D Accumulator1 488µm kNN 2 Vector Block Distributed Accelerator Sort Network 64× 64 Dimensions Accumulator0 682µm Distance Shared Vector0 Vector1 Control0 8b 1 Dimension H. Kaul, R. Krishnamurthy et al, ISSCC 2016 Process 14nm Tri-gate CMOS Nominal Operation 750mV, 338MHz, 25°C Number of Transistors 12.2M 0.333mm 2 Accelerator Area 18 of 16

  19. Performance Measurements 25 60 20 48 15 36 10 24 5 12 0 2 4 6 8 10 k Nearest Neighbors ● 21.5M queries/s and 16 cycles/query (Manhattan, k=1) ● Average latency increase for each successive neighbor: 2 cycles (Manhattan) and 4 cycles (Euclidean) 19 of 16

  20. Power Measurements 80 5.0 Total Energy/Query Vector (nJ) 14nm CMOS, 338MHz, 750mV, 25°C Total Power (mW) 73mW 65 4.5 50 4.0 35 3.5 3.37nJ/query, 9.7TOPS/W 20 3.0 0 2 4 6 8 10 k Nearest Neighbors ● 73mW total power, 3.37nJ/query (Manhattan, k=1) ● Average energy increase for each successive neighbor: 43pJ (Manhattan) and 87pJ (Euclidean) 20 of 16

  21. Supply Voltage Scaling Measurements 100 125 14nm CMOS, 25°C Manhattan, k=1 (Million Query Vectors/s) Euclidean, k=1 100 Total Power (mW) Throughput 10 75 50 1 25 0.1 0 350 450 550 650 750 850 Supply Voltage (mV) ● Robust NTV circuits enable 360mV-850mV operation ● 26.4M queries/s, 114mW at 850mV (Manhattan, k=1) ● 1.1M queries/s, 1.44mW at 360mV (Manhattan, k=1) 21 of 16

  22. Energy Scaling Measurements 6 5 4 3 2 1 350 450 550 650 750 850 Supply Voltage (mV) ● Peak efficiency of 1.23nJ/query or 26.5TOPS/W at 390mV (near-threshold) → 2.73X improvement over nominal 22 of 16

  23. “ Extreme ” energy efficiency 2W – 100 G 20MW - E xaFLOPS igaFLO PS 10 year goal: ~300X Improvement in energy efficiency Equal to 20 pJ /FLOPS at the system level 23 23

  24. intel ligence Inside

  25. kNN Accelerator Organization i i 2 i i min i min ● Iteratively search for kNN within 128x128-D vectors ● Distant vectors eliminated in early iterations ● Reconfigurable for Manhattan and Euclidean distance 25 of 16

  26. Organization: Distance Compute 2 i min ● Narrow single-cycle datapath for distance compute ● Accumulate computed refinement to distance 26 of 16

  27. kNN Operation: Adaptive Precision rd rd st st nd nd ● Data-dependent precision for each vector → Reduces required compute and sort operations ● Same nearest-neighbor result as full-precision 27 of 16

  28. Average Search Space Reduction 20 Search Space Reduction (×) Euclidean Manhattan 15 10 5 0 2 3 4 5 6 7 8 9 10 k Nearest Neighbor ● 10X-18X average reduction of starting search space for next nearest neighbor 28 of 16

Recommend


More recommend