How to Evaluate Efficient Deep Neural Network Approaches Vivienne Sze ( @eems_mit) Massachusetts Institute of Technology In collaboration with Yu-Hsin Chen, Joel Emer, Yannan Wu, Tien-Ju Yang, Google Mobile Vision Team Slides available at https://tinyurl.com/SzeMITDL2020 Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 1
Book on Efficient Processing of DNNs Part I Understanding Deep Neural Networks Introduction Overview of Deep Neural Networks Part II Design of Hardware for Processing DNNs Key Metrics and Design Objectives Kernel Computation Designing DNN Accelerators Operation Mapping on Specialized Hardware Part III Co-Design of DNN Hardware and Algorithms Reducing Precision Exploiting Sparsity Designing Efficient DNN Models Advanced Technologies https://tinyurl.com/EfficientDNNBook Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 2
How to Evaluate these DNN Approaches? o Many Deep Neural Networks (DNN) accelerators and approaches for efficient DNN processing. Too many to cover! n o We will focus on how to evaluate approaches for efficient processing of DNNs n Approaches include the design of DNN accelerators and DNN models n What are the key metrics that should be measured and compared? Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 3
TOPS or TOPS/W? o TOPS = tera (10 12 ) operations per second o TOPS/Watt or TOPS/Watt commonly reported in hardware literature to show efficiency of design o However, does not provide sufficient insights on hardware capabilities and limitations (especially if based on peak throughput/performance) Example : high TOPS per watt can be achieved with inverter (ring oscillator) Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 4
Key Metrics: Much more than OPS/W! CIFAR-10 MNIST ImageNet Accuracy o Quality of result n Throughput o Analytics on high volume data n Real-time performance (e.g., video at 30 fps) n Embedded Device Data Center Latency o For interactive applications (e.g., autonomous navigation) n Energy and Power o Embedded devices have limited battery capacity n Data centers have a power ceiling due to cooling cost n Hardware Cost Computer Speech o Vision Recognition $$$ n Flexibility o Range of DNN models and tasks n Scalability o Scaling of performance with amount of resources n [ Sze , CICC 2017] Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 5
Key Design Objectives of DNN Accelerators Increase Throughput and Reduce Latency o Reduce time per MAC n Reduce critical path à increase clock frequency o Reduce instruction overhead o Avoid unnecessary MACs (save cycles) n Increase number of processing elements (PE) à more MACs in parallel n Increase area density of PE or area cost of system o Increase PE utilization* à keep PEs busy n Distribute workload to as many PEs as possible o Balance the workload across PEs o Sufficient memory bandwidth to deliver workload to PEs (reduce idle cycles) o Low latency has an additional constraint of small batch size o *(100% = peak performance) Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 6
Eyexam: Performance Evaluation Framework MAC/cycle Step 1: max workload parallelism (Depends on DNN Model) Step 2: max dataflow parallelism peak Number of PEs (Theoretical Peak Performance) performance A systematic way of understanding the performance limits for DNN hardware as a function of specific characteristics of the DNN model and hardware design MAC/data [ Chen , arXiv 2019: https://arxiv.org/abs/1807.07928 ] Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 7
Eyexam: Performance Evaluation Framework MAC/cycle Slope = BW to PEs peak Number of PEs (Theoretical Peak Performance) performance Based on Roofline Model MAC/data Bandwidth (BW) Compute [ Williams , CACM 2009] Bounded Bounded Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 8
Eyexam: Performance Evaluation Framework MAC/cycle Step 1: max workload parallelism Step 2: max dataflow parallelism peak Number of PEs (Theoretical Peak Performance) performance Step 3: # of active PEs under a finite PE array size Step 4: # of active PEs under fixed PE array dimension Step 5: # of active PEs under fixed storage capacity PE MAC/data Slope = BW to only active PE C https://arxiv.org/abs/1807.07928 Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 9 M
Eyexam: Performance Evaluation Framework MAC/cycle Step 1: max workload parallelism Step 2: max dataflow parallelism peak Number of PEs (Theoretical Peak Performance) performance Step 3: # of active PEs under a finite PE array size Step 4: # of active PEs under fixed PE array dimension Step 5: # of active PEs under fixed storage capacity Step 6: lower act. PE util. due to insufficient average BW Step 7: lower act. PE util. due to insufficient instantaneous BW MAC/data workload operational intensity https://arxiv.org/abs/1807.07928 Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 10
Key Design Objectives of DNN Accelerators Relative Energy Cost Reduce Energy and Power Operation: Energy o (pJ) Consumption 8b Add 0.03 Reduce data movement as it n dominates energy consumption 16b Add 0.05 Exploit data reuse o 32b Add 0.1 Reduce energy per MAC n 16b FP Add 0.4 Reduce switching activity and/or o 32b FP Add 0.9 capacitance 8b Multiply 0.2 Reduce instruction overhead o 32b Multiply 3.1 Avoid unnecessary MACs n 16b FP Multiply 1.1 32b FP Multiply 3.7 Power consumption is limited by o heat dissipation, which limits the 32b SRAM Read (8KB) 5 maximum # of MACs in parallel 32b DRAM Read 640 (i.e., throughput) 10 10 2 10 3 1 10 4 [ Horowitz , ISSCC 2014] Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 11
DNN Processor Evaluation Tools Require systematic way to o Timeloop Evaluate and compare wide range of n (DNN Mapping Tool & DNN processor designs Performance Simulator) Architecture Rapidly explore design space n description Accelergy [ Wu , ICCAD 2019] o Early stage energy estimation tool at n the architecture level Accelergy (Energy Estimator Tool) Estimate energy consumption based on o Compound Action Action component architecture level components (e.g., counts counts description # of PEs, memory size, on-chip network) Evaluate architecture level energy n impact of emerging devices Energy Energy Energy … estimation estimation estimation Plug-ins for different technologies o plug-in 0 plug-in 1 Timeloop [ Parashar , ISPASS 2019] o Open-source code available at: DNN mapping tool n http://accelergy.mit.edu Performance Simulator à Action counts n Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 12
Accelergy Estimation Validation o Validation on Eyeriss [ Chen , ISSCC 2016] Achieves 95% accuracy compared to post-layout simulations n Can accurately captures energy breakdown at different granularities n PsumRdNoC PsumRdNoC PsumWrNoC SharedBuffer PsumWrNoC 1.3% SharedBuffer 1.2% 0.6% 3.9% 0.6% 3.6% WeightsNoC WeightsNoC WeightsBuffer WeightsBuffer 0.1% 0.1% 0.2% 0.2% IfmapNoC IfmapNoC 0.5% 0.5% PE Array PE Array 93.8% 93.0% Ground Truth Energy Breakdown Accelergy Energy Breakdown [ Wu , ICCAD 2019] Open-source code available at: http://accelergy.mit.edu Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 13
Performing MAC with Memory Storage Element Activation is input voltage (V i ) Analog Compute o Weight is resistor conductance (G i ) Activations, weights and/or partial sums are n encoded with analog voltage, current, or resistance V 1 Increased sensitivity to circuit non-idealities: n G 1 non-linearities, process, voltage, and temperature variations I 1 = V 1 × G 1 Require A/D and D/A peripheral circuits to interface n V 2 with digital domain G 2 Multiplication o I 2 = V 2 × G 2 eNVM (RRAM, STT-RAM, PCM) use resistive device n Flash and SRAM use transistor (I-V curve) or local cap n Psum I = I 1 + I 2 Accumulation o is output = V 1 × G 1 + V 2 × G 2 Current summing n current Charge sharing n Image Source: [ Shafiee , ISCA 2016] Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 14
Processing In Memory (PIM*) * a.k.a. In-Memory Computing (IMC) Columns in Array (A) Storage Element Implement as matrix-vector multiply o DAC Typically, matrix composed of stored weights n and vector composed of input activations input Reduce weight data movement by o activations moving compute into the memory Perform MAC with storage element or in n peripheral circuits Read out partial sums rather than weights à n fewer accesses through peripheral circuits Analog logic Increase weight bandwidth o (mult/add/shift) Multiple weights accessed in parallel to keep n weight MACs busy (high utilization) stationary ADC Increase amount of parallel MACs o dataflow Storage element can be higher area density n than digital MAC psum/ output activations Reduce routing capacitance n eNVM:[ Yu , PIEEE 2018], SRAM:[ Verma , SSCS 2019] Vivienne Sze ( @eems_mit) Website: http://sze.mit.edu 15
Recommend
More recommend