A Case for Dynamic Activation Quantization in CNNs Karl Taht, Surya Narayanan, Rajeev Balasubramonian University of Utah
Overview • Background • Proposal • Search Space • Architecture • Results • Future Work
Improving CNN Efficiency • Stripes: Bit-Serial Deep Neural Network Computing • Per-layer bit precisions net significant savings with <1% accuracy loss • Brute force approach to find best quantization – retraining at each step! • Good end result, but expensive! • Weight-Entropy-Based Quantization for Deep Neural Networks • Quantize both weights and activations • Guided search to find optimal quantization (entropy and clustering) • Still requires retraining, still a passive approach Can we exploit adaptive reduced precision during inference?
Proposal: Adaptive Quantization Approach (AQuA) • Most images contain regions of irrelevant information for the classification task • Can avoid such computations all together? • Quantize completely regions to 0 bits • More simply – Crop them!
Proposal: Activation Cropping
Proposal: Activation Cropping Concept: Save computations Add lightweight here predictor here
Search Space – How to Crop • Exploit domain knowledge N • Information is typically centered within the image (>55% in our tests) • Utilize a regular pattern • Less control logic required Image N • Maps easier to different hardware • Added bonus: • While objects are centered, majority of area (and thus computation) is on the outside!
Proposal: Activation Cropping N = 25 Concept: N = 10 Scale Feature Maps N = 8 Proportionally N = 5 N = 2
Search Space – Crop Directions • We consider 16 possible crops as [ 0 1 0 0 ] [ 1 0 0 0 ] permutations of top, bottom, left, and right crops encoded as a vector: Image Image [ TOP , BOTTOM , LEFT , RIGHT ] • Unlike traditional pruning, AQuA can exploit image-based information to enhance pruning options. [ 0 0 1 0 ] [ 0 0 0 1 ] [ 0 1 0 1 ] [ 1 0 1 1 ] Image Image Image Image
Quantifying Potentials • For maintaining original Number of Edges Cropped Top-1 accuracy, 75% images can tolerate some type of crop! • Greater savings with top-5 predictions • Technique invariant to weight quantization Weight Set
Exploiting Energy Savings with ISAAC • Activation cropping technique can be applied to any architecture 1 bit • We use the ISAAC accelerator due 2 bit 1 bit to its flexibility Inputs W 2 bit e i g • Future work includes leveraging h 1 bit t s 2 bit additional variable precision 1 bit techniques 2 bit 8 bit Outputs
Weight Precision Savings 10 bit 16 bit 1 bit 2 bit 1 bit 2 bit 5 columns 8 columns 1 bit 2 bit 1 bit 2 bit 8 bit ADC (Multiplexed) 5 x ADC Operations 8 x ADC Operations
“FlexPoint” Support 10 bit 16 bit 1 bit 2 bit 1 bit 2 bit 5 columns 8 columns 1 bit 2 bit 1 bit 2 bit 8 bit ADC (Multiplexed) Can vary shift amount to compute fixed point computations with different exponents
Activation Quantization Savings Buffered Input 1 0 0 1 1...1010101 1 .. 1 0 1 1 bit 2 bit 1 .. 1 0 1 0 1 0 1 0...1000110 1 bit 2 bit 1...0011111 0 0 1 0 0 .. 1 0 1 1 bit 2 bit k - bit inputs .. 0 0 1 1 1 1 1 1 1 bit 2 bit 5 4 2 1 k .. 7 6 3 Time Step 8 bit K -bit activations (inputs) require K time steps. Outputs
Activation Quantization Savings Buffered Input 1 0 0 1 1...1010101 1 .. 1 0 1 1 bit Fewer computations means 2 bit 1 .. 1 0 1 0 1 0 1 0...1000110 increasing throughput, 1 bit 2 bit reducing area requirements, 1...0011111 0 0 1 0 0 .. 1 0 1 1 bit and lowering energy. 2 bit k - bit inputs .. 0 0 1 1 1 1 1 1 1 bit 2 bit 5 4 2 1 k .. 7 6 3 Time Step 8 bit K -bit activations (inputs) require K time steps. Outputs
Naive Approach – Crop Everything • Substantial energy savings at a cost to accuracy • Theoretically, can save over 33% energy and maintain original accuracy!
Overall Energy Savings • Adaptive quantization saves 33% on average compared to an uncropped baseline. • Technique can be applied in conjunction with weight quantization techniques with nearly identical relative savings
Future Work • Predict unimportant regions Original • Using a “0 th ” layer with a just a few gradient-based kernels • Use variable low precision computations unimportant Sobel Gradient regions (not just cropping) • Quantify energy and latency changes due to additional prediction step, but fewer overall computations
Conclusion • Adaptive quantization saves 33% on average compared to an uncropped baseline. • Technique can be applied in conjunction with weight quantization techniques with nearly identical relative savings
Thank you! Questions?
Recommend
More recommend