Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS Thierry Moreau , Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze
Internet of Things Revolution aggregate noisy, real world analytics, processing sensory input consumed by human etc. … … double temp = sensor_acquire(); double temp = sensor_acquire(); … … Approximate computing: eliminate ine ffi ciencies in systems by producing just-the-right quality
Quantization: going back to basics aggregate noisy, real world analytics, processing sensory input consumed by human etc. SRAM ALU SRAM ALU
This Talk: A “Limit Study” on Precision Scaling Assumption : hardware that can dynamically and arbitrarily scale its precision float double 1 n SW Scope : compute heavy, regular applications HW Scope : hardware accelerators
Talk Overview 1. How much precision is needed at different stages of a program? 2. How much energy can be saved (upper bound)? 3. How does this inform approximate computing research?
Talk Overview 1. How much precision is needed at different stages of a program? QAPPA - Precision Autotuner 2. How much energy can be saved? 3. How does this inform approximate computing research?
QAPPA: Quality Autotuner for Precision- Programmable Accelerators Goal: Minimize instruction-level precision requirements given a quality target desired quality target quality & energy savings QAPPA kernel.c instruction-level framework precision requirements Built on top of ACCEPT , the approximate C/C++ compiler http://accept.rocks
QAPPA Autotuner Overview Default (no savings) savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK application quality
QAPPA Autotuner Overview Optimized: extraneous precision is shaved off savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK application quality
QAPPA 5-Step Description Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration
1. Program Annotation Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d ( APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { Key: use the APPROX APPROX flt sum = 0 int dstPos = … type qualifier [*] for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } } } [*] EnerJ, Sampson et al., PLDI’11
2. Static Analysis Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d ( APPROX pix *in, APPROX pix *out, APPROX flt *filter) Instruction-Level { Precision Configuration for (row) { for (col) { (ILPC) APPROX flt sum = 0 ACCEPT int dstPos = … conv2d:13:7:load:Int32 for (row_offset) { for (col_offset) { conv2d:13:10:load:Float int srcPos = … conv2d:13:11:fmul:Float int fltPos = … sum += in[srcPos] * filter[fltPos] conv2d:13:12:fadd:Float } conv2d:15:1:fdiv:Float } conv2d:15:7:store:Int32 out[dstPos] = sum / normFactor } } } ACCEPT identifies safe-to-approximate instructions from data annotations using flow analysis
3. Error Injection Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Instruction-Level Precision Configuration (ILPC) Instrumentation Approximate conv2d:13:7:load:Int4 & Compilation conv2d:13:10:load:Fix2.3 Binary conv2d:13:11:fmul:Fix2.3 conv2d:13:12:fadd:Fix4.5 conv2d:15:1:fdiv:Fix2.3 conv2d:15:7:store:Int4 Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings
4. Quality Assessment Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Reference Binary eval.py Approximate Binary 10dB SNR The programmer provides a quality assessment script to evaluate quality on the program output
5. Autotuning Algorithm Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Greedy iterative algorithm [*] : reduces precision requirement of the instruction that impacts quality the least … config k: error = 0.10% config [k+1, i-1]: config [k+1, i]: config [k+1, i+1]: … … error = 5.91% error = 0.30% error = 0.12% config [k+2, i-1]: config [k+2, i]: config [k+2, i+1]: … … error = 5.91% error = 0.33% error = 1.6% … Finds solution in O(m 2 n) worst case where m is the number of static safe-to- approximate instructions and n are the levels of precision for all instructions [*] Precimonious, Rubio-Gonzalez et al., SC’13
5. Autotuning Algorithm Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration The autotuner greedily maximizes bit-savings 10dB as the quality target is lowered 20dB 40dB 60dB precise
PERFECT Application Study Application Domain Kernels Metric Discrete Wavelet Transform PERFECT Application 1 2D Convolution Histogram Equalization Outer Product Space Time Adaptive System Solve Processing Signal to Noise Ratio Inner Product (SNR) Interpolation 1 Synthetic Aperture Radar Interpolation 2 [120dB to 10dB] Back Projection (0.0001% to 31.6% MSE) Debayer Wide Area Motion Imaging Image Registration Change Detection FFT 1D Required Kernels FFT 2D
Opportunity of Approximations QAPPA Analyzes PERFECT Dynamic Instruction Mix control 11% load/store 27% int arith 25% int arith 4% math fp arith 1% 31% Safe to approximate Precise
Average Precision Reduction Achieved Across PERFECT Kernels Approximate High Quality 100% More savings Dynamic precision reduction on safe-to-approximate instructions 83% 80% 74% 60% 57% 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Target Application SNR (dB)
Average Precision Reduction Achieved Across PERFECT Kernels 100% Dynamic precision reduction on safe-to-approximate instructions 83% 80% 74% 60% PERFECT Manual 57% 0.001% MSE 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB)
Average Precision Reduction Achieved Across PERFECT Kernels 100% Approximate Computing Dynamic precision reduction on safe-to-approximate instructions 10% MSE 83% 80% 74% 60% 57% 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB)
Talk Overview 1. How much precision is needed at different stages of a program? QAPPA - Precision Autotuner 2. How much energy can be saved (upper bound)? Case Study of Precision Scaling Hardware Mechanisms 3. How does this inform approximate computing research?
Translating Precision Reduction into Energy Savings (Compute) Baseline ALU 0100 0110 1001 0101 0100 0110 1001 0010 ser ser ser ser quant quant 10 01 01 01 0100 0110 0100 1000 01 01 00 10 1001 0010 1000 0100 c c 11 11 10 00 1101 1100 de-ser de-ser 1110 1100 1110 1100 No savings
Translating Precision Reduction into Energy Savings (Compute) Baseline ALU Value Truncation 0100 0110 1001 0101 0100 0110 1001 0010 ser ser ser ser quant quant 10 01 01 01 0100 0110 0100 1000 01 01 00 10 1001 0010 1000 0100 c c 11 11 10 00 1101 1100 de-ser de-ser 1110 1100 1110 1100 QUORA [MICRO’13] No savings Less Power
Translating Precision Reduction into Energy Savings (Compute) Baseline ALU Value Truncation Bit-Sliced 0100 0110 1001 0101 0100 0110 1001 0010 ser ser ser ser quant quant 10 01 01 01 0100 0110 0100 1000 01 01 00 10 1001 0010 1000 0100 c c 11 11 10 00 1101 1100 de-ser de-ser 1110 1100 1110 1100 QUORA [MICRO’13] Stripes [MICRO’16] No savings Less Power Higher Throughput
Recommend
More recommend