Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis Ceze {moreau, luisceze}@cs.washington.edu, asampson@cornell.edu WAX 2016 co-located with ASPLOS 2016 April 3rd 2016
What this Talk is About How many bits in a program are really that important? 1 - AXE: Quality Tuning Framework 2 - PERFECT Benchmark Study 2
Precision Tuning More precision means larger memory footprint , more data movement , more energy used in computation 3
Precision Tuning More precision means larger memory footprint , more data movement , more energy used in computation float double 4
Precision Tuning More precision means larger memory footprint , more data movement , more energy used in computation float double 1 n 5
AXE Precision Tuning Framework Goal: Maximize Bit-Savings given a Quality Target 6
AXE Precision Tuning Framework quality target quality & bit-savings AXE kernel.c instruction-level framework precision requirements Built on top of ACCEPT , the approximate C/C++ compiler http://accept.rocks 7
AXE Precision Tuning Framework Default (no bit-savings) bit-savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK quality 8
AXE Precision Tuning Framework Coarse-Grained Precision Reduction bit-savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK quality 9
AXE Precision Tuning Framework Fine-Grained Precision Reduction bit-savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK quality 10
PERFECT Benchmark Suite Application Domain Kernels Metric Discrete Wavelet Transform PERFECT Application 1 2D Convolution Histogram Equalization Outer Product Space Time Adaptive System Solve Processing Signal to Noise Ratio Inner Product (SNR) Interpolation 1 Synthetic Aperture Radar Interpolation 2 [120dB to 10dB] Back Projection (0.0001% to 31.6% MSE) Debayer Wide Area Motion Image Registration Imaging Change Detection FFT 1D Required Kernels FFT 2D 11
1 - PERFECT Dynamic Instruction Mix control 11% load/store 27% int arith 25% int arith 4% math fp arith 1% 31% Safe to approximate Precise 12
1 - PERFECT Dynamic Instruction Mix math fp arith 1% 31% Safe to approximate Long latency ops are all safe to approximate Precise 13
1 - PERFECT Dynamic Instruction Mix Memory ops are mostly safe to approximate (mostly data vs. pointers) load/store 27% Safe to approximate Precise 14
1 - PERFECT Dynamic Instruction Mix Control and address control computation must 11% remain precise int arith 25% Safe to approximate Precise 15
2 - Bit-Savings over Approximate Instructions 100% Approximate High Quality 83% 80% 74% Bit-Savings 60% 57% 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB) 16
2 - Bit-Savings over Approximate Instructions 100% 83% 80% 74% Bit-Savings 60% PERFECT Manual 57% 0.001% MSE 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB) 17
2 - Bit-Savings over Approximate Instructions Approximate 100% Computing 10% MSE 83% 80% 74% Bit-Savings 60% PERFECT Manual 57% 0.001% MSE 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB) 18
Future Architectural Challenges Mechanisms to translate bit-savings into energy savings? New data types/representations? ISA extensions? 19
Thank You! Approximating to the Last Bit Thierry Moreau , Luis Ceze, Adrian Sampson {moreau, luisceze}@cs.washington.edu, asampson@cornell.edu WAX 2016 co-located with ASPLOS 2016 April 3rd 2016 20
Backup Slides 21
Bit Savings Explore the opportunity for precision reduction in a hardware-agnostic way ( precision ref − precision approx ) execs X BitSavings = × precision ref execs total insn static 22
Framework Overview Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Built on top of ACCEPT , the approximate C/C++ compiler http://accept.rocks 23
Program Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Annotation Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d (pix *in, pix *out, flt *filter) { for (row) { for (col) { flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } } } 24
Program Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Annotation Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d ( APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 Key: use the APPROX int dstPos = … type qualifier for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } } } 25
Program Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Annotation Program Inputs & Quality Metrics * Instruction-level Precision Configuration tips on annotating programs faster typedef float flt typedef int pix typedef APPROX float flt typedef APPROX int pix Takeways: Annotating data is intuitive (~10 mins to annotate a kernel) Variables used to index arrays cannot be safely approximated 26
Static Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Analysis Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d ( APPROX pix *in, APPROX pix *out, APPROX flt *filter) Instruction-Level { for (row) { Precision Configuration for (col) { (ILPC) APPROX flt sum = 0 ACCEPT int dstPos = … for (row_offset) { conv2d:13:7:load:Int32 for (col_offset) { conv2d:13:10:load:Float int srcPos = … conv2d:13:11:fmul:Float int fltPos = … sum += in[srcPos] * filter[fltPos] conv2d:13:12:fadd:Float } conv2d:15:1:fdiv:Float } conv2d:15:7:store:Int32 out[dstPos] = sum / normFactor } } } ACCEPT identified safe-to-approximate instructions from data annotations using flow analysis 27
Error Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Injection Program Inputs & Quality Metrics * Instruction-level Precision Configuration Instruction-Level Precision Configuration (ILPC) ACCEPT Approximate conv2d:13:7:load:Int32 conv2d:13:10:load:Float Binary conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32 Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings 28
Quality Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Reference Binary eval.py Approximate Binary 10dB SNR The programmer provides a quality assessment script to evaluate quality on the program output 29
Output Quality Results Quality Autotuner Configuration & Bit Savings Autotuner ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Greedy iterative algorithm : reduces precision requirement of the instruction that impacts quality the least … config k: error = 0.10% config [k+1, i-1]: config [k+1, i]: config [k+1, i+1]: … … error = 5.91% error = 0.30% error = 0.12% config [k+2, i-1]: config [k+2, i]: config [k+2, i+1]: … … error = 5.91% error = 0.33% error = 1.6% … Finds solution in O(m 2 n) worst case where m is the number of static safe-to- approximate instructions and n are the levels of precision for all instructions 30
Recommend
More recommend