 
              Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington moreau@uw.edu Theme: 2384.004 1 Thierry Moreau
Approximate Computing Aims to exploit application resilience to trade-off quality for efficiency 2 Thierry Moreau
Approximate Computing 3 Thierry Moreau
Approximate Computing ✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap 4 Thierry Moreau
5 Thierry Moreau
6 Thierry Moreau
7 Thierry Moreau
Neural Networks as Approximate Accelerators CPU Esmaeilzadeh et al. [MICRO 2012] 8 Thierry Moreau
Neural Acceleration float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } 9 Thierry Moreau
Neural Acceleration compiler-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT* *Sampson et. al [UW-TR] 10 Thierry Moreau
Neural Acceleration compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT SNNAP* *Moreau et. al [HPCA2015] 11 Thierry Moreau
Neural Acceleration compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT SNNAP 3.8x speedup and 2.8x efficiency - 10% error 12 Thierry Moreau
Talk Outline Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS 13 Thierry Moreau
Compilation Overview code 1. Region detection annotation 14 Thierry Moreau
Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation 15 Thierry Moreau
Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search 16 Thierry Moreau
Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 17 Thierry Moreau
Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 18 Thierry Moreau
Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 19 Thierry Moreau
Programming Model float sobel (float* p); . . . float** src; float** dst; while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(dst); } 20 Thierry Moreau
Programming Model APPROX float sobel (APPROX float* p); . . . APPROX float** src; APPROX float** dst; while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); } 21 Thierry Moreau
Programming Model APPROX float sobel (APPROX float* p); . . . APPROX float** src; APPROX float** dst; ✅ no side effects while (true) { sobel ✅ executes often src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); } 22 Thierry Moreau
Checking for Quality annotated program sobel.c 23 Thierry Moreau
Checking for Quality annotated quality program metric sobel.c d ( y, y 0 ) 24 Thierry Moreau
Checking for Quality input data annotated quality program metric sobel.c d ( y, y 0 ) 25 Thierry Moreau
Checking for Quality input data annotated quality program metric test sobel.c d ( y, y 0 ) training 26 Thierry Moreau
Checking for Quality input data annotated quality program metric test sobel.c d ( y, y 0 ) Performance training Output Quality 27 Thierry Moreau
Talk Outline Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS 28 Thierry Moreau
Background: Multi-Layer Perceptrons neural network computing a single layer x 9 = ([ [] ] []) x 7 w 67 w 57 w 47 x 6 x 8 w 68 w 58 w 48 6 f x 5 ! ∑ x7 wi7•xi w 69 w 59 w 49 x 4 i=4 x0 w47 x7 x4 w57 y0 x1 x8 activation function f x5 w67 y1 x2 x9 x6 Output x3 Hidden Layer 0 Hidden Layer 1 Input Layer 29 Thierry Moreau
Background: Systolic Arrays computing a single layer systolic array x 9 = ([ x 6 [] ] []) x 7 w 67 w 57 w 47 x 5 x 6 x 8 w 68 w 58 w 48 x 4 f x 5 w 69 w 59 w 49 x 4 w 49 w 48 w 47 w 59 w 58 w 57 w 69 w 68 w 67 f x 7 x 8 x 9 30 Thierry Moreau
PU Micro-Architecture systolic array processing unit x 6 x 5 x 4 PU control w 49 w 48 w 47 PE w 59 w 58 w 57 PE w 69 w 68 w 67 Storage PE f PE x 7 f x 8 x 9 31
PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 PE w 59 w 58 w 57 PE w 69 w 68 w 67 Storage PE f PE x 7 f x 8 x 9 32 Thierry Moreau
PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 PE w 69 w 68 w 67 Storage PE f PE x 7 f x 8 x 9 33 Thierry Moreau
PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 3 - sigmoid unit implements non- PE w 69 w 68 w 67 linear activation functions Storage PE f PE x 7 f x 8 x 9 34 Thierry Moreau
PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 3 - sigmoid unit implements non- PE w 69 w 68 w 67 linear activation functions Storage PE f PE x 7 f 4 - vertically micro-coded sequencer x 8 x 9 35 Thierry Moreau
Multi-Processing Units DMA Master scheduler bus PU PU PU PU control control control control PE PE PE PE PE PE PE PE Storage Storage Storage Storage PE PE PE PE PE PE PE PE f f f f 36 Thierry Moreau
CPU-SNNAP Integration coherent reads custom & writes mastering with accelerator interface coherency port $L2 ACP DMA scheduler low-latency master $L1 event signaling, bus SE WF sleep & CPU wakeup PU PU PU PU 37 Thierry Moreau
Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS 38 Thierry Moreau
Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of f CPU ) vs. precise CPU execution application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff 39 Thierry Moreau
Whole-Application Speedup 10.8 38.1 3.8 4.00 Whole Application Speedup 3.00 2.7 2.4 2.3 2.00 1.5 1.3 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N 40 Thierry Moreau
Energy Savings 7.8 28.0 +36% 4.00 Energy = Power * Runtime on 3.00 2.8 (DRAM Energy Savings 2.2 + SoC) 2.00 1.8 1.7 1.1 .9 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O h i e r n a o s l M t n e l e s E k s A 2 j N 41 Thierry Moreau
Conclusion float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } 42 Thierry Moreau
Conclusion compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT 43 Thierry Moreau
Conclusion compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT SNNAP 3.8x speedup & 2.8x energy savings 44 Thierry Moreau
Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Luis Ceze and Mark Oskin University of Washington moreau@uw.edu ACCEPT: http://accept.rocks SNNAP: upon request 45 Thierry Moreau
Recommend
More recommend