SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Hadi Esmaeilzadeh Mark Wyse Luis Ceze Jacob Nelson Mark Oskin Adrian Sampson
Approximate Computing Expose quality-performance trade-offs
Approximate Computing Expose quality-performance trade-offs ✅ Accurate ❌ Approximate ❌ Expensive ✅ Cheap
Approximate Computing Expose quality-performance trade-offs ✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap Domains include image processing, machine learning, search, physical simulation, multimedia etc.
Neural Acceleration float foo (float a, float b) { F AR NPU … P M return val; G approximation acceleration }
Neural Acceleration float foo (float a, float b) { F AR NPU … P M return val; G approximation acceleration } CPU F NPU P CPU G F D X I M C A Esmaeilzadeh et al. SNNAP [MICRO 2012]
SNNAP float foo (float a, float b) { F AR NPU … P M return val; G approximation acceleration } A neural processing unit on off-the-shelf Programmable SoCs 3.8x speedup and 2.8x efficiency gains offers an alternative to HLS tools for neural acceleration
Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS
Background: Compilation region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search SNNAP binary 3. Code Generation generation CPU
Programming Model float ¡sobel ¡(float* ¡p); ¡ . ¡. ¡. ¡ ¡ Image ¡src; ¡ Image ¡dst; ¡ while ¡(true) ¡{ ¡ sobel ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡
Programming Model APPROX ¡float ¡sobel ¡(APPROX ¡float* ¡p); ¡ . ¡. ¡. ¡ ¡ APPROX ¡Image ¡src; ¡ APPROX ¡Image ¡dst; ¡ ✅ no side effects while ¡(true) ¡{ ¡ sobel ✅ executes often ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡ ACCEPT: compilation framework for approximate programs
Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS
Background: Multi-Layer Perceptrons neural network computing a single layer = ([ [] ] []) x 7 w 67 w 57 w 47 x 6 f x 8 w 68 w 58 w 48 6 x 5 ! ∑ x7 wi7•xi x 9 w 69 w 59 w 49 x 4 i=4 x0 w47 x7 x4 w57 y0 x1 x8 activation function f x5 w67 y1 x2 x9 x6 Output x3 Hidden Layer 0 Hidden Layer 1 Input Layer
Background: Systolic Arrays computing a single layer systolic array = ([ [] ] []) x 7 w 67 w 57 w 47 x 6 f x 8 w 68 w 58 w 48 x 5 x 9 w 69 w 59 w 49 x 4
Background: Systolic Arrays systolic array x 6 x 5 x 4 w 49 w 48 w 47 w 59 w 58 w 57 w 69 w 68 w 67 f x 7 x 8 x 9
PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 3 - sigmoid unit implements non- PE w 69 w 68 w 67 linear activation functions Storage PE f PE x 7 f 4 - vertically micro-coded sequencer x 8 x 9
Multi-Processing Units AXI Master scheduler bus PU PU PU PU control control control control PE PE PE PE PE PE PE PE Storage Storage Storage Storage PE PE PE PE PE PE PE PE f f f f
Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS
CPU-SNNAP Integration Interface requirements: Low-latency data transfer - Fast signaling - $L2 ACP DMA scheduler master $L1 bus SEV WFE CPU PU PU PU PU
CPU-SNNAP Integration coherent reads Interface requirements: & writes Low-latency data transfer - with accelerator Fast signaling - coherency port $L2 ACP DMA scheduler master $L1 bus SEV WFE CPU PU PU PU PU
CPU-SNNAP Integration coherent reads Interface requirements: custom & writes Low-latency data transfer mastering - with accelerator interface Fast signaling - coherency port $L2 ACP DMA scheduler master $L1 bus SEV WFE CPU PU PU PU PU
CPU-SNNAP Integration coherent reads Interface requirements: custom & writes Low-latency data transfer mastering - with accelerator interface Fast signaling - coherency port $L2 ACP DMA scheduler master low-latency $L1 event bus SEV signaling, WFE sleep & CPU wakeup PU PU PU PU
Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS
Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of f CPU ) vs. precise CPU execution
Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of f CPU ) vs. precise CPU execution application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff
Speedup 10.8 38.1 4.00 3.8 Factors: - Amdahl’s Speedup Whole Application Speedup - Cost of instructions on CPU 3.00 2.7 vs. cost of NN on SNNAP 2.4 2.3 2.00 1.5 1.3 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O h i e r n a o s l M t n e l e s E k s A 2 j N
Speedup 10.8 38.1 4.00 3.8 Factors: - Amdahl’s Speedup Whole Application Speedup - Cost of instructions on CPU 3.00 2.7 vs. cost of NN on SNNAP 2.4 2.3 2.00 inversek2j kmeans 1.5 1.3 Amdahl’s >100x 1.47x Speedup 1.00 1660 CPU cost 29 cycles cycles 0.00 NN hidden b f i j j k s G n m p f 1 2 m o t s v E e e b c layers e e g O h i e r n a o s l M t n e l e s E k s A 2 j N
Energy Savings 7.8 28.0 +36% 4.00 Energy = Power * Runtime on 3.00 2.8 (DRAM Energy Savings 2.2 + SoC) 2.00 1.8 1.7 1.1 .9 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O h i e r n a o s l M t n e l e s E k s A 2 j N
HW Acceleration Neural Acceleration with SNNAP vs. High Level Synthesis Compilers which one should you use?
HLS Comparison Study FPGA design neural compiled NN down transform executes SNNAP netlist compiled down x √ HLS HLS x
HLS Comparison Study FPGA design neural compiled NN down transform executes SNNAP netlist compiled down x √ HLS HLS x Resource-normalized throughput: pipeline invocation interval - maximum frequency - resource utilization -
HLS Comparison Study 43.7 7.9 10.00 Neural Normalized Throughput Improvement over HLS Acceleration 1.6 1.6 1.3 is better 1.00 .5 HLS is better .4 .2 0.10 b f i j j k s G f n m p t m o s v e E e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N
HLS Comparison Study Neural HLS Accel. 43.7 7.9 10.00 ✅ Precision Normalized Throughput Improvement over HLS ✅ 1.6 1.6 Virtualization 1.3 1.00 Performance .5 .4 .2 Programmability 0.10 b f i j j k s G f n m p t m o s v e E e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N
HLS Comparison Study Neural HLS Accel. 43.7 7.9 10.00 ✅ Precision Normalized Throughput Improvement over HLS ✅ 1.6 1.6 Virtualization 1.3 1.00 ~ ~ Performance .5 .4 ✅ .2 Programmability 0.10 b f i j j k s G f n m p t m o s v e E e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N
Conclusion SNNAP : apply approximate computing on programmable SoCs through neural acceleration float foo (float a, float b) { … return r; } 3.8x speedup & 2.8x energy savings neural acceleration is a viable alternative to HLS
SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau: moreau@uw.edu Hadi Esmaeilzadeh Mark Wyse Luis Ceze Jacob Nelson Mark Oskin Adrian Sampson http://sampa.cs.washington.edu/
Recommend
More recommend