SNNAP: Approximate Computing on Programmable SoCs via Neural - PowerPoint PPT Presentation

SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Hadi Esmaeilzadeh Mark Wyse Luis Ceze Jacob Nelson Mark Oskin Adrian Sampson

Approximate Computing Expose quality-performance trade-offs

Approximate Computing Expose quality-performance trade-offs ✅ Accurate ❌ Approximate ❌ Expensive ✅ Cheap

Approximate Computing Expose quality-performance trade-offs ✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap Domains include image processing, machine learning, search, physical simulation, multimedia etc.

Neural Acceleration float foo (float a, float b) { F AR NPU … P M return val; G approximation acceleration }

Neural Acceleration float foo (float a, float b) { F AR NPU … P M return val; G approximation acceleration } CPU F NPU P CPU G F D X I M C A Esmaeilzadeh et al. SNNAP [MICRO 2012]

SNNAP float foo (float a, float b) { F AR NPU … P M return val; G approximation acceleration } A neural processing unit on off-the-shelf Programmable SoCs 3.8x speedup and 2.8x efficiency gains offers an alternative to HLS tools for neural acceleration

Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS

Background: Compilation region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search SNNAP binary 3. Code Generation generation CPU

Programming Model float ¡sobel ¡(float* ¡p); ¡ . ¡. ¡. ¡ ¡ Image ¡src; ¡ Image ¡dst; ¡ while ¡(true) ¡{ ¡ sobel ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡

Programming Model APPROX ¡float ¡sobel ¡(APPROX ¡float* ¡p); ¡ . ¡. ¡. ¡ ¡ APPROX ¡Image ¡src; ¡ APPROX ¡Image ¡dst; ¡ ✅ no side effects while ¡(true) ¡{ ¡ sobel ✅ executes often ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡ ACCEPT: compilation framework for approximate programs

Background: Multi-Layer Perceptrons neural network computing a single layer = ([ [] ] []) x 7 w 67 w 57 w 47 x 6 f x 8 w 68 w 58 w 48 6 x 5 ! ∑ x7 wi7•xi x 9 w 69 w 59 w 49 x 4 i=4 x0 w47 x7 x4 w57 y0 x1 x8 activation function f x5 w67 y1 x2 x9 x6 Output x3 Hidden Layer 0 Hidden Layer 1 Input Layer

Background: Systolic Arrays computing a single layer systolic array = ([ [] ] []) x 7 w 67 w 57 w 47 x 6 f x 8 w 68 w 58 w 48 x 5 x 9 w 69 w 59 w 49 x 4

Background: Systolic Arrays systolic array x 6 x 5 x 4 w 49 w 48 w 47 w 59 w 58 w 57 w 69 w 68 w 67 f x 7 x 8 x 9

PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 3 - sigmoid unit implements non- PE w 69 w 68 w 67 linear activation functions Storage PE f PE x 7 f 4 - vertically micro-coded sequencer x 8 x 9

Multi-Processing Units AXI Master scheduler bus PU PU PU PU control control control control PE PE PE PE PE PE PE PE Storage Storage Storage Storage PE PE PE PE PE PE PE PE f f f f

CPU-SNNAP Integration Interface requirements: Low-latency data transfer - Fast signaling - $L2 ACP DMA scheduler master $L1 bus SEV WFE CPU PU PU PU PU

CPU-SNNAP Integration coherent reads Interface requirements: & writes Low-latency data transfer - with accelerator Fast signaling - coherency port $L2 ACP DMA scheduler master $L1 bus SEV WFE CPU PU PU PU PU

CPU-SNNAP Integration coherent reads Interface requirements: custom & writes Low-latency data transfer mastering - with accelerator interface Fast signaling - coherency port $L2 ACP DMA scheduler master $L1 bus SEV WFE CPU PU PU PU PU

CPU-SNNAP Integration coherent reads Interface requirements: custom & writes Low-latency data transfer mastering - with accelerator interface Fast signaling - coherency port $L2 ACP DMA scheduler master low-latency $L1 event bus SEV signaling, WFE sleep & CPU wakeup PU PU PU PU

Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of f CPU ) vs. precise CPU execution

Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of f CPU ) vs. precise CPU execution application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff

Speedup 10.8 38.1 4.00 3.8 Factors: - Amdahl’s Speedup Whole Application Speedup - Cost of instructions on CPU 3.00 2.7 vs. cost of NN on SNNAP 2.4 2.3 2.00 1.5 1.3 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O h i e r n a o s l M t n e l e s E k s A 2 j N

Speedup 10.8 38.1 4.00 3.8 Factors: - Amdahl’s Speedup Whole Application Speedup - Cost of instructions on CPU 3.00 2.7 vs. cost of NN on SNNAP 2.4 2.3 2.00 inversek2j kmeans 1.5 1.3 Amdahl’s >100x 1.47x Speedup 1.00 1660 CPU cost 29 cycles cycles 0.00 NN hidden b f i j j k s G n m p f 1 2 m o t s v E e e b c layers e e g O h i e r n a o s l M t n e l e s E k s A 2 j N

Energy Savings 7.8 28.0 +36% 4.00 Energy = Power * Runtime on 3.00 2.8 (DRAM Energy Savings 2.2 + SoC) 2.00 1.8 1.7 1.1 .9 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O h i e r n a o s l M t n e l e s E k s A 2 j N

HW Acceleration Neural Acceleration with SNNAP vs. High Level Synthesis Compilers which one should you use?

HLS Comparison Study FPGA design neural compiled NN down transform executes SNNAP netlist compiled down x √ HLS HLS x

HLS Comparison Study FPGA design neural compiled NN down transform executes SNNAP netlist compiled down x √ HLS HLS x Resource-normalized throughput: pipeline invocation interval - maximum frequency - resource utilization -

HLS Comparison Study 43.7 7.9 10.00 Neural Normalized Throughput Improvement over HLS Acceleration 1.6 1.6 1.3 is better 1.00 .5 HLS is better .4 .2 0.10 b f i j j k s G f n m p t m o s v e E e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N

HLS Comparison Study Neural HLS Accel. 43.7 7.9 10.00 ✅ Precision Normalized Throughput Improvement over HLS ✅ 1.6 1.6 Virtualization 1.3 1.00 Performance .5 .4 .2 Programmability 0.10 b f i j j k s G f n m p t m o s v e E e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N

HLS Comparison Study Neural HLS Accel. 43.7 7.9 10.00 ✅ Precision Normalized Throughput Improvement over HLS ✅ 1.6 1.6 Virtualization 1.3 1.00 ~ ~ Performance .5 .4 ✅ .2 Programmability 0.10 b f i j j k s G f n m p t m o s v e E e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N

Conclusion SNNAP : apply approximate computing on programmable SoCs through neural acceleration float foo (float a, float b) { … return r; } 3.8x speedup & 2.8x energy savings neural acceleration is a viable alternative to HLS

SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau: moreau@uw.edu Hadi Esmaeilzadeh Mark Wyse Luis Ceze Jacob Nelson Mark Oskin Adrian Sampson http://sampa.cs.washington.edu/

SNNAP: Approximate Computing on Programmable SoCs via Neural - PowerPoint PPT Presentation

SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Hadi Esmaeilzadeh Mark Wyse Luis Ceze Jacob Nelson Mark Oskin Adrian Sampson Approximate Computing Expose quality-performance trade-offs

Built- -In Self In Self- -Test for Programmable Test for Programmable Built I/O Buffers in

Power Management in Power Management in Wireless SOCs SOCs Wireless Jan M. Rabaey Scientific

Platform- -Based Synthesis for Based Synthesis for Platform Field Field- -Programmable

ROMs, PLAs and FPGAs October 5, 2006 Typeset by Foil T EX Why Programmable Logic?

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Programmable Logic Core Based Post-Silicon Debug for SoCs Bradley R. Quinton and Steven J.E.

PROGRAMMABLE LOGIC CONTROLLER Control Systems Types Programmable Logic Controllers

Field Programmable Gate Arrays by Ketil Red Field Programmable Gate Array Integrated

Backward Analysis via Over-Approximate Abstraction and Under-Approximate Subtraction Alexey

Nanowire- -Based Based Nanowire Programmable Programmable Architectures Architectures

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Regulatory Guidance on the Use of Field Programmable Gate of Field Programmable Gate Arrays in

Outline FPGA clocking Programmable clocks Dynamic programmable oscillators EMI

Programmable Data Plane at Terabit Speeds Milad Sharif SOFTWARE ENGINEER PISA: Protocol

TESTING PROGRAMMABLE INFRASTRUCTURE (WITH RUBY) @burythehammer PROGRAMMABLE INFRASTRUCTURE IS

Open Programmable Architecture for Java-enabled Network Devices Tal Lavian Technology Center

Bangkok, September 11 13 th , 2019 Introduction Why Tourism as an Export Sector is

Adaptively Compressed Polarizability Operator For Accelerating Large Scale ab initio Phonon

A semigroup approach to boundary feedback systems. Alessandro Arrigoni, Klaus Engel University

Sustainability, Energy and Climate: The Role of a US Clean Energy Standard by Karen Palmer

Charm physics prospects at the Belle II experiment Marko Stari c Belle Belle II collaboration

Towards OS kernel acceleration in heterogeneous systems Alex Kroh | Oliver Diessel School of

The main products of cyclophosphamide bioactivation exert a cardiotoxic effect at clinical

Diagnostica molecolare nellallergia alimentare Filippo Fassio SOD Immunologia e Terapie