Convolution Engine Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal

The Problem ● Many workloads require specialized hardware to meet performance expectations. ○ Ex. image processing (more on that shortly…) ● Unfortunately, performant specialized hardware comes at the cost of flexibility. ○ In some cases, power-efficiency is sacrificed as well! ● How can we create specialized HW that can balance these 3 factors? ○ Reasonably performant ○ Reasonably power-efficient ○ Reasonably flexible (programmable) Convolution Engine Qadeer et al. 2

Example: Image Processing ● Image (and video) processing calls for specialized HW ○ General-purpose HW not optimized for such high data parallelism ● Traditional Solution: Single Instruction Multiple Data (SIMD) Units ○ Extremely flexible (programmable), but way too slow ● Alternative: GPUs ○ Still flexible, and more performant than SIMD Units, but consume way too much power ● Another Alternative: ASIC Accelerators ○ Very performant and power-efficient, but at the cost of flexibility -- only apply to 1 algorithm ● Convolution Engine Qadeer et al. 3

Source: Powell, Victor. “Image Kernels: Explained Image Processing in Action Visually.” http://setosa.io/ev/image-kernels/ Convolution Engine Qadeer et al. 4

Key Insight: Image Processing is Convolution-like ● Convolution: Apply a mapping function to a stencil (chunk) of data, perform a reduction on the result, then shift stencil and repeat ○ Iterative map-then-reduce: common occurrence in image processing ● Use this insight to abstract over image processing algorithms ○ Rather than build an ASIC for 1 algorithm, build specialized HW for the class of algorithms ○ Allow users to program this specialized HW based on specific application needs ● Idea: Convolution Engine (CE) ○ An architecture that yields reasonable performance, power, and flexibility numbers for convolution-like algorithms Convolution Engine Qadeer et al. 5

Design: Improving Efficiency ● Register file overheads (1D and 2D registers) ○ Shift registers are a natural extension of moving stencils ● Load/store unit ○ Multiple memory access widths, unaligned accesses ● Keeping things simple ○ Interface Units (IF) arrange data as needed for map operation ○ Functional Units are just 2-input ALUs on pre-arranged data ● Complex Graph Fusion Unit (CGFU) ○ Combine up to 9 different convolution instructions into one “super instruction” in reduce ● Lightweight SIMD Unit for all else ○ No multiplication, just add/subtract-type instructions Convolution Engine Qadeer et al. 6

Design: Providing Flexibility ● CE is a processor extension ○ Small set of new ISA instructions ○ Issued through C code compiler intrinsics ● Configuration registers for kernel-constant values ○ Convolution size, ALU operation, etc. ● Completely software controlled ○ Can interleave non-CE instructions before next convolution iteration ● Chained processors (slices) can be used for more complex convolution Convolution Engine Qadeer et al. 7

Evaluation ● 3 different algorithms ○ H.264 motion estimation (video decoding) ○ SIFT (feature detection) ○ demosaic (interpret camera input) ● Measure “custom” ASIC vs CE vs SIMD ● Vary programmability of CE as well ○ Fixed kernel (equivalent to custom ASIC) ○ Multiple kernel sizes (more flexibility in interface units, register files, reduction stage) ○ Multiple flows (different dimensions, access patterns, but same operations) ○ Multiple arithmetic operations (full flexibility) Convolution Engine Qadeer et al. 8

Evaluation: ASIC vs CE vs SIMD Convolution Engine Qadeer et al. 9

Evaluation: Varying Flexibility Convolution Engine Qadeer et al. 10

Key Results ● 8-15x less energy use than SIMD ● 2-3x more energy use than custom ASIC ● Within 6x performance of custom ASIC, 7x better than SIMD ● All programmable versions do better performance-wise than SIMD Convolution Engine Qadeer et al. 11

Conclusion ● Better performance and power than SIMD ● Worse than fixed-application ASIC ● Moderate amount of flexibility ○ The greater the degree of programmability, the more performance gains are lost Convolution Engine Qadeer et al. 12

Discussion Points ● Is image processing a broad enough domain to claim that they apply their architecture to a “wide range” of applications? ● The authors compared CE to SIMD units and ASICs. Should a GPU comparison have also been given? ● CE doesn’t optimize any of the 3 relevant categories (performance, power, and flexibility). Is there sufficient motivation to use it? Convolution Engine Qadeer et al. 13

Convolution Engine Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal The Problem

1 Convolution Convolution is an important operation in signal and image processing. Convolution

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video

Correlation, Convolution, Filtering COMPSCI 527 Computer Vision COMPSCI 527 Computer

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot

E he i m COMPSCI 527 Computer Vision Correlation, Convolution, Filtering 14 / 26 Image

Chapter 8: Fast Convolution Keshab K. Parhi Chapter 8 Fast Convolution Introduction

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

WEBee Reverse Convolution Coding Reverse Convolution Coding Convolutional encoding uses a

Convolution Sum Overview Review of time invariance Review of sampling property

Convolution Layers Convolution Layers In [1]: from mxnet import autograd, nd from mxnet.gluon

Lecture 2: Convolution Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall 2020

Chapter 3 Chapter 3 Convolution Representation Convolution Representation CT Unit-Impulse

Overview of Convolution Integral Topics Impulse response defined Several derivations of the

Chapter 3 Chapter 3 Convolution Representation Convolution Representation DT Unit-Impulse

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

1 Mapping Relational Data Model Patterns To The App Engine Datastore Max Ross November 19,

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #23:

Lecture 5 - SIMD recap Welcome! , = (, ) ,

Performance of GeantV Soon Yung Jun, Philippe Canal, Guilherme Lima Sept. 13, 2019 PDS Geant

Feasibility study on polyparylene deposition in a PECVD reactor E. v. Wahl 1 , C Kirchberg 2 , M.

$ n

Revec: Program Rejuvenation through Revectorization Charith Mendis * Ajay Jain * Paras Jain

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Random matrices and Gaussian multiplicative chaos Nick Simm Mathematics Institute, University