Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal
The Problem ● Many workloads require specialized hardware to meet performance expectations. ○ Ex. image processing (more on that shortly…) ● Unfortunately, performant specialized hardware comes at the cost of flexibility. ○ In some cases, power-efficiency is sacrificed as well! ● How can we create specialized HW that can balance these 3 factors? ○ Reasonably performant ○ Reasonably power-efficient ○ Reasonably flexible (programmable) Convolution Engine Qadeer et al. 2
Example: Image Processing ● Image (and video) processing calls for specialized HW ○ General-purpose HW not optimized for such high data parallelism ● Traditional Solution: Single Instruction Multiple Data (SIMD) Units ○ Extremely flexible (programmable), but way too slow ● Alternative: GPUs ○ Still flexible, and more performant than SIMD Units, but consume way too much power ● Another Alternative: ASIC Accelerators ○ Very performant and power-efficient, but at the cost of flexibility -- only apply to 1 algorithm ● Convolution Engine Qadeer et al. 3
Source: Powell, Victor. “Image Kernels: Explained Image Processing in Action Visually.” http://setosa.io/ev/image-kernels/ Convolution Engine Qadeer et al. 4
Key Insight: Image Processing is Convolution-like ● Convolution: Apply a mapping function to a stencil (chunk) of data, perform a reduction on the result, then shift stencil and repeat ○ Iterative map-then-reduce: common occurrence in image processing ● Use this insight to abstract over image processing algorithms ○ Rather than build an ASIC for 1 algorithm, build specialized HW for the class of algorithms ○ Allow users to program this specialized HW based on specific application needs ● Idea: Convolution Engine (CE) ○ An architecture that yields reasonable performance, power, and flexibility numbers for convolution-like algorithms Convolution Engine Qadeer et al. 5
Design: Improving Efficiency ● Register file overheads (1D and 2D registers) ○ Shift registers are a natural extension of moving stencils ● Load/store unit ○ Multiple memory access widths, unaligned accesses ● Keeping things simple ○ Interface Units (IF) arrange data as needed for map operation ○ Functional Units are just 2-input ALUs on pre-arranged data ● Complex Graph Fusion Unit (CGFU) ○ Combine up to 9 different convolution instructions into one “super instruction” in reduce ● Lightweight SIMD Unit for all else ○ No multiplication, just add/subtract-type instructions Convolution Engine Qadeer et al. 6
Design: Providing Flexibility ● CE is a processor extension ○ Small set of new ISA instructions ○ Issued through C code compiler intrinsics ● Configuration registers for kernel-constant values ○ Convolution size, ALU operation, etc. ● Completely software controlled ○ Can interleave non-CE instructions before next convolution iteration ● Chained processors (slices) can be used for more complex convolution Convolution Engine Qadeer et al. 7
Evaluation ● 3 different algorithms ○ H.264 motion estimation (video decoding) ○ SIFT (feature detection) ○ demosaic (interpret camera input) ● Measure “custom” ASIC vs CE vs SIMD ● Vary programmability of CE as well ○ Fixed kernel (equivalent to custom ASIC) ○ Multiple kernel sizes (more flexibility in interface units, register files, reduction stage) ○ Multiple flows (different dimensions, access patterns, but same operations) ○ Multiple arithmetic operations (full flexibility) Convolution Engine Qadeer et al. 8
Evaluation: ASIC vs CE vs SIMD Convolution Engine Qadeer et al. 9
Evaluation: Varying Flexibility Convolution Engine Qadeer et al. 10
Key Results ● 8-15x less energy use than SIMD ● 2-3x more energy use than custom ASIC ● Within 6x performance of custom ASIC, 7x better than SIMD ● All programmable versions do better performance-wise than SIMD Convolution Engine Qadeer et al. 11
Conclusion ● Better performance and power than SIMD ● Worse than fixed-application ASIC ● Moderate amount of flexibility ○ The greater the degree of programmability, the more performance gains are lost Convolution Engine Qadeer et al. 12
Discussion Points ● Is image processing a broad enough domain to claim that they apply their architecture to a “wide range” of applications? ● The authors compared CE to SIMD units and ASICs. Should a GPU comparison have also been given? ● CE doesn’t optimize any of the 3 relevant categories (performance, power, and flexibility). Is there sufficient motivation to use it? Convolution Engine Qadeer et al. 13
Recommend
More recommend