convolution engine
play

Convolution Engine Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal The Problem


  1. Convolution Engine Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark A. Horowitz Stanford University Kevin Loughlin and Ian Neal

  2. The Problem ● Many workloads require specialized hardware to meet performance expectations. ○ Ex. image processing (more on that shortly…) ● Unfortunately, performant specialized hardware comes at the cost of flexibility. ○ In some cases, power-efficiency is sacrificed as well! ● How can we create specialized HW that can balance these 3 factors? ○ Reasonably performant ○ Reasonably power-efficient ○ Reasonably flexible (programmable) Convolution Engine Qadeer et al. 2

  3. Example: Image Processing ● Image (and video) processing calls for specialized HW ○ General-purpose HW not optimized for such high data parallelism ● Traditional Solution: Single Instruction Multiple Data (SIMD) Units ○ Extremely flexible (programmable), but way too slow ● Alternative: GPUs ○ Still flexible, and more performant than SIMD Units, but consume way too much power ● Another Alternative: ASIC Accelerators ○ Very performant and power-efficient, but at the cost of flexibility -- only apply to 1 algorithm ● Convolution Engine Qadeer et al. 3

  4. Source: Powell, Victor. “Image Kernels: Explained Image Processing in Action Visually.” http://setosa.io/ev/image-kernels/ Convolution Engine Qadeer et al. 4

  5. Key Insight: Image Processing is Convolution-like ● Convolution: Apply a mapping function to a stencil (chunk) of data, perform a reduction on the result, then shift stencil and repeat ○ Iterative map-then-reduce: common occurrence in image processing ● Use this insight to abstract over image processing algorithms ○ Rather than build an ASIC for 1 algorithm, build specialized HW for the class of algorithms ○ Allow users to program this specialized HW based on specific application needs ● Idea: Convolution Engine (CE) ○ An architecture that yields reasonable performance, power, and flexibility numbers for convolution-like algorithms Convolution Engine Qadeer et al. 5

  6. Design: Improving Efficiency ● Register file overheads (1D and 2D registers) ○ Shift registers are a natural extension of moving stencils ● Load/store unit ○ Multiple memory access widths, unaligned accesses ● Keeping things simple ○ Interface Units (IF) arrange data as needed for map operation ○ Functional Units are just 2-input ALUs on pre-arranged data ● Complex Graph Fusion Unit (CGFU) ○ Combine up to 9 different convolution instructions into one “super instruction” in reduce ● Lightweight SIMD Unit for all else ○ No multiplication, just add/subtract-type instructions Convolution Engine Qadeer et al. 6

  7. Design: Providing Flexibility ● CE is a processor extension ○ Small set of new ISA instructions ○ Issued through C code compiler intrinsics ● Configuration registers for kernel-constant values ○ Convolution size, ALU operation, etc. ● Completely software controlled ○ Can interleave non-CE instructions before next convolution iteration ● Chained processors (slices) can be used for more complex convolution Convolution Engine Qadeer et al. 7

  8. Evaluation ● 3 different algorithms ○ H.264 motion estimation (video decoding) ○ SIFT (feature detection) ○ demosaic (interpret camera input) ● Measure “custom” ASIC vs CE vs SIMD ● Vary programmability of CE as well ○ Fixed kernel (equivalent to custom ASIC) ○ Multiple kernel sizes (more flexibility in interface units, register files, reduction stage) ○ Multiple flows (different dimensions, access patterns, but same operations) ○ Multiple arithmetic operations (full flexibility) Convolution Engine Qadeer et al. 8

  9. Evaluation: ASIC vs CE vs SIMD Convolution Engine Qadeer et al. 9

  10. Evaluation: Varying Flexibility Convolution Engine Qadeer et al. 10

  11. Key Results ● 8-15x less energy use than SIMD ● 2-3x more energy use than custom ASIC ● Within 6x performance of custom ASIC, 7x better than SIMD ● All programmable versions do better performance-wise than SIMD Convolution Engine Qadeer et al. 11

  12. Conclusion ● Better performance and power than SIMD ● Worse than fixed-application ASIC ● Moderate amount of flexibility ○ The greater the degree of programmability, the more performance gains are lost Convolution Engine Qadeer et al. 12

  13. Discussion Points ● Is image processing a broad enough domain to claim that they apply their architecture to a “wide range” of applications? ● The authors compared CE to SIMD units and ASICs. Should a GPU comparison have also been given? ● CE doesn’t optimize any of the 3 relevant categories (performance, power, and flexibility). Is there sufficient motivation to use it? Convolution Engine Qadeer et al. 13

Recommend


More recommend