convolution engine balancing efficiency flexibility in
play

Convolution Engine: Balancing Efficiency & Flexibility in - PowerPoint PPT Presentation

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, Thats me Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark


  1. Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Did the heavy lifting but could not come today Wajahat Qadeer, Rehan Hameed, That’s me  Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark Horowitz http://www.c 2 s 2 .org Stanford University

  2. Smile, you’re on camera  By show of hands, who here has an (HD) camera on them?  How many CPU’s/GPU’s in the room?  How many of those xPU’s are used for the image processing? ISCA'13 shacham@alumni.stanford.edu 2

  3. Imaging and video systems  High computational requirements, low power budget  Stills: ~10M pixels x 10 frames per second  Video: ~2M pixels x 30 frames per second  ~400 math operations per pixel (just for the image acquisition)  On CPU… not enough horse power  On GPU… too much power  Typically use special purpose custom HW  About 500X better performance, 500X lower energy than CPU ISCA'13 shacham@alumni.stanford.edu 3

  4. Example: H.264 encoder on RISC vs. ASIC  By coupling compute and storage closely together, ASIC’s are orders of magnitude performance and energy more efficient 10000000 2-3 orders of magnitude 1000000 Energy (uJ) 100000 RISC 10000 ASIC 1000 100 Sub-kernel-1 Sub-kernel-2 Sub-kernel-3 Sub-kernel-4 IME FME IP CABAC * R. Hameed et. al., Understanding Sources of Inefficiency in General-Purpose Chips. ISCA ’10 ISCA'13 shacham@alumni.stanford.edu 4

  5. We are solving the wrong problem!  Yes, ASIC is 1000X more efficient than general purpose  Yes, general purpose is more programmable than ASIC  Yes, we can make each one marginally better  But those are good answers to all the wrong questions!  The right questions: Why is the RISC energy so high? What type of computation can we make efficient? Can we make it just 100X better but keep it programmable? ISCA'13 shacham@alumni.stanford.edu 5

  6. Anatomy of a RISC Instruction Control overheads (Instr Decode, sequencing, pipeline management, clocking, … .) ADD 70 pJ 25pJ 4pJ Control I-Cache access Energy of a 32-bit ADD ≈ 0.5 pJ Register file access * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 6

  7. Other instructions overhead LD 25pJ 4pJ Control Overhead instructions LD 25pJ 4pJ Control ADD 25pJ 4pJ Control ST 25pJ 4pJ Control Overhead instructions BR 25pJ 4pJ Control * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 7

  8. D-Cache accesses overhead LD 25pJ 25pJ 4pJ Control LD 25pJ 25pJ 4pJ Control ADD 25pJ 4pJ Control ST 25pJ 25pJ 4pJ Control BR 25pJ 4pJ Control D-Cache access overheads * Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology ISCA'13 shacham@alumni.stanford.edu 8

  9. SIMD machines give some improvement  SIMD units amortize overhead and improve performance ADD I-Cache RF Control SIMD ADD I-Cache RF Control  Achieves 10X better energy and performance AND is programmable  Can we do 100X and keep it programmable? ISCA'13 shacham@alumni.stanford.edu 9

  10. Energy efficiency in a programmable environment Each memory and instruction fetch must be amortized by hundreds of operations ISCA'13 shacham@alumni.stanford.edu 10

  11. What we want to see D-Cache accesses much narrower than functional path LD D-Cache I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control Many ALU instructions Many ops per instruction per LD/ST instruction OP I-Cache Reg File Control OP I-Cache Reg File Control OP I-Cache Reg File Control ST D-Cache I-Cache Reg File Control ISCA'13 shacham@alumni.stanford.edu 11

  12. Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 12

  13. Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 13

  14. Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils c c  Looks like convolution: ( ) Img f Img f ∑ ∑ ⊗ = ⋅ [ k , l ] [ n k , m l ] [ n , m ] − − l c k c = − = − In Out x coefficients ISCA'13 shacham@alumni.stanford.edu 14

  15. It does not have to be convolution  It only looks like convolution: CE ' $ [ ] [ ] ( ) c c Img f Reduce Reduce map Img , f ⊗ = % " l c k c [ k , l ] [ n k , m l ] = − = − − − & # [ n , m ] In Out reduce map coefficients ISCA'13 shacham@alumni.stanford.edu 15

  16. Let’s look at some convolution-like workloads  De-mosaic:  Adaptive color plane interpolation (ACPI)*: image gradients followed by a three-tap filter in the direction of smallest gradient. * Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007. ISCA'13 shacham@alumni.stanford.edu 16

  17. Let’s look at more convolution-like workloads  H.264 (high definition) video encoder:  IME: 2D-Sum of absolute differences  FME: Half pixel interpolation, quarter pixel interpolation, 2D SAD Video Compressed CABAC Frames Inter Intra Bit Stream Entropy Prediction Prediction Encoder 90% of execution time is here Integer Fractional Motion Motion Estimation Estimation ISCA'13 shacham@alumni.stanford.edu 17

  18. The main computation behind H.264  Trying to find best match for a stencil within a small neighborhood Previous Frame Current Frame ISCA'13 shacham@alumni.stanford.edu 18

  19. The convolution engine must support different ops Map Reduce Stencil Size Data Flow IME SAD Abs Diff Add 4x4 2D Convolution FMW ½ pixel up-sample Multiply Add 6 1D Horizontal & vertical conv. FME ¼ pixel up-sample Average None -- 2D Matrix operation SIFT Gaussian blur Multiply Add 9, 13, 15 1D Horizontal & vertical conv. SIFT DoG Subtract None -- 2D Matrix operation SIFT Extreme Compare Logic AND 3 1D Horizontal & vertical conv. Demosaic interpolation Multiply Complex 3 1D Horizontal & vertical conv. ISCA'13 shacham@alumni.stanford.edu 19

  20. Convolution Engine : An architecture for convolution-like kernels Stencil Coefficients neighborhood 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 Wide 64- lane SIMD ALU ALU ALU ALU “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 20

  21. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference frame pixels frame pixels 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 Wide 64- lane SIMD ALU ALU ALU ALU “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 21

  22. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference frame pixels frame pixels 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 ALU’s Wide 64- - - - - instruction lane SIMD ABS ABS ABS ABS set to |a-b| “map” unit Flexible Arithmetic / Logical reduction “reduce” step ISCA'13 shacham@alumni.stanford.edu 22

  23. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Current Reference pixels frame pixels frame pixels shift left 0 0 1 15 0 0 1 15 16 17 31 2D Regfile 2D shift 1 0 1 15 1 0 1 15 16 17 31 Regfile 15 0 1 15 15 0 1 15 16 17 31 ALU’s Wide 64- - - - - instruction lane SIMD ABS ABS ABS ABS set to |a-b| “map” unit Summation Flexible Sum (Reduction) tree “reduce” step ISCA'13 shacham@alumni.stanford.edu 23

  24. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Reference pixels frame pixels shift left 0 0 1 15 0 1 2 16 17 18 31 0 2D Regfile 2D shift 1 0 1 15 1 1 2 16 17 18 31 0 Regfile 15 0 1 15 15 1 2 16 17 18 31 0 Wide 64- - - - - lane SIMD ABS ABS ABS ABS “map” unit Flexible Sum (Reduction) “reduce” step ISCA'13 shacham@alumni.stanford.edu 24

  25. Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD) Reference pixels frame pixels shift left 0 0 1 15 0 2 3 17 18 19 0 1 2D Regfile 2D shift 1 0 1 15 1 2 3 17 17 19 0 1 Regfile 15 0 1 15 15 2 3 17 18 19 0 1 Wide 64- - - - - lane SIMD ABS ABS ABS ABS “map” unit Flexible Sum (Reduction) “reduce” step ISCA'13 shacham@alumni.stanford.edu 25

Recommend


More recommend