fpgas for image processing
play

FPGAs for Image Processing A DSL and program transformations Rob - PowerPoint PPT Presentation

FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1. EPSRC Rathlin project interested in


  1. FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016

  2. What I will say 1. EPSRC Rathlin project interested in remote image processing. 2. We’ve developed a DSL for FPGAs called RIPL. 3. Dataflow IR transformation between RIPL and FPGA help. Low powered accelerated remote image processing.

  3. FPGAs vs GPUs FPGAs GPUs ✦ energy efficient ✦ fast floating point ✦ sometimes faster ✦ fast SIMD parallelism ✪ hard to program ✪ uses lots of energy ✪ hard to optimise ✪ poor performance with irregular memory access

  4. FPGAs vs CPUs " A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation ". D Thomas et a. Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.

  5. Block RAM on an FPGA

  6. DSPs on an FPGA

  7. RIPL in an FPGA

  8. RIPL in an FPGA

  9. Part 1 of 4: RIPL skeletons.

  10. A RIPL program program = image1 = imread 512 512; image2 = imap image1 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); image3 = imap image2 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); image4 = map image3 ( λ [x] -> [min 255 (x + 50) ]); image4; out

  11. Memory efficient skeletons RIPL: s 1 1 λ [.] ([.-1] + [.] + [.+1]) / 3 [.+1] [.-1] 2 0 index State transitions: s 1 Ø 1 σ σ init: 2 0 1 [.] s s ' 1 1 σ σ stream: 1 1 1 1 midpoint s ' 1 1 Images are just streams of pixels.

  12. RIPL skeletons map : I ( M , N ) → ([ P ] A → [ P ] A ) → I ( M , N ) imap : I ( M , N ) → ( P i → P ) → I ( M , N ) scaleRow : I ( M , N ) → ([ P ] A → [ P ] B ) → I ( M ∗ ( B / A ) , N ) scaleCol : I ( M , N ) → ([ P ] A → [ P ] B ) → I ( M , N ∗ ( B / A )) filter 2 D : I ( M , N ) → ( x , y ) : ( Int , Int ) → [ K ] ( x ∗ y ) → I ( M , N ) zipWith : I ( M , N ) → I ( M , N ) → ([ P ] A → [ P ] A → [ P ] A ) → I ( M , N ) unzip : I ( M , N ) → ( P i → P ) → ( P i → P ) → ( I ( M , N ) , I ( M , N ) ) foldScalar : I ( M , N ) → Int → ( P → Int → Int ) → Int foldVector : I ( M , N ) → Int → a : Int → ( P → [ Int ] a → [ Int ] a ) → [ Int ] a transpose : I ( M , N ) → I ( N , M )

  13. RIPL to FPGAs 1. Use algorithmic skeletons. 2. Compile RIPL → pipelined parallel dataflow graphs. 3. Optimise apply dataflow transformations. 4. Compile dataflow graph → hardware description with Verilog. 5. Synthesise Verilog for an FPGA. 6. Send bitstream to the FPGA.

  14. Part 2 of 4: RIPL to dataflow.

  15. RIPL to dataflow

  16. RIPLs dataflow constraints SDF + - - - memory bound CSDF + + runtime expressiveness DPN scheduling

  17. RIPLs small step dataflow semantics Skeleton implementation is set of transition rule. [ a , b ] �→ [ c , d ] � σ y , S ′ � � σ x , S� − − − − − − − → • Transition from σ x to σ y • Start with internal state S , end with S ′ • Consumes [ a , b ] pixels, generates [ c , d ] pixels • " What " is computed defined by RIPL programmer

  18. RIPLs small step dataflow semantics image2 = imap image1 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); RIPL: s 1 1 λ [.] ([.-1] + [.] + [.+1]) / 3 [.+1] [.-1] 2 0 index State transitions: s 1 Ø 1 σ σ init: 2 0 1 [.] s s ' 1 1 stream: σ σ 1 1 1 1 midpoint s ' 1 1 [23 , 27] �→∅ � σ 0 , [0 , 0 , 0] � − − − − − − → � σ 1 , [27 , 23 , 0] � [28] �→ [27] � σ 1 , [27 , 23 , 0] � − − − − − → � σ 1 , [27 , 23 , 28] � [34] �→ [28] � σ 1 , [23 , 27 , 28] � − − − − − → � σ 1 , [34 , 23 , 28] � [92] �→ [51] � σ 1 , [34 , 23 , 28] � − − − − − → � σ 1 , [34 , 92 , 28] �

  19. Part 3 of 4: optimising dataflow.

  20. Dataflow profiling Find bottlenecks using open source TURNUS tool • critical dataflow path • actors with high computational latency • low clock frequency

  21. Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive 3664 8777 88 49 55.41 Final_XY 76 80 0 0 721.48 Centre_XY 182 199 0 0 530.81 Stream_to_YUV 90 287 24 0 420.07 update_model 1042 2399 30 0 148.74 YUV2RGB 300 957 7 0 126.71 displacement 545 1326 2 9 73.40 update_weight 556 1544 14 4 66.46 kArray_derv 437 1074 1 18 55.44 kArray_evaluation 460 1148 1 18 55.41

  22. Manual dataflow transformation Profile Guided Dataflow Transformation for FPGAs & CPUs . R. Stewart, D. Bhowmik, G. Michaelson, A. Wallace. Special Issue on Dataflow, in The Journal of Signal Processing Systems, Springer, 2015. Functionality Transformation Registers Slice LUTs BRAM DSP Clock (MHz) None 90 287 24 0 420.0 Stream to YUV Loop elimination 27 85 0 0 386.7 None 300 957 7 0 126.7 YUV to RGB Actor fusion 99 353 0 0 182.8 None 545 1326 2 9 73.4 Displacement Task parallelism 791 1210 7 9 110.0 None 556 1544 14 4 66.5 Fission 12352 19878 55 128 72.5 Update weight Just square root (none) 346 548 0 4 72.5 Square root Lookup 139 227 32 0 368.2 Combined 7907 38544 1028 0 225.9 None 437 1074 1 18 55.4 k-array derive Loop promotion 4447 12484 5 144 52.7

  23. Interactive dataflow transformation Task parallel decomposition video http://goo.gl/awBWg4 Data parallel fan out/fan in video http://goo.gl/0iwVCM

  24. Part 4 of 4: evaluation.

  25. Power performance Sub-module Power (W) Camera 24MHz 0.043330 Camera 100MHz 0.106660 Visual Saliency 50MHz 0.045940 Visual Saliency 85MHz 0.078098 Visual Saliency 100MHz 0.091310

  26. Space performance Resource Usage Occupation DSP48E1s 3 1% FIFO36E1s 2 1% External IOB33s 80 40% RAMB18E1s 135 48% RAMB36E1s 26 18% Slices 2812 21% Slice Registers 4989 4% Slice LUTS 7357 13% Slice LUT-Flip Flop pairs 8457 15%

  27. Throughput performance Processing time (ms) Frame rate FPGA 19.0623 52 CPU 189 5 Current experiments show RIPL performance 50-160 FPS.

  28. Our contribution • A new image processing DSL for FPGAs. • Small step operational dataflow semantics for skeletons. • Identified profiling metrics that matter for FPGAs. • A graphical dataflow transformations framework. • FPGA-based image processing system architecture.

  29. Future work • Evaluate RIPLs expressivity for real world computer vision. • Many dataflow implementations for each skeleton. • Machine learning to construct & prune search space of all possible dataflow representations of a single RIPL program. • Integrate transformations with dataflow profiling tool. • Automated compiler based transformation. Thanks. R.Stewart@hw.ac.uk

Recommend


More recommend