FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016
What I will say 1. EPSRC Rathlin project interested in remote image processing. 2. We’ve developed a DSL for FPGAs called RIPL. 3. Dataflow IR transformation between RIPL and FPGA help. Low powered accelerated remote image processing.
FPGAs vs GPUs FPGAs GPUs ✦ energy efficient ✦ fast floating point ✦ sometimes faster ✦ fast SIMD parallelism ✪ hard to program ✪ uses lots of energy ✪ hard to optimise ✪ poor performance with irregular memory access
FPGAs vs CPUs " A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation ". D Thomas et a. Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, 2009.
Block RAM on an FPGA
DSPs on an FPGA
RIPL in an FPGA
RIPL in an FPGA
Part 1 of 4: RIPL skeletons.
A RIPL program program = image1 = imread 512 512; image2 = imap image1 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); image3 = imap image2 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); image4 = map image3 ( λ [x] -> [min 255 (x + 50) ]); image4; out
Memory efficient skeletons RIPL: s 1 1 λ [.] ([.-1] + [.] + [.+1]) / 3 [.+1] [.-1] 2 0 index State transitions: s 1 Ø 1 σ σ init: 2 0 1 [.] s s ' 1 1 σ σ stream: 1 1 1 1 midpoint s ' 1 1 Images are just streams of pixels.
RIPL skeletons map : I ( M , N ) → ([ P ] A → [ P ] A ) → I ( M , N ) imap : I ( M , N ) → ( P i → P ) → I ( M , N ) scaleRow : I ( M , N ) → ([ P ] A → [ P ] B ) → I ( M ∗ ( B / A ) , N ) scaleCol : I ( M , N ) → ([ P ] A → [ P ] B ) → I ( M , N ∗ ( B / A )) filter 2 D : I ( M , N ) → ( x , y ) : ( Int , Int ) → [ K ] ( x ∗ y ) → I ( M , N ) zipWith : I ( M , N ) → I ( M , N ) → ([ P ] A → [ P ] A → [ P ] A ) → I ( M , N ) unzip : I ( M , N ) → ( P i → P ) → ( P i → P ) → ( I ( M , N ) , I ( M , N ) ) foldScalar : I ( M , N ) → Int → ( P → Int → Int ) → Int foldVector : I ( M , N ) → Int → a : Int → ( P → [ Int ] a → [ Int ] a ) → [ Int ] a transpose : I ( M , N ) → I ( N , M )
RIPL to FPGAs 1. Use algorithmic skeletons. 2. Compile RIPL → pipelined parallel dataflow graphs. 3. Optimise apply dataflow transformations. 4. Compile dataflow graph → hardware description with Verilog. 5. Synthesise Verilog for an FPGA. 6. Send bitstream to the FPGA.
Part 2 of 4: RIPL to dataflow.
RIPL to dataflow
RIPLs dataflow constraints SDF + - - - memory bound CSDF + + runtime expressiveness DPN scheduling
RIPLs small step dataflow semantics Skeleton implementation is set of transition rule. [ a , b ] �→ [ c , d ] � σ y , S ′ � � σ x , S� − − − − − − − → • Transition from σ x to σ y • Start with internal state S , end with S ′ • Consumes [ a , b ] pixels, generates [ c , d ] pixels • " What " is computed defined by RIPL programmer
RIPLs small step dataflow semantics image2 = imap image1 ( λ [.] -> ([. -1] + [.] + [.+1]) / 3); RIPL: s 1 1 λ [.] ([.-1] + [.] + [.+1]) / 3 [.+1] [.-1] 2 0 index State transitions: s 1 Ø 1 σ σ init: 2 0 1 [.] s s ' 1 1 stream: σ σ 1 1 1 1 midpoint s ' 1 1 [23 , 27] �→∅ � σ 0 , [0 , 0 , 0] � − − − − − − → � σ 1 , [27 , 23 , 0] � [28] �→ [27] � σ 1 , [27 , 23 , 0] � − − − − − → � σ 1 , [27 , 23 , 28] � [34] �→ [28] � σ 1 , [23 , 27 , 28] � − − − − − → � σ 1 , [34 , 23 , 28] � [92] �→ [51] � σ 1 , [34 , 23 , 28] � − − − − − → � σ 1 , [34 , 92 , 28] �
Part 3 of 4: optimising dataflow.
Dataflow profiling Find bottlenecks using open source TURNUS tool • critical dataflow path • actors with high computational latency • low clock frequency
Slice LUT Slice registers Block RAM DSP48E FMax /FIFO (MHz) Naive 3664 8777 88 49 55.41 Final_XY 76 80 0 0 721.48 Centre_XY 182 199 0 0 530.81 Stream_to_YUV 90 287 24 0 420.07 update_model 1042 2399 30 0 148.74 YUV2RGB 300 957 7 0 126.71 displacement 545 1326 2 9 73.40 update_weight 556 1544 14 4 66.46 kArray_derv 437 1074 1 18 55.44 kArray_evaluation 460 1148 1 18 55.41
Manual dataflow transformation Profile Guided Dataflow Transformation for FPGAs & CPUs . R. Stewart, D. Bhowmik, G. Michaelson, A. Wallace. Special Issue on Dataflow, in The Journal of Signal Processing Systems, Springer, 2015. Functionality Transformation Registers Slice LUTs BRAM DSP Clock (MHz) None 90 287 24 0 420.0 Stream to YUV Loop elimination 27 85 0 0 386.7 None 300 957 7 0 126.7 YUV to RGB Actor fusion 99 353 0 0 182.8 None 545 1326 2 9 73.4 Displacement Task parallelism 791 1210 7 9 110.0 None 556 1544 14 4 66.5 Fission 12352 19878 55 128 72.5 Update weight Just square root (none) 346 548 0 4 72.5 Square root Lookup 139 227 32 0 368.2 Combined 7907 38544 1028 0 225.9 None 437 1074 1 18 55.4 k-array derive Loop promotion 4447 12484 5 144 52.7
Interactive dataflow transformation Task parallel decomposition video http://goo.gl/awBWg4 Data parallel fan out/fan in video http://goo.gl/0iwVCM
Part 4 of 4: evaluation.
Power performance Sub-module Power (W) Camera 24MHz 0.043330 Camera 100MHz 0.106660 Visual Saliency 50MHz 0.045940 Visual Saliency 85MHz 0.078098 Visual Saliency 100MHz 0.091310
Space performance Resource Usage Occupation DSP48E1s 3 1% FIFO36E1s 2 1% External IOB33s 80 40% RAMB18E1s 135 48% RAMB36E1s 26 18% Slices 2812 21% Slice Registers 4989 4% Slice LUTS 7357 13% Slice LUT-Flip Flop pairs 8457 15%
Throughput performance Processing time (ms) Frame rate FPGA 19.0623 52 CPU 189 5 Current experiments show RIPL performance 50-160 FPS.
Our contribution • A new image processing DSL for FPGAs. • Small step operational dataflow semantics for skeletons. • Identified profiling metrics that matter for FPGAs. • A graphical dataflow transformations framework. • FPGA-based image processing system architecture.
Future work • Evaluate RIPLs expressivity for real world computer vision. • Many dataflow implementations for each skeleton. • Machine learning to construct & prune search space of all possible dataflow representations of a single RIPL program. • Integrate transformations with dataflow profiling tool. • Automated compiler based transformation. Thanks. R.Stewart@hw.ac.uk
Recommend
More recommend