automatic streamization of image processing applications
play

Automatic Streamization of Image Processing Applications LCPC 2014 - PowerPoint PPT Presentation

DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho Franois Irigoin MINES ParisTech, PSL Research


  1. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Automatic Streamization of Image Processing Applications LCPC 2014 Pierre Guillou Fabien Coelho François Irigoin MINES ParisTech, PSL Research University Hillsboro, OR, September 15, 2014 1 / 24

  2. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Context Image processing applications Computing systems CPUs (multi/many cores) Accelerators (GPUs, FPGAs. . . ) 2 / 24

  3. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results DSL − → Streaming Language − → Manycore Accelerator Domain Specific Languages: High-level Easy-to-use Hardware agnostic C Embedded language: FREIA Streaming languages: Target easily multi/many cores architectures Image processing applications Verbose Examples: StreamIt, Sigma-C 3 / 24

  4. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Manycore Processor PCI-Express DDR interface I/O cluster Host I/O cluster I/O cluster CPU Compute Attached DDR3 clusters Host RAM Kalray MPPA-256: 256 VLIW cores I/O cluster 2 MB/cluster 10 W 4 / 24

  5. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Outline DSL & Streaming Language 1 Compilation and Execution Model 2 Optimizations 3 Experimental Results 4 5 / 24

  6. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Image Processing DSL: FREIA FRamework for Embedded Image Applications: Sequential Embedded C code High-level image processing operators Example: freia_aipo_erode_8c (im1 , im0 , kernel ); // morphological freia_aipo_dilate_8c (im2 , im1 , kernel ); // morphological freia_aipo_and (im3 , im2 , im0); // arithmetic im1 ero dil im2 im0 im3 and 6 / 24

  7. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Image Operators Arithmetic operators unary binary + − × / min max = & | ∼ Morphological operators selection + min/max/avg Reduction operators min/max/sum 7 / 24

  8. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Sigma-C Agents input 0 = ⇒ output input 1 Agent foo agent foo() { interface { // define I/O channels in <int > in0 , in1; // 2 input integer channels out <int > out0; // 1 output integer channel spec{in0[2],in1 , // define flow scheduling out0 [3]}; } void start () exchange // DO SOMETHING! (in0 i0[2], in1 i1 , out0 o[3]) { o[0] = i0[0], o[1] = i1 , o[2] = i0 [1]; } } 8 / 24

  9. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results From Agents to Subgraphs Agent 2 Agent 4 Subgraph bar Agent 1 Agent 5 Subgraph 3 subgraph bar() { interface { // define I/O channels in <int > in0 [2]; out <int > out0 , out1; spec{ { in0 [][3]; out0 }; { out1 [2] } }; } map { agent a1 = new Agent1 (); // instantiate agents agent a3 = new Subgraph3 (); ... connect (in0 [0], a1.input0 ); // I/O connections ... connect (a5.output , out1 ); connect (a1.output0 , a2.input ); // internal connections ... connect (a3.output , a5.input1 ); } } 9 / 24

  10. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Input & Output From FREIA sequential C code: freia_aipo_erode_8c (im1 , im0 , kernel ); // morphological freia_aipo_dilate_8c (im2 , im1 , kernel ); // morphological freia_aipo_and (im3 , im2 , im0); // arithmetic To Sigma-C subgraph: subgraph foo() { int16_t kernel [9] = {0,1,0, 0,1,0, 0,1,0}; ... agent ero = new img_erode(kernel ); im1 agent dil = new img_dilate(kernel ); ero dil im2 agent and = new img_and_img (); im0 im3 ... and connect(ero.output , dil.input ); connect(dil.output , and.input ); ... } 10 / 24

  11. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results From DSL Code to Streaming Code Build sequences of basic image operations 1 composed operator inlining partial evaluation loop unrolling Extract and optimize image expressions − → DAG 2 common subexpression elimination unused image computations removal copy propagation Generate target code 3 1 DAG � 1 subgraph 1 vertex � 1 agent Subgraph activation Use image operator library 4 11 / 24

  12. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Execution Scheme Compute cores Control code Host run-time Accelerator run-time stream images load from HD agent 1 a launch a transfer store result agent n a launch b transfer stream images agent 1 b write on HD store result 12 / 24

  13. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Mapping Sigma-C Graphs Graph throughput constraints: Slowest node in critical path = ⇒ split slow nodes, merge fast nodes Agent constraints: � agents ≤ 256 1 agent / compute core 2 MB for 16 cores mem(1 agent) ≤ 128 kB Fixed iteration overhead pack pixels Mapping constraints: NoC comms between clusters use few clusters Constant activation time use few large graphs 13 / 24

  14. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Agent Granularity 1.4 Normalized execution times 128 256 512 640 per pixel on MPPA-256 1.2 1 0.8 0.6 0.4 0.2 0 anr999 deblocking licensePlate retina toggle GMEAN Fixed iteration overhead − → pack pixels Small memory − → avoid large structures Stencil ops − → manage overlap = ⇒ operate on image rows 14 / 24

  15. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Optimization of Morphological Agents Morphological agents are the bottlenecks: 3 × 3 boolean matrix mask for selecting neighbors min, max or avg on selected neighbors Often combined in deep pipelines Some optimizations have been implemented: Agent buffer of 3 rows fed in a round-robin manner Innermost loop written in VLIW assembly code 15 / 24

  16. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Bottleneck Reduction: Graph Transformation Data Parallelization of Morphological Agents morpho morpho 1 row 1 row 1 row 1 row 1 row 1 row morpho split join split morpho join morpho morpho (a) one row (b) two half-rows (c) three thirds of a row 1.6 Normalized execution time case (a) case (b) case (c) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 anr999 antibio deblockinglicensePlate oop retina toggle GMEAN 16 / 24

  17. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Reduce Number of Used Cores: Graph Transformation Aggregation of Arithmetic Agents � agents ≤ 256 Fast agents can be aggregated to use fewer cores Arithmetic operators are fast: good candidates for aggregation 1.4 Normalized execution time no compound agent 4 operators/compound agent 2 operators/compound agent 1.2 1 0.8 0.6 0.4 0.2 0 antibio burner licensePlate oop retina toggle GMEAN = ⇒ fewer cores used/same execution time 17 / 24

  18. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Reduce Control Overhead: Enlarge Graphs While Unrolling for Convergent Transformations do { p = c; // p and c depend on the processed image ... // a converging operation freia_aipo_global_vol (img , &c); } while(c != p); 1.4 Normalized execution time without unrolling u.f. 4 u.f. 16 1.2 unrolling factor 2 u.f. 8 1 0.8 0.6 0.4 0.2 0 antibio burner retina GMEAN #control overhead ց #agents ր #speculative execution ր = ⇒ tradeoff: unroll by 8 18 / 24

  19. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Benchmark Suite #operators Apps. LoC #subg #clust image size arith morpho red Total anr999 87 1 20 2 23 1 2 224 × 288 antibio 200 8 41 25 74 8 6 256 × 256 burner 510 18 410 3 431 3 16 256 × 256 deblocking 161 23 9 2 34 2 10 512 × 512 licensePlate 203 4 65 0 69 1 5 640 × 383 oop 442 7 10 0 17 1 2 350 × 288 retina 469 15 38 3 56 3 4 256 × 256 toggle 143 8 6 1 15 1 1 512 × 512 19 / 24

  20. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Target Systems Targets hardware kind backend max W SPoC FPGA FPGA 26 Terapix FPGA FPGA 26 Intel dual-core 2c CPU OpenCL 65 AMD quad-core 4c CPU OpenCL 60 NV Geforce GTX 8800 GPU OpenCL 120 NV Quadro 600 GPU OpenCL 40 NV Tesla 2050C GPU OpenCL 240 Kalray MPPA-256 Manycore Sigma-C 10 20 / 24

  21. DSL & Streaming Language Compilation and Execution Model Optimizations Experimental Results Relative Execution Times Kalray MPPA-256 Intel dual-core NVIDIA Quadro 600 SPoC AMD quad-core NVIDIA Tesla C 2050 Terapix NVIDIA GeForce 8800 GTX 10 1 0.1 0.01 anr999 antibio burner deblocking licensePlate oop retina toggle GMEAN Reference: MPPA = 1.0 21 / 24

Recommend


More recommend