Coelho & Irigoin MINES ParisTech API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ¸ois Irigoin ANR project: FREIA software environment for image application development on modern architectures API-Compilation for Image Hardware Accelerators 1
Coelho & Irigoin MINES ParisTech Terapix Hardware Accelerator ������� ���� ���������� ���� #������� ����� �� ����� �!�� ��"���� ��������� �!�� ��"���� � ����� ���������� ��� �� ��� ��� ����� ��� �� ������� ��� ��� ����� ��� �� ������� ��� �� ��� ����� ��� �� ������� ��� ���� ��� $$$ ����� ��� ���� ����� ��� �� ������� ��� ����� ��� �� ������� ��� ����� ��� �� ������� ��� ������� ������� ��� ��� • µ P + 128 SIMD PE array, 1024 pixels per PE, neighbor coms • computation // communication (in or out) double buffer • issues: small memory implies tiles, 5.3 pixels/cycle bandwidth with DDR API-Compilation for Image Hardware Accelerators 2
Coelho & Irigoin MINES ParisTech SPoC Hardware Accelerator Vector Unit 2 paths, 5 image ops + reductions, 4 pixels/cycle bandwidth 16−bit 16−bit MORPH THR MX pixels pixels MES MX ALU MX MES 16−bit MX 16−bit MORPH THR pixels pixels Pipeline of 8 units 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit 16−bit MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MX MX MX MX MX MX MX MX pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels MES MES MES MES MES MES MES MES MX MX MX MX MX MX MX MX ALU ALU ALU ALU ALU ALU ALU ALU MX MX MX MX MX MX MX MX MES MES MES MES MES MES MES MES 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit 16−bit MX 16−bit MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR MORPH THR pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels pixels API-Compilation for Image Hardware Accelerators 3
Coelho & Irigoin MINES ParisTech Portability vs Performance? Portability write one generic code Performance re-write code for every accelerator API-Compilation for Image Hardware Accelerators 4
Coelho & Irigoin MINES ParisTech (Pure) Library Approach? • domain-specific API, optimized (by hand) • small library: not enough operator aggregation, missed opportunities • large library: cost? portability? VSIPL 1000s functions (Pure) Compiler Approach? • start from source, inline functions, loop fusion. . . • issues: complexity, impact of stencils, conditions for borders. . . API-Compilation for Image Hardware Accelerators 5
Coelho & Irigoin MINES ParisTech Mixed Library/Compiler Approach Input small domain-specific image-level API in plain C basic/composed operators relevant to application developers library implemented (optimized?) by hand – quickly available Locality hardware and runtime handle loop fusion details! SPoC: delay lines with cyclic buffers Terapix: overlapping tiling induces redundant computations, µ -code Compilation get ops, merge ops, schedule, allocate API-Compilation for Image Hardware Accelerators 6
Coelho & Irigoin MINES ParisTech ANR999: running example excerpt // SKIPPED declarations and inits freia common rx image(in, &fin); // INPUT freia global min(in, &min); // COMPUTE freia global vol(in, &vol); freia dilate(od, in, 8, 10); freia gradient(og, in, 8, 10); printf("min=%d, vol=%d \ n", min, vol); // OUTPUT freia common tx image(od, &fout); freia common tx image(og, &fout); API-Compilation for Image Hardware Accelerators 7
Coelho & Irigoin MINES ParisTech Compilation Strategy Standard techniques for low-cost implementation 1. Build large basic blocks of elementary operations: generic inlining, scalar const. prop., loop unroll., dead-code elimination 2. Build and optimize DAGs of image operations: generic constant propagation, CSE, SDC, copy propagation 3. Generate code for target: specific SPoC : DAG splitting and scheduling, compaction, cutting Terapix : DAG splitting, scheduling, memory allocation OpenCL : DAG splitting, simple operation aggregation API-Compilation for Image Hardware Accelerators 8
Coelho & Irigoin MINES ParisTech 2.1 Build Image Expression DAG *. = i + /. = b *. s b -| * m E8 thr E8 D8 - -. &. min ? D8 max from Video Survey • expression DAG of simple image operations morpho, ALU, threshold, measure, copies, scalar ops • arrows: image and scalar dependencies API-Compilation for Image Hardware Accelerators 9
Coelho & Irigoin MINES ParisTech 2.2 Optimize DAG freia_gradient connexity=8 depth=10 freia_erode connexity=8 depth=10 E8 E8 E8 E8 E8 E8 E8 E8 E8 E8 freia_dilate connexity=8 depth=10 - g D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 freia_dilate connexity=8 depth=10 i D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 d vol d D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 min E8 E8 E8 E8 E8 E8 E8 E8 E8 E8 - g i vol min Anr999 API-Compilation for Image Hardware Accelerators 10
Coelho & Irigoin MINES ParisTech 3. Target-dependent code generator mostly NP-Complete, greedy heuristics to split DAG and schedule ops spoc helper 0 spoc helper 1 d d vol D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 i - g E8 E8 E8 E8 E8 E8 E8 E8 E8 g E8 min SPoC terapix helper 1 terapix helper 0 terapix helper 3 E8 E8 E8 E8 g g E8 E8 E8 E8 E8 E8 - g terapix helper 2 D8 D8 D8 D8 D8 D8 d d D8 D8 D8 D8 i d vol min Terapix OpenCL helper 0 min vol OpenCL helper 1 i E8 E8 E8 E8 E8 E8 E8 E8 E8 E8 - g D8 D8 D8 D8 D8 D8 D8 D8 D8 D8 OpenCL d API-Compilation for Image Hardware Accelerators 11
Coelho & Irigoin MINES ParisTech Performance aggregated speedups for 9 applications Hardware Target H/L L/C H/C SPoC 6.5 FPGA 14.2 91.5 Accelerators Terapix 20.5 2.3 47.6 Multi-cores Intel dual-core 0.9 2.0 1.9 OpenCL AMD quad-core 1.3 2.7 3.5 GPGPU GeForce 8800 GTX – 7.8 – NVIDIA Quadro 600 – 22.1 – OpenCL Tesla C 2050 – 10.2 – H one thread on host, L library version, C compiled version API-Compilation for Image Hardware Accelerators 12
Coelho & Irigoin MINES ParisTech Implementation in PIPS: add 5% to code base • source-to-source, easier to debug output • phase 1 – reuse (more or less) standard phases: 155000 LOCs • phase 2 – DAG building, optimization, utils: 4000 LOCs • phase 3 – code generation for three targets: 4400 LOCs SPoC 1900 LOCs Terapix 1400 LOCs OpenCL 1100 LOCs http://pips4u.org/ API-Compilation for Image Hardware Accelerators 13
Coelho & Irigoin MINES ParisTech Benefits: Cost effective reusable applications! Portability through small common API Performance through high-level coarse-grain low-cost compilation Key success factors Co-design API / compiler / runtime / hardware • overlapping tiling moved from compiler to runtime • double buffers moved from runtime to compiler • borders management moved to runtime and hardware Source-to-source ease development and testing Functional simulators help testing API-Compilation for Image Hardware Accelerators 14
Coelho & Irigoin MINES ParisTech Applicability Apps quite static (but not only!) structure and behavior API one data type, few dozen ops, a lot of parallelism Hardware well suited, hides loop fusion. . . Future Work • Kalray MPPA data-flow model target? • new applications? new transformations? • consider other application domains? API-Compilation for Image Hardware Accelerators 15
Coelho & Irigoin MINES ParisTech Questions? API-Compilation for Image Hardware Accelerators 16
Coelho & Irigoin MINES ParisTech Hardware Accelerators • more or less domain specific • ASIC, FPGA, GPGPU, multi-cores. . . • embedded? real-time? systems Motivation? • better execution time • lower energy footprint • (hide) intellectual property • product life time: up to 30 years Two accelerators: Terapix (128 PE SIMD) and SPoC (chained vector) API-Compilation for Image Hardware Accelerators 17
Coelho & Irigoin MINES ParisTech 2.2 Optimize DAG (1) : = | : : | = conv conv conv : | max in D8 cst ^ +_ - : | E8 - conv min *_ /_ l2 : | -_ conv : | | out conv conv : : : | conv | in D8 max conv : | E8 - : | conv : min *_ /_ l2 conv _- : | conv +_ conv conv | : | out : from Deblocking API-Compilation for Image Hardware Accelerators 18
Recommend
More recommend