Methodology for mapping image processing algorithms on massively parallel processors An NVIDIA GPU specific approach Florian Gouin Corinne Ancourt firstname.name@mines-paristech.fr MINES ParisTech – PSL Research University, Paris Centre de Recherche en Informatique 22/06/2017 French community of compilation – 12 th meeting – Saint Germain au Mont d’Or
Context and motivation Application case Mapping methodology Experiments Conclusion Motivation Image processing domain Figure: Image processing examples 2/28
Context and motivation Application case Mapping methodology Experiments Conclusion Motivation Image processing domain General tendances for today and tomorrow: Data source volume is growing exponentially Data sources tend to be multiplied Available computing time tends to be shorter for real time processing Image processing algorithms are even more complex 3/28
Context and motivation Application case Mapping methodology Experiments Conclusion Motivation Architectural evolution Figure: Processor frequency wall Instructions SISD MISD Data SIMD MIMD SIMT Figure: Flynn’s taxonomy Figure: NVIDIA Kepler processor – 192 cores architecture 4/28
Context and motivation Application case Mapping methodology Experiments Conclusion Motivation Why do we need a methodology? Parallel thinking is not trivial. The following methodology has been elaborated to provide: an assistance for GPU developpers, an improvement of software production for industries, a support for other domain engineers, an assistance to optimise software for a specific GPU architecture. Tools and compilers results can be limited in some cases: dynamic control code intensive function calls pointers arithmetic object oriented languages ... 5/28
Context and motivation Application case Mapping methodology Experiments Conclusion Content Application case 1 6/28
Context and motivation Application case Mapping methodology Experiments Conclusion Content Application case 1 Mapping methodology 2 6/28
Context and motivation Application case Mapping methodology Experiments Conclusion Content Application case 1 Mapping methodology 2 Experiments 3 6/28
Context and motivation Application case Mapping methodology Experiments Conclusion Content Application case 1 Mapping methodology 2 Experiments 3 Conclusion 4 6/28
Context and motivation Application case Mapping methodology Experiments Conclusion Optical Flow algorithms Optical Flow: definition Principle: Examples of applications: Motion quantification of each Motion estimation pixel taken from two distinct Image stabilization pictures. Image segmentation Image processing application: Moving object tracking spatial characterization SLAM algorithms temporal characterization ... 7/28
Context and motivation Application case Mapping methodology Experiments Conclusion Optical Flow algorithms Optical Flow: industrial application example Figure: Example of motion flow analysis. Tesla Motor Company automatic drive. 8/28
Context and motivation Application case Mapping methodology Experiments Conclusion SimpleFlow algorithm Algorithm data The SimpleFlow 1 algorithm is available in the OpenCV extensions. Approximatively 600 lines of code Sequential algorithm Dynamic control code Approximative runtime for a couple of 2 million pixels images: 200s on a NVIDIA Jetson TX1 ARM Cortex A57(1.9GHz) + A53(1.3GHz) 50s on a desktop computer Intel Core I7 4770S (8 logical cores at 3.1GHz) Ideal runtime: 40ms Language and library: C++ with the OpenCV library 1 Michael W. Tao et al. “SimpleFlow: A Non-iterative, Sublinear Optical Flow Algorithm”. In: Computer Graphics Forum (Eurographics 2012) 31.2 (May 2012). url : http://graphics.berkeley.edu/papers/Tao-SAN-2012-05/ . 9/28
Context and motivation Application case Mapping methodology Experiments Conclusion SimpleFlow algorithm Simplified CallGraph calcOpticalFlowSF selectPointsToRecalcFlow upscaleOpticalFlow buildPyramidWithResizeMethod ones GaussianBlur mixChannels removeOcclusions calcIrregularityMat calcConfidence extrapolateFlow calcOpticalFlowSingleScaleSF crossBilateralFilter resize zeros max dist min cvRound extrapolateValueInRect multiply wd wc split sum copyMakeBorder exp Figure: Simplified call graph. function is simpleflow one, function is openCV one and function comes from the C++ std library 10/28
Context and motivation Application case Mapping methodology Experiments Conclusion SimpleFlow algorithm Application example Figure: Image 1 ( t ) Figure: Image 2 ( t + δ ) Figure: X coordinate pixel motions Figure: Y coordinate pixel motions 11/28
Context and motivation Application case Mapping methodology Experiments Conclusion Overview - Macroscopic scale source code code analyses code analyses loop nest transformations for SIMT architectures loop optimisations GPU specialisation GPU mapping CPU+GPU source code 12/28
Context and motivation Application case Mapping methodology Experiments Conclusion Code analyses source code code analyses code analyses loop nest transformations for SIMT architectures loop optimisations GPU specialisation GPU mapping CPU+GPU source code 13/28
Context and motivation Application case Mapping methodology Experiments Conclusion Code analyses application executable file source code Function call Loop Detection Array detection Branch detection detection profiling compilation Block identification Loop iteration Array accesses analysis analysis Dependance analysis Global Function Loop Loop mining runtime runtime runtime parallel sequential loops loops 14/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures source code code analyses code analyses loop nest transformations for SIMT architectures loop optimisations GPU specialisation GPU mapping CPU+GPU source code 15/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures parallel sequential loops loops GPU loop identification 16/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures parallel sequential loops loops GPU loop pattern GPU loop identification // 1 ≤ # b ≤ 3 b 0 GPU loop pattern b 1 // b l o c k b 2 // s t 0 // or ↓ 0 ≤ # t ≤ 3 t 1 // or ↓ t h r e a d t 2 // or ↓ s 16/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures parallel sequential loops loops GPU loop pattern GPU loop size GPU loop identification = b b 0 × b 1 × b 2 b 0 < 2147483647 // 1 ≤ # b ≤ 3 b 1 < 65535 b 0 b 2 < 65535 GPU loop pattern b t b 1 // ≫ b l o c k b 2 // s t = t 0 × t 1 × t 2 t 0 // or ↓ t < 1024 0 ≤ # t ≤ 3 GPU loop size t 0 < 1024 t 1 // or ↓ t 1 < 1024 t h t 2 < 64 r e a d t 2 // or ↓ t %32 = 0 s t > 4 × 32 16/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures parallel sequential loops loops GPU loop pattern GPU loop size GPU loop identification = b b 0 × b 1 × b 2 b 0 < 2147483647 // 1 ≤ # b ≤ 3 b 1 < 65535 b 0 b 2 < 65535 GPU loop pattern b t b 1 // ≫ b l o c k b 2 // s t = t 0 × t 1 × t 2 t 0 // or ↓ t < 1024 0 ≤ # t ≤ 3 GPU loop size t 0 < 1024 t 1 // or ↓ t 1 < 1024 t h t 2 < 64 r e a d t 2 // or ↓ t %32 = 0 s t > 4 × 32 GPU memory size GPU memory size Global memory footprint < GPU memory 16/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures parallel sequential loops loops GPU loop pattern GPU loop size GPU loop identification = b b 0 × b 1 × b 2 b 0 < 2147483647 // 1 ≤ # b ≤ 3 b 1 < 65535 b 0 b 2 < 65535 GPU loop pattern b t b 1 // ≫ b l o c k b 2 // s t = t 0 × t 1 × t 2 t 0 // or ↓ t < 1024 0 ≤ # t ≤ 3 GPU loop size t 0 < 1024 t 1 // or ↓ t 1 < 1024 t h t 2 < 64 r e a d t 2 // or ↓ t %32 = 0 s t > 4 × 32 GPU memory size GPU memory size Global memory footprint < GPU memory GPU loop nests 16/28
Context and motivation Application case Mapping methodology Experiments Conclusion loop nest transformations for SIMT architectures parallel sequential loops loops GPU loop identification Strip Parallel mining reduction Fusion Fission Tiling InterchangeSplitting Coalescing X X X X X X GPU loop pattern GPU loop size X X X X X X GPU memory size GPU loop nests 16/28
Recommend
More recommend