polymage high performance compilation for heterogeneous
play

PolyMage: High-Performance Compilation for Heterogeneous Stencils - PowerPoint PPT Presentation

PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India Apr 15, 2015 Uday


  1. PolyMage: High-Performance Compilation for Heterogeneous Stencils Uday Bondhugula (with Ravi Teja Mullapudi, Vinay Vasista) Department of Computer Science and Automation Indian Institute of Science Bangalore, India Apr 15, 2015 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  2. Domain-Specific Languages A DSL and compiler for optimizing image processing pipelines Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  3. Domain-Specific Languages A DSL and compiler for optimizing image processing pipelines Too specialized Need to learn a new language! A Dodo (highly special- ized, but extinct) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  4. Domain-Specific Languages A DSL and compiler for optimizing image processing pipelines Too specialized Need to learn a new language! But DSLs can be embedded in existing languages Can grow and become more general-purpose A DSL compiler can “see” across A Dodo (generalized to routines – allow whole program adapt) optimization Generate optimized code for multiple targets Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  5. Introduction Image Processing Pipelines Graphs of interconnected processing stages I in I x I y I xx I xy I yy S xx S xy S yy det trace harris Figure: Harris corner detection Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  6. Introduction Computation Patterns g f Point-wise f ( x, y ) = w r · g ( x, y, • ) + w g · g ( x, y, • ) + w b · g ( x, y, • ) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  7. Introduction Computation Patterns g f Stencil +1 +1 f ( x, y ) = � � g ( x + σ x , y + σ y ) · w ( σ x , σ y ) σ x = − 1 σ y = − 1 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  8. Introduction Computation Patterns g f Downsample +1 +1 f ( x, y ) = � � g (2 x + σ x , 2 y + σ y ) · w ( σ x , σ y ) σ x = − 1 σ y = − 1 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  9. Introduction Computation Patterns f g Upsample +1 +1 f ( x, y ) = � � g (( x + σ x ) / 2 , ( y + σ y ) / 2) · w ( σ x , σ y , x, y ) σ x = − 1 σ y = − 1 Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  10. Introduction Example: Pyramid Blending pipeline ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x M ↓ y ↓ x ↓ y ↓ x ↓ y ↓ x ↓ x ↓ y ↑ x ↑ x ↑ x ↑ x ↑ x ↑ x ↓ x ↓ y ↑ y ↑ y ↑ y ↑ y ↑ y ↑ y ↓ x ↓ y L L L L L L X X ↑ x X ↑ x X ↑ x ↑ + ↑ + ↑ + Image courtesy: Kyros Kutulakos Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  11. Introduction Where are Image Processing Pipelines used? On images uploaded to social networks like Facebook, Google+ On all camera-enabled devices Everyday workloads from data center to mobile device scales Computational photography, computer vision, medical imaging, ... Google + Auto Enhance Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  12. Introduction Naive vs Optimized Implementation 354 . 56 Execution time (ms) Naive implementation in C Naive parallelization – 7 × OpenMP, Vector pragmas (icc) 53 . 91 12 . 3 Manual optimization – 29 × Seq Par Tuned Locality, Parallelism, Vector intrinsics Harris corner detection (16 cores) Manually optimizing pipelines is hard Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  13. Introduction Naive vs Optimized Implementation 354 . 56 Execution time (ms) Naive implementation in C Naive parallelization – 7 × OpenMP, Vector pragmas (icc) 53 . 91 12 . 3 Manual optimization – 29 × Seq Par Tuned Locality, Parallelism, Vector intrinsics Harris corner detection (16 cores) Goal: Performance levels of manual tuning Without the pain Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  14. Approach Our Approach: PolyMage High-level language (DSL embedded in Python) – Allow expressing common patterns intuitively – Enables compiler analysis and optimization Automatic Optimizing Code Generator – Uses domain-specific cost models to apply complex combinations of scaling, alignment, tiling and fusion to optimize for parallelism and locality Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  15. Approach Harris Corner Detection R, C = Parameter ( I n t ), Parameter ( I n t ) I = Image ( Float , [R+2, C+2]) x, y = V a r i a b l e (), V a r i a b l e () I n t e r v a l (0,R+1 ,1), I n t e r v a l (0,C+1 ,1) row , col = c = Condition (x,’>=’ ,1) & Condition (x,’<=’,R) & Condition (y,’>=’ ,1) & Condition (y,’<=’,C) I in cb = Condition (x,’>=’ ,2) & Condition (x,’<=’,R -1) & Condition (y,’>=’ ,2) & Condition (y,’<=’,C -1) Iy = Function (varDom = ([x,y],[row ,col ]), Float ) Iy.defn = [ Case (c, S t e n c i l (I(x,y), 1.0/12 , I x I y [[-1, -2, -1], [ 0, 0, 0], [ 1, 2, 1]]) ] Ix = Function (varDom = ([x,y],[row ,col ]), Float ) I xx I xy Ix.defn = [ Case (c, S t e n c i l (I(x,y), 1.0/12 , I yy [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) ] Ixx = Function (varDom = ([x,y],[row ,col ]), Float ) Ixx.defn = [ Case (c, Ix(x,y) * Ix(x,y)) ] S xx S xy S yy Iyy = Function (varDom = ([x,y],[row ,col ]), Float ) Iyy.defn = [ Case (c, Iy(x,y) * Iy(x,y)) ] Ixy = Function (varDom = ([x,y],[row ,col ]), Float ) Ixy.defn = [ Case (c, Ix(x,y) * Iy(x,y)) ] det Sxx = Function (varDom = ([x,y],[row ,col ]), Float ) trace Syy = Function (varDom = ([x,y],[row ,col ]), Float ) Sxy = Function (varDom = ([x,y],[row ,col ]), Float ) f o r pair i n [(Sxx , Ixx), (Syy , Iyy), (Sxy , Ixy)]: pair [0]. defn = [ Case (cb , S t e n c i l (pair [1], 1, [[1, 1, 1], [1, 1, 1], [1, 1, 1]]) ] Function (varDom = ([x,y],[row ,col ]), Float ) det = d = Sxx(x,y) * Syy(x,y) - Sxy(x,y) * Sxy(x,y) harris det.defn = [ Case (cb , d) ] trace = Function (varDom = ([x,y],[row ,col ]), Float ) trace.defn = [ Case (cb , Sxx(x,y) + Syy(x,y)) ] harris = Function (varDom = ([x,y],[row ,col ]), Float ) coarsity = det(x,y) - .04 * trace(x,y) * trace(x,y) harris.defn = [ Case (cb , coarsity) ] Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  16. Compiler Our Approach: PolyMage High-level language (DSL embedded in Python) – Allow expressing common patterns intuitively – Enables compiler analysis and optimization Automatic Optimizing Code Generator – Uses domain-specific cost models to apply complex combinations of scaling, alignment, tiling and fusion to optimize for parallelism and locality Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  17. Compiler Polyhedral Representation f out f 2 f 1 x V a r i a b l e () x = f in = Image ( Float , [18]) f 1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f 1 .defn = [ f in (x) + 1 ] f 2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f 2 .defn = [ f 1 (x -1) + f 1 (x+1) ] Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) f out = f out .defn = [ f 2 (x -1) + f 2 (x+1) ] Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  18. Compiler Polyhedral Representation Domains f out f 2 f 1 x V a r i a b l e () x = f in = Image ( Float , [18]) f 1 = Function (varDom = ([x], [ I n t e r v a l (0, 17, 1)]), Float ) f 1 .defn = [ f in (x) + 1 ] f 2 = Function (varDom = ([x], [ I n t e r v a l (1, 16, 1)]), Float ) f 2 .defn = [ f 1 (x -1) + f 1 (x+1) ] Function (varDom = ([x], [ I n t e r v a l (2, 15, 1)]), Float ) f out = f out .defn = [ f 2 (x -1) + f 2 (x+1) ] Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  19. Compiler Polyhedral Representation Dependence vectors f out f 2 f 1 x Function Dependence Vectors f out ( x ) = f 2 ( x − 1) · f 2 ( x + 1) (1 , 1) , (1 , − 1) f 2 ( x ) = f 1 ( x − 1) + f 1 ( x + 1) (1 , 1) , (1 , − 1) f 1 ( x ) = f in ( x ) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  20. Compiler Polyhedral Representation Live-outs f out f 2 f 1 x Function Dependence Vectors f out ( x ) = f 2 ( x − 1) · f 2 ( x + 1) (1 , 1) , (1 , − 1) f 2 ( x ) = f 1 ( x − 1) + f 1 ( x + 1) (1 , 1) , (1 , − 1) f 1 ( x ) = f in ( x ) Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  21. Compiler Scheduling Criteria f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  22. Compiler Scheduling Criteria Default schedule f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  23. Compiler Scheduling Criteria Default schedule f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  24. Compiler Scheduling Criteria Default schedule f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

  25. Compiler Scheduling Criteria Parallelogram tiling f out f 2 f 1 x Locality Storage Parallelism Uday Bondhugula, Indian Institute of Science Dagstuhl seminar, Apr 12-17, 2015

Recommend


More recommend