automatically scheduling halide image processing pipelines
play

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja - PowerPoint PPT Presentation

Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU) High demand for e ffi cient image processing Scheduling


  1. Automatically Scheduling Halide Image Processing Pipelines Ravi Teja Mullapudi (CMU) Andrew Adams (Google) Dillon Sharlet (Google) Jonathan Ragan-Kelley (Stanford) Kayvon Fatahalian (CMU)

  2. High demand for e ffi cient image processing

  3. Scheduling image processing algorithms Algorithm description Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + …

  4. Scheduling image processing algorithms Algorithm description Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … Implementation Schedule (machine mapping) parallelize y loop tile output dims vectorize y loop

  5. Scheduling image processing algorithms Algorithm Image processing description algorithm developers Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … Implementation Schedule (machine mapping) parallelize y loop tile output dims vectorize y loop

  6. Few developers have the skill set to author highly optimized schedules Algorithm Image processing description algorithm developers Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … > 10x Faster Implementation Schedule (machine mapping) parallelize y loop tile output dims vectorize y loop

  7. Contribution: automatic scheduling of image processing pipelines Algorithm Image processing Image processing description algorithm developers algorithm developers Var x, y; Func f, g; g(x,y) = f(x,y) + … h(x) = g(x,y) + … > 10x Faster Generates expert-quality Implementation schedules in seconds Scheduling Algorithm

  8. Why is it challenging to schedule image processing pipelines?

  9. Algorithm: 3x3 box blur in

  10. Algorithm: 3x3 box blur in bx bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3

  11. Algorithm: 3x3 box blur in bx out bx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)) / 3 out(x, y) = (bx(x, y-1) + bx(x, y) + bx(x, y+1)) / 3

  12. A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel x y in

  13. A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel x y in

  14. A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel x y in bx

  15. A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel Intermediate buffer x y in bx out

  16. A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel Intermediate buffer x y in bx out

  17. A basic (slow) schedule compute all pixels of bx, in parallel compute all pixels of by, in parallel Intermediate buffer x y in bx out

  18. Low performance: bandwidth bound Large in-memory buffer x y in bx out

  19. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x 3x3 tile y in bx out

  20. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile Required pixels of bx 3x3 tile x y in bx out

  21. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile Required pixels of bx 3x3 tile x y in bx out

  22. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx Intermediate buffer: compute pixels of out in tile fits in fast on-chip storage x y in bx out

  23. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x y in bx out

  24. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x y in bx out

  25. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x y in bx out

  26. Tiling to improve data locality for each 3x3 tile, in parallel compute required pixels of bx compute pixels of out in tile x y in bx out

  27. Tiling introduces redundant work x y in bx out

  28. Tiling introduces redundant work Pixels computed twice x y in bx out

  29. Tiling introduces redundant work Pixels computed twice x y in bx out

  30. Larger tiles reduce redundant work for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out x y in bx out

  31. Goal: balance parallelism, locality, work for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out x y in bx out

  32. Goal: balance parallelism, locality, work for each 3x6 tile, in parallel compute required pixels of bx compute pixels in tile of out x y in bx out

  33. Represent image processing pipelines as graphs out in bx DAG representation of the two-stage blur pipeline

  34. Real world pipelines are complex graphs Local Laplacian filters 100 stages [Paris et al. 2010, Aubry et al. 2011] Google Nexus HDR+ mode: over 2000 stages!

  35. Key aspects of scheduling out in

  36. Key aspects of scheduling Deciding which stages to out in interleave for better data locality

  37. Key aspects of scheduling Deciding which stages to out in interleave for better data locality Picking tiles sizes to trade-off locality and re-computation

  38. Key aspects of scheduling Deciding which stages to out in interleave for better data locality Picking tiles sizes to trade-off locality and re-computation Maintain ability to execute in parallel

  39. An Algorithm for Scheduling Image Processing Pipelines

  40. Algorithm D Input: DAG of pipeline stages in A B E C

  41. Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

  42. Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

  43. Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B in A,B C,D,E for each 8x8 tile in parallel compute required pixels of C compute required pixels of D compute pixels in tile of E

  44. Algorithm D Input: DAG of pipeline stages in A B E C Output: Optimized schedule for each 8x128 tile in parallel compute required pixels of A compute pixels in tile of B in A,B C,D,E for each 8x8 tile in parallel Tile size: 8 x 128 Tile size: 8 x 8 compute required pixels of C compute required pixels of D compute pixels in tile of E

  45. Scheduling the DAG for better locality Determine which stages to group together? How to tile stages in each group?

  46. When to group stages? for each 3x3 tile in parallel ? compute required pixels of A D compute pixels in tile of B in A,B E compute all pixels of C, in parallel compute all pixels of D, in parallel C Tile size: 3 x 3 compute all pixels of E, in parallel Grouping A and B together can either improve or degrade performance

  47. Quantifying the cost of a group for each 3x3 tile in parallel compute required pixels of A D compute pixels in tile of B in A,B E compute all pixels of C, in parallel compute all pixels of D, in parallel C Tile size: 3 x 3 compute all pixels of E, in parallel Cost = Cost of arithmetic + Cost of memory

  48. Quantifying the cost of a group for each 3x3 tile in parallel compute required pixels of A D compute pixels in tile of B in A,B E compute all pixels of C, in parallel compute all pixels of D, in parallel C Tile size: 3 x 3 compute all pixels of E, in parallel Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  49. Quantifying the cost of a group D for each 3x3 tile in parallel in A,B E compute required pixels of A compute pixels in tile of B C Tile size: 3 x 3 Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  50. Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  51. Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  52. Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  53. Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  54. Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = (Number of arithmetic operations) + (Number of memory accesses) x (LOAD COST)

  55. Estimating cost using interval analysis in A,B Tile size: 3 x 3 in A B Cost = Number of tiles x Cost per tile

  56. Search for best tile sizes in A,B Tile size: 1 x 6 in A B

Recommend


More recommend