resource allocation for hardware implementations of map
play

Resource Allocation for Hardware Implementations of Map Richard - PowerPoint PPT Presentation

Resource Allocation for Hardware Implementations of Map Richard Townsend Martha A. Kim Stephen A. Edwards Columbia University ASBD Workshop, June 15, 2014 Functional Programs to Functional Hardware Functional program (Haskell) Compiler


  1. Resource Allocation for Hardware Implementations of Map Richard Townsend Martha A. Kim Stephen A. Edwards Columbia University ASBD Workshop, June 15, 2014

  2. Functional Programs to Functional Hardware Functional program (Haskell) Compiler Circuit (Verilog)

  3. Functional Programs to Functional Hardware Map f [ x 1 , x 2 ,..., x n ] Ordered List This Talk ?

  4. Functional Programs to Functional Hardware Map f [ x 1 , x 2 ,..., x n ] Ordered List This Talk Order Dependent ? scan fold

  5. Functional Map vs. MapReduce

  6. Functional Map vs. MapReduce (0,3) (1,3) (2,3) (3,3) (0,2) (1,2) (2,2) (3,2) (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0)

  7. Functional Map vs. MapReduce Ordered (0,3) (1,3) (2,3) (3,3) (0,2) (1,2) (2,2) (3,2) (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0)

  8. Functional Map vs. MapReduce Ordered (0,3) (1,3) (2,3) (3,3) (0,2) (1,2) (2,2) (3,2) (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) Unordered

  9. Structure of a Hardware Implementation f

  10. Structure of a Hardware Implementation f f f f f

  11. Structure of a Hardware Implementation f f f f f

  12. Structure of a Hardware Implementation f f f f f x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10

  13. Structure of a Hardware Implementation f f x 7 x 6 x 5 x 4 x 3 x 2 x 1 f f f

  14. Structure of a Hardware Implementation x 1 f x 2 f x 10 x 9 x 8 x 7 x 6 x 3 f x 4 f x 5 f

  15. Structure of a Hardware Implementation f f ( x 1 ) x 2 f x 10 x 9 x 8 x 7 x 6 f f ( x 3 ) x 4 f x 5 f

  16. Structure of a Hardware Implementation x 6 f f ( x 1 ) x 2 f x 10 x 9 x 8 x 7 f f ( x 3 ) x 4 f x 5 f

  17. Structure of a Hardware Implementation x 6 f x 2 f x 10 x 9 x 8 x 7 f f ( x 3 ) f ( x 1 ) x 4 f x 5 f

  18. Structure of a Hardware Implementation x 6 f x 2 f x 10 x 9 x 8 x 7 f ( x 7 ) f f ( x 3 ) f ( x 1 ) x 4 f x 5 f

  19. Structure of a Hardware Implementation still processing x 6 f x 2 f x 10 x 9 x 8 x 7 f ( x 7 ) f f ( x 3 ) f ( x 1 ) x 4 f x 5 f

  20. Structure of a Hardware Implementation still processing x 6 f stuck in buffer x 2 f x 10 x 9 x 8 x 7 f ( x 7 ) f f ( x 3 ) f ( x 1 ) x 4 f x 5 f

  21. Structure of a Hardware Implementation still processing x 6 f stuck in buffer x 2 f x 10 x 9 x 8 x 7 f ( x 7 ) f f ( x 3 ) f ( x 1 ) x 4 f holding and stalling x 5 f

  22. Structure of a Hardware Implementation f f f f f

  23. Structure of a Hardware Implementation f f f f f More Functional Units More Buffers (Parallelism) (Utilization)

  24. Multiple Possible Configurations...Which to Choose? Area = 15

  25. Multiple Possible Configurations...Which to Choose? Buffers 50% size of func. unit Area = 15

  26. Multiple Possible Configurations...Which to Choose? f f f f f f f 15 Functional Units f f f f Area = 15 f f f f

  27. Multiple Possible Configurations...Which to Choose? f 28 Buffers Area = 15

  28. Multiple Possible Configurations...Which to Choose? Area = 15 f f 3 f f 5 f f f 24 × 1 2 = 12 f 20 × 1 2 = 10

  29. Workload Structure f f f

  30. Workload Structure f f f Best-case

  31. Workload Structure Time f f f f f Best-case f

  32. Workload Structure Time f f f f f Best-case f ? Average-case

  33. Workload Structure Time f f f f f Best-case f ? Average-case Worst-case

  34. Workload Structure Time f f f f f Best-case f ? Average-case f f Worst-case f

  35. Workload Structure Time f f f f f Best-case f ? Average-case f f Worst-case f

  36. Optimal Resource Allocation Simulator Results 100% 80% Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit

  37. Optimal Resource Allocation Simulator Results 100% 80% Maximizing Functional Units Completion Time 60% 2x speedup 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit

  38. Optimal Resource Allocation Simulator Results 100% 80% Maximizing Buffers 3x speedup Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit

  39. Optimal Resource Allocation f f f f f 100% 80% Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit

  40. Optimal Resource Allocation f f f f f 100% 80% Completion Time 60% Slower 40% Fewer Buffers 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit

  41. Why Are There Multiple Optima? 100% 80% Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30

  42. Why Are There Multiple Optima? 100% 80% 12 Completion Time 60% 40% 5 20% 7 0% 0 5 10 15 20 25 30 10

  43. Why Are There Multiple Optima? 100% 80% 12 11 Completion Time 60% 40% 5 6 20% 7 6 0% 0 5 10 15 20 25 30 10 11

  44. Performance Scales with Area 100% 90% 80% 70% Completion Time 60% 50% 40% 30% Minimum 20% 10% Ideal 0% 0 10 20 30 40 50 60 Total Area

  45. Performance Scales with Area 100% Completion Time 80% 60% 40% 20% 0% 0 10 20 30 40 50 60 Buffers Slots / Functional Unit 30 20 10 0 0 10 20 30 40 50 60 10 Functional Units 8 6 4 2 0 10 20 30 40 50 60 Total Area

  46. Performance Scales with Area 100% Completion Time 80% 60% f 40% 20% f 0% 0 10 20 30 40 50 60 Buffers Slots / Functional Unit 30 20 10 0 0 10 20 30 40 50 60 12 Functional Units 10 8 6 4 2 0 10 20 30 40 50 60 Total Area

  47. Conclusions 100% 80% Area trade-off is important... Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit

  48. Conclusions 100% 80% Area trade-off is important... Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit f f f ...and non-obvious f f ? f f f

  49. Conclusions 100% 80% Area trade-off is important... Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit f f f ...and non-obvious f f ? f f f f Model helps explore design space f f f f x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10

  50. Conclusions 100% 80% Area trade-off is important... Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit f f f ...and non-obvious f f ? f f f f Model helps explore design space f f f f x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Synthesize Efficient Hardware Implementation of Map

  51. Conclusions 100% 80% Area trade-off is important... Completion Time 60% 40% 20% 0% 0 5 10 15 20 25 30 50 100 150 200 250 Buffer Slots per Functional Unit f f f ...and non-obvious f f ? f f f f Model helps explore design space f f f f x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 Synthesize Efficient Hardware Implementation of Map 100% 90% 80% 70% Completion Time Map Fold Scan 60% 50% Enhance our abstraction 40% 30% 20% 10% 0% 0 10 20 30 40 50 60 Total Area

Recommend


More recommend