stubby a transformation based
play

Stubby: A Transformation-based Optimizer for MapReduce Workflows - PowerPoint PPT Presentation

Stubby: A Transformation-based Optimizer for MapReduce Workflows Harold Lim, Herodotos Herodotou, Shivnath Babu Duke University MapReduce Workflow 30 MapReduce Workflow D0 1 D0 2 J1 J2 D2 D1 J3 D3 J4 D4 J5 J6 D5 D6 J7 31 D7


  1. Transformations • Transformations + Annotations allow Stubby to support different interfaces by being External to any interface D0 1 D0 2 M2 M1 J1 J2 R2 R1 D1 D2 73

  2. Transformations • Transformations + Annotations allow Stubby to support different interfaces by being External to any interface D0 1 D0 2 Transformation M2 M1 J1 J2 R2 R1 D1 D2 74

  3. Transformations • Transformations + Annotations allow Stubby to support different interfaces by being External to any interface D0 1 D0 2 D0 1 D0 2 Transformation M2 M1 M1 M2 J1 J2 J1-2 R2 R1 R1 R2 D1 D2 D2 D1 75

  4. Transformations • Transformations + Annotations allow Stubby to support different interfaces by being External to any interface D0 1 D0 2 D0 1 D0 2 Transformation M2 M1 M1 M2 J1 J2 J1-2 R2 R1 R1 R2 D1 D2 D2 D1 • Annotations ensure only valid transformations are considered 76

  5. Transformations • Transformations + Annotations allow Stubby to support different interfaces by being External to any interface D0 1 D0 2 D0 1 D0 2 Transformation M2 M1 M1 M2 J1 J2 J1-2 R2 R1 R1 R2 D1 D2 D2 D1 • Annotations ensure only valid transformations are considered • Transformations can be combined (whole >> sum of parts!) • Stubby considers 5 types of transformations (more to come) 77

  6. Transformations • Transformations + Annotations allow Stubby to support different interfaces by being External to any interface D0 1 D0 2 D0 1 D0 2 Transformation M2 M1 M1 M2 J1 J2 J1-2 R2 R1 R1 R2 D1 D2 D2 D1 • Annotations ensure only valid transformations are considered • Transformations can be combined (whole >> sum of parts!) • Stubby considers 5 types of transformations (more to come) 78

  7. Intra-Job Vertical Packing • Transforms a MapReduce job into a Map-only job 79

  8. Intra-Job Vertical Packing • Transforms a MapReduce job into a Map-only job < 51,2> < 51,1> < 50,1> … M M hash (O,Z) < 51,2> sort (O,Z) < 51,1> J.K 2 ={O,Z} < 50,1> … R R < 51,2> <51,1> < 50,1> … M M M hash (O) sort (O) <51,2> J.K 2 ={O} < 50,1> < 51,1> … R R 80

  9. Intra-Job Vertical Packing • Transforms a MapReduce job into a Map-only job < 51,2> < 51,2> < 51,1> < 50,1> < 51,1> < 50,1> … … M M M M hash (O,Z) hash (O) < 51,2> <51,2> sort (O,Z) sort (O,Z) < 51,1> J.K 2 ={O,Z} < 50,1> < 50,1> < 51,1> … … R R R R Transformation < 51,2> < 50,1> <51,1> <51,2> < 50,1> < 51,1> … M M M M M hash (O) … sort (O) <51,2> R R J.K 2 ={O} < 50,1> < 51,1> … R R 81

  10. Intra-Job Vertical Packing • Transforms a MapReduce job into a Map-only job < 51,2> < 51,2> < 51,1> < 50,1> < 51,1> < 50,1> … … M M M M hash (O,Z) hash (O) < 51,2> <51,2> sort (O,Z) sort (O,Z) < 51,1> J.K 2 ={O,Z} < 50,1> < 50,1> < 51,1> … … R R R R Transformation < 51,2> < 50,1> <51,1> <51,2> < 50,1> < 51,1> … M M M M M hash (O) … sort (O) <51,2> R R J.K 2 ={O} < 50,1> < 51,1> … R R 82

  11. Intra-Job Vertical Packing • Transforms a MapReduce job into a Map-only job • Group/Partition requirements of both jobs is now enforced at the same time < 51,2> < 51,2> < 51,1> < 50,1> < 51,1> < 50,1> … … M M M M hash (O,Z) hash (O) < 51,2> <51,2> sort (O,Z) sort (O,Z) < 51,1> J.K 2 ={O,Z} < 50,1> < 50,1> < 51,1> … … R R R R Transformation < 51,2> < 50,1> <51,1> <51,2> < 50,1> < 51,1> … M M M M M hash (O) … sort (O) <51,2> R R J.K 2 ={O} < 50,1> < 51,1> … R R 83

  12. Intra-job Vertical Packing (2) • Can have positive / negative effect on performance -> Need cost-based approach 3 2.5 2 Speedup 1.5 1 0.5 0 84 Performance Degradation Performance Improvement

  13. Intra-job Vertical Packing (2) • Can have positive / negative effect on performance -> Need cost-based approach 3 - Forces dependencies 2.5 of configurations (e.g., parallelism) - Resource contention 2 Speedup (more functions in a task) 1.5 1 0.5 0 85 Performance Degradation Performance Improvement

  14. Intra-job Vertical Packing (2) • Can have positive / negative effect on performance -> Need cost-based approach + Eliminates inter-task data transfer + Eliminates sorting overhead + Eliminates writing output to disk 3 - Forces dependencies 2.5 of configurations (e.g., parallelism) - Resource contention 2 Speedup (more functions in a task) 1.5 1 0.5 0 86 Performance Degradation Performance Improvement

  15. Inter-job Vertical Packing • Merges a map-only job with another job 87

  16. Inter-job Vertical Packing • Merges a map-only job with another job M R M 88

  17. Inter-job Vertical Packing • Merges a map-only job with another job M R Transformation M 89

  18. Inter-job Vertical Packing • Merges a map-only job with another job M M R R Transformation M M 90

  19. Inter-job Vertical Packing • Merges a map-only job with another job M M R R Transformation M M • If combine intra-job + inter-job -> 2 MapReduce jobs to 1 MapReduce job 91

  20. Inter-job Vertical Packing • Merges a map-only job with another job M M R R Transformation M M • Again, not always a good thing • + Eliminates writing to disk • - Forces dependencies 92

  21. Horizontal Packing • Combine concurrent running jobs into a single job 93

  22. Horizontal Packing • Combine concurrent running jobs into a single job M M M R R R 94

  23. Horizontal Packing • Combine concurrent running jobs into a single job Transformation M M M R R R 95

  24. Horizontal Packing • Combine concurrent running jobs into a single job Transformation M M M M M M R R R R R R 96

  25. Horizontal Packing • Combine concurrent running jobs into a single job Transformation M M M M M M R R R R R R • + Read dataset once • + Share overhead of launching jobs • - Extra overhead of sorting/partitioning combined map output • - Share limited memory resources per task (can spill more) 97

  26. Partition Function • Change how map outputs are partitioned and sorted 98

  27. Partition Function • Change how map outputs are partitioned and sorted M hash (O) R M filter={0<=O<100} R 99

  28. Partition Function • Change how map outputs are partitioned and sorted M hash (O) R Transformation M filter={0<=O<100} R 100

  29. Partition Function • Change how map outputs are partitioned and sorted M M range (O) hash (O) R split-points (100,200,…) R Transformation M M filter={0<=O<100} filter={0<=O<100} R R 101

  30. Partition Function • Change how map outputs are partitioned and sorted M M range (O) hash (O) R split-points (100,200,…) R Transformation M M filter={0<=O<100} filter={0<=O<100} R R • Enables partition pruning 102 • Enables vertical packing transformation

  31. Configuration Transformation • Changes the configuration of a MapReduce job 103

  32. Configuration Transformation • Changes the configuration of a MapReduce job Memory Buffer 512MB M M R R 2 Reduce Tasks 104

  33. Configuration Transformation • Changes the configuration of a MapReduce job Memory Buffer 512MB vs. M M M M Memory Buffer 128MB Transformation R R R R R R 2 Reduce Tasks vs. 4 Reduce Tasks 105

  34. Configuration Transformation • Changes the configuration of a MapReduce job Memory Buffer 512MB vs. M M M M Memory Buffer 128MB Transformation R R R R R R 2 Reduce Tasks vs. 4 Reduce Tasks • Many configurations that affect performance (e.g., sort buffer, compression, combiner, reduce tasks, etc) • Impact of configuration depends on other 106 transformations (interaction)

  35. Configuration Transformation • Changes the configuration of a MapReduce job Memory Buffer 512MB vs. M M M M Memory Buffer 128MB Transformation R R R R R R 2 Reduce Tasks vs. 4 Reduce Tasks • Many configurations that affect performance (e.g., sort buffer, compression, combiner, reduce tasks, etc) • Impact of configuration depends on other 107 transformations (interaction)

  36. Next Transformations Many Interfaces Information Large MapReduce Spectrum Plan Space Workflow Optimization Challenges Interac racti tions ns Annotations 108

  37. Next Transformations Many Interfaces Information Large MapReduce Spectrum Plan Space Workflow Optimization Challenges Interac racti tions ns Annotations 109

  38. Optimization Process 110

  39. Optimization Process U (1) O ptimization unit D0 1 D0 2 localizes M2 M1 interactions among J1 J2 R2 R1 plan space choices D1 D2 M3 J3 R3 D3 M4 J4 D4 M5 M6 J5 J6 R5 R6 D5 D6 111 M7 J7 R7 D7

  40. Optimization Process U (1) D0 1 D0 2 M1 M2 J1-2 R1 R2 D1 D2 M3 J3 R3 D3 M4 J4 D4 M5 M6 J5 J6 R5 R6 D5 D6 112 M7 J7 R7 D7

  41. Optimization Process D0 1 D0 2 M1 M2 J1-2 R1 R2 D1 D2 U (2) M3 J3 R3 D3 M4 J4 D4 M5 M6 J5 J6 R5 R6 D5 D6 113 M7 J7 R7 D7

  42. Optimization Process Top-Down D0 1 D0 2 because producer jobs M1 M2 affect the input datasets J1-2 R1 R2 of consumer jobs D1 D2 U (2) M3 J3 Dynamically generated R3 because previous D3 optimization unit transforms workflow M4 J4 D4 M5 M6 J5 J6 R5 R6 D5 D6 114 M7 J7 R7 D7

  43. Optimization Process D0 1 D0 2 M1 M2 J1-2 R1 R2 D2 D1 U (4) M3 R3 M4 M5 M6 J3-7 R5 R6 M7 R7 D7 D6 115

  44. Divide and Conquer 116

  45. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies 117

  46. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies 118

  47. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies M5 M6 J5 J6 R5 R6 D5 D6 M7 J7 R7 D7 119

  48. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies M5 M6 J5 J6 R5 R6 D5 D6 M7 J7 R7 D7 • Divide into producer-consumer relationships • Transformations on producer jobs, affect transformations on consumer jobs • E.g, partition function on J5 -> vertical packing on J7, 120 compressing D5 forces J7 to decompress

  49. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies M5 M6 J5 J6 R5 R6 D5 D6 M7 J7 R7 D7 • Divide into producer-consumer relationships • Transformations on producer jobs, affect transformations on consumer jobs • E.g, partition function on J5 -> vertical packing on J7, 121 compressing D5 forces J7 to decompress

  50. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies M5 M6 J5 J6 R5 R6 D5 D6 M7 J7 R7 D7 • Concurrent jobs use the same cluster resources • E.g, affect configuration and horizontal packing transformations 122

  51. Divide and Conquer • Divide workflow into Optimization Units to have smaller plan spaces • Issue: Interactions among plan space choices • Insight: Based on Dataset and Resource dependencies M5 M6 J5 J6 R5 R6 D5 D6 M7 J7 R7 D7 123

  52. Within an Optimization Unit • Enumerate all valid combinations of packing transformations 124

  53. Within an Optimization Unit • Enumerate all valid combinations of packing transformations D1 D2 M3 R3 D3 M4 D4 p 4 p 3 p 2 p 1 D1 D2 D1 D2 D1 D2 D1 D2 M3 M3 M3 M3 R3 R3 R3 R3 M4 M4 D3 D3 125 D4 D4 M4 M4 D4 D4

  54. Within an Optimization Unit • Enumerate all valid combinations of packing transformations p 4 p 3 p 2 p 1 D1 D2 D1 D2 D1 D2 D1 D2 M3 M3 M3 M3 R3 R3 R3 R3 M4 M4 D3 D3 126 D4 D4 M4 M4 D4 D4

  55. Within an Optimization Unit • Enumerate all valid combinations of packing transformations • Use Starfish’s What -If Engine [Herodotou VLDB 2011] for costing • Use Recursive Random Search [Ye SIGMETRICS 03] to find configurations with best cost for each combination p i p 4 p 3 p 2 p 1 D1 D2 D1 D2 D1 D2 D1 D2 M3 M3 M3 M3 R3 R3 R3 R3 M4 M4 D3 D3 127 D4 D4 M4 M4 D4 D4

  56. Within an Optimization Unit • Enumerate all valid combinations of packing transformations • Use Starfish’s What -If Engine [Herodotou VLDB 2011] for costing • Use Recursive Random Search [Ye SIGMETRICS 03] to find configurations with best cost for each combination p i p 4 p 3 p 2 p 1 D1 D2 D1 D2 D1 D2 D1 D2 M3 M3 M3 M3 R3 R3 R3 R3 M4 M4 D3 D3 128 D4 D4 M4 M4 D4 D4

Recommend


More recommend