applied research group
play

Applied research group Systems+database people building prototypes, - PowerPoint PPT Presentation

Applied research group Systems+database people building prototypes, publishing papers Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to


  1. Applied research group Systems+database people building prototypes, publishing papers

  2. Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to production

  3. Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to production Open-sourcing our code Apache Hadoop, REEF, Heron

  4. Resource Distributed Query management tiered storage optimization Stream Log analytics processing

  5. Resource Distributed Query management tiered storage optimization Stream Log analytics processing

  6. Node Node Node Manager Manager Manager

  7. • Node Node Node Manager Manager Manager

  8. • • Node Node Node Manager Manager Manager

  9. • • Node Node Node Manager Manager Manager

  10. • • 1. Request Node Node Node Manager Manager Manager

  11. • • 1. Request 2. Allocation Node Node Node Manager Manager Manager

  12. • • 1. Request 2. Allocation 3. Start task Node Node Node Manager Manager Manager

  13. • • 1. Request 2. Allocation • 3. Start task Node Node Node Manager Manager Manager

  14. • • 1. Request 2. Allocation • • 3. Start task Node Node Node Manager Manager Manager

  15. • • 1. Request Do we really need a Resource Manager? 2. Allocation • • 3. Start task Node Node Node Manager Manager Manager

  16. Hadoop 1 World Hadoop 2 World • monolithic Users Application Frameworks Hive / Pig Hive / Pig Ad-hoc Ad-hoc app Ad-hoc Apps Ad-hoc Ad-hoc Ad-hoc Scope app Programming app app MR v1 app Model(s) on MR ... YARN Tez Giraph Storm Spark Dryad v2 Heron REEF Hadoop 1.x (MapReduce) Cluster OS (Resource YARN Management) File System HDFS 2 HDFS 1 Hardware

  17. Hadoop 1 World Hadoop 2 World • monolithic Users Application Frameworks • Reuse of RM Hive / Pig Hive / Pig Ad-hoc Ad-hoc app Ad-hoc Apps Ad-hoc component Ad-hoc Ad-hoc Scope app Programming app app MR v1 app Model(s) on MR ... YARN Tez Giraph Storm Spark Dryad v2 Heron REEF Hadoop 1.x (MapReduce) Cluster OS (Resource YARN Management) File System HDFS 2 HDFS 1 Hardware

  18. Hadoop 1 World Hadoop 2 World • monolithic Users Application Frameworks • Reuse of RM Hive / Pig Hive / Pig Ad-hoc Ad-hoc app Ad-hoc Apps Ad-hoc component Ad-hoc Ad-hoc Scope app Programming app app MR v1 app Model(s) on MR ... YARN Tez Giraph Storm Spark Dryad v2 Heron REEF Hadoop 1.x (MapReduce) YARN Cluster OS • (Resource YARN Management) layering abstractions File System HDFS 2 HDFS 1 Hardware

  19. But is all this good enough for the Microsoft clusters?

  20. High resource Scalability utilization Production jobs Workload and heterogeneity predictability

  21. 100% Utilization

  22. 0

  23. • Wide variety

  24. • Wide variety

  25. • Wide variety • •

  26. deadlines recurring >60% • Predictability over-provisioned

  27. 4 Hadoop committers in CISL 404 patches as of last night • Rayon/Morpheus: • Mercury/Yaq: • YARN Federation: • Medea:

  28. 4 Hadoop committers in CISL 404 patches as of last night • Rayon/Morpheus: • Mercury/Yaq: • YARN Federation: • Medea:

  29. [Hadoop 3.0; ATC 2015, EuroSys 2016]

  30. RM N1 N2

  31. j1 RM N1 N2

  32. j1 RM N1 N2

  33. j2 RM N1 N2

  34. j2 RM N1 N2

  35. j2 RM N1 N2

  36. j2 RM N1 N2

  37. j2 RM N1 N2

  38. j2 RM • Feedback delays idle between allocations N1 N2

  39. j2 RM • Feedback delays idle between allocations N1 N2 5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%

  40. j2 RM • Feedback delays idle between allocations N1 N2 5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38% • Actual

  41. • Introduce task queuing at nodes • Mask feedback delays • Improve cluster utilization • Improve task throughput (by up to 40%) • Container types • GUARANTEED and OPPORTUNISTIC • Keep guarantees for important jobs • Use opportunistic execution to improve utilization

  42. RM N1 N2

  43. RM N1 N2

  44. j1 RM N1 N2

  45. j1 RM N1 N2

  46. j2 RM N1 N2

  47. j2 RM N1 N2

  48. j2 RM N1 N2

  49. j2 RM N1 N2

  50. • j2 RM N1 N2

  51. • j2 RM • N1 N2

  52. • j2 RM • N1 N2

  53. • •

  54. • So all we need to do is use long queues? •

  55. can be detrimental for job completion times • Despite the utilization gains

  56. can be detrimental for job completion times • Despite the utilization gains Proper queue management techniques are required

  57. N1 N2 N3

  58. N1 N2 N3

  59. N1 N2 N3

  60. N1 N2 N3

  61. Prioritize task Place tasks to execution node queues (queue reordering) Bound queue lengths

  62. Prioritize task Place tasks to execution node queues (queue reordering) Bound queue lengths Yaq improves median job completion time by 1.7x over YARN

  63. RM N1 N2 N3

  64. queue length RM N1 N2 N3

  65. queue length RM N1 N2 N3

  66. queue length RM N1 N2 N3

  67. queue length RM queue wait time N1 N2 N3

  68. queue length RM queue wait time N1 N2 N3

  69. • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF)

  70. RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3

  71. RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3

  72. RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3

  73. RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3 job-aware

  74. lower throughput longer job completion times

  75. • 1.7x improvement in median JCT over YARN

  76. • Container types distributed scheduling any distributed scheduler over-commitment multi-tenancy • Pricing

  77. cluster utilization queue management techniques job completion time

Recommend


More recommend