Applied research group Systems+database people building prototypes, publishing papers
Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to production
Applied research group Systems+database people building prototypes, publishing papers Collaborating with Big Data product group at MS Shipping our code to production Open-sourcing our code Apache Hadoop, REEF, Heron
Resource Distributed Query management tiered storage optimization Stream Log analytics processing
Resource Distributed Query management tiered storage optimization Stream Log analytics processing
Node Node Node Manager Manager Manager
• Node Node Node Manager Manager Manager
• • Node Node Node Manager Manager Manager
• • Node Node Node Manager Manager Manager
• • 1. Request Node Node Node Manager Manager Manager
• • 1. Request 2. Allocation Node Node Node Manager Manager Manager
• • 1. Request 2. Allocation 3. Start task Node Node Node Manager Manager Manager
• • 1. Request 2. Allocation • 3. Start task Node Node Node Manager Manager Manager
• • 1. Request 2. Allocation • • 3. Start task Node Node Node Manager Manager Manager
• • 1. Request Do we really need a Resource Manager? 2. Allocation • • 3. Start task Node Node Node Manager Manager Manager
Hadoop 1 World Hadoop 2 World • monolithic Users Application Frameworks Hive / Pig Hive / Pig Ad-hoc Ad-hoc app Ad-hoc Apps Ad-hoc Ad-hoc Ad-hoc Scope app Programming app app MR v1 app Model(s) on MR ... YARN Tez Giraph Storm Spark Dryad v2 Heron REEF Hadoop 1.x (MapReduce) Cluster OS (Resource YARN Management) File System HDFS 2 HDFS 1 Hardware
Hadoop 1 World Hadoop 2 World • monolithic Users Application Frameworks • Reuse of RM Hive / Pig Hive / Pig Ad-hoc Ad-hoc app Ad-hoc Apps Ad-hoc component Ad-hoc Ad-hoc Scope app Programming app app MR v1 app Model(s) on MR ... YARN Tez Giraph Storm Spark Dryad v2 Heron REEF Hadoop 1.x (MapReduce) Cluster OS (Resource YARN Management) File System HDFS 2 HDFS 1 Hardware
Hadoop 1 World Hadoop 2 World • monolithic Users Application Frameworks • Reuse of RM Hive / Pig Hive / Pig Ad-hoc Ad-hoc app Ad-hoc Apps Ad-hoc component Ad-hoc Ad-hoc Scope app Programming app app MR v1 app Model(s) on MR ... YARN Tez Giraph Storm Spark Dryad v2 Heron REEF Hadoop 1.x (MapReduce) YARN Cluster OS • (Resource YARN Management) layering abstractions File System HDFS 2 HDFS 1 Hardware
But is all this good enough for the Microsoft clusters?
High resource Scalability utilization Production jobs Workload and heterogeneity predictability
100% Utilization
0
• Wide variety
• Wide variety
• Wide variety • •
deadlines recurring >60% • Predictability over-provisioned
4 Hadoop committers in CISL 404 patches as of last night • Rayon/Morpheus: • Mercury/Yaq: • YARN Federation: • Medea:
4 Hadoop committers in CISL 404 patches as of last night • Rayon/Morpheus: • Mercury/Yaq: • YARN Federation: • Medea:
[Hadoop 3.0; ATC 2015, EuroSys 2016]
RM N1 N2
j1 RM N1 N2
j1 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
j2 RM • Feedback delays idle between allocations N1 N2
j2 RM • Feedback delays idle between allocations N1 N2 5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38%
j2 RM • Feedback delays idle between allocations N1 N2 5 sec 10 sec 50 sec Mixed-5-50 Cosmos-gm 60.59% 78.35% 92.38% 78.54% 83.38% • Actual
• Introduce task queuing at nodes • Mask feedback delays • Improve cluster utilization • Improve task throughput (by up to 40%) • Container types • GUARANTEED and OPPORTUNISTIC • Keep guarantees for important jobs • Use opportunistic execution to improve utilization
RM N1 N2
RM N1 N2
j1 RM N1 N2
j1 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
j2 RM N1 N2
• j2 RM N1 N2
• j2 RM • N1 N2
• j2 RM • N1 N2
•
• •
• So all we need to do is use long queues? •
can be detrimental for job completion times • Despite the utilization gains
can be detrimental for job completion times • Despite the utilization gains Proper queue management techniques are required
N1 N2 N3
N1 N2 N3
N1 N2 N3
N1 N2 N3
Prioritize task Place tasks to execution node queues (queue reordering) Bound queue lengths
Prioritize task Place tasks to execution node queues (queue reordering) Bound queue lengths Yaq improves median job completion time by 1.7x over YARN
RM N1 N2 N3
queue length RM N1 N2 N3
queue length RM N1 N2 N3
queue length RM N1 N2 N3
queue length RM queue wait time N1 N2 N3
queue length RM queue wait time N1 N2 N3
• Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF)
RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3
RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3
RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3
RM j2: 5 tasks j3: 9 tasks j1: 21 tasks • Shortest Remaining Job First (SRJF) • Least Remaining Tasks First (LRTF) N1 N2 N3 job-aware
lower throughput longer job completion times
• 1.7x improvement in median JCT over YARN
• Container types distributed scheduling any distributed scheduler over-commitment multi-tenancy • Pricing
cluster utilization queue management techniques job completion time
Recommend
More recommend