Detecting Bottlenecks in Parallel DAG-based Data Flow Programs Björn Lohrmann Dominic Battré Matthias Hovestadt Alexander Stanik Daniel Warneke Email: {firstname}.{lastname}@tu-berlin.de Complex and Distributed IT-Systems Technische Universität Berlin
Introduction (1) IaaS clouds offer virtual machines on-demand Why use clouds for data processing? ■ Fast and unlimited** scale-out ■ Pricing Model ♦ Pay-as-you-go ♦ 10 nodes for 1 day = 1 node for 10 days ■ No long-term obligations **almost 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 2
Introduction (2) Frameworks are required for effective use of clouds ? Parallelization Job Modelling Job Scheduling Eucalyptus Hadoop VM Nephele Management etc. Job Job Deployment Monitoring 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 3
Prerequisites ● Jobs modelled as directed Task 4 acyclic graphs ■ Vertices are tasks ■ Edges are communication channels ● Each task has 1..n parallel Task 2 Task 3 task instances ● Unidirectional and blocking communication Task 1 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 4
Overview Key question of this talk: Task 5 Task 5 ● Given a DAG-shaped job, how many task instances should I assign to each task? Task 3 Task 4 Our approach ● Begin with 1 instance for Task 2 each task ● Iteratively detect bottlenecks and add instances where Task 1 necessary 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 5
Bottlenecks Negative effects of bottlenecks: Task 5 Task 5 ■ Input starvation ■ Output blockage Task 3 Task 4 Low throughput of workflow Low resource utilization Time and money wasted Task 2 Task 1 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 6
Bottlenecks Types: Task 3 CPU ● CPU ■ Enough input available Task 2 CPU ■ Throughput limited by CPU ■ Lack of input for subsequent Task 1 CPU tasks ● I/O ■ Transport infrastructure is Task 2 CPU overloaded (NICs, switches, etc) ■ Forces tasks to wait Task 1 CPU 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 7
Bottleneck Detection ● Monitor job at runtime: ● Continuously measure CPU load and I/O wait on task instances ● Aggregate to task statistics ● Continuously analyze task statistics: ■ Traverse task nodes in reverse topological order and check for CPU bottlenecks ■ If none found traverse edges in reverse topological order and check for I/O bottlenecks ■ If bottleneck found: Report it! 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 8
Implementation ● Based on Nephele framework ■ Java framework ■ 1 master, n workers ■ Task instance = Java thread ● Analysis of thread state statistics: ■ Threshold for CPU bottleneck: ♦ USR + SYS + BLK >= 90% time ■ Threshold for I/O bottleneck ♦ WAIT caused by sending on channel >= 90% time 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 9
Evaluation Demo Job Setup: ● Private compute cloud PDF Index Writer Writer ● Hosts with two Intel Xeon 2,66Ghz, 32 GB RAM and PDF Inverted 1GB Ethernet Creator Index ● KVM guests with one virtual CPU and 2GB RAM OCR ● Eucalyptus framework for VM File allocation/deallocation Reader 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 10
Evaluation (2) Phase 1: Fine tuning 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 11
Evaluation (1) Phase 2: Scale-out 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 12
Conclusion ● Bottleneck detection is useful to scale out jobs in the cloud, while maintaining high resource utilization ● We presented a simple approach to gather and analyze relevant statistics ● Right now, manual adaptation and job re-runs are necessary to eliminate bottlenecks ● Future work: ■ Dynamically and automatically adjust parallelization at runtime 15.11.2010 Björn Lohrmann- Detecting Bottlenecks in Parallel Dag-based Data Flow Programs 13
Recommend
More recommend