Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data - PowerPoint PPT Presentation

HotCloud’17 Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua

Wide Area Data Analytics DC Master Namenode Workers Datanodes 2

Wide Area Data Analytics Why wide area data analytics? • Data Volume • User Distribution • Regulation Policy … DC #1 DC #2 DC #n … Problems Master Workers Workers • Widely shared resources … Namenode Datanodes Datanodes ‣ Fluctuating available provision • Distributed runtime environment ‣ Heterogenous utilizations 2

Fluctuating WAN Bandwidths 10.6.3.3 (VC) 500 10.8.3.3 (CT) 10.12.3.32 (TR) 10.4.3.5 (WT) 10.2.3.4 (TR) 400 Bandwidth (Mbps) 300 200 100 0 0:00 6:00 12:00 18:00 0:00 6:00 12:00 Jan 1 Jan 2 Measured by iperf on SAVI testbed https://www.savinetwork.ca/ 3

Heterogenous Memory Util Nodes in di ff erent DCs may have di ff erent resource utilizations 0.4 node_1 0.2 node_2 0.0 node_3 0.2 node_4 0.4 1 301 601 901 1201 1501 1801 2101 Time (s) Running Berkeley Big Data Benchmark on AWS EC2 4 nodes across 4 regions. Collected by jvmtop 4

Runtime Bottlenecks Bottlenecks emerges at runtime Fluctuation Heterogeneity • Any time • Any nodes Bottlenecks • Any resources Data analytics performance • Long completion times • Low resource utilization • Invalid optimization 5

Optimization of Data Analytics Existing optimization method does not consider runtime bottlenecks • Clarient [OSDI’16] considers the heterogeneity of available WAN bandwidth • Iridium [SIGCOMM’15] trades o ff between time and WAN bandwidth usage • Geode [NSDI’15] saves WAN usage via data placement and query plan selection • SWAG [SoCC’15] reorders jobs across datacenters “Much of this performance work has been motivated by three widely-accepted mantras about the performance of data analytics — network , disk and straggler .” Making Sense of Performance in Data Analytics Frameworks NSDI’15, Kay Ousterhout 6

Mitigating Bottlenecks at Runtime Mitigating bottlenecks • How to detect bottlenecks? • How to overcome the scheduling delay? • How to enforce the bottleneck mitigation? Resource queue Task queue in bottleneck 7

Architecture of Lube Lube Client Lube Master Bottleneck Detector Bottleneck Info. Cache Three major components Online Bottleneck Detector Available Worker Pool • Performance monitors Training Model (worker, intensity) • Bottleneck detecting module Pool Update • Bottleneck-aware scheduler Lube Scheduler Lightweight Performance Monitors Submitted Task Queue Network I/O JVM Bottleneck-aware Disk I/O more metrics Scheduling 8

Detecting Bottlenecks — ARIMA y t = θ 0 + φ 1 y t − 1 + φ 2 y t − 2 +…+ φ p y t − p + ε t − θ 1 ε t − 1 − θ 2 ε t − 2 − … − θ q ε t − q θ φ ε Ramdon error y t Current state Coe ffi cients input output Historical Autoregressive (AR) + Current states Moving Average(MA) state (time_1, mem_util) (time_2, mem_util) ARIMA(p, d, q) (time_t, mem_util) … (time_t-1, mem_util) 9

Detecting Bottlenecks — HMM Hidden Markov Model • Hidden states: O past future • Observation states: Q t • Emission probability: A A(a ij ) A(a ij ) … … Q q j q j q 1 q 1 q 2 q 2 q i q i • Transition probability: B B(b j ( k )) B(b j ( k )) To make HMM online … O k O k O 1 O 1 O 2 O 2 O d O d O Sliding Hidden Markov Model {time_stamp: mem, net, cpu, disk} • A sliding window for new observations • A moving average approximation for outdated observations 10

Bottleneck-Aware Scheduling Memory utilization of executor processes Built-in task schedulers: • Data-locality Network utilization of datanode processes Bottleneck-aware scheduler: • Data-locality • Bottlenecks at runtime CPU utilization of executor processes A single worker node is bottlenecked continuously while Disk (SSD) utilization of datanode processes all nodes are rarely bottlenecked at the same time 11 Time (s)

Implementation & Deployment Implementation • Spark-1.6.1 (scheduler) APIs: • redis database (cache) Master Node Lube Scheduler • Python scikit-learn, Keras (ML) HGET worker_id time Master Redis Server HSET worker_id {time: {metric: val_ob, val_inf}} Deployment Bottleneck Detection Module Worker Nodes • 37 EC2 m4.2xlarge instances SUBSCRIBE metric_1 metric_2 … Worker Redis Server • 9 regions PUBLISH + HSET • Berkeley Big Data Benchmark metric {time: val} … nethogs jvmtop iotop (e.g, iotop {time: I/O}) • An 1.1 TB dataset 12

Evaluation — Accuracy ARIMA SlidHMM 100 100 Calculation Hit Rate (%) Hit Rate (%) 80 80 hitrate = #((time, detection) ∩ (time, observation)) 60 60 Query-1 Query-2 #(time, detection) a b c a b c 100 100 ARIMA ignores nonlinear patterns Hit Rate (%) Hit Rate (%) 80 80 60 60 Query-3 Query-4 a b c 13

Evaluation — Completion Times Pure Spark Lube-SlidHMM Lube-ARIMA 1.0 1.0 Query-1 Query-2 Task completion times 0.5 0.5 Average 75th Time (ms) Time (ms) 0 0 0 5 0 5 0 5 0 5 0 0 Lube-ARIMA 12.454s 22.075s 1 1 1 1 × × × × 2 4 2 4 1.0 1.0 Query-3 Query-4 Lube-SlidHMM 14.783s 27.469s 0.5 0.5 Time (ms) Time (ms) 0 0 0 5 0 5 0 5 0 5 0 0 1 1 1 1 × × × × 2 4 2 4 14

Evaluation — Completion Times Pure Spark Lube-ARIMA Query completion times ARIMA + Spark Lube-SlidHMM SlidHMM + Spark • Lube-ARIMA • Lube-SlidHMM 1600 1600 • Reduce median query response 1400 1400 Time (s) time by up to 33% 1200 1200 1000 1000 Query-1 Query-2 Control Groups for overhead 1800 1800 • ARIMA + Spark 1600 1600 Time (s) • SlidHMM + Spark 1400 1400 1200 • Negligible overhead 1200 1000 Query-3 Query-4 1000 800 15

Conclusion • Runtime performance bottleneck detection ‣ ARIMA, HMM • A simple greedy bottleneck-aware task scheduler ‣ Jointly consider data-locality and bottlenecks • Lube , a closed-loop framework mitigating bottlenecks at runtime. 16

The End Thank You

Discussion Bottleneck detection models • More performance metrics could be explored • More e ffi cient models for time series prediction, e.g ., Reinforcement Learning, LSTM Bottleneck-aware scheduling • Fine-grained scheduling with specific resource awareness WAN conditions • We measure pair-wise WAN bandwidths by a cron job running iperf locally • Try to exploit support from SDN interfaces 18

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data - PowerPoint PPT Presentation

HotCloud17 Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area Data Analytics DC Master Namenode Workers Datanodes 2 Wide Area Data Analytics Why wide area data analytics? Data Volume

MSC SHIPBOARD LUBE OIL SAMPLING 29 APR 2020 N7 MSC LUBE OIL SAMPLING What a are t e the g

Lube Oil Conditioner Unit Oil Filter Systems C.C. Jensen C.C.JENSEN, INC. Lube Oil Conditioner

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

Characterizing and Mitigating Web Performance Bottlenecks in Home Networks Srikanth Sundaresan,

Evaluating Approaches to Detect Bottlenecks in the Pipe & Filter Framework TeeTime Adrian

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

Mitigating Seabird Mitigating Seabird Interactions with Interactions with Trawl Nets Trawl

MITIGATING RISK MITIGATING RISK IN GIFT CARD SALES IN GIFT CARD SALES March 2019 MEET

Mitigating Geomagnetic Induced Currents Using Surge Arresters ALBERTO RAMIREZ MITIGATING

Gulf Oil Lubricants India Ltd Investor Presentation An Iconic Global Brand Fastest Growing Lube

Some remarks on grad-div stabilization of incompressible flow simulations Gert Lube Institute

Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel

Upset Prevention & Recovery Training (UPRT) Why Mitigating Loss of Control In-Flight Matters

Data Driven Assessment of Cyber Risk: Challenges in Assessing and Mitigating Cyber Risk

Operational measures for mitigating aircraft climate change Volker Grewe + contributions from

On Adversarial Removal of Hypothesis-only Bias in Natural Language Inference Yonatan Belinkov * ,

Towards Robust Natural Language Understanding Group 3 Shengshuo L, Xuhui Z, Zeyu L, Xinyi W,

FAST UNCERTAINTY ESTIMATES AND BAYESIAN MODEL AVERAGING OF DNNS WESLEY MADDOX JOINT WORK WITH

Staying Well and Achieving Goals Piper S. Meyer-Kalos, Ph.D. Susan Gingerich, MSW Delbert

Moving tables across clusters Scaling a high traffic database Nice to meet you! Developer on

ASPE Training: Professional Development Days 1 Who is ASPE? We provide innovative, custom

The Express Elevator to Success A Program Starter Kit Who We Are Karen Brown Associate

How to address Polo? Grammatically correct Prof. Chau Dr. Chau Grammatically incorrect, but

Sambuz

Useful Links

Newsletter

Mail Us

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data - PowerPoint PPT Presentation

HotCloud17 Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area Data Analytics DC Master Namenode Workers Datanodes 2 Wide Area Data Analytics Why wide area data analytics? Data Volume

MSC SHIPBOARD LUBE OIL SAMPLING 29 APR 2020 N7 MSC LUBE OIL SAMPLING What a are t e the g

Lube Oil Conditioner Unit Oil Filter Systems C.C. Jensen C.C.JENSEN, INC. Lube Oil Conditioner

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

Characterizing and Mitigating Web Performance Bottlenecks in Home Networks Srikanth Sundaresan,

Evaluating Approaches to Detect Bottlenecks in the Pipe &amp; Filter Framework TeeTime Adrian

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

Mitigating Seabird Mitigating Seabird Interactions with Interactions with Trawl Nets Trawl

MITIGATING RISK MITIGATING RISK IN GIFT CARD SALES IN GIFT CARD SALES March 2019 MEET

Mitigating Geomagnetic Induced Currents Using Surge Arresters ALBERTO RAMIREZ MITIGATING

Gulf Oil Lubricants India Ltd Investor Presentation An Iconic Global Brand Fastest Growing Lube

Some remarks on grad-div stabilization of incompressible flow simulations Gert Lube Institute

Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel

Upset Prevention &amp; Recovery Training (UPRT) Why Mitigating Loss of Control In-Flight Matters

Data Driven Assessment of Cyber Risk: Challenges in Assessing and Mitigating Cyber Risk

Operational measures for mitigating aircraft climate change Volker Grewe + contributions from

On Adversarial Removal of Hypothesis-only Bias in Natural Language Inference Yonatan Belinkov * ,

Towards Robust Natural Language Understanding Group 3 Shengshuo L, Xuhui Z, Zeyu L, Xinyi W,

FAST UNCERTAINTY ESTIMATES AND BAYESIAN MODEL AVERAGING OF DNNS WESLEY MADDOX JOINT WORK WITH

Staying Well and Achieving Goals Piper S. Meyer-Kalos, Ph.D. Susan Gingerich, MSW Delbert

Moving tables across clusters Scaling a high traffic database Nice to meet you! Developer on

ASPE Training: Professional Development Days 1 Who is ASPE? We provide innovative, custom

The Express Elevator to Success A Program Starter Kit Who We Are Karen Brown Associate

How to address Polo? Grammatically correct Prof. Chau Dr. Chau Grammatically incorrect, but

Sambuz

Useful Links

Newsletter

Mail Us

Evaluating Approaches to Detect Bottlenecks in the Pipe & Filter Framework TeeTime Adrian

Upset Prevention & Recovery Training (UPRT) Why Mitigating Loss of Control In-Flight Matters