O PTIMUS C LOUD : Heterogeneous Configuration Optimization for - PowerPoint PPT Presentation

O PTIMUS C LOUD : Heterogeneous Configuration Optimization for Distributed Databases in the Cloud Ashraf Mahgoub 1 , Alexander Medoff 1 , Rakesh Kumar 2 , Subrata Mitra 3 , Ana Klimovic 4 , Somali Chaterji 1 , Saurabh Bagchi 1 1: Purdue University; 2: Microsoft 3: Adobe Research; 4: Google Research Supported by NIH R01 AI123037-01 (2016-21), WHIN center (2018-22) 1

Agenda • Introduction • Challenges in Key-Value Stores Online Tuning • Dynamic Workloads • Prior work • Proposed Approach • Heterogeneous Configurations Benefits • Use cases and Evaluation • Conclusion 2

Introduction • O PTIMUS C LOUD ’s Goal: Achieving cost and performance efficiency for cloud-hosted distributed key-value store using online configuration tuning • O PTIMUS C LOUD considers two set of configuration parameters: – Key-value store parameters: Cloud VM parameters: VM size/type which controls: Cache size, Number of cores # Reading\Writing threads, Memory Size Compaction Network Bandwidth, method/throughput etc. etc. 3

Challenges in Online Tuning for Key-Value Stores • Combining both sets of configuration parameters (Key-Value store + VM type/size) produces a large configuration space 25+ Performance 133 instance types/sizes Tuning Parameters Prices vary by a factor of 5,000X • Dependency between key-value store and VM configurations: – For example, the cache size of Cassandra is limited by the available RAM in the cloud VM • O PTIMUS C LOUD performs joint optimization while taking into account the dependencies between the two spaces to achieve globally optimized performance 4

Cassandra’s Performance on different VM types/sizes Takeaways : ❑ Best configurations vary across different VM types/sizes ❑ Therefore, jointly tuning key-value store and cloud VM parameters is crucial to achieve cost-optimal performance 5

O PTIMUS C LOUD ’ S O VERVIEW 6

Dynamic workloads and online reconfiguration • Dynamic workloads: – Workload characteristics (e.g. Read-to-Write ratio, Request-rate, etc.) change over time, sometimes unpredictably – New characteristics causes current configurations to perform sub-optimally, necessitating reconfigurations • Impact of online reconfiguration : – Changing configurations at runtime usually requires a server-restart, causing a downtime and a degradation in performance – For fast changing workloads, frequent reconfiguration of the overall cluster could severely degrade performance • Q: Can we reconfigure only a subset of the nodes in the cluster? Which subset? – This will lead to heterogenous configuration 7

Why heterogeneous configurations is beneficial? Best Configurations To optimize Perf/$: Write-Heavy -> All C4.L Read-Heavy -> 2 C4.L & 2 R4.XL 8

O PTIMUS C LOUD ’ S Solution • Heterogeneous configurations: Reduce reconfiguration downtime & avoids overprovisioning • However, heterogeneity increases the configuration space size – Consider a cluster of N=20 nodes and I=15 configurations – Homogeneous: We have I=15 possible configurations = 1.3× 10 9 possible configurations – Heterogeneous: We have 𝑂+𝐽−1 I−1 • O PTIMUS C LOUD uses the concept of Complete-Sets to reduce the size of the search space – Complete-Set: the minimum subset of nodes for which the union of their data records covers all the records in the database at least once 9

Complete-Sets • This concept of Complete-Set relies on selecting the fastest replica for a given request – Dynamic Snitch (Cassandra) or Adaptive Replica Selection (Elasticsearch) • Consistency-Level (CL) defines how many replicas need to reply to a request before it is satisfied – Therefore, the slow replica will dominate the response latency – The servers within a Complete-Set must be upgraded to the faster configuration upon a workload change for the cluster performance to improve • O PTIMUS C LOUD keeps the configurations homogeneous within the same Complete-Set, while allowing different Complete-Sets to have different configurations 10

How partitioning the cluster into Complete-Sets reduces the search space? • First, we show that we have at most #Complete-Sets = Replication-Factor for any cluster (proof is given in the paper) – RF is practically low (3 or 5) • Second, reconfiguring #Complete-Sets = Consistency-Level (CL<=RF) , all requests are served from nodes with optimized configurations • With S Complete-Sets, the size space is reduces to 𝑇+𝐽−1 = 680 I−1 possible configurations for a cluster with RF=3 (Compared to 1.3× 10 9 ) 11

Using data-placement info to identify Complete-Sets First, 12

Applications 1. MG-RAST: – Real workload traces from the largest metagenomics analysis portal – Its workload does not have any discernible daily or weekly pattern, as the requests come from all across the globe – Workload can change drastically over a few minutes (accurately predictable for 5min) 2. Bus-Tracking: – Real workload traces from a bus-tracking mobile application – Traces show a daily pattern of workload switches. – Workload is accurately predictable for longer look-ahead periods (e.g. 2 hours) 3. HPC: – Simulated workload traces from data analytics jobs submitted to a shared HPC queue. – Using profiling techniques, job execution times can be predicted with high accuracy and for long look-ahead periods. 13

Performance Prediction Accuracy 14

Baselines 1. Homogeneous-Static: the single best configuration to use for the entire duration of the predicted workload. Impractical because assumes perfect knowledge of future workload 2. CherryPick [ NSDI-17]: Uses Bayesian Optimization to find a heterogeneous cloud configuration for a representative job/phase of the workload 3. Selecta [ ATC-18] : uses SVD techniques to select the optimized homogeneous cloud configuration for different jobs/phases of the workload 4. SOPHIA [ ATC-19] : uses Genetic-Algorithms and performance modeling to find optimized homogeneous configurations for Key-Value store parameters 15

Evaluation: Cassandra MG-RAST (Cluster-Size=6, RF=3, CL=1, 16GB/server) Compared to SOPHIA, OptimusCloud achieves Normalized Ops/s/$ 100% 2 Latency (sec) O PTIMUS C LOUD +46.9% O PTIMUS C LOUD up to 173% and 130% +86.5% +115% +212% achieves up-to 86% achieves up to 212% 50% 1 over CherryPick and better Perf/$ over the better Perf/$ as Sophia Selecta due to its ability 0% 0 homogeneous- considers only to find heterogeneous Homo- Cherry- Selecta SOPHIA Optimus Static Pick Cloud configuration due to its homogeneous configurations which Normalized Ops/s/$ Latency (P99) online reconfiguration configurations for key- minimizes the HPC (Cluster-Size=6, RF=3, CL=1, 16GB/server) Normalized Ops/s/$ Latency (sec) 100% 2 capability. value store parameters +23.2% reconfiguration +20% without considering downtime and avoids +143% +130% 50% 1 online reconfiguration overprovisioning. 0% 0 for the cloud VM Homo- Cherry Selecta SOPHIA Optimus type/size. Static -Pick Cloud Normalized Ops/s/$ Latency (P99) Normalized Ops/s/$ Bus-Tracking (Cluster-Size=6, RF=3, CL=1, 16GB/server) Latency (sec) 100% 1.5 +22.3%$ 1 +43.8% +67.3% 50% +173% 0.5 0% 0 Homo- Cherry Selecta SOPHIA Optimus Static -Pick Cloud Normalized Ops/s/$ Latency (P99) 16

Tolerance to Prediction Errors HPC (RF=3, CL=1,Cluster-Size=6, 16GB/server) 25 O PTIMUS C LOUD ’s improvement over % Improvement over Homogeneous-Static 20 Homogeneous-Static decreases with increasing levels of noise, as the 15 selected configurations deviate from the best configurations. 10 O PTIMUS C LOUD ’s is more sensitive 5 to errors in the throughput predictor compared to errors in the workload 0 0% 5% 10% 15% 20% 25% 50% predictor, which is demonstrated in the steeper downward slope in the % Noise noisy throughput predictor curve. Noisy Workload Predictor Noisy Throughput Predictor 17

Conclusion • For cost-optimal performance of a distributed Key-Value store in the cloud, it is critical to jointly tune Key-Value store and cloud configurations. • OPTIMUSCLOUD provides the insight that it is optimal to create heterogeneous configurations and for this, it determines at runtime the minimum number of servers to reconfigure. • Using a novel concept of Complete-Sets , O PTIMUS C LOUD provides a technique to reduce the large search space that is brought out by heterogeneity • Configurations found by O PTIMUS C LOUD outperform those by prior works, CherryPick, Selecta, and SOPHIA, in both Perf/$ and Tail Latency (P99) 18

O PTIMUS C LOUD : Heterogeneous Configuration Optimization for - PowerPoint PPT Presentation

O PTIMUS C LOUD : Heterogeneous Configuration Optimization for Distributed Databases in the Cloud Ashraf Mahgoub 1 , Alexander Medoff 1 , Rakesh Kumar 2 , Subrata Mitra 3 , Ana Klimovic 4 , Somali Chaterji 1 , Saurabh Bagchi 1 1: Purdue

Loud Voices in the China Field Loud Voices in the China Field A recent debate in Eurasian

C loud A pp P rofiler: T elco C loud A pplications T racing and M onitoring CTPD Project By:

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Musical Instruments A glass pane exposed to a loud, short sound A. A glass pane exposed to a

Uniform access to heterogeneous Uniform access to heterogeneous grid infrastructures with grid

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Composing heterogeneous software with style Stephen Kell stephen.kell@cs.ox.ac.uk Composing. . .

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable Failures Anne

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

What is noise? Loud sounds if they are over 85 dB can be damaging. How do I know if I am

Andr Walker-Loud Staff Scientist Lawrence Berkeley National Laboratory S91010 - Accelerating

Baid idu Clo loud In Industry ry Quali lity In Inspection Solu lution Baidu Inc. Lei Nie

Bacterial Genome Annotation Lucile Soler Annotation course 9 th -11 th may 2017 Bacterial genome

Supporting Transitions Cultural Connections for People on the Autism Spectrum and other

Reading Reading: Angel 5.6, 9.10.3 Optional reading: Foley, van Dam, Feiner, Hughes,

How Deep Learning, could help to improve GeoSpatial data quality ? an OSM use case @o_courtin

Management and visualization of multitemporal data in GRASS GIS 7 Anna Petrasova MEA 592

blo lood cult lture bottles. Gunnar Kahlmeter EUCAST Development Laboratory (EDL) On

A glimpse at the -calculus Precise Modeling and Analysis group University of Oslo Daniel Fava

Lineages of Scholars in pre-industrial Europe: Nepotism vs. Intergenerational Human Capital