ML for Resource Management Arjun Karuvally, Priyanka Mary Mammen
Introduction Big data analytics on cloud is very crucial for industry and it is growing ● rapidly A number of techniques are used for data processing - Map Reduce, ● SQL-like languages, Deep Learning and in memory analytics A cluster of virtual machines is the execution environment for these type ● of jobs Different analytic jobs have diverse behavior and resource requirements ●
Problem Statement The task of resource management is to find the right cloud configuration ● for an application This configuration includes the number of VMs, number of CPUs, CPU ● speed per core, RAM, disk count, disk speed, network capacity etc Any technique that is used for resource management in cloud need to ● create a performance model This performance model indicates which configuration of the cloud is best ● for the particular job that is being run
Motivation Choosing the right configuration for an application is essential to service ● quality and commercial competitiveness. Lot of jobs are recurring - means that similar workloads are executed ● repeatedly Choosing poorly can result in a slowdown of 2-3x on average and 12x in ● the worst case
Challenges Evaluation of all the possible cloud configuration to find the best is ● prohibitively expensive Each workload has its own prefered choice of cloud configuration - ● difficult to come up with one configuration for all workloads Resource requirements to achieve a certain objective (execution time or ● running cost) for a specific workload are opaque The running time and cost has complex relation to the resources of cloud ● instances
CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics
Features Uses Bayesian Optimization to build performance model for various ● applications Models are just accurate enough to find near optimal configuration with ● only a few test runs. Bayesian Optimization enables to obtains minimum number of samples ● to get near optimal configurations with good confidence interval
Problem Formulation For a given application workload the objective is to find optimal or near ● optimal cloud configuration that satisfies a performance requirement The problem is formulated mathematically as ● The cloud configuration is represented by x, C represents the cost ● P is the price per unit time for VMs using x, T is the running time function ●
Problem Formulation The unknown required to compute the cost is the function T for different ● configurations x Since this is expensive bayesian optimization is used to directly search for ● an approximate solution of the equation with significantly smaller cost
Bayesian Optimization Bayesian Optimization is used to solve optimization problems like the ● previous equation where the objective function C is unknown but can be observed using experiments Cost (C) can be modeled as a stochastic process (eg. Gaussian), ● confidence interval can be computed using one or more samples from C Observational noise can be incorporated in the computation of ● confidence interval of the objective function By integrating this CherryPick has the ability to learn the objective ● function quickly and only take samples in the areas that most likely contain the minimum point.
Working of BO
Prior and Acquisition function Prior is given assuming a Gaussian Process ● Acquisition function is given using Expected Improvement ● 𝜚 and 𝛠 are the standard normal cumulative function and standard normal probability density function respectively
Design options and decisions Prior function - Gaussian Process is chosen as the prior function ● C is described using a mean function and a kernel covariance function ● Matern with parameter 5/2 is chosen as the covariance function between ● inputs because it does not require strong smoothness Acquisition function - Expected Improvement is chosen as the ● acquisition function
Design options and decisions Stopping condition - When the EI is less than a threshold(10%) and at ● least N cloud configurations have been observed Starting points - Quasi random sequence to generate the starting points ● Encoding cloud configuration - x is a vector of number of VMs. number ● of cores, CPU speed per core, average RAM per core, disk count, disk speed and network capacity of the VM Normalization and discretization of most of the features ●
Handling uncertainties in clouds The resources of clouds are shared by multiple users so different ● workloads may interfere with one another Failures and resource overloading can impact the completion time of a ● job.
Implementation
Experimental Setup Benchmark applications on Spark and Hadoop to exercise different ● CPU/Disk/RAM/Network resources TPS-DS - a recent benchmark for big data systems that models a decision ● support workload. TPC-H - another SQL benchmark that contains a number of ad-hoc ● decision support queries that process large amounts of data Terasort - common benchmarking application for big data analytics ● SparkReg - Machine learning workloads on top of Spark ● SparkKm - A clustering machine learning working ●
Experimental Setup Cloud configurations - Four families in Amazon EC2: M4 (general purpose), ● C4 (compute optimized), R3 (memory optimized), I2 (disk optimized) EI = 10%, N=6 and 3 initial samples. EI is chosen such that it gives a good ● tradeoff between search cost and accuracy Baselines - Exhaustive Search, Coordinate descent ● Metrics - running cost, search cost ●
Results CherryPick finds the optimal configuration with low search time ●
Results It reaches better configurations with more stability compared to random ● search on similar budget
Results CherryPick comes up with similar running costs with a linear predictor ● based model but with lower search cost and time
Results CherryPick can tune EI to trade-off between search cost and accuracy ●
Results Effectiveness of CherryPick ● Scaling with workload size Navigation of search space Estimation of running time vs cluster size
Discussion Reliance on good representative workloads ● Larger search space - Complexity depends only on number of samples ● and not the number of candidates. Choice of prior - Choice of Gaussian as prior the assumption is that the ● final function is a sample from a Gaussian distribution.
Shortcomings of CherryPick Model Accuracy - Tries to accurately model the performance metric which ● requires more data Cold Start - Bayesian Approximation requires initial data to build the ● performance space. Fragility - Overly sensitive to initial parameters - initial points, kernel ● function, process
Scout: An Experienced Guide to Find the Best Cloud Configuration
Exploration and Exploitation Any search based method has two aspects - exploration and exploitation ● Exploration - Gather new information about the search space by ● executing a new cloud configuration Exploitation - Choose the most promising configuration based on ● information enclosed Additional exploration incurs high cost and exploitation without ● exploration leads to suboptimal solutions - exploration exploitation dilemma.
Features Search process efficiency - Performance and workload characterization ● derived from historical data of previous workloads/ Search process effectiveness - Uses comprehensive performance data for ● prediction, uses low level performance information. Search process reliability - Using different sets for unevaluated and ● evaluated configurations, historical data to create a model for current workload.
Methodology Low level information is incorporated into the feature vector of the ● configuration The set of all possible configurations are taken and split into unevaluated ● and evaluated To search for the next best configuration given a starting configuration a ● function f(F(S_i), F(S_j), L_i) is learned This function is a classification function that classifies as “better”, “fair” ● and “worse”
Search Strategy Given <F(S_i), L_i> we can obtain the different prediction classes for ● unevaluated configurations The next best configuration is chosen such that the expected ● performance is improved. Due to the use of historical data, the search space is minimized and ● exploitation is more. Search stops when it can no longer find a better configuration ● Also stops if it fails to find better solutions due to an inaccurate ● performance model
Experimental Setup Workloads: Diverse workloads (CPU intensive, memory heavy, IO-intensive and network intensive) such as PageRank, sorting, recommendation, OLAP etc run on Apache Hadoop and Apache Spark Deployment Choices: Single node as well as multiple node settings Parameters: 1) Labelled classes: “better+”, “better”, “fair”, “worse” and “worse+” 2) Probability thresholds: 0.5 3) Misprediction tolerance: 3 and 4 for single and multiple nodes resp.
Recommend
More recommend