boat building auto tuners with structured bayesian
play

BOAT: Building Auto-Tuners with Structured Bayesian Optimization - PowerPoint PPT Presentation

BOAT: Building Auto-Tuners with Structured Bayesian Optimization Valentin Dalibard Michael Schaarschmidt Eiko Yoneki Presented by Jesse Mu Parameters in large-scale systems Coarse Number of cluster nodes ML Hyperparams Compiler


  1. BOAT: Building Auto-Tuners with Structured Bayesian Optimization Valentin Dalibard Michael Schaarschmidt Eiko Yoneki Presented by Jesse Mu

  2. Parameters in large-scale systems Coarse Number of cluster nodes ML Hyperparams Compiler Flags Fine

  3. Parameters in large-scale systems Coarse How to optimize Number of cluster nodes parameters θ ? ML Hyperparams Compiler Flags Fine

  4. Parameters in large-scale systems Coarse How to optimize Number of cluster nodes parameters θ ? Minimize some cost ML Hyperparams function f( θ ) . Compiler Flags Fine

  5. Parameters in large-scale systems Coarse How to optimize Number of cluster nodes parameters θ ? Minimize some cost ML Hyperparams function f( θ ) ...where cost is runtime, Compiler Flags Fine memory, I/O, etc

  6. Auto-tuning (optimization)

  7. Auto-tuning (optimization) Grid search θ ∈ [1, 2, 3, …] ●

  8. Auto-tuning (optimization) θ ∈ [1, 2, 3, …] ● Grid search ● Evolutionary approaches (e.g. ) ● Hill-climbing (e.g. )

  9. Auto-tuning (optimization) θ ∈ [1, 2, 3, …] ● Grid search ● Evolutionary approaches (e.g. ) ● Hill-climbing (e.g. ) SPEARMINT ● Bayesian optimization (e.g. )

  10. Auto-tuning (optimization) in distributed systems θ ∈ [1, 2, 3, …] ● Grid search ● Evolutionary approaches (e.g. ) ● Hill-climbing (e.g. ) SPEARMINT ● Bayesian optimization (e.g. )

  11. Auto-tuning (optimization) in distributed systems θ ∈ [1, 2, 3, …] ● Grid search ● Evolutionary approaches (e.g. ) Require 1000s of evaluations of ● Hill-climbing (e.g. ) cost function! SPEARMINT ● Bayesian optimization (e.g. )

  12. Auto-tuning (optimization) in distributed systems θ ∈ [1, 2, 3, …] ● Grid search ● Evolutionary approaches (e.g. ) Require 1000s of evaluations of ● Hill-climbing (e.g. ) cost function! Fails in high SPEARMINT ● Bayesian optimization (e.g. ) dimensions!

  13. Auto-tuning (optimization) in distributed systems θ ∈ [1, 2, 3, …] ● Grid search ● Evolutionary approaches (e.g. ) Require 1000s of evaluations of ● Hill-climbing (e.g. ) cost function! Fails in high SPEARMINT ● Bayesian optimization (e.g. ) dimensions! ● Structured Bayesian optimization (this work: B esp O ke A uto- T uners)

  14. Gaussian Processes Data Prior Posterior From Carl Rasmussen’s 4F13 lectures http://mlg.eng.cam.ac.uk/teaching/4f13/1718/gp%20and%20data.pdf

  15. e.g. expected increase over max perf. (balance exploration vs exploitation)

  16. Bayesian Optimization Gaussian Process

  17. Structured Bayesian Optimization (SBO) Gaussian Process

  18. Structured Bayesian Optimization (SBO)

  19. Structured Bayesian Optimization (SBO) * *Developer-specified, semi-parametric model of performance from observed performance + arbitrary runtime characteristics

  20. Structured Bayesian Optimization (SBO) * *Developer-specified, semi-parametric model of performance from observed performance + arbitrary runtime characteristics

  21. Probabilistic Models for SBO

  22. Probabilistic Models for SBO

  23. Probabilistic Models for SBO Too restrictive Too generic Just right

  24. Semi-parametric models in SBO ● Specify the parametric component only (GP for free)

  25. Semi-parametric models in SBO ● Specify the parametric component only (GP for free) ● e.g. predict GC rate from JVM eden size

  26. Semi-parametric models in SBO ● Specify the parametric component only (GP for free) ● e.g. predict GC rate from JVM eden size

  27. Semi-parametric models in SBO ● Specify the parametric component only (GP for free) ● e.g. predict GC rate from JVM eden size Prior: malloc rate ~ Uniform(0, 5000)

  28. Semi-parametric models in SBO

  29. Composing semi-parametric models

  30. Composing semi-parametric models

  31. Composing semi-parametric models Dataflow DAG Inference exploits conditional independence between models

  32. Composing semi-parametric models Dataflow DAG Inference exploits conditional independence between models

  33. SBO: Summary 1. Configuration space (i.e. possible params) 2. Objective function + runtime measurements 3. Semi-parametric model of system

  34. SBO: Summary 1. Configuration space (i.e. possible params) standard 2. Objective function + runtime measurements 3. Semi-parametric model of system

  35. SBO: Summary 1. Configuration space (i.e. possible params) standard 2. Objective function + runtime measurements 3. Semi-parametric model of system new

  36. SBO: Summary 1. Configuration space (i.e. possible params) standard 2. Objective function + runtime measurements 3. Semi-parametric model of system new Key: try generic system, before optimizing with structure

  37. Evaluation: Cassandra GC

  38. Evaluation: Cassandra GC

  39. Evaluation: Cassandra GC Best params outperform Cassandra defaults by 63% Existing systems converge but take 6x longer

  40. Evaluation: Neural Net SGD Load balancing, worker allocation over 10 machines = 30 params

  41. Evaluation: Neural Net SGD Load balancing, worker allocation over 10 machines = 30 params

  42. Evaluation: Neural Net SGD Load balancing, worker allocation over 10 machines = 30 params Default configuration: 9.82s OpenTuner: 8.71s BOAT: 4.31s Existing systems don’t converge!

  43. Review:

  44. Review: overall, a good, unsurprising contribution

  45. Review: overall, a good, unsurprising contribution ● Theory ○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques

  46. Review: overall, a good, unsurprising contribution ● Theory ○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques ● Implementation ○ Cross-validation? ○ Key for system adoption: make interface as high-level as possible

  47. Review: overall, a good, unsurprising contribution ● Theory ○ Unsurprising that expert-developed models optimize better! ■ Tradeoff: developer hours vs machine hours ○ Cassandra GC system converges in 2 iterations - model is near-perfect! What happens when parametric model is wrong? ■ More details about tradeoff between parametric model and generic GP ■ OpenTuner: build an ensemble of multiple search techniques ● Implementation ○ Cross-validation? ○ Key for system adoption: make interface as high-level as possible ● Evaluation ○ What happens when # params >> 30? ○ “DAGModels help debugging”...how?

Recommend


More recommend