scalable bandit methods for hyper parameter tuning
play

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan - PowerPoint PPT Presentation

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon University Guest Lecture - Scalable Machine Learning for Big Data Biology University of Pittsburgh, Pittsburgh, PA November 3, 2017 Hyper-parameter


  1. Outline ◮ Part I: Bandits in the Bayesian Paradigm 1. Gaussian processes 2. Algorithms: Upper Confidence Bound (UCB) & Thompson Sampling (TS) ◮ Part II: Scaling up Bandits 1. Multi-fidelity bandit: cheap approximations to an expensive experiment 2. Parallelising function evaluations 3. High dimensional input spaces (N.B: Part II is a shameless plug for my research.) 12/40

  2. Part 2.1: Multi-fidelity Bandits Motivating question: What if we have cheap approximations to f ? 1. Hyper-parameter tuning: Train & validate with a subset of the data, and/or early stopping before convergence. E.g. Bandwidth ( ℓ ) selection in kernel density estimation. 13/40

  3. Part 2.1: Multi-fidelity Bandits Motivating question: What if we have cheap approximations to f ? 1. Hyper-parameter tuning: Train & validate with a subset of the data, and/or early stopping before convergence. E.g. Bandwidth ( ℓ ) selection in kernel density estimation. 2. Computational astrophysics: cosmological simulations and numerical computations with less granularity. 3. Autonomous driving: simulation vs real world experiment. 13/40

  4. Multi-fidelity Methods For specific applications, ◮ Industrial design (Forrester et al. 2007) ◮ Hyper-parameter tuning (Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016) ◮ Active learning (Zhang & Chaudhuri 2015) ◮ Robotics (Cutler et al. 2014) Multi-fidelity bandits & optimisation (Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016) 14/40

  5. Multi-fidelity Methods For specific applications, ◮ Industrial design (Forrester et al. 2007) ◮ Hyper-parameter tuning (Agarwal et al. 2011, Klein et al. 2015, Li et al. 2016) ◮ Active learning (Zhang & Chaudhuri 2015) ◮ Robotics (Cutler et al. 2014) Multi-fidelity bandits & optimisation (Huang et al. 2006, Forrester et al. 2007, March & Wilcox 2012, Poloczek et al. 2016) . . . with theoretical guarantees (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) 14/40

  6. Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? 15/40

  7. Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). 15/40

  8. Multi-fidelity Bandits for Hyper-parameter tuning - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). 15/40

  9. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. Z X 16/40

  10. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . X 16/40

  11. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. 16/40

  12. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). 16/40

  13. Multi-fidelity Bandits (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X Denote f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). A cost function, λ : Z → R + . λ ( z ) λ ( z ) = λ ( N , T ) = O ( N 2 T ) (say). Z z • 16/40

  14. Algorithm: BOCA (Kandasamy et al. ICML 2017) 17/40

  15. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + 17/40

  16. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 17/40

  17. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 17/40

  18. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  19. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  20. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  21. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  22. Algorithm: BOCA (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R std-dev σ t − 1 : Z × X → R + (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � λ ( z ) � q � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) = ξ ( z ) λ ( z • ) (3) (cheapest z in Z t ) z t = argmin λ ( z ) z ∈Z t 17/40

  23. Theoretical Results for BOCA g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” 18/40

  24. Theoretical Results for BOCA g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” Theorem: (Informal) BOCA does better, i.e. achieves better Simple regret, than GP- UCB . The improvements are better in the “good” setting when compared to the “bad” setting. 18/40

  25. Experiment: SVM with 20 News Groups Tune two hyper-parameters for the SVM. Dataset has N • = 15 K data and use T • = 100 iterations. But can choose N ∈ [5 K , 15 K ] or T ∈ [20 , 100] (2D fidelity space) . 0.915 0.91 0.905 0.9 0.895 0.89 500 1000 1500 2000 19/40

  26. Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N • = 192 data. Requires numerical integration on a grid of size G • = 10 6 . Approximate with N ∈ [50 , 192] or G ∈ [10 2 , 10 6 ] (2D fidelity space) . 20/40

  27. Experiment: Cosmological inference on Type-1a supernovae data Estimate Hubble constant, dark matter fraction & dark energy fraction by maximising likelihood on N • = 192 data. Requires numerical integration on a grid of size G • = 10 6 . Approximate with N ∈ [50 , 192] or G ∈ [10 2 , 10 6 ] (2D fidelity space) . 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 1000 1500 2000 2500 3000 3500 20/40

  28. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. 21/40

  29. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. 21/40

  30. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. 21/40

  31. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. Can be extended to infinite X . 21/40

  32. Hyper-band: A multi-fidelity method with incremental resource allocation (Li et al. 2016) E.g: Training a neural network with gradient descent for several iterations. If the CV error is bad after early iterations, then it will likely be bad at the end. Successive Halving (with finite X ): 1. Allocate a small resource R to each x ∈ X . e.g. Train all hyper-parameters for 100 iterations. 2. Drop half of the x ’s that are performing worst. 3. Repeat steps 1 & 2 until one arm is left. Can be extended to infinite X . Does not fall within the GP/Bayesian framework. 21/40

  33. Hyper-band (cont’d) When compared to Bayesian methods, ◮ Pro: Incremental resource allocation (do not need to retrain all models from the beginning). ◮ Con: Cannot use correlation between arms (e.g. if x 1 has large CV accuracy, then x 2 close to x 1 is also likely to do well). 22/40

  34. Hyper-band (cont’d) When compared to Bayesian methods, ◮ Pro: Incremental resource allocation (do not need to retrain all models from the beginning). ◮ Con: Cannot use correlation between arms (e.g. if x 1 has large CV accuracy, then x 2 close to x 1 is also likely to do well). Experiments: 22/40

  35. Outline ◮ Part I: Bandits in the Bayesian Paradigm 1. Gaussian processes 2. Algorithms: Upper Confidence Bound (UCB) & Thompson Sampling (TS) ◮ Part II: Scaling up Bandits 1. Multi-fidelity bandit: cheap approximations to an expensive experiment 2. Parallelising function evaluations 3. High dimensional input spaces 23/40

  36. Part 2.2: Parallelising arm pulls Sequential evaluations with one worker 24/40

  37. Part 2.2: Parallelising arm pulls Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) 24/40

  38. Part 2.2: Parallelising arm pulls Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous) 24/40

  39. Why parallelisation? ◮ Computational experiments: infrastructure with 100-1000’s CPUs or GPUs. 25/40

  40. Why parallelisation? ◮ Computational experiments: infrastructure with 100-1000’s CPUs or GPUs. Prior work: (Ginsbourger et al. 2011, Janusevskis et al. 2012, Wang et al. 2016, Gonz´ alez et al. 2015, Desautels et al. 2014, Contal et al. 2013, Shah and Ghahramani 2015, Kathuria et al. 2016, Wang et al. 2017, Wu and Frazier 2016, Hernandez-Lobato et al. 2017) Shortcomings ◮ Asynchronicity ◮ Theoretical guarantees ◮ Computationally & conceptually simple 25/40

  41. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 26/40

  42. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 1) Construct posterior GP . 26/40

  43. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x 1) Construct posterior GP . 2) Draw sample g from posterior. 26/40

  44. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x t x 1) Construct posterior GP . 2) Draw sample g from posterior. 3) Choose x t = argmax x g ( x ). 26/40

  45. Review: Sequential Thompson Sampling in GP Bandits Thompson Sampling (TS) (Thompson, 1933) . f ( x ) x t x 1) Construct posterior GP . 2) Draw sample g from posterior. 3) Choose x t = argmax x g ( x ). 4) Evaluate f at x t . 26/40

  46. Parallelised Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyTS At any given time, 1. ( x ′ , y ′ ) ← Wait for a worker to finish. 2. Compute posterior GP . 3. Draw a sample g ∼ GP . 4. Re-deploy worker at argmax g . 27/40

Recommend


More recommend