multi fidelity bayesian optimisation
play

Multi-fidelity Bayesian Optimisation g ( z, x ) f ( x ) Z x z X - PowerPoint PPT Presentation

Multi-fidelity Bayesian Optimisation g ( z, x ) f ( x ) Z x z X Kirthevasan Kandasamy Carnegie Mellon University Facebook Inc. Menlo Park, CA September 26, 2017 Slides: www.cs.cmu.edu/ kkandasa/talks/fb-mf-slides.pdf


  1. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ ◮ Optimise f = f (2) . x ⋆ = argmax x f (2) ( x ). ◮ But .. we have an approximation f (1) to f (2) . ◮ f (1) costs λ (1) , f (2) costs λ (2) . λ (1) < λ (2) . “cost”: could be computation time, money etc. ◮ f (1) , f (2) ∼ GP (0 , κ ). ◮ � f (2) − f (1) � ∞ ≤ ζ (1) . ζ (1) is known. 12/30

  2. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. 13/30

  3. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. Maximise f (2) . Don’t care for maximum of f (1) . End Goal: 13/30

  4. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. Maximise f (2) . Don’t care for maximum of f (1) . End Goal: S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) Simple Regret: Capital Λ ← amount of the resource spent. E.g. seconds or dollars . 13/30

  5. Multi-fidelity Bandit Optimisation in 2 Fidelities (1 Approximation) (Kandasamy et al. NIPS 2016b) f (2) f (1) x ⋆ At time t : Determine the point x t ∈ X and fidelity m t ∈ { 1 , 2 } for querying. Maximise f (2) . Don’t care for maximum of f (1) . End Goal: S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) Simple Regret: Capital Λ ← amount of the resource spent. E.g. seconds or dollars . No reward for querying f (1) , but use cheap evaluations to guide search for x ⋆ at f (2) . 13/30

  6. Challenges f (2) = f x ⋆ 13/30

  7. Challenges f (2) + ζ (1) − ζ (1) x ⋆ 13/30

  8. Challenges f (2) f (1) x ⋆ 13/30

  9. Challenges f (2) f (1) x ⋆ ◮ f (1) is not just a noisy version of f (2) . 13/30

  10. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  11. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  12. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  13. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ 13/30

  14. Challenges f (2) f (1) x ⋆ x (1) ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ ◮ Need to explore f (2) sufficiently well around the high valued regions of f (1) – but at a not too large region. 13/30

  15. Challenges f (1) f (2) x (1) x ⋆ ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ ◮ Need to explore f (2) sufficiently well around the high valued regions of f (1) – but at a not too large region. 13/30

  16. Challenges f (1) f (2) x (1) x ⋆ ⋆ ◮ f (1) is not just a noisy version of f (2) . x (1) ◮ Cannot just maximise f (1) . is suboptimal for f (2) . ⋆ ◮ Need to explore f (2) sufficiently well around the high valued regions of f (1) – but at a not too large region. Key Message: We will explore X using f (1) and use f (2) mostly in a promising region X α . 13/30

  17. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound f (2) f (1) x ⋆ 14/30

  18. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound f (2) f (1) x ⋆ ◮ Construct Upper Confidence Bound ϕ t for f (2) . Choose point x t = argmax x ∈X ϕ t ( x ). 14/30

  19. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound t = 11 f (2) f (1) x ⋆ x t ◮ Construct Upper Confidence Bound ϕ t for f (2) . Choose point x t = argmax x ∈X ϕ t ( x ). ϕ (1) µ (1) t − 1 ( x ) + β 1 / 2 σ (1) t − 1 ( x ) + ζ (1) t ( x ) = t ϕ (2) t ( x ) = µ (2) t − 1 ( x ) + β 1 / 2 σ (2) t − 1 ( x ) t ϕ t ( x ) = min { ϕ (1) t ( x ) , ϕ (2) t ( x ) } 14/30

  20. MF-GP-UCB (Kandasamy et al. NIPS 2016b) Multi-fidelity Gaussian Process Upper Confidence Bound t = 11 f (2) m t = 2 f (1) γ (1) x ⋆ x t ◮ Construct Upper Confidence Bound ϕ t for f (2) . Choose point x t = argmax x ∈X ϕ t ( x ). ϕ (1) µ (1) t − 1 ( x ) + β 1 / 2 σ (1) t − 1 ( x ) + ζ (1) t ( x ) = t ϕ (2) t ( x ) = µ (2) t − 1 ( x ) + β 1 / 2 σ (2) t − 1 ( x ) t ϕ t ( x ) = min { ϕ (1) t ( x ) , ϕ (2) t ( x ) } � if β 1 / 2 σ (1) t − 1 ( x t ) > γ (1) 1 ◮ Choose fidelity m t = t 2 otherwise. 14/30

  21. Theoretical Results for MF-GP-UCB GP-UCB , κ is an SE kernel (Srinivas et al. 2010) � vol ( X ) S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) w.h.p � Λ 15/30

  22. Theoretical Results for MF-GP-UCB GP-UCB , κ is an SE kernel (Srinivas et al. 2010) � vol ( X ) S (Λ) = f (2) ( x ⋆ ) − max t : m t =2 f (2) ( x t ) w.h.p � Λ MF-GP-UCB , κ is an SE kernel (Kandasamy et al. NIPS 2016b) � � vol ( X α ) vol ( X ) w.h.p ∀ α > 0 , S (Λ) + � Λ 2 − α Λ X α = { x : f (2) ( x ⋆ ) − f (1) ( x ) ≤ C α ζ (1) } Good approximation (small ζ (1) ) = ⇒ vol ( X α ) ≪ vol ( X ). 15/30

  23. MF-GP-UCB with multiple approximations 16/30

  24. MF-GP-UCB with multiple approximations Things work out. 16/30

  25. Experiment: Viola & Jones Face Detection 22 Threshold values for each cascade. ( d = 22) Fidelities with dataset sizes (300 , 3000). ( M = 2) 0.35 0.3 0.25 0.2 0.15 0.1 1000 2000 3000 4000 5000 6000 7000 8000 17/30

  26. Experiment: Cosmological Maximum Likelihood Inference ◮ Type Ia Supernovae Data ◮ Maximum likelihood inference for 3 cosmological parameters: ◮ Hubble Constant H 0 ◮ Dark Energy Fraction Ω Λ ◮ Dark Matter Fraction Ω M ◮ Likelihood: Robertson Walker metric (Robertson 1936) Requires numerical integration for each point in the dataset. 18/30

  27. Experiment: Cosmological Maximum Likelihood Inference 3 cosmological parameters. ( d = 3) Fidelities: integration on grids of size (10 2 , 10 4 , 10 6 ). ( M = 3) 10 5 0 -5 -10 500 1000 1500 2000 2500 3000 3500 19/30

  28. MF-GP-UCB Synthetic Experiment: Hartmann-3 D d = 3 , M = 3 Query frequencies for Hartmann-3D 40 m=1 m=2 35 m=3 Num. of Queries 30 25 20 15 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 f (3) ( x ) 19/30

  29. Outline 1. A finite number of approximations (Kandasamy et al. NIPS 2016b) - Formalism, intuition and challenges - Algorithm - Theoretical results - Experiments 2. A continuous spectrum of approximations (Kandasamy et al. ICML 2017) - Formalism - Algorithm - Theoretical results - Experiments 20/30

  30. Why continuous approximations? - Use an arbitrary amount of data? 21/30

  31. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? 21/30

  32. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). 21/30

  33. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). 21/30

  34. Why continuous approximations? - Use an arbitrary amount of data? - Iterative algorithms: use arbitrary number of iterations? E.g. Train an ML model with N • data and T • iterations. - But use N < N • data and T < T • iterations to approximate cross validation performance at ( N • , T • ). Approximations from a continuous 2D “fidelity space” ( N , T ). Scientific studies: Simulations and numerical computations at varying continuous levels of granularity. 21/30

  35. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. Z X 22/30

  36. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . X 22/30

  37. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z at hyper-parameter x . z • X We wish to optimise f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. 22/30

  38. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X We wish to optimise f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). 22/30

  39. Multi-fidelity Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) g ( z, x ) A fidelity space Z and domain X Z ← all ( N , T ) values. X ← all hyper-parameter values. f ( x ) g : Z × X → R . g ([ N , T ] , x ) ← cv accuracy when training with N data for T iterations Z x ⋆ at hyper-parameter x . z • X We wish to optimise f ( x ) = g ( z • , x ) where z • ∈ Z . z • = [ N • , T • ]. End Goal: Find x ⋆ = argmax x f ( x ). A cost function, λ : Z → R + . λ ( z ) λ ( z ) = λ ( N , T ) = O ( N 2 T ) . Z z • 22/30

  40. Multi-fidelity Simple Regret (Kandasamy et al. ICML 2017) g ( z, x ) f ( x ) λ ( z ) Z x ⋆ z • X Z z • End Goal: Find x ⋆ = argmax x f ( x ). 23/30

  41. Multi-fidelity Simple Regret (Kandasamy et al. ICML 2017) g ( z, x ) f ( x ) λ ( z ) Z x ⋆ z • X Z z • End Goal: Find x ⋆ = argmax x f ( x ). Simple Regret after capital Λ: S (Λ) = f ( x ⋆ ) − max t : z t = z • f ( x t ) . Λ ← amount of a resource spent, e.g. computation time or money. 23/30

  42. Multi-fidelity Simple Regret (Kandasamy et al. ICML 2017) g ( z, x ) f ( x ) λ ( z ) Z x ⋆ z • X Z z • End Goal: Find x ⋆ = argmax x f ( x ). Simple Regret after capital Λ: S (Λ) = f ( x ⋆ ) − max t : z t = z • f ( x t ) . Λ ← amount of a resource spent, e.g. computation time or money. No reward for maximising low fidelities, but use cheap evaluations at z � = z • to speed up search for x ⋆ . 23/30

  43. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) 24/30

  44. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev 24/30

  45. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 24/30

  46. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X 24/30

  47. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  48. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  49. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  50. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  51. BOCA : Bayesian Optimisation with Continuous Approximations (Kandasamy et al. ICML 2017) Model g ∼ GP (0 , κ ) and com- pute posterior GP : mean µ t − 1 : Z × X → R σ t − 1 : Z × X → R + std-dev (1) x t ← maximise upper confidence bound for f ( x ) = g ( z • , x ). µ t − 1 ( z • , x ) + β 1 / 2 x t = argmax σ t − 1 ( z • , x ) t x ∈X � λ ( z ) � q � � (2) Z t ≈ { z • } ∪ z : σ t − 1 ( z , x t ) ≥ γ ( z ) = ξ ( z ) λ ( z • ) (cheapest z in Z t ) (3) z t = argmin λ ( z ) z ∈Z t 24/30

  52. Theoretical Results for BOCA κ : ( Z × X ) 2 → R . g ∼ GP ( 0 , κ ), κ ([ z , x ] , [ z ′ , x ′ ]) = κ X ( x , x ′ ) · κ Z ( z , z ′ ) 25/30

  53. Theoretical Results for BOCA κ : ( Z × X ) 2 → R . g ∼ GP ( 0 , κ ), κ ([ z , x ] , [ z ′ , x ′ ]) = κ X ( x , x ′ ) · κ Z ( z , z ′ ) g ( z, x ) g ( z, x ) f ( x ) f ( x ) Z Z x ⋆ x ⋆ z • X z • X “good” “bad” 25/30

Recommend


More recommend