High Dimensional Bayesian Optimisation and Bandits via Additive Models Kirthevasan Kanda samy , Jeff Schneider, Barnab´ as P´ oczos ICML ’15 July 8 2015 1/20
Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics Cosmological Simulator E.g: Hubble Constant Baryonic Density Observation 2/20
Bandits & Optimisation Maximum Likelihood inference in Computational Astrophysics Cosmological Simulator E.g: Hubble Constant Baryonic Density Observation 2/20
Bandits & Optimisation Expensive Blackbox Function 2/20
Bandits & Optimisation Expensive Blackbox Function Examples: Hyper-parameter tuning in ML Optimal control strategy in Robotics 2/20
Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) f ( x ∗ ) x ∗ x 3/20
Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3/20
Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Optimisation ∼ = Minimise Simple Regret . S T = f ( x ∗ ) − x t , t =1 ,..., T f ( x t ) . max 3/20
Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandits ∼ = Minimise Cumulative Regret . T � R T = f ( x ∗ ) − f ( x t ) . t =1 3/20
Bandits & Optimisation f : [0 , 1] D → R is an expensive, black-box, nonconvex function. Let x ∗ = argmax x f ( x ). f ( x ) x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Optimisation ∼ = Minimise Simple Regret . S T = f ( x ∗ ) − x t , t =1 ,..., T f ( x t ) . max 3/20
Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 4/20
Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Obtain posterior GP. . 4/20
Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Maximise acquisition function ϕ t : x t = argmax x ϕ t ( x ). ϕ t ( x ) x t = 0 . 828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x GP-UCB : ϕ t ( x ) = µ t − 1 ( x ) + β 1 / 2 σ t − 1 ( x ) (Srinivas et al. 2010) t 4/20
Gaussian Process (Bayesian) Optimisation Model f ∼ GP ( 0 , κ ). f ( x ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1 Maximise acquisition function ϕ t : x t = argmax x ϕ t ( x ). ϕ t ( x ) x t = 0 . 828 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x ϕ t : Expected Improvement ( GP-EI ), Thompson Sampling etc. 4/20
Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. 5/20
Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. Existing Work: ◮ (Chen et al. 2012): f depends on a small number of variables. Find variables and then GP-UCB . ◮ (Wang et al. 2013): f varies along a lower dimensional subspace. GP-EI on a random subspace. ◮ (Djolonga et al. 2013): f varies along a lower dimensional subspace. Find subspace and then GP-UCB . 5/20
Scaling to Higher Dimensions Two Key Challenges: ◮ Statistical Difficulty: Nonparametric sample complexity exponential in D . ◮ Computational Difficulty: Optimising ϕ t to within ζ accuracy requires O ( ζ − D ) effort. Existing Work: Chen et al. 2012, Wang et al. 2013, Djolonga et al. 2013. ◮ Assumes f varies only along a low dimensional subspace. ◮ Perform BO on a low dimensional subspace. ◮ Assumption too strong in realistic settings. 5/20
Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , 6/20
Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , E.g. f ( x { 1 ,..., 10 } ) = f (1) ( x { 1 , 3 , 9 } ) + f (2) ( x { 2 , 4 , 8 } ) + f (3) ( x { 5 , 6 , 10 } ) . 1 2 3 4 5 6 ❍ 7 ✟ 8 9 10 ✟ ❍ Call {X ( j ) M j =1 } = { (1 , 3 , 9) , (2 , 4 , 8) , (5 , 6 , 10) } the “decomposition”. 6/20
Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , Assume each f ( j ) ∼ GP ( 0 , κ ( j ) ). Then f ∼ GP ( 0 , κ ) where, κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . 6/20
Additive Functions Structural assumption: f ( x ) = f (1) ( x (1) ) + f (2) ( x (2) ) + . . . + f ( M ) ( x ( M ) ) . x ( j ) ∈ X ( j ) = [0 , 1] d , x ( i ) ∩ x ( j ) = ∅ . d ≪ D , Assume each f ( j ) ∼ GP ( 0 , κ ( j ) ). Then f ∼ GP ( 0 , κ ) where, κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . Given ( X , Y ) = { ( x i , y i ) T i =1 } , and test point x † , � µ ( j ) , σ ( j )2 ) . f ( j ) ( x ( j ) † ) | X , Y ∼ N 6/20
Outline 1. GP-UCB 2. The Add-GP-UCB algorithm ◮ Bounds on S T : exponential in D → linear in D . ◮ An easy-to-optimise acquisition function. ◮ Performs well even when f is not additive. 3. Experiments 4. Conclusion & some open questions 7/20
GP-UCB µ t − 1 ( x ) + β 1 / 2 x t = argmax σ t − 1 ( x ) t x ∈X 8/20
GP-UCB µ t − 1 ( x ) + β 1 / 2 x t = argmax σ t − 1 ( x ) t x ∈X Squared Exponential Kernel � � x − x ′ � 2 � κ ( x , x ′ ) = A exp 2 h 2 Theorem (Srinivas et al. 2010) Let f ∼ GP ( 0 , κ ). Then w.h.p, �� � D D (log T ) D S T ∈ O . T 8/20
GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. 9/20
GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. Can be shown: If each κ ( j ) is a SE kernel, �� � D 2 d d (log T ) d S T ∈ O . T 9/20
GP-UCB on additive κ If f ∼ GP ( 0 , κ ) where κ ( x , x ′ ) = κ (1) ( x (1) , x (1) ′ ) + · · · + κ ( M ) ( x ( M ) , x ( M ) ′ ) . κ ( j ) → SE Kernel. Can be shown: If each κ ( j ) is a SE kernel, �� � D 2 d d (log T ) d S T ∈ O . T But ϕ t = µ t − 1 + β 1 / 2 σ t − 1 is D -dimensional ! t 9/20
Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) . ϕ t ( x ) = � t j =1 10/20
Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) ϕ t ( x ) = � . t � �� � j =1 ϕ ( j ) t ( x ( j ) ) � ϕ ( j ) Maximise each � separately. t Requires only O ( poly ( D ) ζ − d ) effort (vs O ( ζ − D ) for GP-UCB ) . 10/20
Add-GP-UCB M � µ ( j ) t − 1 ( x ) + β 1 / 2 σ ( j ) t − 1 ( x ( j ) ) ϕ t ( x ) = � . t � �� � j =1 ϕ ( j ) t ( x ( j ) ) � ϕ ( j ) Maximise each � separately. t Requires only O ( poly ( D ) ζ − d ) effort (vs O ( ζ − D ) for GP-UCB ) . Theorem Let f ( j ) ∼ GP ( 0 , κ ( j ) ) and f = � j f ( j ) . Then w.h.p, �� � D 2 d d (log T ) d S T ∈ O . T 10/20
Summary of Theoretical Results (for SE Kernel) GP-UCB with no assumption on f : � D D / 2 (log T ) D / 2 T − 1 / 2 � S T ∈ O GP-UCB on additive f : � DT − 1 / 2 � S T ∈ O O ( ζ − D ) effort. Maximising ϕ t : Add-GP-UCB on additive f : � DT − 1 / 2 � S T ∈ O O ( poly ( D ) ζ − d ) effort. Maximising � ϕ t : 11/20
f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 f (2) ( x { 2 } ) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 f (1) ( x { 1 } ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20
f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 f (2) ( x { 2 } ) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 f (1) ( x { 1 } ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20
f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20
f ( x { 1 , 2 } ) = f (1) ( x { 1 } ) + f (2) ( x { 2 } ) Add-GP-UCB 1 x { 2 } 0.9 0.8 ϕ (2) ( x { 2 } ) 0.7 ˜ 0.6 0.5 0.4 = 0 . 141 0.3 x ( 2 ) 0.2 t 0.1 0 ϕ (1) ( x { 1 } ) ˜ x ( 1 ) = 0 . 869 t 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x { 1 } 12/20
Recommend
More recommend