Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT
Bayesian/Bandit Optimization Set of candidates A 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T 2
Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T Use Gaussian process/kernelized Bandit to model f 2
Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) 3 Image from Berkeley’s CS 188
Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? 3 Image from Berkeley’s CS 188
Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? Batch BKB: no-regret and scalable 3 Image from Berkeley’s CS 188
Why Scalable GP Optimization is Hard Experimental scalability Computational scalability 4
Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational scalability 4
Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP 4
Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP Batching and approximations increase regret 4
Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5
Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5
Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5
Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5
Choosing good candidates with GP-UCB 6
Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) 6
Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) 6
Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. 6
Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points 6
Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points [Cal+19]: � u t valid UCB if D t updated at every t . 6
Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e 7
Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost 7
Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost Improve scalability: batching feedback (GP-BUCB), batching resparsification ? 7
Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch 8
Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch “Not too big” Lemma : valid UCB 8
Delayed Resparsification New adaptive batching rule BBKB 4000 � σ 2 ( x i ) � 1 no-resparsify until � size 3000 i ∈ Batch batch 2000 “Not too big” Lemma : valid UCB 1000 “Not too small” Lemma : batch-size = Ω ( t ) 0 2000 4000 6000 8000 10000 12000 t 8
Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. 9
Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. Comparisons: Same regret of GP-UCB/IGP-UCB and better scalability (form O ( T 3 ) to O ( Td 2 eff ) ) Larger batches than GP-BUCB Better regret and better scalability than async-TS 9
In practice: Scalability Cadata NAS-bench-101 A = 20640, d = 8, T = 2000 A = 12416, d = 19, T = 12000 40 50 Batch-GPUCB eps-Greedy BKB Regularized evolution 35 Global-BBKB Global-BBKB 40 GPUCB 30 async-TS time ( sec ) time ( sec ) 25 30 20 20 15 10 10 5 0 0 0 250 500 750 1000 1250 1500 1750 2000 2000 4000 6000 8000 10000 12000 t t 10
Recommend
More recommend