near linear time gaussian process optimization with
play

Near-linear Time Gaussian Process Optimization with Adaptive - PowerPoint PPT Presentation

Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT


  1. Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification D. Calandriello* 1 , L. Carratino * 2 , A. Lazaric 3 , M. Valko 1 , L. Rosasco 2,4 * equal contribution. 1 DeepMind, 2 MaLGa - UniGe, 3 Facebook, 4 MIT - IIT

  2. Bayesian/Bandit Optimization Set of candidates A 2

  3. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  4. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  5. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  6. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  7. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  8. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  9. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  10. Bayesian/Bandit Optimization Set of candidates A for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  11. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  12. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate (2) Receive noisy feedback (3) Update model 2

  13. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback (3) Update model 2

  14. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model 2

  15. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t 2

  16. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T 2

  17. Bayesian/Bandit Optimization Set of candidates A = { x 1 , . . . , x A } ⊂ R d , unknown reward function f : A → R for t = 1, . . . , T : (1) Select candidate x t using model u t (ideally u t ≈ f ) (2) Receive noisy feedback y t = f ( x t ) + η t (3) Update model u t Performance measure: cumulative regret w.r.t. best x ∗ t = 1 f ( x ∗ ) − f ( x t ) . R T = � T Use Gaussian process/kernelized Bandit to model f 2

  18. Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) 3 Image from Berkeley’s CS 188

  19. Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? 3 Image from Berkeley’s CS 188

  20. Gaussian Process Optimization Well studied: exploration vs exploitation → no-regret (low error) performance vs scalability ? Batch BKB: no-regret and scalable 3 Image from Berkeley’s CS 188

  21. Why Scalable GP Optimization is Hard Experimental scalability Computational scalability 4

  22. Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational scalability 4

  23. Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP 4

  24. Why Scalable GP Optimization is Hard Experimental vs scalability sequential batch Computational vs scalability exact GP approximate GP Batching and approximations increase regret 4

  25. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  26. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  27. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  28. Landscape of No-Regret GP Optimization Our solution: approximate GP new adaptive schedule for Batch BKB � O ( T ) - batch-size - approximation updates BKB � O ( T 2 ) GP-UCB GP-BUCB IGP-UCB exact GP � O ( T 3 ) Async-TS GP-TS sequential batched 5

  29. Choosing good candidates with GP-UCB 6

  30. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) 6

  31. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) 6

  32. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. 6

  33. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points 6

  34. Choosing good candidates with GP-UCB X t = { x 1 , . . . , x t } , Y t = { y 1 , . . . , y t } Exact GP-UCB: u t ( · ) = µ ( · | X t , Y t ) + β t σ ( · | X t ) [Sri+10]: u t valid UCB. Sparse GP-UCB: µ ( · | X t , Y t , D t ) + � � u t ( · ) = � β t � σ ( · | X t , D t ) with D t ⊂ X t inducing points [Cal+19]: � u t valid UCB if D t updated at every t . 6

  35. Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e 7

  36. Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost 7

  37. Performance vs Scalability Better performance: collect more feedback, update inducing points (resparsify) µ ( · | X t , Y t , D t ) + e σ 2 ( · | X t , D t ) e u t ( · ) = e β t e Worse scalability: experimental cost, resparsification cost Improve scalability: batching feedback (GP-BUCB), batching resparsification ? 7

  38. Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch 8

  39. Delayed Resparsification New adaptive batching rule � σ 2 ( x i ) � 1 no-resparsify until � i ∈ Batch “Not too big” Lemma : valid UCB 8

  40. Delayed Resparsification New adaptive batching rule BBKB 4000 � σ 2 ( x i ) � 1 no-resparsify until � size 3000 i ∈ Batch batch 2000 “Not too big” Lemma : valid UCB 1000 “Not too small” Lemma : batch-size = Ω ( t ) 0 2000 4000 6000 8000 10000 12000 t 8

  41. Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. 9

  42. Batch-BKB Theorem With high probability Batch-BKB achieves no-regret with time complexity O ( Td 2 eff ) , where d eff ≪ T is the effective dimension / degrees of freedom of the GP. Comparisons: Same regret of GP-UCB/IGP-UCB and better scalability (form O ( T 3 ) to O ( Td 2 eff ) ) Larger batches than GP-BUCB Better regret and better scalability than async-TS 9

  43. In practice: Scalability Cadata NAS-bench-101 A = 20640, d = 8, T = 2000 A = 12416, d = 19, T = 12000 40 50 Batch-GPUCB eps-Greedy BKB Regularized evolution 35 Global-BBKB Global-BBKB 40 GPUCB 30 async-TS time ( sec ) time ( sec ) 25 30 20 20 15 10 10 5 0 0 0 250 500 750 1000 1250 1500 1750 2000 2000 4000 6000 8000 10000 12000 t t 10

Recommend


More recommend