scalable hyperparameter transfer learning
play

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe - PowerPoint PPT Presentation

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe Jenatton , C edric Archambeau , Matthias Seeger AWS AI /Amazon Research , Berlin Co-authors R. Jenatton C. Archambeau M. Seeger Most of the material


  1. Scalable Hyperparameter Transfer learning Valerio Perrone † , Rodolphe Jenatton † , C´ edric Archambeau ⇤ , Matthias Seeger ⇤ AWS AI † /Amazon Research ⇤ , Berlin

  2. Co-authors R. Jenatton C. Archambeau M. Seeger Most of the material from V. Perrone, R. Jenatton, M. Seeger, C. Archambeau Scalable Hyperparameter Transfer learning. NeurIPS 2018

  3. Tuning deep neural nets for optimal performance LeNet5 [LBBH98] The search space X is large and diverse: Architecture: # hidden layers, activation functions, . . . Model complexity: regularization, dropout, . . . Optimisation parameters: learning rates, momentum, batch size, . . .

  4. Two straightforward approaches (Figure by Bergstra and Bengio, 2012) Exhaustive search on a regular or random grid Complexity is exponential in p Wasteful of resources, but easy to parallelise Memoryless

  5. Hyperparameter transfer learning

  6. Hyperparameter transfer learning

  7. Hyperparameter transfer learning

  8. Hyperparameter transfer learning

  9. Motivation Transfer learning : Exploit evaluations of related past tasks I A given ML algorithm tuned over di ff erent datasets I Can we do it in absence of meta-data? Scalability : Both with respect to I #evaluations: P T t =1 N t I #tasks: T

  10. Black-box global optimisation The function f to optimise can be non-convex. The number of hyperparameters p is moderate (typically < 20). Our goal is to solve the following optimisation problem: x ? = argmin f ( x ) . x 2 X Evaluating f ( x ) is expensive. No analytical form or gradient. Evaluations may be noisy.

  11. Black-box global optimisation The function f to optimise can be non-convex. The number of hyperparameters p is moderate (typically < 20). Our goal is to solve the following optimisation problem: x ? = argmin f ( x ) . x 2 X Evaluating f ( x ) is expensive. No analytical form or gradient. Evaluations may be noisy.

  12. Example: tuning deep neural nets [SLA12, SRS + 15, KFB + 16] LeNet5 [LBBH98] f ( x ) is the validation loss of the neural net as a function of its hyperparameters x . Evaluating f ( x ) is very costly ⇡ up to weeks!

  13. Bayesian (black-box) optimisation [MTZ78, SSW + 16] x ? = argmin f ( x ) x 2 X Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: I Select candidate x new 2 X using M and C #exploration/exploitation I Collect evaluation y new of f at x new #time-consuming I Update C = C [ { ( x new , y new ) } I Update M with C #Update surrogate model I Update BUDGET

  14. Bayesian (black-box) optimisation [MTZ78, SSW + 16] x ? = argmin f ( x ) x 2 X Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: I Select candidate x new 2 X using M and C #exploration/exploitation I Collect evaluation y new of f at x new #time-consuming I Update C = C [ { ( x new , y new ) } I Update M with C #Update surrogate model I Update BUDGET

  15. Bayesian (black-box) optimisation [MTZ78, SSW + 16] x ? = argmin f ( x ) x 2 X Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: I Select candidate x new 2 X using M and C #exploration/exploitation I Collect evaluation y new of f at x new #time-consuming I Update C = C [ { ( x new , y new ) } I Update M with C #Update surrogate model I Update BUDGET

  16. Bayesian (black-box) optimisation [MTZ78, SSW + 16] x ? = argmin f ( x ) x 2 X Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: I Select candidate x new 2 X using M and C #exploration/exploitation I Collect evaluation y new of f at x new #time-consuming I Update C = C [ { ( x new , y new ) } I Update M with C #Update surrogate model I Update BUDGET

  17. Bayesian (black-box) optimisation [MTZ78, SSW + 16] x ? = argmin f ( x ) x 2 X Canonical algorithm: Surrogate model M of f #cheaper to evaluate Set of evaluated candidates C = {} While some BUDGET available: I Select candidate x new 2 X using M and C #exploration/exploitation I Collect evaluation y new of f at x new #time-consuming I Update C = C [ { ( x new , y new ) } I Update M with C #Update surrogate model I Update BUDGET

  18. Bayesian (black-box) optimisation with Gaussian processes 1 Learn a probabilistic model of f , which is cheap to evaluate: f ( x i ) , ς 2 � y i | f ( x i ) ⇠ Gaussian � f ( x ) ⇠ GP (0 , K ) . ,

  19. Bayesian (black-box) optimisation with Gaussian processes 1 Learn a probabilistic model of f , which is cheap to evaluate: f ( x i ) , ς 2 � � y i | f ( x i ) ⇠ Gaussian , f ( x ) ⇠ GP (0 , K ) . 2 Given the observations y = ( y 1 , . . . , y n ), compute the predictive mean and the predictive standard deviation: 3 Repeatedly query f by balancing exploitation against exploration

  20. Bayesian (black-box) optimisation with Gaussian processes 1 Learn a probabilistic model of f , which is cheap to evaluate: f ( x i ) , ς 2 � � y i | f ( x i ) ⇠ Gaussian , f ( x ) ⇠ GP (0 , K ) . 2 Given the observations y = ( y 1 , . . . , y n ), compute the predictive mean and the predictive standard deviation: 3 Repeatedly query f by balancing exploitation against exploration

  21. Where is the minimum of f ( x )?

  22. Bayesian optimisation in practice (Image credit: Javier Gonz´ alez)

  23. Bayesian optimization with transfer learning Problem statement: t ) } N t T functions { f t ( x ) } T t =1 with observations D t = { ( x n t , y n n =1 May/may not have meta-data (or contextual features) for { f t ( x ) } T t =1 Goal: Optimize some fixed f t 0 ( x ) while exploiting {D t } T t =1 (this is not multi-objective!) Previous work: Multitask GP (Swersky et al. 2013, Poloczek et al. 2016) GP + filter evaluations by task similarity (Feurer et al. 2015) Various ensemble-based approaches I GPs (Feurer et al. 2018) I Feedforward NNs (Schilling et al. 2015)

  24. Bayesian optimization with transfer learning Problem statement: t ) } N t T functions { f t ( x ) } T t =1 with observations D t = { ( x n t , y n n =1 May/may not have meta-data (or contextual features) for { f t ( x ) } T t =1 Goal: Optimize some fixed f t 0 ( x ) while exploiting {D t } T t =1 (this is not multi-objective!) Previous work: Multitask GP (Swersky et al. 2013, Poloczek et al. 2016) GP + filter evaluations by task similarity (Feurer et al. 2015) Various ensemble-based approaches I GPs (Feurer et al. 2018) I Feedforward NNs (Schilling et al. 2015)

  25. What is wrong with the Gaussian process surrogate? � N 3 � Scaling is O

  26. Adaptive Bayesian linear regression (ABLR) [Bis06] The model: Y N ( φ z ( x n ) w , β � 1 ) , P ( y | w , z , β ) = n P ( w | α ) = N ( 0 , α � 1 I D ) . The predictive distribution: Z P ( y ⇤ | x ⇤ , D ) = P ( y ⇤ | x ⇤ , w ) P ( w |D ) d w = N ( µ t ( x ⇤ ) , σ 2 t ( x ⇤ ))

  27. Multi-task ABLR for transfer learning 1 Multi-task extension of the model: Y N ( φ z ( x n t ) w t , β � 1 P ( w t | α t ) = N ( 0 , α � 1 P ( y t | w t , z , β t ) = ) , t I D ) . t n t 2 Shared features φ z ( x ): I Explicit features set (e.g., RBF) I Random kitchen sinks [RR + 07] I Learned by feedforward neural net 3 Multi-task objective: T ⇣ ⌘ z , { α t , β t } T X = � log P ( y t | z , α t , β t ) ρ t =1 t =1

  28. Examples of φ z Feedforward neural networks : φ z ( x ) = a L ( Z L a L � 1 ( . . . Z 2 a 1 ( Z 1 x ) . . . )) . z consists of all { Z l } L l =1 Random Fourier features : n 1 o p φ z ( x ) = 2 / D cos σ Ux + b , with U ⇠ N ( 0 , I ) and b ⇠ U ([0 , 2 π ]) . z only consists of 1 / σ

  29. Pictorial summary of ABLR

  30. Posterior inference Hyperparameters: { α t , β t } T t =1 for each task t z for the shared basis function Empirical Bayesian approach: Marginalize out the Bayesian linear regression parameters { w t } T t =1 Jointly learn the hyper-parameters of the model { α t , β t } T t =1 and z Minimize T ⇣ ⌘ X z , { α t , β t } T log { P ( y t | X t , α t , β t , z ) } ρ = � t =1 t =1

  31. Posterior inference (cont’d) We have closed-forms for posterior mean and variance: t ; D t , α t , β t , z ) = β t µ t ( x ⇤ φ z ( x ⇤ t ) > K � 1 t Φ > t y t α t t ; D t , α t , β t , z ) = 1 t ) + 1 σ 2 t ) > K � 1 t ( x ⇤ φ z ( x ⇤ t φ z ( x ⇤ α t β t and marginal likelihood: T " D # ✓ ◆ N t 2 log β t � β || y t || 2 � β t z , { α t , β t } T X || c t || 2 X � � ρ = � � log([ L t ] ii ) t =1 2 α t t =1 i =1 Cholesky for K t = � t ↵ t Φ > t Φ t + I D = L t L t > c t = L t � 1 Φ > t y t

  32. Leveraging MXNet In Bayesian optimization, derivatives needed for Posterior inference: ( z , { α t , β t } T t =1 ) 7! ρ ( z , { α t , β t } T t =1 ) Acquisition functions A , typically of the form (e.g., EI, PI, UCB,. . . ): x ⇤ 7! A ( µ t ( x ⇤ ; D t , α t , β t , z ) , σ 2 t ( x ⇤ ; D t , α t , β t , z )) Leverage MXNet (Seeger et al. 2017): Auto-di ff erentiation Backward operator for Cholesky Can use any φ z

  33. Optimization of the marginal likelihood Optimization properties: Number of tasks: T ⇡ few tens Number of points per task: N t � 1 Not standard SGD regime We apply L-BFGS jointly over all parameters z and { α t , β t } T t =1 Warm-start parameters: Re-convergence in a very few steps

Recommend


More recommend