Bayesian Meta-Learning CS 330 1
Reminders Homework 2 due next Friday. Project group form due today Project proposal due in one week . Project proposal presentations in one week . (full schedule released on Friday) 2
Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. Goals for by the end of lecture: - Understand the interpreta8on of meta-learning as Bayesian inference - Understand techniques for represen2ng uncertainty over parameters, predic8ons 3
Disclaimers Bayesian meta-learning is an ac2ve area of research (like most of the class content) More ques2ons than answers. This lecture covers some of the most advanced & mathiest topics of the course. So ask ques2ons ! 4
Recap from last week. Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i 5
Recap from last week. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance These proper2es are important for most applica2ons! 6
Recap from last week. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance ability to reason about ambiguity during learning Uncertainty awareness ac@ve learning, calibrated uncertainty, RL Why? principled Bayesian approaches *this lecture* 7
Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. 8
Mul,-Task & Meta-Learning Principles Training and tes8ng must match. Tasks must share “structure.” What does “structure” mean? sta8s8cal dependence on shared latent informa8on θ If you condi8on on that informa8on, - task parameters become independent i.e. ϕ i 1 ⊥ ⊥ ϕ i 2 ∣ θ and are not otherwise independent ϕ i 1 ⊥ ⊥ / ϕ i 2 - hence, you have a lower entropy i.e. ℋ ( p ( ϕ i | θ )) < ℋ ( p ( ϕ i )) Thought exercise #1 : If you can iden8fy (i.e. with meta-learning) , θ when should learning be faster than learning from scratch? ϕ i Thought exercise #2 : what if ℋ ( p ( ϕ i | θ )) = 0 ∀ i ? 9
Mul,-Task & Meta-Learning Principles Training and tes8ng must match. Tasks must share “structure.” What does “structure” mean? sta8s8cal dependence on shared latent informa8on θ What informa8on might contain… θ …in a toy sinusoid problem? corresponds to family of sinusoid funcAons θ (everything but phase and amplitude) …in mul8-language machine transla8on? θ corresponds to the family of all language pairs Note that is narrower than the space of all possible funcAons. θ Thought exercise #3 : What if you meta-learn without a lot of tasks? “meta-overfiKng” to the family of training func8ons 10
Recall parametric approaches: Use determinis2c p ( φ i |D tr (i.e. a point es@mate) i , θ ) - Why/when is this a problem? + Few-shot learning problems may be ambiguous . (even with prior) Can we learn to generate hypotheses about the underlying func@on? p ( φ i |D tr i , θ ) i.e. sample from - safety-cri,cal few-shot learning (e.g. medical imaging) Important for: - learning to ac,vely learn - learning to explore in meta-RL Ac2ve learning w/ meta-learning : Woodward & Finn ’16, Konyushkova et al. ’17, Bachman et al. ’17 11
Plan for Today Why be Bayesian? Bayesian meta-learning approaches - black-box approaches - op8miza8on-based approaches How to evaluate Bayesian meta-learners. 12
Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i y ts Version 0: Let output the parameters of a distribu8on over . f - probability values of discrete categorical distribu2on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts : parameters of a sequence of - for mul8-dimensional distribu2ons (i.e. autoregressive model) Then, op8mize with maximum likelihood. 13
y ts Version 0: Let output the parameters of a distribu8on over . f - probability values of discrete categorical distribu2on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts : parameters of a sequence of - for mul8-dimensional distribu2ons (i.e. autoregressive model) Then, op8mize with maximum likelihood . Pros : + simple + can combine with variety of methods Cons : - can’t reason about uncertainty over the underlying func@on [to determine how uncertainty across datapoints relate] y ts - limited class of distribu@ons over can be expressed - tends to produce poorly-calibrated uncertainty es@mates Thought exercise #4 : Can you do the same maximum likelihood training for ? ϕ 14
The Bayesian Deep Learning Toolbox a broad one-slide overview (CS 236 provides a thorough treatment) Goal : represent distribu@ons with neural networks Latent variable models + variaAonal inference (Kingma & Welling ‘13, Rezende et al. ‘14) : - approximate likelihood of latent variable model with varia8onal lower bound Bayesian ensembles (Lakshminarayanan et al. ‘17) : - par8cle-based representa8on: train separate models on bootstraps of the data Bayesian neural networks (Blundell et al. ‘15) : - explicit distribu8on over the space of network parameters Normalizing Flows (Dinh et al. ‘16) : - inver8ble func8on from latent distribu8on to data distribu8on Energy-based models & GANs (LeCun et al. ’06, Goodfellow et al. ‘14) : We’ll see how we can leverage data - es8mate unnormalized density the first two. everything The others could be useful in else developing new methods. 15
Background: The Varia,onal Lower Bound Observed variable , latent variable x z log p ( x ) ≥ 𝔽 q ( z | x ) [ log p ( x , z ) ] + ℋ ( q ( z | x )) ELBO: = 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) Can also be wriaen as: p ( x | z ) represented w/ neural net, p : model model parameters , θ p ( z ) represented as 𝒪 ( 0 , I ) varia8onal parameters ϕ q ( z | x ) : inference network, varia8onal distribu8on Problem : need to backprop through sampling Reparametriza,on trick For Gaussian q ( z | x ) : q ( z | x ) = μ q + σ q ϵ i.e. compute derivaAve of 𝔽 q w.r.t. q where ϵ ∼ 𝒪 ( 0 , I ) Can we use amor,zed varia,onal inference for meta-learning? 16
Bayesian black-box meta-learning with standard, deep varia@onal inference y ts q ( ϕ i | tr i ) D tr ϕ i neural net What should condi8on on? q i max 𝔽 q ( ϕ | tr ) [ log p ( | ϕ ) ] − D KL ( q ( ϕ | tr ) ∥ p ( ϕ ) ) x ts Standard VAE: Observed variable , latent variable x z max 𝔽 q ( ϕ | tr ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | tr ) ∥ p ( ϕ ) ) ELBO: 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) p : model, represented by a neural net : inference network, varia8onal distribu8on q What about the meta-parameters ? θ 𝔽 q ( ϕ | tr , θ ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | tr , θ ) ∥ p ( ϕ | θ ) ) Meta-learning: max θ Observed variable , latent variable ϕ Can also condi8on on here θ max 𝔽 q ( ϕ ) [ log p ( | ϕ ) ] − D KL ( q ( ϕ ) ∥ p ( ϕ ) ) 𝔽 𝒰 i [ 𝔽 q ( ϕ i | tr i , ϕ i ) ] − D KL ( q ( ϕ i | tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts Final objec8ve (for completeness): max θ 17
Bayesian black-box meta-learning with standard, deep varia@onal inference y ts q ( ϕ i | tr i ) D tr ϕ i neural net i x ts 𝔽 𝒰 i [ 𝔽 q ( ϕ i | tr i , ϕ i ) ] − D KL ( q ( ϕ i | tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts max θ Pros : y ts + can represent non-Gaussian distribu@ons over + produces distribu@on over func@ons Cons : - Can only represent Gaussian distribu@ons p ( ϕ i | θ ) 18
What about Bayesian op,miza,on-based meta-learning? Recas&ng Gradient-Based Meta-Learning as Hierarchical Bayes (Grant et al. ’18) task-specific parameters (empirical Bayes) MAP es@mate How to compute MAP es2mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at ini@al parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) Provides a Bayesian interpreta2on of MAML. p ( ϕ i | θ , tr i ) But, we can’t sample from ! 19
Recommend
More recommend