Bayesian Meta-Learning CS 330 1
Logistics Homework 2 due next Wednesday. Project proposal due in two weeks . Poster presentation: Tues 12/3 at 1:30 pm . � 2
Disclaimers Bayesian meta-learning is an ac#ve area of research (like most of the class content) More ques#ons than answers. This lecture covers some of the most advanced topics of the course. So ask ques#ons ! � 3
Recap from last Bme. Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i � 4
Recap from last Bme. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance These proper#es are important for most applica#ons! � 5
Recap from last Bme. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance ability to reason about ambiguity during learning Uncertainty awareness acBve learning, calibrated uncertainty, RL Why? principled Bayesian approaches *this lecture* � 6
Plan for Today Why be Bayesian? Bayesian meta-learning approaches How to evaluate Bayesians. � 7
Mul,-Task & Meta-Learning Principles Training and tesBng must match. Tasks must share “structure.” What does “structure” mean? staBsBcal dependence on shared latent informaBon � θ If you condiBon on that informaBon, - task parameters become independent i.e. � ϕ i 1 ⊥ ⊥ ϕ i 2 ∣ θ and are not otherwise independent � ϕ i 1 ⊥ ⊥ / ϕ i 2 - hence, you have a lower entropy i.e. � ℋ ( p ( ϕ i | θ )) < ℋ ( p ( ϕ i )) Thought exercise #1 : If you can idenBfy � (i.e. with meta-learning) , θ when should learning � ϕ i be faster than learning from scratch? Thought exercise #2 : what if � ? ℋ ( p ( ϕ i | θ )) = 0 � 8
Mul,-Task & Meta-Learning Principles Training and tesBng must match. Tasks must share “structure.” What does “structure” mean? staBsBcal dependence on shared latent informaBon � θ What informaBon might � contain… θ …in the toy sinusoid problem? � corresponds to family of sinusoid funcBons θ (everything but phase and amplitude) …in the machine translaBon example? � corresponds to the family of all language pairs θ Note that � is narrower than the space of all possible funcBons. θ Thought exercise #3 : What if you meta-learn without a lot of tasks? “meta-overfiTng” � 9
p ( φ i |D tr Recall parametric approaches: Use determinis#c (i.e. a point esBmate) i , θ ) - + Why/when is this a problem? Few-shot learning problems may be ambiguous . (even with prior) Can we learn to generate hypotheses about the underlying funcBon? p ( φ i |D tr i , θ ) i.e. sample from - safety-cri,cal few-shot learning (e.g. medical imaging) Important for: - learning to ac,vely learn - learning to explore in meta-RL Ac#ve learning w/ meta-learning : Woodward & Finn ’16, Konyushkova et al. ’17, Bachman et al. ’17 � 10
Plan for Today Why be Bayesian? Bayesian meta-learning approaches How to evaluate Bayesians. � 11
Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i y ts Version 0: Let � output the parameters of a distribuBon over � . f - probability values of discrete categorical distribu#on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts - for mulB-dimensional � : parameters of a sequence of distribu#ons (i.e. autoregressive model) Then, opBmize with maximum likelihood. � 12
y ts Version 0: Let � output the parameters of a distribuBon over � . f - probability values of discrete categorical distribu#on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts - for mulB-dimensional � : parameters of a sequence of distribu#ons (i.e. autoregressive model) Then, opBmize with maximum likelihood . Pros : + simple + can combine with variety of methods Cons : - can’t reason about uncertainty over the underlying funcBon [to determine how uncertainty across datapoints relate] y ts - limited class of distribuBons over � can be expressed - tends to produce poorly-calibrated uncertainty esBmates Thought exercise #4 : Can you do the same maximum likelihood training for � ? ϕ � 13
The Bayesian Deep Learning Toolbox a broad one-slide overview (CS 236 provides a thorough treatment) Goal : represent distribuBons with neural networks Latent variable models + varia#onal inference (Kingma & Welling ‘13, Rezende et al. ‘14) : - approximate likelihood of latent variable model with variaBonal lower bound Bayesian ensembles (Lakshminarayanan et al. ‘17) : - parBcle-based representaBon: train separate models on bootstraps of the data Bayesian neural networks (Blundell et al. ‘15) : - explicit distribuBon over the space of network parameters Normalizing Flows (Dinh et al. ‘16) : - inverBble funcBon from latent distribuBon to data distribuBon Energy-based models & GANs (LeCun et al. ’06, Goodfellow et al. ‘14) : - esBmate unnormalized density We’ll see how we can leverage data the first two. everything The others could be useful in else developing new methods. � 14
� � � � Background: The Varia,onal Lower Bound Observed variable � , latent variable � x z log p ( x ) ≥ 𝔽 q ( z | x ) [ log p ( x , z ) ] + ℋ ( q ( z | x )) ELBO: � = 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) Can also be wrijen as: p ( x | z ) represented w/ neural net, p � : model model parameters � , θ p ( z ) represented as � 𝒪 ( 0 , I ) variaBonal parameters � ϕ q ( z | x ) : inference network, variaBonal distribuBon Problem : need to backprop through sampling Reparametriza,on trick For Gaussian � q ( z | x ) : i.e. compute derivaBve of � w.r.t. � 𝔽 q q � q ( z | x ) = μ q + σ q ϵ where � ϵ ∼ 𝒪 ( 0 , I ) Can we use amor,zed varia,onal inference for meta-learning? � 15
Bayesian black-box meta-learning with standard, deep variaBonal inference y ts � q ( ϕ i | tr i ) D tr ϕ i neural net What should � condiBon on? q i � max 𝔽 q ( ϕ | tr ) [ log p ( | ϕ ) ] − D KL ( q ( ϕ | tr ) ∥ p ( ϕ ) ) x ts Standard VAE: Observed variable � , latent variable � x z � max 𝔽 q ( ϕ | tr ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | tr ) ∥ p ( ϕ ) ) ELBO: 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) p � : model, represented by a neural net � : inference network, variaBonal distribuBon q What about the meta-parameters � ? θ 𝔽 q ( ϕ | tr , θ ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | tr , θ ) ∥ p ( ϕ | θ ) ) Meta-learning: � max θ Observed variable � , latent variable � ϕ Can also condiBon on � here θ max 𝔽 q ( ϕ ) [ log p ( | ϕ ) ] − D KL ( q ( ϕ ) ∥ p ( ϕ ) ) 𝔽 𝒰 i [ 𝔽 q ( ϕ i | tr i , ϕ i ) ] − D KL ( q ( ϕ i | tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts Final objecBve (for completeness): � max θ � 16
Bayesian black-box meta-learning with standard, deep variaBonal inference y ts � q ( ϕ i | tr i ) D tr ϕ i neural net i x ts 𝔽 𝒰 i [ 𝔽 q ( ϕ i | tr i , ϕ i ) ] − D KL ( q ( ϕ i | tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts � max θ Pros : y ts + can represent non-Gaussian distribuBons over � + produces distribuBon over funcBons Cons : - Can only represent Gaussian distribuBons � p ( ϕ i | θ ) p ( y ts i | x ts Not always restricBng: e.g. if � is also condiBoned on � . i , ϕ i , θ ) θ � 17
What about Bayesian op,miza,on-based meta-learning? Recall: Recas5ng Gradient-Based Meta-Learning as Hierarchical Bayes (Grant et al. ’18) task-specific parameters (empirical Bayes) MAP esBmate How to compute MAP es#mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at iniBal parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) Provides a Bayesian interpreta#on of MAML. p ( ϕ i | θ , tr i ) But, we can’t sample from � ! 18 Hybrid Varia#onal Inference
Recommend
More recommend