Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1

Logistics Homework 2 due next Wednesday. Project proposal due in two weeks . Poster presentation: Tues 12/3 at 1:30 pm . � 2

Disclaimers Bayesian meta-learning is an ac#ve area of research (like most of the class content) More ques#ons than answers. This lecture covers some of the most advanced topics of the course. So ask ques#ons ! � 3

Recap from last Bme. Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i � 4

Recap from last Bme. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance These proper#es are important for most applica#ons! � 5

Recap from last Bme. Algorithmic proper(es perspec,ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will solve task with enough data Consistency reduce reliance on meta-training tasks,   Why? good OOD task performance ability to reason about ambiguity during learning Uncertainty awareness acBve learning, calibrated uncertainty, RL Why? principled Bayesian approaches *this lecture* � 6

Plan for Today Why be Bayesian? Bayesian meta-learning approaches How to evaluate Bayesians. � 7

  Mul,-Task & Meta-Learning Principles Training and tesBng must match. Tasks must share “structure.” What does “structure” mean? staBsBcal dependence on shared latent informaBon � θ If you condiBon on that informaBon, - task parameters become independent   i.e. � ϕ i 1 ⊥ ⊥ ϕ i 2 ∣ θ and are not otherwise independent � ϕ i 1 ⊥ ⊥ / ϕ i 2 - hence, you have a lower entropy   i.e. � ℋ ( p ( ϕ i | θ )) < ℋ ( p ( ϕ i )) Thought exercise #1 : If you can idenBfy � (i.e. with meta-learning) ,   θ when should learning � ϕ i be faster than learning from scratch? Thought exercise #2 : what if � ? ℋ ( p ( ϕ i | θ )) = 0 � 8

Mul,-Task & Meta-Learning Principles Training and tesBng must match. Tasks must share “structure.” What does “structure” mean? staBsBcal dependence on shared latent informaBon � θ What informaBon might � contain… θ …in the toy sinusoid problem? � corresponds to family of sinusoid funcBons θ (everything but phase and amplitude) …in the machine translaBon example? � corresponds to the family of all language pairs θ Note that � is narrower than the space of all possible funcBons. θ Thought exercise #3 : What if you meta-learn without a lot of tasks? “meta-overfiTng” � 9

p ( φ i |D tr Recall parametric approaches: Use determinis#c (i.e. a point esBmate) i , θ ) - + Why/when is this a problem? Few-shot learning problems may be ambiguous .   (even with prior) Can we learn to generate hypotheses about the underlying funcBon? p ( φ i |D tr i , θ ) i.e. sample from - safety-cri,cal few-shot learning (e.g. medical imaging) Important for: - learning to ac,vely learn - learning to explore in meta-RL Ac#ve learning w/ meta-learning : Woodward & Finn ’16, Konyushkova et al. ’17, Bachman et al. ’17 � 10

Plan for Today Why be Bayesian? Bayesian meta-learning approaches How to evaluate Bayesians. � 11

Computa(on graph perspec,ve Black-box Op,miza,on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i y ts Version 0: Let � output the parameters of a distribuBon over � . f - probability values of discrete categorical distribu#on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts - for mulB-dimensional � : parameters of a sequence of distribu#ons (i.e. autoregressive model) Then, opBmize with maximum likelihood. � 12

y ts Version 0: Let � output the parameters of a distribuBon over � . f - probability values of discrete categorical distribu#on For example: - mean and variance of a Gaussian - means, variances, and mixture weights of a mixture of Gaussians y ts - for mulB-dimensional � : parameters of a sequence of distribu#ons (i.e. autoregressive model) Then, opBmize with maximum likelihood . Pros : + simple + can combine with variety of methods Cons : - can’t reason about uncertainty over the underlying funcBon   [to determine how uncertainty across datapoints relate] y ts - limited class of distribuBons over � can be expressed - tends to produce poorly-calibrated uncertainty esBmates Thought exercise #4 : Can you do the same maximum likelihood training for � ? ϕ � 13

The Bayesian Deep Learning Toolbox a broad one-slide overview (CS 236 provides a thorough treatment) Goal : represent distribuBons with neural networks Latent variable models + varia#onal inference (Kingma & Welling ‘13, Rezende et al. ‘14) : - approximate likelihood of latent variable model with variaBonal lower bound Bayesian ensembles (Lakshminarayanan et al. ‘17) : - parBcle-based representaBon: train separate models on bootstraps of the data Bayesian neural networks (Blundell et al. ‘15) : - explicit distribuBon over the space of network parameters Normalizing Flows (Dinh et al. ‘16) : - inverBble funcBon from latent distribuBon to data distribuBon Energy-based models & GANs (LeCun et al. ’06, Goodfellow et al. ‘14) : - esBmate unnormalized density We’ll see how we can leverage data the first two. everything   The others could be useful in else developing new methods. � 14

� � � � Background: The Varia,onal Lower Bound Observed variable � , latent variable � x z log p ( x ) ≥ 𝔽 q ( z | x ) [ log p ( x , z ) ] + ℋ ( q ( z | x )) ELBO: � = 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) Can also be wrijen as: p ( x | z ) represented w/ neural net, p � : model model parameters � , θ p ( z ) represented as � 𝒪 ( 0 , I ) variaBonal parameters � ϕ q ( z | x ) : inference network, variaBonal distribuBon Problem : need to backprop through sampling Reparametriza,on trick For Gaussian � q ( z | x ) : i.e. compute derivaBve of � w.r.t. � 𝔽 q q � q ( z | x ) = μ q + σ q ϵ where � ϵ ∼ 𝒪 ( 0 , I ) Can we use amor,zed varia,onal inference for meta-learning? � 15

Bayesian black-box meta-learning   with standard, deep variaBonal inference y ts � q ( ϕ i | 𝒠 tr i ) D tr ϕ i neural net What should � condiBon on? q i � max 𝔽 q ( ϕ | 𝒠 tr ) [ log p ( 𝒠 | ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr ) ∥ p ( ϕ ) ) x ts Standard VAE: Observed variable � , latent variable � x z � max 𝔽 q ( ϕ | 𝒠 tr ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr ) ∥ p ( ϕ ) ) ELBO: 𝔽 q ( z | x ) [ log p ( x | z ) ] − D KL ( q ( z | x ) ∥ p ( z ) ) p � : model, represented by a neural net � : inference network, variaBonal distribuBon q What about the meta-parameters � ? θ 𝔽 q ( ϕ | 𝒠 tr , θ ) [ log p ( y ts | x ts , ϕ ) ] − D KL ( q ( ϕ | 𝒠 tr , θ ) ∥ p ( ϕ | θ ) ) Meta-learning: � max θ Observed variable � 𝒠 , latent variable � ϕ Can also condiBon on � here θ max 𝔽 q ( ϕ ) [ log p ( 𝒠 | ϕ ) ] − D KL ( q ( ϕ ) ∥ p ( ϕ ) ) 𝔽 𝒰 i [ 𝔽 q ( ϕ i | 𝒠 tr i , ϕ i ) ] − D KL ( q ( ϕ i | 𝒠 tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts Final objecBve (for completeness): � max θ � 16

Bayesian black-box meta-learning   with standard, deep variaBonal inference y ts � q ( ϕ i | 𝒠 tr i ) D tr ϕ i neural net i x ts 𝔽 𝒰 i [ 𝔽 q ( ϕ i | 𝒠 tr i , ϕ i ) ] − D KL ( q ( ϕ i | 𝒠 tr i , θ ) ∥ p ( ϕ i | θ ) ) ] i , θ ) [ log p ( y ts i | x ts � max θ Pros : y ts + can represent non-Gaussian distribuBons over � + produces distribuBon over funcBons Cons : - Can only represent Gaussian distribuBons � p ( ϕ i | θ ) p ( y ts i | x ts Not always restricBng: e.g. if � is also condiBoned on � . i , ϕ i , θ ) θ � 17

What about Bayesian op,miza,on-based meta-learning? Recall: Recas5ng Gradient-Based Meta-Learning as Hierarchical Bayes (Grant et al. ’18) task-specific parameters (empirical Bayes) MAP esBmate How to compute MAP es#mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at iniBal parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) Provides a Bayesian interpreta#on of MAML. p ( ϕ i | θ , 𝒠 tr i ) But, we can’t sample from � ! 18 Hybrid Varia#onal Inference

Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next Wednesday. Project proposal due in two weeks . Poster presentation: Tues 12/3 at 1:30 pm . 2 Disclaimers Bayesian meta-learning is an ac#ve area of research (like most of the

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Comparison of Bayesian Network Meta-Analysis Models for Survival Data Purvi Prajapati James

Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy Department of Statistical Sciences

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Meta Reinforcement Learning Kate Rakelly 11/13/19 Questions we seek to answer Motivation : What

Recruit itment Messagin ing: From analy lysis to desig ign Jonathan Schreiner American

LevelJump logo + customer logo Name Contact info URL Housekeeping If you cant hear

Some thoughts on messaging Lets hear from an expert Dave McGimpsey interviews George

Meta Queries Workshop Scott Joyce Advanced Meta Queries Which table do I use? How do I

Meta-policies for Distributed Role-based Access Control Andrs Belokosztolszki, Ken Moody

Towards Proximity Graph Auto-Configuration: an Approach Based on Meta-learning Rafael S. Oyamada,

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Sambuz

Useful Links

Newsletter

Mail Us

Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next - PowerPoint PPT Presentation

Bayesian Meta-Learning CS 330 1 Logistics Homework 2 due next Wednesday. Project proposal due in two weeks . Poster presentation: Tues 12/3 at 1:30 pm . 2 Disclaimers Bayesian meta-learning is an ac#ve area of research (like most of the

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Comparison of Bayesian Network Meta-Analysis Models for Survival Data Purvi Prajapati James

Meta-Bayesian Analysis Jun Yang joint work with Daniel M. Roy Department of Statistical Sciences

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem &amp; Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Meta Reinforcement Learning Kate Rakelly 11/13/19 Questions we seek to answer Motivation : What

Recruit itment Messagin ing: From analy lysis to desig ign Jonathan Schreiner American

LevelJump logo + customer logo Name Contact info URL Housekeeping If you cant hear

Some thoughts on messaging Lets hear from an expert Dave McGimpsey interviews George

Meta Queries Workshop Scott Joyce Advanced Meta Queries Which table do I use? How do I

Meta-policies for Distributed Role-based Access Control Andrs Belokosztolszki, Ken Moody

Towards Proximity Graph Auto-Configuration: an Approach Based on Meta-learning Rafael S. Oyamada,

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Sambuz

Useful Links

Newsletter

Mail Us

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,