optimization based meta learning
play

Optimization-Based Meta-Learning ( fi nishing from last time) and - PowerPoint PPT Presentation

Optimization-Based Meta-Learning ( fi nishing from last time) and Non-Parametric Few-Shot Learning CS 330 1 Logistics Homework 1 due, Homework 2 out this Wednesday Fill out poster presentation preferences ! (Tues 12/3 or Weds 12/4) Course


  1. Optimization-Based Meta-Learning ( fi nishing from last time) and Non-Parametric Few-Shot Learning CS 330 1

  2. Logistics Homework 1 due, Homework 2 out this Wednesday Fill out poster presentation preferences ! (Tues 12/3 or Weds 12/4) Course project details & suggestions posted 
 Proposal due Monday 10/28 � 2

  3. Plan for Today Optimization-Based Meta-Learning - Recap & discuss advanced topics 
 Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks 
 Properties of Meta-Learning Algorithms - Comparison of approaches � 3

  4. Recap from Last Time pre-trained parameters φ θ � α r θ L ( θ , D tr ) Fine-tuning training data 
 [test-&me] for new task X X L ( θ � α r θ L ( θ , D tr L ( θ � α r θ L ( θ , D tr L ( θ � α r θ L ( θ , D tr i ) , D ts i ) , D ts min min min i ) , i ) i ) MAML θ θ θ task i task i i Op&mizes for an effec&ve ini&aliza&on for fine-tuning. Discussed : performance on extrapolated tasks, expressive power � 4

  5. Probabilis.c Interpreta.on of Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Meta-parameters serve as a prior. One form of prior knowledge: ini.aliza.on for fine-tuning θ task-specific parameters (empirical Bayes) MAP es&mate How to compute MAP es.mate? Gradient descent with early stopping = MAP inference under meta-parameters Gaussian prior with mean at ini&al parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) MAML approximates hierarchical Bayesian inference. Grant et al. ICLR ‘18 � 5

  6. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Meta-parameters serve as a prior. One form of prior knowledge: ini.aliza.on for fine-tuning θ φ θ � α r θ L ( θ , D tr ) Gradient-descent + early stopping (MAML): implicit Gaussian prior Other forms of priors? Gradient-descent with explicit Gaussian prior Rajeswaran et al. implicit MAML ‘19 Bayesian linear regression on learned features Harrison et al. ALPaCA ‘18 Closed-form or convex opBmizaBon on learned features ridge regression, logisBc regression s upport vector machine Ber&neYo et al. R2-D2 ‘19 Lee et al. MetaOptNet ‘19 Current SOTA on few-shot image classifica.on � 6

  7. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Challenges How to choose architecture that is effec&ve for inner gradient-step? Idea : Progressive neural architecture search + MAML (Kim et al. Auto-Meta) - finds highly non-standard architecture (deep & narrow) - different from architectures that work well for standard supervised learning MAML, basic architecture: 63.11% MiniImagenet, 5-way 5-shot MAML + AutoMeta: 74.65% � 7

  8. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Challenges Bi-level op&miza&on can exhibit instabili&es. Idea : Automa&cally learn inner vector learning rate, tune outer learning rate (Li et al. Meta-SGD, Behl et al. AlphaMAML) Idea : Op&mize only a subset of the parameters in the inner loop (Zhou et al. DEML, Zintgraf et al. CAVIA) Idea : Decouple inner learning rate, BN sta&s&cs per-step (Antoniou et al. MAML++) Idea : Introduce context variables for increased expressive power. (Finn et al. bias transforma&on, Zintgraf et al. CAVIA) Takeaway : a range of simple tricks that can help op&miza&on significantly � 8

  9. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Challenges Backpropaga&ng through many inner gradient steps is compute- & memory- intensive. Idea : [Crudely] approximate as iden&ty (Finn et al. first-order MAML ‘17, Nichol et al. Rep&le ’18) Takeaway : works for simple few-shot problems, but (anecdotally) not for more complex meta-learning problems. Can we compute the meta-gradient without differen-a-ng through the op-miza-on path ? -> whiteboard Idea : Derive meta-gradient using the implicit func&on theorem (Rajeswaran, Finn, Kakade, Levine. Implicit MAML ’19) � 9

  10. Op.miza.on-Based Inference Can we compute the meta-gradient without differen-a-ng through the op-miza-on path ? Idea : Derive meta-gradient using the implicit func&on theorem (Rajeswaran, Finn, Kakade, Levine. Implicit MAML) Memory and computa.on trade-offs Allows for second-order op.mizers in inner loop A very recent development (NeurIPS ’19) 
 (thus, all the typical caveats with recent work) � 10

  11. Op.miza.on-Based Inference Key idea : Acquire through op&miza&on. Takeaways : Construct bi-level op-miza-on problem. + posi&ve induc&ve bias at the start of meta-learning + consistent procedure, tends to extrapolate beYer + maximally expressive with sufficiently deep network + model-agnos&c (easy to combine with your favorite architecture) - typically requires second-order op&miza&on - usually compute and/or memory intensive Can we embed a learning procedure without a second-order op&miza&on? � 11

  12. Plan for Today Optimization-Based Meta-Learning - Recap & discuss advanced topics 
 Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks 
 Properties of Meta-Learning Algorithms - Comparison of approaches � 12

  13. So far : Learning parametric models. In low data regimes, non-parametric methods are simple, work well. During meta-test Bme : few-shot learning <-> low data regime During meta-training : s&ll want to be parametric Can we use parametric meta-learners that produce effec&ve non-parametric learners ? Note: some of these methods precede parametric approaches � 13

  14. Non-parametric methods Key Idea : Use non-parametric learner. D tr test datapoint training data i Compare test image with training images In what space do you compare? With what distance metric? pixel space, l 2 distance? � 14

  15. In what space do you compare? With what distance metric? pixel space, l 2 distance? Zhang et al. (arXiv 1801.03924) 15

  16. Non-parametric methods Key Idea : Use non-parametric learner. D tr test datapoint training data i Compare test image with training images In what space do you compare? With what distance metric? pixel space, l 2 distance? pixel space, l 2 distance? Learn to compare using meta-training data! � 16

  17. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 0 Koch et al., ICML ‘15 17

  18. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 1 Koch et al., ICML ‘15 18

  19. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 0 Koch et al., ICML ‘15 19

  20. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label label 1 D tr Meta-test &me: compare image to each image in j Meta-training : Binary classifica&on Can we match meta-train & meta-test? Meta-test : N-way classifica&on Koch et al., ICML ‘15 20

  21. Non-parametric methods Key Idea : Use non-parametric learner. Can we match meta-train & meta-test? Nearest neighbors in learned embedding space D tr i bidirec.onal f θ ( x ts , x k ) y ) y k LSTM y ts = X f θ ( x ts , x k ) y k e ˆ x k ,y k ∈ D tr convolu.onal 
 Trained end-to-end . encoder Meta-train & meta-test &me match . D ts Vinyals et al. Matching Networks, NeurIPS ‘16 21 i

  22. Non-parametric methods Key Idea : Use non-parametric learner. General Algorithm : Amor&zed approach Non-parametric approach (matching networks) 1. Sample task T i (or mini batch of tasks) 2. Sample disjoint datasets D tr i , D test from D i i (Parameters � integrated ϕ y ts = X f θ ( x ts , x k ) y k Compute ˆ 3. Compute φ i ← f θ ( D tr i ) out, hence non-parametric ) x k ,y k ∈ D tr 4. Update θ using r θ L ( φ i , D test y ts , y ts ) ) Update θ using r θ L (ˆ i Matching networks will perform comparisons independently What if >1 shot ? Can we aggregate class informa.on to create a prototypical embedding ? � 22

  23. Non-parametric methods Key Idea : Use non-parametric learner. c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr i exp( − d ( f θ ( x ) , c n )) p θ ( y = n | x ) = P n 0 exp( d ( f θ ( x ) , c n 0 )) d: Euclidean, or cosine distance Snell et al. Prototypical Networks, NeurIPS ‘17 � 23

  24. Non-parametric methods So far : Siamese networks, matching networks, prototypical networks Embed, then nearest neighbors. Challenge What if you need to reason about more complex rela&onships between datapoints? Idea : Learn non-linear rela&on Idea : Learn infinite Idea : Perform message module on embeddings mixture of prototypes. passing on embeddings (learn d in PN) Sung et al. Rela&on Net Allen et al. IMP, ICML ‘19 Garcia & Bruna, GNN � 24

  25. Plan for Today Optimization-Based Meta-Learning - Recap & discuss advanced topics 
 Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks 
 Properties of Meta-Learning Algorithms - Comparison of approaches How can we think about how these methods compare? � 25

Recommend


More recommend