non parametric few shot learning
play

Non-Parametric Few-Shot Learning CS 330 1 Logistics Homework 1 due - PowerPoint PPT Presentation

Non-Parametric Few-Shot Learning CS 330 1 Logistics Homework 1 due tonight, Homework 2 out soon Fill out project group form if you havent already. Project suggestions & project spreadsheet posted 2 Plan for Today Non-Parametric Few-Shot


  1. Non-Parametric Few-Shot Learning CS 330 1

  2. Logistics Homework 1 due tonight, Homework 2 out soon Fill out project group form if you haven’t already. Project suggestions & project spreadsheet posted 2

  3. Plan for Today Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks - Case study of few-shot medical image diagnosis Properties of Meta-Learning Algorithms - Comparison of approaches Example Meta-Learning Applications - Imitation learning, drug discovery, motion prediction, language generation Goals for by the end of lecture : - Basics of non-parametric few-shot learning techniques (& how to implement) - Trade-o ff s between black-box, optimization-based, and non-parametric meta-learning - Familiarity with applied formulations of meta-learning 3

  4. Recap: Black-Box Meta-Learning φ i f θ 4 y ts x ts 0 1 2 3 4 D tr i Key idea: parametrize learner as a neural network - challenging op0miza0on problem + expressive

  5. Recap: Op9miza9on-Based Meta-Learning φ i r θ L 4 y ts x ts 0 1 2 3 4 D tr i Key idea: embed op3miza3on inside the inner learning process + structure of op0miza0on - typically requires second-order op0miza0on embedded into meta-learner Today: Can we embed a learning procedure without a second-order op9miza9on?

  6. So far : Learning parametric models. In low data regimes, non-parametric methods are simple, work well. During meta-test 0me : few-shot learning <-> low data regime During meta-training : s9ll want to be parametric Can we use parametric meta-learners that produce effec9ve non-parametric learners ? Note: some of these methods precede parametric approaches 6

  7. Non-parametric methods Key Idea : Use non-parametric learner. test datapoint training data D tr i Compare test image with training images In what space do you compare? With what distance metric? pixel space, l 2 distance? 7

  8. In what space do you compare? With what distance metric? pixel space, l 2 distance? Zhang et al. (arXiv 1801.03924) 8

  9. Non-parametric methods Key Idea : Use non-parametric learner. test datapoint training data D tr i Compare test image with training images In what space do you compare? With what distance metric? pixel space, l 2 distance? pixel space, l 2 distance? Learn to compare using meta-training data! 9

  10. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 0 Koch et al., ICML ‘15 10

  11. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 1 Koch et al., ICML ‘15 11

  12. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label 0 Koch et al., ICML ‘15 12

  13. Non-parametric methods Key Idea : Use non-parametric learner. train Siamese network to predict whether or not two images are the same class label label 1 D tr Meta-test 9me: compare image to each image in j Meta-training : Binary classifica9on Can we match meta-train & meta-test? Meta-test : N-way classifica9on Koch et al., ICML ‘15 13

  14. Non-parametric methods Key Idea : Use non-parametric learner. Can we match meta-train & meta-test? Nearest neighbors in learned embedding space D tr i bidirec9onal f θ ( x ts , x k ) y ) y k LSTM y ts = X f θ ( x ts , x k ) y k e ˆ x k ,y k ∈ D tr convolu9onal Trained end-to-end . encoder Meta-train & meta-test 9me match . D ts Vinyals et al. Matching Networks, NeurIPS ‘16 14 i

  15. Non-parametric methods Key Idea : Use non-parametric learner. General Algorithm : Black-box approach Non-parametric approach (matching networks) 1. Sample task T i (or mini batch of tasks) 2. Sample disjoint datasets D tr i , D test from D i i (Parameters integrated ϕ y ts = X f θ ( x ts , x k ) y k Compute ˆ 3. Compute φ i ← f θ ( D tr i ) out, hence non-parametric ) x k ,y k ∈ D tr 4. Update θ using r θ L ( φ i , D test y ts , y ts ) ) Update θ using r θ L (ˆ i Matching networks will perform comparisons independently What if >1 shot ? Can we aggregate class informa9on to create a prototypical embedding ? 15

  16. Non-parametric methods Key Idea : Use non-parametric learner. c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr i exp( − d ( f θ ( x ) , c n )) p θ ( y = n | x ) = P n 0 exp( d ( f θ ( x ) , c n 0 )) d: Euclidean, or cosine distance Snell et al. Prototypical Networks, NeurIPS ‘17 16

  17. Non-parametric methods So far : Siamese networks, matching networks, prototypical networks Embed, then nearest neighbors. Challenge What if you need to reason about more complex rela9onships between datapoints? Idea : Learn non-linear rela9on Idea : Learn infinite Idea : Perform message module on embeddings mixture of prototypes. passing on embeddings (learn d in PN) Sung et al. Rela9on Net Allen et al. IMP, ICML ‘19 Garcia & Bruna, GNN 17

  18. Case Study Machine Learning for Healthcare Conference 2019 NeurIPS 2018 ML4H Workshop Link: h^ps://arxiv.org/abs/1811.03066

  19. Problem: Few-Shot Learning for Dermatological Disease Diagnosis Dermnet dataset (h^p://www.dermnet.com/) - hard to get data Challenges : - data is long-tailed - significant intra-class variability A cquire accurate Goal : classifier on all classes (Top 200 classes only!)

  20. Prototypical Clustering Networks for Few-Shot Classifica3on Approach: Prototypical Networks + Problem formula0on : - learn mul3ple prototypes per class (to different image classes = different diseases handle intra-class variability) 150 base classes (classes w/ most data) - incorporate unlabeled support examples via 50 novel classes k-means on learned embedding Test on all 200 classes. Note : Unlike black-box & op9miza9on-based meta-learning, ProtoNets can train for N way classifica9on and test for > N way classifica9on (Side note if you read the paper: They flipped the standard nota3on of K and N in the paper)

  21. Evalua9on Compare : PN - standard ProtoNets, trained on 150 base classes, pre-trained on ImageNet FT N -*NN - ImageNet pre-training, fine-tuned ResNet on N classes, *-nearest neighbors in resul9ng embedding space FT 200 -*CE - ImageNet pre-trained, fine-tuned on all 200 classes with balancing (very strong baseline, accesses more info during training, requires re-training for new classes) Evalua0on Metric : mean class accuracy (mca), i.e. average of per-class accuracies across 200 classes. k = 5 k = 10 PCN > PN PCN > FT N -*NN PCN ≈ FT 200 -*CE without requiring re-training More visualiza9ons and analysis in the paper!

  22. Plan for Today Non-Parametric Few-Shot Learning - Siamese networks, matching networks, prototypical networks - Case study of few-shot medical image diagnosis Properties of Meta-Learning Algorithms - Comparison of approaches Example Meta-Learning Applications - Imitation learning, drug discovery, motion prediction, language generation How can we think about how these methods compare? 22

  23. Black-box vs. Op9miza9on vs. Non-Parametric Computa(on graph perspec0ve Black-box Op0miza0on-based Non-parametric y ts = f θ ( D tr i , x ts ) � f θ ( x ts ) , c n � = softmax( − d ) y ts where c n = 1 X ( y = n ) f θ ( x ) K ( x,y ) ∈ D tr x ts i Note: (again) Can mix & match components of computa9on graph Gradient descent on rela9on net embedding. Both condi9on on data & MAML, but ini9alize last layer as run gradient descent. ProtoNet during meta-training Jiang et al. CAML ‘19 Triantafillou et al. Proto-MAML ‘19 Rusu et al. LEO ‘19 23

  24. Black-box vs. Op9miza9on vs. Non-Parametric Algorithmic proper(es perspec0ve the ability for f to represent a range of learning procedures Expressive power Why? scalability, applicability to a range of domains learned learning procedure will monotonically improve with more data Consistency reduce reliance on meta-training tasks, Why? good OOD task performance Recall: These proper9es are important for most applica9ons! 24

  25. Black-box vs. Op9miza9on vs. Non-Parametric Black-box Op9miza9on-based Non-parametric + complete expressive power + consistent , reduces to GD + expressive for most architectures ~ consistent under certain - not consistent ~ expressive for very deep models * condi0ons + easy to combine with variety of + posi0ve induc0ve bias at the start + en9rely feedforward of meta-learning learning problems (e.g. SL, RL) + computa0onally fast & easy to + handles varying & large K well op0mize - challenging op0miza0on (no + model-agnos0c - harder to generalize to varying K induc9ve bias at the ini9aliza9on) - second-order op0miza0on - ojen data-inefficient - hard to scale to very large K - usually compute and memory - so far, limited to classifica0on intensive Generally, well-tuned versions of each perform comparably on exis9ng few-shot benchmarks! (likely says more about the benchmarks than the methods) Which method to use depends on your use-case . *for supervised learning sekngs 25

Recommend


More recommend