T HOUGHTS ON P ROGRESS M ADE AND C HALLENGES A HEAD IN F EW -S HOT L EARNING Hugo Larochelle Google Brain
3 Human-level concept learning People are through probabilistic good at it program induction Brenden M. Lake, 1 * Ruslan Salakhutdinov, 2 Joshua B. Tenenbaum 3 Machines are getting better at it
5 RELATED WORK: ONE-SHOT LEARNING • One-shot learning has been studied before ‣ One-Shot learning of object categories (2006) Fei-Fei Li, Rob Fergus and Pietro Perona ‣ Knowledge transfer in learning to recognize visual objects classes (2004) Fei-Fei Li ‣ Object classification from a single example utilizing class relevance pseudo-metrics (2004) Michael Fink ‣ Cross-generalization: learning novel classes from a single example by feature replacement (2005) Evgeniy Bart and Shimon Ullman • These largely relied on hand-engineered features and algorithms ‣ with recent progress in end-to-end deep learning, we hope to jointly learn a representation and algorithm better suited for few-shot learning
6 META-LEARNING
7 META-LEARNING D train D test = episode
8 META-LEARNING D train D test = episode
8 META-LEARNING D train D test = episode Meta-learner ( A )
8 META-LEARNING D train D test = episode Meta-learner ( A ) Learner ( M )
8 META-LEARNING D train D test = episode Loss Meta-learner ( A ) Learner ( M )
9 META-LEARNING D train D test = episode Loss Meta-learner ( A ) Learner ( M )
9 META-LEARNING D train D test = episode Loss Meta-learner ( A ) Learner ( M )
If you don’t evaluate on never-seen problems/datasets…
If you don’t evaluate on never-seen problems/datasets… … it’s not meta-learning!
11 LEARNING PROBLEM STATEMENT • Assuming a probabilistic model M over labels, the cost per episode can written as 1 X C ( D train , D test ) = − log p ( y t | x t , D train ) | D test | ( x t ,y t ) ∈ D test • Here jointly represents the meta-learner A (which processes p ( y | x , D train ) D train ) and the learner M (which processes x )
12 CHOOSING A META-LEARNER • How to parametrize learning algorithms (meta-learners )? p ( y | x , D train ) • Two approaches to defining a meta-learner ‣ Take inspiration from a known learning algorithm kNN/kernel machine: Matching networks (Vinyals et al. 2016) - Gaussian classifier: Prototypical Networks (Snell et al. 2017) - Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) , MAML (Finn et al. 2017) - ‣ Derive it from a black box neural network SNAIL (Mishra et al. 2018) -
13 CHOOSING A META-LEARNER • How to parametrize learning algorithms (meta-learners )? p ( y | x , D train ) • Two approaches to defining a meta-learner ‣ Take inspiration from a known learning algorithm kNN/kernel machine: Matching networks (Vinyals et al. 2016) - Gaussian classifier: Prototypical Networks (Snell et al. 2017) - Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) , MAML (Finn et al. 2017) - ‣ Derive it from a black box neural network SNAIL (Mishra et al. 2018) -
14 MATCHING NETWORKS • Training a “ pattern matcher ” (kNN/kernel machine) k X y = ˆ a (ˆ x, x i ) y i i =1 attention models and kernel functions) is to x ) ,g ( x i )) / P k x, x i ) = e c ( f (ˆ j =1 e c ( f (ˆ x ) ,g ( x j )) a (ˆ ate neural networks (potentially with f = g ) to • Matching networks for one shot learning (2016) Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra
15 PROTOTYPICAL NETWORKS • Training a “ prototype extractor ” (Gaussian classifier) exp( − d ( f φ ( x ) , c k )) p φ ( y = k | x ) = P k 0 exp( − d ( f φ ( x ) , c k 0 )) c 2 minimizing the negative log-probability J ( φ ) = 1 X c k = f φ ( x i ) | S k | x ( x i ,y i ) ∈ S k c 1 S k = { ( x i , y i ) | y i = k, ( x i , y i ) ∈ D train } c 3 φ ≡ Θ • Prototypical Networks for Few-shot Learning (2017) Jake Snell, Kevin Swersky and Richard Zemel
16 META-LEARNER LSTM • Training an “ initialize and gradient descent procedure ” applied on some learner M D test D train C ( D train , D test )
16 META-LEARNER LSTM • Training an “ initialize and gradient descent procedure ” applied on some learner M D test D train C ( D train , D test ) • Optimization as a Model for Few-Shot Learning (2017) Sachin Ravi and Hugo Larochelle
16 META-LEARNER LSTM • Training an “ initialize and gradient descent procedure ” applied on some learner M D test D train C ( D train , D test ) • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017) Chelsea Finn, Pieter Abbeel and Sergey Levine
17 CHOOSING A META-LEARNER • How to parametrize learning algorithms (meta-learners )? p ( y | x , D train ) • Two approaches to defining a meta-learner ‣ Take inspiration from a known learning algorithm kNN/kernel machine: Matching networks (Vinyals et al. 2016) - Gaussian classifier: Prototypical Networks (Snell et al. 2017) - Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) , MAML (Finn et al. 2017) - ‣ Derive it from a black box neural network SNAIL (Mishra et al. 2018) -
18 SIMPLE NEURAL ATTENTIVE LEARNER Supervised Learning • Using a convolutional/attentional network edicted Label t to represent p ( y | x , D train ) ‣ alternates between dilated convolutional layers and attentional layers ‣ when inputs are images, an convolutional embedding network is used (b) Attention Block (key size K, value size V) to map to a vector space outputs, shape [T, C + V] (a) Dense Block (dilation rate R, D lters) concatenate outputs, shape [T, C + D] matmul a ne, output size V concatenate matmul, masked softmax (values) causal conv, kernel 2 a ne, output size K a ne, output size K (query) (keys) dilation R, D lters inputs, shape [T, C] inputs, shape [T, C] x t-3 x t-2 x t-1 x t (Examples, • A Simple Neural Attentive Meta-Learner (2018) y t-3 y t-2 y t-1 -- Nikhil Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel
19 AND SO MUCH MORE!!! bit.ly/2PikS82
20 EXPERIMENT • Mini-ImageNet (split used in Ravi & Larochelle, 2017) ‣ random subset of 100 classes (64 training, 16 validation, 20 testing) ‣ random sets D train are generated by randomly picking 5 classes from class subset 5-class Model 1 -shot 5 -shot Baseline-finetune 28 . 86 ± 0 . 54 % 49 . 79 ± 0 . 79 % Baseline-nearest-neighbor 41 . 08 ± 0 . 70 % 51 . 04 ± 0 . 65 % 43 . 40 ± 0 . 78 % 51 . 09 ± 0 . 71 % Matching Network 43 . 56 ± 0 . 84 % 55 . 31 ± 0 . 73 % Matching Network FCE 43.56% ± 0.84% 55.31% ± 0.73% 43.44% ± 0.77% 60.60% ± 0.71% Meta-Learner LSTM (OURS) 43 . 44 ± 0 . 77 % 60 . 60 ± 0 . 71 %
21 EXPERIMENT • Mini-ImageNet (split used in Ravi & Larochelle, 2017) ‣ random subset of 100 classes (64 training, 16 validation, 20 testing) ‣ random sets D train are generated by randomly picking 5 classes from class subset 5-class Model 1 -shot 5 -shot Baseline-finetune 28 . 86 ± 0 . 54 % 49 . 79 ± 0 . 79 % 49.42% ± 0.78% 68.20% ± 0.66% Prototypical Nets (Snell et al.) Baseline-nearest-neighbor 41 . 08 ± 0 . 70 % 51 . 04 ± 0 . 65 % MAML (Finn et al.) 48.70% ± 1.84% 63.10% ± 0.92% 43 . 40 ± 0 . 78 % 51 . 09 ± 0 . 71 % Matching Network SNAIL (Mishra et al.) 55.71% ± 0.99% 68.88% ± 0.98% 43 . 56 ± 0 . 84 % 55 . 31 ± 0 . 73 % Matching Network FCE 43.56% ± 0.84% 55.31% ± 0.73% 43.44% ± 0.77% 60.60% ± 0.71% Meta-Learner LSTM (OURS) 43 . 44 ± 0 . 77 % 60 . 60 ± 0 . 71 %
22 REMAINING CHALLENGES • Going beyond supervised classification ‣ unsupervised learning, structured output, interactive learning • Going beyond Mini-ImageNet ‣ coming up with a realistic definition of distributions over problems/datasets • Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, Hugo Larochelle Google
23 META-DATASET • To learn across many tasks requires learning over many datasets (a) ImageNet (b) Omniglot (c) Aircraft (d) Birds (e) DTD (f) Quick Draw (g) Fungi (h) VGG Flower (i) Traffic Signs (j) MSCOCO
23 META-DATASET • To learn across many tasks requires learning over many datasets Held out for testing (a) ImageNet (b) Omniglot (c) Aircraft (d) Birds (e) DTD (f) Quick Draw (g) Fungi (h) VGG Flower (i) Traffic Signs (j) MSCOCO
Recommend
More recommend