a few meta learning papers
play

A few meta learning papers Guy Gur-Ari Machine Learning Journal - PowerPoint PPT Presentation

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta Learning Mechanisms for faster, better adaptation to new tasks Integrate prior experience with a small amount of new information


  1. A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017

  2. Meta Learning • Mechanisms for faster, better adaptation to new tasks • ‘Integrate prior experience with a small amount of new information’ • Examples: Image classifier applied to new classes, game player applied to new games, … • Related: single-shot learning, catastrophic forgetting • Learning how to learn (instead of designing by hand)

  3. Meta Learning • Mechanisms for faster, better adaptation to new tasks • Learning how to learn (instead of designing by hand) • Each task is a single training sample • Performance metric: Generalization to new tasks • Higher derivatives show up, but first-order approximations sometimes work well

  4. Transfer Learning (ad-hoc meta-learning)

  5. Learning to learn by gradient descent 
 by gradient descent Andrychowicz et al. 1606.04474

  6. Basic idea Target (‘optimizee’) loss function Recurrent Neural Network m with parameters ɸ Target Optimizer parameters parameters [1606.04474]

  7. Vanilla RNN refresher y t − 1 y t Backpropagation through time m m m h t h t − 1 x t − 1 x t t + 1 t − 1 t h t = tanh ( W h h t − 1 + W x x t ) y t = W y h t [Karpathy]

  8. Meta loss function Ideal Optimal target parameters for given optimizer In practice r t = r θ f ( θ t ) w t ≡ 1 RNN RNN hidden (2-layer LSTM) state [1606.04474]

  9. Meta loss function r t = r θ f ( θ t ) w t ≡ 1 • Recurrent network can use trajectory information, similar to momentum • Including historical losses also helps with backprop through time [1606.04474]

  10. Training protocol • Sample a random task f • Train optimizer on f by gradient descent 
 (100 steps, unroll for 20) • Repeat [1606.04474]

  11. Test optimizer performance • Sample new tasks • Apply optimizer for some steps, compute average loss • Compare with existing optimizers (ADAM, RMSProp) [1606.04474]

  12. Computational graph ( φ ) ( φ ) ( φ ) Graph used for computing the gradient of the optimizer (with respect to ɸ ) [1606.04474]

  13. Simplifying assumptions • No 2nd order derivatives: r φ r θ f = 0 • RNN weights shared between target parameters • Result is independent of parameter ordering • Each parameter has separate hidden state [1606.04474]

  14. Experiments Variability is in initial target parameters and choice of mini-batches [1606.04474]

  15. Experiments Separate optimizers for convolutional and fully-connected layers [1606.04474]

  16. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Finn, Abbeel, Levine 1703.03400

  17. 
 Basic idea • Start with a class of tasks with distribution p ( T ) T i • Train one model 𝛴 that can be quickly fine-tuned to new tasks (‘few-shot learning’) 
 • How? Explicitly require that a single training step will significantly improve the loss • Meta loss function, optimized over 𝛴 : [1703.03400]

  18. (to avoid overfitting?) [1703.03400]

  19. 
 
 Comments • Can be adapted to any scenario that uses gradient descent (e.g. regression, reinforcement learning) • Involves taking second derivative 
 • First-order approximation still works well [1703.03400]

  20. Regression experiment Single task = compute sine with given underlying amplitude and phase Model is FC network Pretrained = compute a single model with 2 hidden layers on many tasks simultaneously [1703.03400]

  21. Classification experiment Each classification class is a single task [1703.03400]

  22. RL experiment Reward = negative square distance from goal position. For each task, goal is placed randomly. [1703.03400]

  23. Overcoming catastrophic forgetting in neural networks Kirkpatrick et al. 1612.00796

  24. Basic idea • Catastrophic forgetting: When a model is trained on task A followed by task B, it typically forgets A • Idea: After training on A, freeze the parameters that are important for A optimal parameters hyperparameter for task A F i ≈ ∂ 2 L A diagonal of Fisher information matrix ∂θ 2 i [1612.00796]

  25. Why Fisher information? L ( θ ) = − log( θ | D A , D B ) = − log p ( D B | θ ) − log p ( θ ) − log p ( D A | θ ) + log p ( D A , D B ) ∼ L B ( θ ) − log p ( D A | θ ) X X − log p ( D A | θ ) = − log p θ ( x i ) ∼ − p A ( x ) log p θ ( x ) x i now suppose p θ ∗ = p A then p θ ∗ ( x ) log p θ ∗ + d θ ( x ) = S ( p θ ∗ ) + 1 2 d θ T Fd θ + · · · X − x F ij = E x ∼ p θ [ r θ i log p θ ( x ) r θ j log p θ ( x )]

  26. Why Fisher information? L ( θ ) ∼ L B ( θ ) + 1 2 d θ T Fd θ d θ = θ − θ ∗ A

  27. MNIST experiment

Recommend


More recommend