Meta Learning Shengchao Liu
Background • Meta Learning (AKA Learning to Learn) • A fast-learning algorithm: quickly adapted from the source tasks to the target tasks • Key terminologies • Support Set & Query Set • C-Way K-Shot Learning: C classes and each with K samples • Pre-training & Fine-tuning
Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network
Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network
1. Metric-Based • Similar ideas to nearest neighborhoods algorithm ∑ , where is the kernel function p θ ( y | x , S ) = k θ ( x , x i ) y i k θ • ( x i , y i ) ∈ S • Siamese Neural Networks for One-shot Image Recognition, ICML 2015 • Learning to Compare: Relation Network for Few-Shot Learning, CVPR 2018 • Matching Network for One-Shot Learning, NIPS 2016 • Prototypical Networks for Few-Shot Learning, NeurIPS 2017 • Few-Shot Learning with Graph Neural Networks, ICLR 2018
Siamese Neural Network • Few-Shot Learning • Twin network • L1-distance as the metric
Siamese Neural Network
Relation Network • Few-Shot Learning • Similar to Siamese Network • Di ff erence: concatenation and CNN as the relation module
Matching Network • S = { x i , y i } k Given a training set (k samples per class): i =1 k k exp[ cosine ( f ( ̂ x ), g ( x i ))] ∑ ∑ Goal: P ( ̂ y | ̂ a ( ̂ x , S ) = x , x i ) y i = y i • ∑ k j =1 exp[ cosine ( f ( ̂ x ), g ( x j ))] i =1 i =1 • Two embedding methods are tested for . f , g • Episodic Training • Support Set (C-Way K-Shot)
̂ Matching Network • Simple Embedding : with some CNN model and f = g • Full Context Embedding : • applies bidirectional LSTM g ( x i ) • f ( ̂ applies attention-LSTM x ) 1. First encodes through CNN to get f ′ ( ̂ x ) 2. Then an attention-LSTM is trained with a read attention over the full support set S h k , c k = LSTM ( f ′ ( ̂ x ), [ h k − 1 , r k − 1 ], c k − 1 ) h k = ̂ h k + f ′ ( ̂ x ) | S | ∑ a ( h k − 1 , g ( x i )) ⋅ g ( x i ) r k = i =1 | S | ∑ a ( h k − 1 , g ( x i )) = exp{ h T exp{ h T where k − 1 g ( x i )}/ k − 1 g ( x j )} j =1 3. Finally , where is # of read. f ( x ) = h K K
Prototypical Network • For each class: • Sample a support set 1 ∑ c k = f ϕ ( x i ) | S k | ( x i , y i ) ∈ S k • Sample a query set exp( − d ( f ϕ ( x ), c k )) p ( y = k | x ) = ∑ k ′ exp( − d ( f ϕ ( x ), c k ′ ))
Prototypical Network
Prototypical Network • When viewed as a clustering algorithm, then the Bregman divergences can achieve the minimum distance to the center point in S ) T ∇ ϕ ( z ′ d ϕ ( z , z ′ ) = ϕ ( z ) − ϕ ( z ′ ) − ( z − z ′ ) • Viewed as the linear regression when the Euclidean distance is used. • Comparison between Matching Network & Prototypical Network: • equal in the one-shot learning, not in the K-shot learning • Matching Network: k k exp[ cosine ( f ( ̂ x ), g ( x i ))] ∑ ∑ P ( ̂ y | ̂ a ( ̂ x ) = x , x i ) y i = y i ∑ k j =1 exp[ cosine ( f ( ̂ x ), g ( x j ))] i =1 i =1 • Prototypical Network: exp( − d ( f ϕ ( x ), c k )) p ( y = k | x ) = ∑ k ′ exp( − d ( f ϕ ( x ), c k ′ ))
Meta GNN
Meta GNN • For the -th layer: k • i = GCN ( x k − 1 ) x k • A k i , j = ϕ ( x k i , x k j ) = MLP ( abs | x k i − x k j | )
Metric-Based • Comments: • Highly depends on the metric function. • Robustness: more troublesome when the new task diverges from the source tasks.
Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network
2. Model-Based • Goal: to learn a model f θ • Solution: learning another model to parameterize f θ
2. Model-Based • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ
2. Model-Based • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ
2. Model-Based • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ • Meta-Learning with Memory-Augmented Neural Networks, ICML 2016 • Meta Networks, ICML 2017 • HyperNetworks, ArXiv 2016
Memory-Augmented Neural Networks (MANN) • Basic idea (Neural Turning Machine): • Store the useful information of the new task using an external memory. • The true label of the last time step is used. • External memory.
Memory-Augmented Neural Networks (MANN) • Example:
Addressing Mechanism • key vector at step t is generated from input , memory matrix at step t is , memory at step t is k t x t M t r t • w r w u w w read weights , usage weights , write weights t t t • Read k t M t ( i ) t ( i ) = softmax ( exp(( ∥ k t ∥∥ M t ( i ) ∥ )) ) w r N ∑ w r r t = t M t ( i ) i =1 • Write (Least Recently Used Access, LRUA) w u t = γ w u t − 1 + w r t + w w t w w t = σ ( α ) w r t − 1 + (1 − σ ( α )) w lu t − 1 t = { 0, if w u t ( i ) > m ( w u t , n ) w ul m ( w u w u , where is the -th smallest element in vector t , n ) n t 1, otherwise M t ( i ) = M t − 1 ( i ) + w w t ( i ) k t , ∀ i
Meta-Learning Metric-Based Model-Based Gradient-Based Siamese Meta GNN NN Meta Hyper MANN Networks Networks Relation Prototypical MAML Reptile ANIL Matching Network Networks (FOMAML) Network
3. Gradient-Based Model-Based: • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ
3. Gradient-Based Model-Based: • Goal: to learn a base model f θ • Solution: learning a meta model to parameterize f θ Gradient-Based: • Goal: to learn a base model f θ • Solution: learning to parameterize without a meta model f θ
3. Gradient-Based • Learning to learn with Gradients • MAML (Model-Agnostic Meta Learning ) & FOMAML, ICML 2017 • Reptile, ArXiv 2018 • ANIL (Almost No Inner Loop), ICLR 2020
MAML • Model-Agnostic Meta-Learning (MAML) • Motivation • find a model parameter that are sensitive to changes in the task • small changes in the parameters can get large improvements
MAML • Outer loop: • Inner loop: • Sample batch of tasks τ i • Sample samples K i ) = ∑ ∑ Meta-object: min ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • θ τ i ∼ P ( τ ) τ i ∼ P ( τ ) θ = θ − β ∇ θ ∑ i ) = θ − β ∇ θ ∑ SGD: ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • τ i ∼ p ( τ ) τ i ∼ p ( τ )
FOMAML • Involves a gradient through a gradient: θ = θ − β ∇ θ ∑ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) τ i ∼ p ( τ ) • First-order approximation, A.K.A. first-order MAML (FOMAML) • Omit the second-order derivatives • Still compute the meta-gradient at the post-update parameter θ ′ i θ = θ − β ∇ θ ∑ ℓ τ i ( f θ ′ i ) • τ i ∼ p ( τ ) • Almost the same performance, but ~33% faster • Notice: this meta-objective is multi-task learning.
MAML • Outer loop: • Inner loop: • Sample batch of tasks τ i • Sample samples K i ) = ∑ ∑ Meta-object: min ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • θ τ i ∼ P ( τ ) τ i ∼ P ( τ ) θ = θ − β ∇ θ ∑ i ) = θ − β ∇ θ ∑ SGD: ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • τ i ∼ p ( τ ) τ i ∼ p ( τ )
FOMAML • Outer loop: • Inner loop: • Sample batch of tasks τ i • Sample samples K i ) = ∑ ∑ Meta-object: min ℓ τ i ( f θ ′ ℓ τ i ( f θ − α ∇ θ ℓ τ i ( f θ ) ) • θ τ i ∼ P ( τ ) τ i ∼ P ( τ ) • SGD: θ ′ = θ − α ∇ θ ℓ τ i f ( θ ) θ = θ − β ∇ θ ∑ SGD: ℓ τ i ( f θ ′ i ) • τ i ∼ p ( τ )
Reptile • Same motivation: • pre-training: learn a initialization • fine-tuning: able to quickly be adapted on new tasks
Reptile • For each iteration, do: • Sample task τ • Get the corresponding loss ℓ τ • ˜ θ = U k Compute , with steps of SGD/Adam τ ( θ ) k n θ = θ + ϵ 1 ∑ θ = θ + ϵ (˜ (˜ Update or θ − θ ) θ i − θ ) • n i =1
Reptile • If , Reptile is similar to min 𝔽 τ [ L τ ] k = 1 g Reptile , k =1 = θ − ˜ θ = θ − U τ , A ( θ ) = θ − ( θ − ∇ θ L τ , A ( θ )) = ∇ θ L τ , A ( θ ) • If , Reptile diverges from min 𝔽 τ [ L τ ] k > 1 θ − U τ , A ( θ ) ≠ θ − ( θ − ∇ θ L τ , A ( θ ))
ANIL • ANIL (Almost No Inner Loop) • The reason why MAML works: rapid learning or feature reuse
ANIL • ANIL (Almost No Inner Loop) • The reason why MAML works: rapid learning or feature reuse
ANIL • ANIL: Only update the head (last layer) in the inner loop
Recommend
More recommend