multi task meta learning basics
play

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 - PowerPoint PPT Presentation

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03 Plan for Today Multi-Task Learning - Models


  1. Multi-Task & Meta-Learning Basics CS 330

  2. Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03

  3. Plan for Today Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning 
 — short break — 
 Meta-Learning - Problem formulation - General recipe of meta-learning algorithms } Topic of Homework 1! - Black-box adaptation approaches

  4. Multi-Task Learning Basics

  5. Some notation θ tiger tiger cat lynx cat y x length of paper f θ ( y | x ) Single-task learning: What is a task? 𝒠 = {( x , y ) k } (more formally this time) [supervised] θ ℒ ( θ , 𝒠 ) min 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: data generating distributions Typical loss: negative log likelihood 𝒠 tr 𝒠 test ℒ ( θ , 𝒠 ) = − 𝔽 ( x , y ) ∼𝒠 [log f θ ( y | x )] Corresponding datasets: i i 𝒠 i 𝒠 tr will use as shorthand for : i

  6. Examples of Tasks ℒ i 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } Multi-task classi fi cation: A task: same across all tasks e.g. per-language data generating distributions handwriting recognition e.g. personalized 𝒠 tr 𝒠 test Corresponding datasets: i i spam fi lter 𝒠 i 𝒠 tr will use as shorthand for : i ℒ i p i ( x ) , same across all tasks Multi-label learning: e.g. CelebA attribute recognition e.g. scene understanding ℒ i When might vary across tasks? - mixed discrete, continuous labels across tasks - if you care more about one task than another

  7. θ length of paper y x summary of paper paper review f θ ( y | x ) f θ ( y | x , z i ) z i task descriptor e.g. one-hot encoding of the task index T ∑ or, whatever meta-data you have ℒ i ( θ , 𝒠 i ) min Objective: θ - personalization: user features/attributes i =1 - language description of the task - formal speci fi cations of the task A model decision and an algorithm decision: z i How should we condition on ? How to optimize our objective?

  8. Conditioning on the task z i Let’s assume is the task index. Question : How should you condition on the task in order to share as little as possible?

  9. Conditioning on the task z i y 1 x multiplicative gating y = ∑ y 2 1 ( z i = j ) y j x j … y T x —> independent training within a single network! with no shared parameters

  10. The other extreme x y z i z i Concatenate with input and/or activations all parameters are shared except z i the parameters directly following

  11. An Alternative View on the Multi-Task Objective θ sh θ i θ Split into shared parameters and task-speci fi c parameters T ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) min Then, our objective is: θ sh , θ 1 ,…, θ T i =1 Choosing how to Choosing how & where equivalent to z i condition on to share parameters

  12. Conditioning: Some Common Choices 1. Concatenation-based conditioning 2. Additive conditioning z i z i These are actually the same! Diagram sources: distill.pub/2018/feature-wise-transformations/

  13. Conditioning: Some Common Choices 3. Multi-head architecture 4. Multiplicative conditioning Ruder ‘17 - more expressive Why might multiplicative conditioning be a good idea? - recall: multiplicative gating Multiplicative conditioning generalizes independent networks and independent heads. Diagram sources: distill.pub/2018/feature-wise-transformations/

  14. Conditioning: More Complex Choices Cross-Stitch Networks . Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network . Liu, Johns, Davison ‘18 Deep Relation Networks . Long, Wang ‘15 Sluice Networks . Ruder, Bingel, Augenstein, Sogaard ‘17

  15. Conditioning Choices Unfortunately, these design decisions are like neural network architecture tuning: - problem dependent - largely guided by intuition or knowledge of the problem - currently more of an art than a science

  16. ̂ Optimizing the objective T ∑ ℒ i ( θ , 𝒠 i ) min Objective: θ i =1 Basic Version: ℬ ∼ { 𝒰 i } 1. Sample mini-batch of tasks 𝒠 b i ∼ 𝒠 i 2. Sample mini-batch datapoints for each task ℒ ( θ , ℬ ) = ∑ ℒ k ( θ , 𝒠 b 3. Compute loss on the mini-batch: k ) 𝒰 k ∈ℬ ∇ θ ̂ ℒ 4. Backpropagate loss to compute gradient 5. Apply gradient with your favorite neural net optimizer (e.g. Adam) Note: This ensures that tasks are sampled uniformly, regardless of data quantities. Tip: For regression problems, make sure your task labels are on the same scale!

  17. Challenges

  18. Challenge #1: Negative transfer Negative transfer : Sometimes independent networks work the best. Multi-Task CIFAR-100 state-of-the-art approaches - optimization challenges Why? - caused by cross-task interference - tasks may learn at di ff erent rates - limited representational capacity - multi-task networks often need to be much larger than their single-task counterparts

  19. If you have negative transfer, share less across tasks. It’s not just a binary decision! T T ∥ θ t − θ t ′ � ∥ ∑ ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) + min θ sh , θ 1 ,…, θ T t ′ � =1 i =1 “soft parameter sharing” y 1 x <-> <-> <-> <-> … constrained weights y T x + allows for more fluid degrees of parameter sharing - yet another set of design decisions / hyperparameters

  20. Challenge #2: Over fi tting You may not be sharing enough! Multi-task learning <-> a form of regularization Solution : Share more.

  21. Case study Goal : Make recommendations for YouTube

  22. Case study Goal : Make recommendations for YouTube - videos that users will rate highly - videos that users they will share Conflicting objectives: - videos that user will watch implicit bias caused by feedback: 
 user may have watched it because it was recommended!

  23. Framework Set-Up Input : what the user is currently watching (query video) + user features 1. Generate a few hundred of candidate videos 2. Rank candidates 3. Serve top ranking videos to the user Candidate videos : pool videos from multiple candidate generation algorithms - matching topics of query video - videos most frequently watched with query video - And others Ranking : central topic of this paper

  24. The Ranking Problem Input: query video, candidate video, user & context features Model output: engagement and satisfaction with candidate video Engagement : Satisfaction : - binary classi fi cation tasks like clicks - binary classi fi cation tasks like clicking “like” - regression tasks for tasks related to time spent - regression tasks for tasks such as rating Weighted combination of engagement & satisfaction predictions -> ranking score score weights manually tuned

  25. The Architecture Basic option: “Shared-Bottom Model" 
 (i.e. multi-head architecture) -> harm learning when correlation between tasks is low

  26. The Architecture Allow di ff erent parts of the network to “specialize" Instead: use a form of soft-parameter sharing 
 “ Multi-gate Mixture-of-Experts (MMoE) " expert neural networks Decide which expert to use for input x, task k: Compute features from selected expert: Compute output:

  27. Experiments Set-Up Results - Implementation in TensorFlow, TPUs - Train in temporal order , running training continuously to consume newly arriving data - O ffl ine AUC & squared error metrics - Online A/B testing in comparison to production system - live metrics based on time spent, survey responses, rate of dismissals - Model computational e ffi ciency matters Found 20% chance of gating polarization during distributed training -> use drop-out on experts

  28. Plan for Today Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning 
 — short break — 
 Meta-Learning - Problem formulation - General recipe of meta-learning algorithms } Topic of Homework 1! - Black-box adaptation approaches

  29. Meta-Learning Basics

  30. Two ways to view meta-learning algorithms Mechanistic view Probabilistic view ➢ Deep neural network model that can read in ➢ Extract prior information from a set of (meta- an entire dataset and make predictions for training) tasks that allows efficient learning of new datapoints new tasks ➢ Training this network uses a meta-dataset, ➢ Learning a new task uses this prior and (small) which itself consists of many datasets, each training set to infer most likely posterior for a different task parameters ➢ This view makes it easier to implement meta- ➢ This view makes it easier to understand meta- learning algorithms learning algorithms

  31. Problem definitions label training data input (e.g., image) model parameters regularizer (e.g., weight decay) data likelihood What is wrong with this? ➢ The most powerful models typically require large amounts of labeled data ➢ Labeled data for some tasks may be very limited

  32. Problem definitions Image adapted from Ravi & Larochelle

  33. The meta-learning problem this is the meta-learning problem

  34. A Quick Example test label test input

  35. How do we train this thing? test label test input Key idea: “our training procedure is based on a simple machine learning principle: test and train conditions must match” Vinyals et al., Matching Networks for One-Shot Learning

Recommend


More recommend