Multi-Task Learning & Transfer Learning Basics CS 330 1
Logistics Homework 1 posted Monday 9/21 , due Wednesday 9/30 at midnight. TensorFlow review session tomorrow at 6:00 pm PT. Project guidelines posted early next week. 2
Plan for Today Multi-Task Learning - Problem statement - Models, objectives, optimization - Challenges short break here - Case study of real-world multi-task learning Transfer Learning - Pre-training & fi ne-tuning Goals for by the end of lecture : - Know the key design decisions when building multi-task learning systems - Understand the di ff erence between multi-task learning and transfer learning - Understand the basics of transfer learning 3
Multi-Task Learning 4
Some notation θ tiger tiger cat lynx cat y x length of paper f θ ( y | x ) Single-task learning: What is a task? = {( x , y ) k } (more formally this time) [supervised] θ ℒ ( θ , ) min 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: data generating distributions Typical loss: negative log likelihood tr test ℒ ( θ , ) = − 𝔽 ( x , y ) ∼ [log f θ ( y | x )] Corresponding datasets: i i i tr will use as shorthand for : i 5
Examples of Tasks ℒ i 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } Multi-task classi fi cation: A task: same across all tasks e.g. per-language data generating distributions handwriting recognition e.g. personalized tr test Corresponding datasets: i i spam fi lter i tr will use as shorthand for : i ℒ i p i ( x ) , same across all tasks Multi-label learning: e.g. CelebA attribute recognition e.g. scene understanding ℒ i When might vary across tasks? - mixed discrete, continuous labels across tasks - multiple metrics that you care about 6
θ length of paper y x summary of paper paper review f θ ( y | x ) f θ ( y | x , z i ) z i task descriptor e.g. one-hot encoding of the task index T or, whatever meta-data you have ∑ ℒ i ( θ , i ) Vanilla MTL Objective: min - personalization: user features/attributes θ i =1 - language description of the task - formal speci fi cations of the task Decisions on the model, the objective, and the optimization. z i How should we condition on ? What objective should we use? How to optimize our objective? 7
z i How should the model be conditioned on ? Model What parameters of the model should be shared? Objective How should the objective be formed? Optimization How should the objective be optimized? 8
Conditioning on the task z i Let’s assume is the one-hot task index. Question : How should you condition on the task in order to share as little as possible? (raise your hand) 9
Conditioning on the task z i y 1 x multiplicative gating y = ∑ y 2 1 ( z i = j ) y j x j … y T x —> independent training within a single network! with no shared parameters 10
The other extreme x y z i z i Concatenate with input and/or activations all parameters are shared z i z i (except the parameters directly following , if is one-hot) 11
An Alternative View on the Multi-Task Architecture θ sh θ i θ Split into shared parameters and task-speci fi c parameters T ∑ ℒ i ({ θ sh , θ i }, i ) min Then, our objective is: θ sh , θ 1 ,…, θ T i =1 Choosing how to Choosing how & where equivalent to condition on z i to share parameters 12
Conditioning: Some Common Choices 1. Concatenation-based conditioning 2. Additive conditioning z i z i (raise your hand) These are actually equivalent! Question : why are they the same thing? Application of following fully-connected layer: Diagram sources: distill.pub/2018/feature-wise-transformations/ 13
Conditioning: Some Common Choices 3. Multi-head architecture 4. Multiplicative conditioning Ruder ‘17 - more expressive per layer Why might multiplicative conditioning be a good idea? - recall: multiplicative gating Multiplicative conditioning generalizes independent networks and independent heads. Diagram sources: distill.pub/2018/feature-wise-transformations/ 14
Conditioning: More Complex Choices Cross-Stitch Networks . Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network . Liu, Johns, Davison ‘18 Deep Relation Networks . Long, Wang ‘15 Sluice Networks . Ruder, Bingel, Augenstein, Sogaard ‘17 15
Conditioning Choices Unfortunately, these design decisions are like neural network architecture tuning: - problem dependent - largely guided by intuition or knowledge of the problem - currently more of an art than a science 16
z i How should the model be conditioned on ? Model What parameters of the model should be shared? Objective How should the objective be formed? Optimization How should the objective be optimized? 17
T T Often want to weight ∑ ∑ ℒ i ( θ , i ) w i ℒ i ( θ , i ) min min Vanilla MTL Objective tasks di ff erently: θ θ i =1 i =1 - dynamically adjust - manually based on w i How to choose ? throughout training importance or priority a. various heuristics ℒ i ( θ a ) ≤ ℒ i ( θ b ) ∀ i θ a θ b dominates if encourage gradients to have similar magnitudes ℒ i ( θ a ) ≠ ∑ and if ∑ ℒ i ( θ b ) (Chen et al. GradNorm. ICML 2018) i i b. use task uncertainty (e.g. see Kendall et al. CVPR 2018) c. aim for monotonic improvement towards Pareto optimal solution (e.g. see Sener et al. NeurIPS 2018) d. optimize for the worst-case task loss θ ⋆ θ ⋆ θ is Pareto optimal if there exists no that dominates ℒ i ( θ , i ) θ ⋆ min θ max (At , improving one task will always require worsening another) i (e.g. for task robustness, or for fairness) 18
z i How should the model be conditioned on ? Model What parameters of the model should be shared? Objective How should the objective be formed? Optimization How should the objective be optimized? 19
̂ Optimizing the objective T ∑ ℒ i ( θ , i ) Vanilla MTL Objective: min θ i =1 Basic Version: 1. Sample mini-batch of tasks ℬ ∼ { 𝒰 i } 2. Sample mini-batch datapoints for each task b i ∼ i ℒ ( θ , ℬ ) = ∑ 3. Compute loss on the mini-batch: ℒ k ( θ , b k ) 𝒰 k ∈ℬ ∇ θ ̂ 4. Backpropagate loss to compute gradient ℒ 5. Apply gradient with your favorite neural net optimizer (e.g. Adam) Note: This ensures that tasks are sampled uniformly, regardless of data quantities. Tip: For regression problems, make sure your task labels are on the same scale! 20
Challenges 21
Challenge #1: Negative transfer Negative transfer : Sometimes independent networks work the best. Multi-Task CIFAR-100 } multi-head architectures recent approaches } cross-stitch architecture } independent training (Yu et al. Gradient Surgery for Multi-Task Learning. 2020) - optimization challenges Why? - caused by cross-task interference - tasks may learn at di ff erent rates - limited representational capacity - multi-task networks often need to be much larger than their single-task counterparts 22
If you have negative transfer, share less across tasks. It’s not just a binary decision! T T ∥ θ t − θ t ′ ∑ ∑ ℒ i ({ θ sh , θ i }, i ) + ∥ min θ sh , θ 1 ,…, θ T t ′ i =1 =1 “soft parameter sharing” y 1 x <-> <-> <-> <-> … constrained weights y T x + allows for more fluid degrees of parameter sharing - yet another set of design decisions / hyperparameters 23
Challenge #2: Over fi tting You may not be sharing enough! Multi-task learning <-> a form of regularization Solution : Share more. 24
Plan for Today Multi-Task Learning - Problem statement - Models, objectives, optimization - Challenges short break here - Case study of real-world multi-task learning Transfer Learning - Pre-training & fi ne-tuning 25
Case study Goal : Make recommendations for YouTube 26
Case study Goal : Make recommendations for YouTube - videos that users will rate highly - videos that users they will share Conflicting objectives: - videos that user will watch implicit bias caused by feedback: user may have watched it because it was recommended! 27
Framework Set-Up Input : what the user is currently watching (query video) + user features 1. Generate a few hundred of candidate videos 2. Rank candidates 3. Serve top ranking videos to the user Candidate videos : pool videos from multiple candidate generation algorithms - matching topics of query video - videos most frequently watched with query video - And others Ranking : central topic of this paper 28
The Ranking Problem Input: query video, candidate video, user & context features Model output: engagement and satisfaction with candidate video Engagement : Satisfaction : - binary classi fi cation tasks like clicks - binary classi fi cation tasks like clicking “like” - regression tasks for tasks related to time spent - regression tasks for tasks such as rating Weighted combination of engagement & satisfaction predictions -> ranking score score weights manually tuned Question: Are these objectives reasonable? What are some of the issues that might come up? (answer in chat) 29
The Architecture Basic option: “Shared-Bottom Model" (i.e. multi-head architecture) -> harm learning when correlation between tasks is low 30
Recommend
More recommend