+ Unifying Perspectives on Knowledge Sharing: From Atomic to Parameterised Domains and Tasks Task-CV @ ECCV 2016 Timothy Hospedales University of Edinburgh & Queen Mary University of London With Yongxin Yang Queen Mary University of London
+ Today’s Topics n Distributed definitions of task/domains, and different problem settings that arise. n A flexible approach to task/domain transfer n Generalizes existing approaches n Generalizes multiple problem settings n Covers shallow and deep models
+ Why Transfer Learning? Data Data Model 1 1 1 Lifelong Data Data Learning Model 2 2 2 Model Data Data Model 3 3 3 IID Tasks or Domains But…. Humans seem to generalize across tasks E.g., Crawl => Walk => Run => Scooter => Bike => Motorbike => Driving.
+ Taxonomy of Research Issues Sharing Setting Labeling assumption n Sequential / One-way n Supervised n Multi-task n Unsupervised n Life-long learning Transfer Across: Sharing Approach n Task Transfer n Model-based n Domain Transfer n Instance-based n Feature-based Feature/Label Space n Homogeneous Balancing Challenge n Heterogeneous n Positive Transfer Strength n Negative Transfer Robustness
+ Taxonomy of Research Issues Sharing Setting Labeling assumption n Sequential / One-way n Supervised n Multi-task n Unsupervised n Life-long learning Transfer Across: Sharing Approach n Task Transfer n Model-based n Domain Transfer n Instance-based n Feature-based Feature/Label Space n Homogeneous Balancing Challenge n Heterogeneous n Positive Transfer Strength n Negative Transfer Robustness
+ Overview n A review of some classic methods n A general framework n Example problems and settings n Going deeper n Open questions
+ Some Classic Methods – 1 Model Adaptation An example of simple sequential transfer: ∑ T x i + λ w s T w s min y i − w s n Learn a source task: y = f s ( x , w s ) w s i T ( w − w s ) n Learn a target new task: ∑ y i − w T x i + λ ( w − w s ) y = f t ( x , w ) min w i n Regularize new task toward old task n (…rather than toward zero) w 1 w 1 w 2 w 2 Source Target E.g., Yang, ACM MM, 2007
+ Some Classic Methods – 1 Model Adaptation An example of simple sequential transfer: T ( w − w s ) n Learn a target new task: ∑ y i − w T x i + λ ( w − w s ) y = f t ( x , w ) min w i n Limitations: ✘ Assumes relatedness of source task ✘ Only sequential, one-way transfer E.g., Yang, ACM MM, 2007
+ Some Classic Methods – 2 Regularized Multi-Task An example of simple multi-task transfer: n Learn a set of tasks: { y = f t ( x , w t ) } { } x i , t , y i , t T ( w t − w 0 ) ∑ T x i , t + λ ( w t − w 0 ) min y i , t − w t w 0 , w t i , t t = 1.. T n Regularize each task towards mean of all tasks: w 1 E.g., Evgeniou & Pontil, KDD’04 E.g., Salakhutdinov, CVPR’11 w 2 E.g., Khosla, ECCV’12
+ Some Classic Methods – 2 Regularized Multi-Task An example of simple multi-task transfer: n Learn a set of tasks: { y = f t ( x , w t ) } { } x i , t , y i , t T ( w t − w 0 ) ∑ T x i , t + λ ( w t − w 0 ) min y i , t − w t w 0 , w t i , t t = 1.. T y i , t − ( w t + w 0 ) T x i , t ∑ min Or…. n Summary: w 0 , w t i , t t = 1.. T ✔ Now multi-task ✗ Tasks and their mean are inter-dependent: jointly optimise ✗ Still assumes all tasks are (equally) related w 1 w 2
+ Some Classic Methods – 3 Task Clustering Relaxing relatedness assumption through task clustering n Learn a set of tasks: { y = f t ( x , w t ) } { } x i , t , y i , t n Assume tasks form K similar groups: n Regularize task towards nearest group T ( w t − w k ' ) ∑ T x i , t + min min y i , t − w t k ' λ ( w t − w k ' ) w k , w t i , t k = 1.. K , t = 1.. T w 1 E.g., Evgeniou et al, JMLR, 2005 E.g., Kang et al, ICML, 2011 w 2
+ Some Classic Methods – 3 Task Clustering Multi-task transfer without assuming relatedness n Assume tasks form similar groups: T ( w t − w k ' ) ∑ T x i , t + min min y i , t − w t k ' λ ( w t − w k ' ) w k , w t i , t k = 1.. K , t = 1.. T n Summary: ü Doesn’t require all tasks related => More robust to negative transfer ü Benefits from “more specific” transfer ✗ What about task specific/task independent knowledge? ✗ How to determine number of clusters K? ✗ What if tasks share at the level of “parts”? ✗ Optimization is hard w 1
+ Some Classic Methods – 4 Task Factoring n Learn a set of tasks { y = f t ( x , w t ) } { } x i , t , y i , t n Assume related by a factor analysis / latent task structure. Binary task indicator vector { x i , y i , z i } n Notation: Input now triples: T x n STL, weight stacking notation: y = f t ( x , W ) = W T ( ) ( t ,:) x = W z T x i + λ W 2 ( ) ∑ min y i − W z i 2 n Factor Analysis-MTL: W i T x = PQ z T x ( ) ( ) y = W z T x i + λ P + ω Q ∑ ( ) min y i − PQ z i P , Q i E.g., Kumar, ICML’12 E.g., Passos, ICML’12
+ Some Classic Methods – 4 Task Factoring n Learn a set of tasks { } x i , y i , z i y = f t ( x , W ) n Assume related by a factor analysis / latent task structure. T x = PQ z T x y = w T ( ) ( ) t x = W z n Factor Analysis-MTL: T x i + λ P + ω Q n What does it mean? ∑ ( ) min y i − PQ z i P , Q i n W: DxK matrix of all task parameters n P: DxK matrix of basis/latent tasks n Q: KxT matrix of low-dimensional task models n => Each task is a low-dimensional linear combination of basis tasks.
+ Some Classic Methods – 4 Task Factoring n Learn a set of tasks { } x i , y i , z i y = f t ( x , W ) n Assume related by a factor analysis / latent task structure. T x = PQ z T x y = w T ( ) ( ) t x = W z n What does it mean? n z: (1-hot binary) Activates a column of Q T x i + λ P + ω Q ∑ ( ) min y i − PQ z i n P: DxK matrix of basis/latent tasks P , Q i n Q: KxT matrix of task models n => Tasks lie on a low-dimensional manifold w 1 Q n => Knowledge sharing by jointly learning manifold P n P: Specify the manifold w 2 n Q: Each task’s position on the manifold w 3
+ Some Classic Methods – 4 Task Factoring n Summary: n Tasks lie on a low-dimensional manifold n Each task is a low-dimensional linear combination of basis tasks. T x = PQ z T x y = w T ü Can flexibly share or not share: ( ) ( ) t x = W z n Two Q cols (tasks) similarity. T x i + λ P + ω Q ∑ ( ) min y i − PQ z i ü Can share piecewise: P , Q i n Two Q cols (tasks) similar in some rows only ü Can represent globally shared knowledge: w 1 n Uniform row in Q => all tasks activate same basis of P w 2 w 3
+ Overview n A review of some classic methods n A general framework n Example problems and settings n Going deeper n Open questions
+ MTL Transfer as a Neural Network n Consider a two sided neural network: n Left: Data input x. n Right: Task indicator z. n Output unit y: Inner product of representations n Equivalent to: Task Regularization [Evgeniou KDD’04], if: n Q = W: (trainable) FC layer. P: (fixed) identity matrix. n z: 1-hot task encoding plus a bias bit => The shared knowledge n Linear activation T x i , t ( ) ∑ min y i , t − w t + w 0 w 0 , w t i , t t = 1.. T y = ( w t + w 0 ) T x [ Yang & Hospedales, ICLR’15 ]
+ MTL Transfer as a Neural Network n Consider a two sided neural network: n Left: Data input x. n Right: Task indicator z. n Output unit y: Inner product of representation on each side. n Equivalent to: Task Factor Analysis [ Kumar, ICML’12, GO-MTL ] if: Constraining task description/parameters: n Train FC layers P&Q Encompass: 5+ classic MTL/MDL approaches! n z: 1-hot task encoding n Linear activation T x ( ) y = W z T x i ∑ ( ) min y i − PQ z i P , Q i T ∑ ( ) Q z i ( ) = min y i − P x i P , Q i
+ MTL Transfer as a Neural Network: Interesting things n Interesting things: n Generalizes many existing frameworks… n Can do regression & classification (activation on y). n Can do multi-task and multi-domain. n As neural network, left side X can be any CNN and train end-to-end T x ( ) y = W z T ∑ ( ) Q z i ( ) min y i − P x i z: Task/Domain-ID x: Data P , Q i
+ MTL Transfer as a Neural Network: Interesting things Interesting things: n Non-linear activation on hidden layers: n Have representation learning on both task and data. n Exploit a non-linear task subspace. w 1 n CF GO-MTL’s linear task subspace. n Final classifier can be non-linear in feature space. w 2 w 3 T ( ) σ Q z ( ) y = σ P x T ∑ ( ) σ Q z i ( ) min y i − σ P x i z: Task/Domain-ID x: Data P , Q i
+ Overview n A review of some classic methods n A general framework n Example problems and settings n Going deeper n Open questions
Recommend
More recommend