adaptive adversarial multi task representation learning
play

Adaptive Adversarial Multi-task Representation Learning Yuren Mao 1 - PowerPoint PPT Presentation

ICML 2020 WHU C H I N A Adaptive Adversarial Multi-task Representation Learning Yuren Mao 1 Weiwei Liu 2 Xuemin Lin 1 1. University of New South Wales, Australia. 2. Wuhan University, China. Overview: Adaptive AMTRL (Adversarial Multi-task


  1. ICML 2020 WHU C H I N A Adaptive Adversarial Multi-task Representation Learning Yuren Mao 1 Weiwei Liu 2 Xuemin Lin 1 1. University of New South Wales, Australia. 2. Wuhan University, China.

  2. Overview: Adaptive AMTRL (Adversarial Multi-task Representation Learning) Algorithm Forward Propagation Backward Propagation ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ 1 ) ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ 1 ) Task 1 Task Specific Layers Original MTRL ๐œ–๐œ„ ๐‘กโ„Ž Adaptive AMTRL ๐œ–๐œ„ 1 ๐‘ก๐‘๐‘”๐‘ข๐‘›๐‘๐‘ฆ(๐‘‹๐‘Œ + ๐‘) โ€ข โ€ฆโ€ฆ โ€ฆโ€ฆ Augmented ๐œ–เท  ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ ๐‘ˆ ) ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ ๐‘ˆ ) Input Task T ๐œ–๐œ„ ๐‘ˆ ๐œ–๐œ„ ๐‘กโ„Ž (a) Three 2-d Gaussian distributions (b) Discriminator (c) Relatedness changing curve โ€ฆโ€ฆ Lagrangian Task Relatedness for โ€ข Discriminator Relatedness based MinMax โ€ฆ ๐‘€ ๐ธ (๐œ„ ๐‘กโ„Ž ) ๐‘€ ๐ธ (๐œ„ ๐‘กโ„Ž ) โˆ’๐œ–เท  ๐œ–เท  AMTRL Weighting Strategy ๐œ–๐œ„ ๐‘กโ„Ž ๐œ–๐‘‹ Shared Layers Gradient Reversal Layer Better Performance AMTRL PAC Bound ๏ฟฝ L D ( h ) โˆ’ L S ( h ) โ‰ค c 1 ฯ G a ( G โˆ— ( X 1 )) + c 2 Qsup g โˆˆ G โˆ— โˆฅ g ( X 1 ) โˆฅ 9 ln (2 / ฮด ) โˆš n + n 2 nT Negligible Generalization Error The number of tasks does not matter

  3. Content โ€ข Adversarial Multi-task Representation Learning (AMTRL) โ€ข Adaptive AMTRL โ€ข PAC Bound and Analysis โ€ข Experiments

  4. Adversarial Multi-task Representation Learning Adversarial Multi-task Representation Learning (AMTRL) has achieved success in various applications, ranging from sentiment analysis to question answering systems. h L ( h, ฮป ) = L S ( h ) + ฮป L adv Forward Propagation Backward Propagation min ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ 1 ) ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ 1 ) Task 1 Task Specific Layers Original MTRL ๐œ–๐œ„ ๐‘กโ„Ž ๐œ–๐œ„ 1 Empirical loss: โ€ฆโ€ฆ โ€ฆโ€ฆ T n L S ( h ) = 1 ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ ๐‘ˆ ) ๐œ–เท  ๐‘€(๐œ„ ๐‘กโ„Ž , ๐œ„ ๐‘ˆ ) ๏ฟฝ ๏ฟฝ l t ( f t ( g ( x t i )) , y t i ) Input Task T ๐œ–๐œ„ ๐‘ˆ ๐œ–๐œ„ ๐‘กโ„Ž โ€ฆโ€ฆ nT t =1 i =1 Loss of the adversarial module: Discriminator MinMax โ€ฆ ๐‘€ ๐ธ (๐œ„ ๐‘กโ„Ž ) ๐‘€ ๐ธ (๐œ„ ๐‘กโ„Ž ) T n โˆ’๐œ–เท  ๐œ–เท  1 L adv = max ๏ฟฝ ๏ฟฝ e t ฮฆ ( g ( x t ๐œ–๐œ„ ๐‘กโ„Ž i )) ๐œ–๐‘‹ nT ฮฆ Shared Layers Gradient Reversal Layer t =1 i =1

  5. Adaptive AMTRL Adversarial AMTRL aims to minimize the task-averaged empirical risk and enforce the representation of each task to share an identical distribution. We formulate it as a constraint optimization problem min L S ( h ) h L adv โˆ’ c = 0 , s.t. and propose to solve the problem with an augmented Lagrangian method. 1 T L S ( h ) + ฮป ( L adv โˆ’ c ) + r 2( L adv โˆ’ c ) 2 . min h ๐œ‡ and ๐‘  updates in the training process.

  6. Relatedness for AMTRL ๏ฟฝ N n =1 e j ฮฆ ( g ( x i n )) + e i ฮฆ ( g ( x j n )) Relatedness between task i and task j: R ij = min { , 1 } ๏ฟฝ N n )) + e j ฮฆ ( g ( x j n =1 e i ฮฆ ( g ( x i n )) ๏ฃฎ ๏ฃน R 11 R 12 R 1 T ยท ยท ยท R 21 R 22 R 2 T Relatedness matrix: ๏ฃฏ ๏ฃบ ยท ยท ยท R = ๏ฃป . ๏ฃฏ ๏ฃบ . . . ... ๏ฃฏ ๏ฃบ . . . . . . ๏ฃฐ R T 1 R T 2 R TT ยท ยท ยท ๐‘ก๐‘๐‘”๐‘ข๐‘›๐‘๐‘ฆ(๐‘‹๐‘Œ + ๐‘) (c) Relatedness changing curve (a) Three 2-d Gaussian distributions (b) Discriminator

  7. Adaptive AMTRL In multi-task learning, tasks regularize each other and improve the generalization of some tasks. The weights of each task influences the effect of the regularization. This paper proposes a weighting strategy for AMTRL based on the proposed task relatedness. 1 1 R 1 โ€ฒ 1 R, w = where 1 is a 1ร—๐‘ˆ vector of all 1, and ๐‘† is the relatedness matrix. Combining the augmented Lagrangian method with the weighting strategy, optimization objective of our adaptive AMTRL method is T 1 w t L S t ( f t โ—ฆ g ) + ฮป ( L adv โˆ’ c ) + r ๏ฟฝ 2( L adv โˆ’ c ) 2 . min T h t =1

  8. PAC Bound and Analysis Assume the representation of each task share an identical distribution, we have the following generalization error bound. ๏ฟฝ L D ( h ) โˆ’ L S ( h ) โ‰ค c 1 ฯ G a ( G โˆ— ( X 1 )) + c 2 Qsup g โˆˆ G โˆ— โˆฅ g ( X 1 ) โˆฅ 9 ln (2 / ฮด ) โˆš n + n 2 nT Generalization Error Negligible The number of tasks does not matter โ€ข The generalization error bound for AMTRL is tighter than that for MTRL. โ€ข The number of tasks slightly influence the generalization bound of AMTRL.

  9. Experiments - Relatedness Evolution Sentiment Analysis and Topic Classification. T R t = 1 Mean of ๏ฟฝ R tk . T k =0 Sentiment Analysis. Topic Classification

  10. Experiments - Classification Accuracy Sentiment Analysis and Topic Classification. Sentiment Analysis. Topic Classification

  11. Experiments - Influence of the Number of Tasks Sentiment Analysis. Relative Error: er MTL er rel = ๏ฟฝ T 1 1 er t STL T Error rate for the task โ€™ appeal โ€™ .

  12. THANK YOU

Recommend


More recommend