Parameter-Efficient Transfer Learning for NLP N. Houlsby, A. Giurgiu*, S. Jastrzębski*, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly
Imagine doing Transfer Learning for NLP Ingredients: ● A large pretrained model (BERT) ● Fine-tuning 2/5
Imagine doing Transfer Learning for NLP Ingredients: ● A large pretrained model (BERT) ● Fine-tuning BERT Task 1 BERT Task 2 Problem for large N ... BERT Task N-1 BERT Task N 2/5
Imagine doing Transfer Learning for NLP Ingredients: ● A large pretrained model (BERT) ● Fine-tuning + Adapter 1 Task 1 + Adapter 2 Task 2 BERT + Adapter N-1 Task N-1 + Adapter N Task N 2/5
BERT + Adapters ● Solution : Train tiny adapter modules at each layer Solution 5 5 3/5
BERT + Adapters ● Solution : Train tiny adapter modules at each layer Solution 6 6 3/5
BERT + Adapters ● Solution : Train tiny adapter modules at each layer Solution 7 7 3/5
BERT + Adapters ● Solution : Train tiny adapter modules at each layer Bottleneck Solution 8 8 3/5
Results on GLUE Benchmark 4/5
Results on GLUE Benchmark 4/5
Results on GLUE Benchmark 4/5
Results on GLUE Benchmark 4/5
Results on GLUE Benchmark Fewer parameters, similar performance Fewer parameters, degraded performance 4/5
Results on GLUE Benchmark 0.4% accuracy drop for 96.4% reduction in the # of parameters/task 4/5
Conclusions 1. If we move towards a single model future, we need to improve parameter-efficiency of transfer learning 2. We propose a module reducing drastically # params/task for NLP , e.g. by 30x at only 0.4% accuracy drop Related work (@ ICML): “ BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning”, A. Stickland & I. Murray Please come to our poster today at 6:30 PM (#102) 5/5
Recommend
More recommend