– IN5550 – Neural Methods in Natural Language Processing Ensembles, transfer and multi-task learning Erik Velldal University of Oslo 31 March 2020
This session ◮ No new bricks. ◮ Taking what we already have, putting it together in new ways. ◮ Ensemble learning ◮ Training several models to do one task. ◮ Multi-task learning ◮ Training one model to do several tasks. ◮ Transfer learning ◮ Training a model for a new task based on a model for some other task. 2
Standard approach to model selection ◮ Train a bunch of models ◮ Keep the model with best performance on the development set ◮ Discard the rest ◮ Some issues: ◮ Best on dev. is not necessarily best on held-out. ◮ ANNs generally have low bias and high variance, can be unstable and have a danger of overfitting. ◮ Models might have non-overlapping errors. ◮ Ensemble methods may help. 3
Ensemble learning ◮ Combine multiple models to obtain better performance than for any of the individual base models alone. ◮ The various base models in the ensemble could be based on the same or different learning algorithms. ◮ Several meta-heuristics available for how to create the base models and how to combine their predictions. E.g.: ◮ Boosting ◮ Bagging ◮ Stacking 4
Examples of ensembling Boosting ◮ The base learners are generated sequentially: ◮ Incrementally build the ensemble by training each new model to emphasize training instances that previous models misclassified. ◮ Combine predictions through a weighted majority vote (classification) or average (regression). Bagging (Bootstrap AGGregating) ◮ The base learners are generated independently: ◮ Create multiple instances of the training data by sampling with replacement, training a separate model for each. ◮ Combine (‘aggregate’) predictions by voting or averaging. 5
Examples of ensembling Stacking ◮ Train several base-level models on the complete training set, ◮ then train a meta-model with the base model predictions as features. ◮ Ofen used with heterogeneous ensembles. Drawbacks of ensembling ◮ ANNs often applied in ensembles to squeeze out some extra F1 points. ◮ But their high leaderboard ranks come at a high computational cost: ◮ Must learn, store, and apply several separate models. 6
Distillation ◮ High acc./F1 models tend to have a high number of parameters. ◮ Often too inefficient to deploy in real systems. ◮ Knowledge distillation is a technique for reducing the complexity while retaining much of the performance. ◮ Idea: Train a (smaller) student model to mimic the behaviour of a (larger) teacher model. ◮ The student is typically trained using the output probabilities of the teacher as soft labels. ◮ Can be used to distill an ensemble into a single model. 7
ML as a one-trick pony ◮ Standard single-task models: ◮ Ensembles: 8
Enter multi-task learning ◮ Train one model to solve multiple tasks. ◮ Each task has its own loss-function, but the model weights are (partly) shared. ◮ Examples for the different labels can be distinct (take turns picking examples) or the same. ◮ Most useful for closely related tasks. ◮ Example: PoS-tagging and syntactic chunking. 9
Standard single-model approach 10
Multi-task approach ◮ Often one task will be considered the main task; the others so-called supporting- or auxilliary tasks. 11
Hierarchical / cascading multi-task learning ◮ Observation: while relying on similar underlying information, tagging intuitively seems more low-level than chunking. ◮ Cascading architecture with selective sharing of parameters: ◮ Note that the units of classifiation for the main and aux. tasks can be different, e.g. sentence- vs word-level. 12
Transfer learning ◮ Learn a model M1 for task A, and re-use (parts of) M1 in another model M2 to be (re-)trained for task B. ◮ Example: Transfer learning with tagging as the source task and chunking as the target (destination) task. ◮ Can you think of any examples of transfer learning we’ve seen so far? 13
Related notions ◮ Self-supervised learning: ◮ Making use of unlabeled data while learning in a supervised manner. ◮ E.g. word embeddings, trained by predicting words in context. ◮ Pretrained LMs most widely used instance of transfer in NLP. ◮ Transfer sometimes applied for domain adaptation: ◮ Same task but different domains or genres. ◮ Can also be used as part of distillation. 14
TL/MTL and regularization ◮ MTL can be seen as a regularizer in its own right; keeps the weights from specializing too much to just one task. ◮ With transfer on the other hand, there is often a risk of unlearning too much of the pre-trained information: ◮ ‘Catastrophic forgetting’ (McCloskey & Cohen, 1989; Ratcliff, 1990). ◮ May need to introduce regularization for the transfered layers. ◮ Extreme case: frozen weights (infinite regularization) ◮ Not unusual to only re-train selected parameters / higher layers. ◮ Other strategies: gradual unfreezing, reduced or layer-specific learning rates (in addition to early stopping, dropout, L2, etc.) 15
When is TL/MTL most useful ◮ When low-level features learned for task A could be helpful for learning task B. ◮ When you have limited labeled data for your main/target task and want to tap into a larger dataset for some other related aux/source task. 16
TL/MTL in NLP ◮ TL/MTL is particularly well-suited for neural models: ◮ Representation learners! With a modular design. ◮ Intuitively very well-suited for NLP too: ◮ Due to the complexity of the overall task of NLP (understanding language), it has been split up into innumerable sub-tasks. ◮ Typically have rather small labeled data sets, but closely related tasks. ◮ We’ve unfortunately not seen huge boosts (unlike e.g. computer vision). ◮ But TL/MTL still a very active area of research. ◮ Most promising so far: Transfer of pre-trained word or sentence embeddings as input representations. ◮ Lots of research currently on the representational transferability of different encoding architectures and objectives. 17
Next: ◮ More about transfer as pre-training ◮ Contextual word embeddings ◮ Universal sentence embeddings 18
Recommend
More recommend