Structure-Level Knowledge Distillation For Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Fei Huang, Kewei Tu School of Information Science and Technology, ShanghaiTech University DAMO Academy, Alibaba Group 1
Motivation • Most of the previous work of sequence labeling focused on monolingual models. • It is resource consuming to train and serve multiple monolingual models online. • A unified multilingual model: smaller, easier, more generalizable. • However, the accuracy of the existing unified multilingual model is inferior to monolingual models. 2
Our Solution Knowledge Distillation 3
Background: Knowledge Distillation Teacher Data Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation 4 Learning Workshop.
Background: Knowledge Distillation XE Distribution 𝑄 𝑢 Teacher loss Data Distribution 𝑄 𝑡 Student Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation 5 Learning Workshop.
Background: Knowledge Distillation XE Distribution 𝑄 𝑢 Teacher loss Data Distribution 𝑄 𝑡 Student Update Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation 6 Learning Workshop.
Background: Sequence Labeling Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. NAACL 2016. Neural architectures for named entity 7 recognition.
Background: Sequence Labeling Exponentially number of possible labeled sequences Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. NAACL 2016. Neural architectures for named entity 8 recognition.
Top-K Distillation Top-K label sequence 9
Top-WK Distillation 10
Posterior Distillation Posterior Distribution 11
Structure-Level Knowledge Distillation 12
Results • Monolingual teacher models outperform multilingual student models • Our approaches outperform the baseline model • Top-WK+Posterior stays in between Top-WK and Posterior 13
Zero-shot Transfer 14
KD with weaker teachers 15
k Value in Top-K 16
Conclusion • Two structure-level KD methods: Top-K and Posterior distillation • Our approaches improve the performance of multilingual models over 4 tasks on 25 datasets. • Our distilled model has stronger zero-shot transfer ability on the NER and POS tagging task. 17
Recommend
More recommend