multilingual sequence labeling
play

Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, - PowerPoint PPT Presentation

Structure-Level Knowledge Distillation For Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Fei Huang, Kewei Tu School of Information Science and Technology, ShanghaiTech University DAMO Academy, Alibaba Group 1


  1. Structure-Level Knowledge Distillation For Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Fei Huang, Kewei Tu School of Information Science and Technology, ShanghaiTech University DAMO Academy, Alibaba Group 1

  2. Motivation • Most of the previous work of sequence labeling focused on monolingual models. • It is resource consuming to train and serve multiple monolingual models online. • A unified multilingual model: smaller, easier, more generalizable. • However, the accuracy of the existing unified multilingual model is inferior to monolingual models. 2

  3. Our Solution Knowledge Distillation 3

  4. Background: Knowledge Distillation Teacher Data Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation 4 Learning Workshop.

  5. Background: Knowledge Distillation XE Distribution 𝑄 𝑢 Teacher loss Data Distribution 𝑄 𝑡 Student Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation 5 Learning Workshop.

  6. Background: Knowledge Distillation XE Distribution 𝑄 𝑢 Teacher loss Data Distribution 𝑄 𝑡 Student Update Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation 6 Learning Workshop.

  7. Background: Sequence Labeling Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. NAACL 2016. Neural architectures for named entity 7 recognition.

  8. Background: Sequence Labeling Exponentially number of possible labeled sequences Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. NAACL 2016. Neural architectures for named entity 8 recognition.

  9. Top-K Distillation Top-K label sequence 9

  10. Top-WK Distillation 10

  11. Posterior Distillation Posterior Distribution 11

  12. Structure-Level Knowledge Distillation 12

  13. Results • Monolingual teacher models outperform multilingual student models • Our approaches outperform the baseline model • Top-WK+Posterior stays in between Top-WK and Posterior 13

  14. Zero-shot Transfer 14

  15. KD with weaker teachers 15

  16. k Value in Top-K 16

  17. Conclusion • Two structure-level KD methods: Top-K and Posterior distillation • Our approaches improve the performance of multilingual models over 4 tasks on 25 datasets. • Our distilled model has stronger zero-shot transfer ability on the NER and POS tagging task. 17

Recommend


More recommend