pseudo masked language models for unified language model
play

Pseudo-Masked Language Models for Unified Language Model - PowerPoint PPT Presentation

Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon Unified Pre-Training Framework h


  1. Pseudo-Masked Language Models for Unified Language Model Pre-Training ICML-2020 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

  2. Unified Pre-Training Framework h 1 h 2 h 3 h 4 h 5 Language Understanding Bidirectional LM Transformer Block L ... intent classification All tokens can see each other. Encoder entity recognition Transformer Block 2 question answering Transformer Block 1 BERT, RoBERT a … x 1 x 2 x 3 x 4 x 5 Language Generation Unidirectional (Left-to-Right) LM h 1 h 2 h 3 h 4 h 5 y 1 y 2 y 3 y 4 (text generation) Transformer Block L Transformer Block L ... A token can only see its left context. ... UniLM story/news generation Decoder … Transformer Block 2 Transformer Block 2 GPT Transformer Block 1 Transformer Block 1 Sequence-to-Sequence LM Language Generation x 1 x 2 x 3 x 4 x 5 x 0 (sequence-to-sequence) 1) The given input is bidirectionally encoded. h 1 h 2 h 3 h 4 h 5 2) The output is unidirectionally decoded. summary generation y 1 y 2 y 3 y 4 question generation Transformer Block L ... Transformer Block L response generation ... T5, BART Encoder machine translation Decoder Transformer Block 2 … Transformer Block 2 Transformer Block 1 Transformer Block 1 x 1 x 2 x 3 x 4 x 5 Downstream Tasks Pre-Training Tasks

  3. UniLM v1 Bidirectional Encoder NLU: text classification, entity recognition, question answering, … Unidirectional Decoder NLG: synthetic text generation, … Encoder-Decoder NLG (sequence-to-sequence) : text Unified Modeling 1 summarization, question generation , … Multitask-Style Pre-Training 2 Unified Language Model Pre-training for Natural Language Understanding and Generation. NeurIPS 2019.

  4. Motivation of UniLM v2 (v1) One training example for each type of LM How to train multiple LMs in one • Three types of LMs forward pass? • Three forward passes with different self-attention masks Bidirectional LM Training Batch Training Example Unidirectional LM h 1 h 2 h 3 h 4 h 5 Transformer Block L ... UniLM Training Transformer Block 2 Example Transformer Block 1 x 1 x 2 x 3 x 4 x 5 Sequence-to-Sequence LM Training Example

  5. Pseudo-Masked Language Model 𝑦 2 𝑦 4 𝑦 5 Bidirectional LM T ask (for NLU) 1. Bidirectionally encode context tokens [M] [M] 𝑦 1 𝑦 3 𝑦 6 [M] 2. Predict the masked spans at the same time 𝑦 4 𝑦 5 Sequence-to-Sequence LM T ask (for NLG) 1. Bidirectionally encode context tokens [M] [M] 𝑦 3 𝑦 6 𝑦 1 t=1 [M] 2. Predict the masked spans one by one (e.g., 𝑦 4 , 𝑦 5 → 𝑦 2 ) 𝑦 2 1. Predict 𝑦 4 , 𝑦 5 2. Encode 𝑦 4 , 𝑦 5 (i.e., fill in what we have predicted) 𝑦 3 𝑦 4 𝑦 6 𝑦 1 𝑦 5 t=2 [M] 3. Predict 𝑦 2

  6. Pseudo-Masked Language Model Observatio vation n 1: c conte text xt encodi oding ng can be reused ed 𝑦 2 𝑦 4 𝑦 5 Bidirectional LM T ask (for NLU) 1. Bidirectionally encode context tokens [M] [M] 𝑦 1 𝑦 3 𝑦 6 [M] 2. Predict the masked spans at the same time 𝑦 4 𝑦 5 Sequence-to-Sequence LM T ask (for NLG) 1. Bidirectionally encode context tokens [M] [M] 𝑦 3 𝑦 6 𝑦 1 t=1 [M] 2. Predict the masked spans one by one (e.g., 𝑦 4 , 𝑦 5 → 𝑦 2 ) 𝑦 2 1. Predict 𝑦 4 , 𝑦 5 2. Encode 𝑦 4 , 𝑦 5 (i.e., fill in what we have predicted) 𝑦 3 𝑦 4 𝑦 6 𝑦 1 𝑦 5 t=2 [M] 3. Predict 𝑦 2

  7. Pseudo-Masked Language Model Observatio vation n 1: c conte text xt encodi oding ng can be reused ed Observati vation n 2: m masked sked positions ions have e three ee roles es 𝑦 2 𝑦 4 𝑦 5 Bidirectional LM T ask (for NLU) 1. Bidirectionally encode context tokens [M] [M] 𝑦 1 𝑦 3 𝑦 6 [M] 2. Predict the masked spans at the same time (1) Contex ext masks ks [M] (2) Pseudo udo masks ks [P] 𝑦 4 𝑦 5 Sequence-to-Sequence LM T ask (for NLG) 1. Bidirectionally encode context tokens [P] [P] 𝑦 3 𝑦 6 𝑦 1 t=1 [M] 2. Predict the masked spans one by one (e.g., 𝑦 4 , 𝑦 5 → 𝑦 2 ) 𝑦 2 1. Predict 𝑦 4 , 𝑦 5 2. Encode 𝑦 4 , 𝑦 5 (i.e., fill in what we have predicted) 𝑦 3 𝑦 4 𝑦 6 𝑦 1 𝑦 5 t=2 [P] 3. Predict 𝑦 2 (3) Original nal tokens ens

  8. Bidirectional LM Sequence-to-Sequence LM (Autoencoding) (Partially Autoregressive) 𝒚 𝟓 𝒚 𝟔 𝒚 𝟑 𝒚 𝟓 𝒚 𝟔 𝒚 𝟑 Token Embeddings 𝒚 2 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 𝒚 𝟕 [𝐐] 𝒚 1 [𝐐] [𝐐] M M M + + + + + + + + + + + + 5 2 2 3 4 4 5 6 1 2 4 5 Position Embeddings (TL;DR) UniLM v2 : unified pre-training of bi-directional LM (via autoencoding) and sequence-to-sequence LM (via partially autoregressive) with Pseudo-Masked Language Model for language understanding and generation • Transformer/Self-attention treats tokens with the same position embeddings as the same “token” at that position • Pseudo-masked LM can be used to efficiently realize different pre-training objectives, such as AE (autoencoding), AR (autoregressive), PAR (partially autoregressive) , AE + AR, and AE + PAR, among which AE + PAR performs the best

  9. Pre-Training Objectives Autoencoding Autoregressive Partially Autoregressive Encourage the pre-trained model to learn and use global context (long-distance dependency)

  10. T akeaway Message of UniLM v2 • Pseudo-masked language model efficiently realizes unified pre-training Sequence-to- • Two types of LM tasks within one Sequence LM Bidirectional LM forward pass • Bi-directional LM (for NLU) • Sequence-to-sequence LM (for NLG) • Learn different word dependencies • Between context and mask predictions • Between mask predictions

  11. Benchmark Datasets • Natural language understanding • Question answering (SQuAD) Bidirectional encoding • GLUE: General Language Understanding Evaluation • Natural language generation • Abstractive summarization • CNN / DailyMail Sequence-to-sequence modeling • Gigaword • XSum • Question generation (SQuAD)

  12. UniLMv2-Base for NLU T asks +1.6 +2.5 +2.4 +2.8 +0.9 +0.3 +1.6 +2.6 +0.7 -0.2 -0.2 +2.6 Results of BASE-size models on the development set of the GLUE benchmark . We report Results of BASE-size pre-trained models on the Matthews correlation coefficient (MCC) for CoLA, Pearson correlation coefficient (PCC) SQuAD v1.1/v2.0 development sets. We report F1 scores and exact match (EM) scores. Results for STS, and accuracy (Acc) for the rest. Metrics of UniLMv2 are averaged over five runs of UniLMv2 are averaged over five runs. for the tasks.

  13. UniLMv2-Base for NLG T asks (Abstractive Summarization) Abstractive summarization results on CNN/DailyMail and XSum. The evaluation metric is the F1 version of ROUGE (RG) scores. We also present the number of parameters (#Param) and the corpus size (#Corpus) for the methods using pre-trained models.

  14. UniLMv2-Base for NLG T asks (Question Generation) MTR is short for METEOR, and RG for ROUGE. The official split is from (Du & Cardie, 2018), while the reversed split is the same as in (Zhao et al., 2018).

  15. Effect of Pre-Training Objectives • AE: autoencoding • AR: autoregressive (AR) • PAR: partially autoregressive Comparisons between the pre-training objectives. All models are pre-trained over Wikipedia and BookCorpus for one million steps with a batch size of 256. Results in the second block are average over five runs for each task. We report F1 and exact match (EM) scores for SQuAD, and accuracy (Acc) for MNLI and SST-2.

  16. Thanks! https://github.com/microsoft/unilm

Recommend


More recommend