BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Source : NAACL-HLT 2019 Speaker : Ya-Fang, Hsiao Advisor : Jia-Ling, Koh Date : 2019/09/02
CONTENTS Introduction Conclusion Method 1 5 3 4 2 Experiment Related Work
1 Introduction
Introduction B idirectional E ncoder R epresentations from T ransformers Language Model π α» π π₯ 1 , π₯ 2 , β¦ , π₯ π = ΰ· π(π₯ π’ |π₯ 1 , π₯ 2 , β¦ , π₯ π’β1 π’=1 Pre-trained Language Model BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2 Related Work
Related Work Pre-trained Language Model : ELMo Feature-based Fine-tuning : OpenAI GPT
Related Work Pre-trained Language Model : ELMo Feature-based Fine-tuning : OpenAI GPT 1. Unidirectional language model 2. Same objective function B idirectional E ncoder R epresentations from T ransformers Masked Language Models (MLM) Next Sentence Prediction (NSP)
γ Attention is all you need γ Transformers Vaswani et al. (NIPS2017) Sequence2sequence Encoder Decoder RNN : hard to parallel
γ Attention is all you need γ Transformers Vaswani et al. (NIPS2017) Encoder-Decoder
γ Attention is all you need γ Transformers Vaswani et al. (NIPS2017) Encoder-Decoder *6 Self-attention layer can be parallelly computed
γ Attention is all you need γ Transformers Vaswani et al. (NIPS2017) Self-Attention query (to match others) key (to be matched) information to be extracted
γ Attention is all you need γ Transformers Vaswani et al. (NIPS2017) Multi-Head Attention
γ Attention is all you need γ Transformers Vaswani et al. (NIPS2017)
B idirectional E ncoder R epresentations BERT from T ransformers BERT BASE (L=12, H=768, A=12, Parameters=110M) BERT LARGE (L=24, H=1024, A=16, Parameters=340M) 4H L A
3 Method
Framework Pre-training : trained on unlabeled data over different pre-training tasks. Fine-Tuning : fine-tuned parameters using labeled data from the downstream tasks.
Input [CLS] : classification token [SEP] : separate token Pre-training corpus : BooksCorpus γ English Wikipedia Token Embedding : WordPiece embeddings with a 30,000 token vocabulary. Segment Embedding : Learned embeddings belong to sentence A or sentence B. Position Embedding : Learned positional embeddings.
Pre-training Two unsupervised tasks: 1. Masked Language Models (MLM) 2. Next Sentence Prediction (NSP)
Task1. MLM Masked Language Models Replace the token with (1) the [MASK] token 80% of the time. (2) a random token 10% of the time. (3) the unchanged i-th token 10% of the time. Mask 15% of all WordPiece tokens in each sequence at random for prediction. Hung-Yi Lee - BERT ppt
Task2. NSP Next Sentence Prediction Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext Hung-Yi Lee - BERT ppt
Fine-Tuning Fine-Tuning : fine-tuned parameters using labeled data from the downstream tasks.
Task 1 (b) Single Sentence Classification Tasks class Linear Trained from Input: single sentence, Classifier Scratch output: class Example: Sentiment analysis BERT Fine-tune Document Classification w 1 w 2 w 3 [CLS] sentence Hung-Yi Lee - BERT ppt
Task 2 (d) Single Sentence Tagging Tasks class class class Linear Linear Linear Input: single sentence, Cls Cls Cls output: class of each word Example: Slot filling BERT w 1 w 2 w 3 [CLS] sentence Hung-Yi Lee - BERT ppt
Task 3 (a) Sentence Pair Classification Tasks Input: two sentences, Class output: class Linear Example: Natural Language Inference Classifier BERT w 3 w 4 w 5 w 1 w 2 [CLS] [SEP] Sentence 1 Sentence 2 Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks 17 Document : πΈ = π 1 , π 2 , β― , π π Query : π = π 1 , π 2 , β― , π π 77 79 π‘ πΈ QA π π Model π‘ = 17, π = 17 output: two integers ( π‘ , π ) π΅ = π π‘ , β― , π π Answer : π‘ = 77, π = 79 Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks 0.3 0.2 0.5 Learned from scratch Softmax s = 2, e = 3 The answer is β d 2 d 3 β. dot product BERT d 1 d 2 d 3 q 1 q 2 [CLS] [SEP] question document Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks 0.1 0.2 0.7 Learned from scratch Softmax s = 2, e = 3 The answer is β d 2 d 3 β. dot product BERT d 1 d 2 d 3 q 1 q 2 [CLS] [SEP] question document Hung-Yi Lee - BERT ppt
4 Experiment
Experiments Fine-tuning results on 11 NLP tasks
Implements LeeMeng- ι²ζη BERT (Pytorch)
Implements LeeMeng- ι²ζη BERT (Pytorch)
Implements LeeMeng- ι²ζη BERT (Pytorch)
Implements LeeMeng- ι²ζη BERT (Pytorch)
Implements LeeMeng- ι²ζη BERT (Pytorch)
5 Conclusion
References BERT http://bit.ly/BERTpaper θͺθ¨ζ¨‘εηΌε± http://bit.ly/nGram2NNLM θͺθ¨ζ¨‘ει θ¨η·΄ζΉζ³ http://bit.ly/ELMo_OpenAIGPT_BERT Attention Is All You Need http://bit.ly/AttIsAllUNeed ζεΌζ― -Transformer(Youtube) http://bit.ly/HungYiLee_Transformer Illustrated Transformer http://bit.ly/illustratedTransformer 詳解 Transformer http://bit.ly/explainTransformer github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch Pytorch.org_BERT http://bit.ly/pytorchorgBERT ε―¦δ½εζ°θει‘ http://bit.ly/implementpaircls
Recommend
More recommend