Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020
Text NMT Generation Multi- MRC Modality Conversati Question- onal Answering System This figure was credited to AMiner, Tsinghua University, 2019
β’ Neural NLP(NN-NLP)
Word embedding (Mikol olov ov et al., , 2013) 3) Senten tence ce Embedding ng Encod oder-Decod oder er with atten tention tion (Bahd hdana anau et al., , 2014) 4) Transfo nsforme rmer r (Vasw swani ani et al., 2016) 16)
r πππ£ βπ’ππβ π’ππβ πππππ‘π’ππ β 0 = π(π₯ 0 π¦) β 1 = π(π₯ 1 β 0 ) β 2 = π(π₯ 2 β 1 ) Active functions : π§ = π(π¦) π§ = π(π₯ 3 β 2 )
Word embedding tries to map words from a discrete space into a semantic space, in which the semantically similar words have similar embedding vectors. You ou sh shal all l kn know a a wor ord by y th the company it it keeps Mikolov et al. Efficient Estimation of Word Representations in Vector Space . EMNLP, 2014
π₯ π’β2 π₯ π’β2 SUM π₯ π’β1 π₯ π’β1 π₯ π’ π₯ π’ π₯ π’+1 π₯ π’+1 π₯ π’+2 π₯ π’+2 CBOW(Continuous Bag-of-Words) :Using Skip ip-gram ram(Continuous Skip-gram) :Using the central word to predict the context the context words in a window to predict the central word. words in a window. Mikolov et al. Efficient Estimation of Word Representations in Vector Space . EMNLP, 2014
electronic leaders of products China Psychological leaders of comparative reaction words companies adjectives
Sentence Embedding
Encoder-Decoder (Cho et al., 2014) θΏ ε εΉ΄ η»ζ΅ εε± ε ζ ’ δΊ . </S> θΏ , Decoder π =( ε εΉ΄ , η»ζ΅ , εε± , ε , ζ ’ , δΊ , . ) Decoder 0.7 0.0 0.2 0.2 0.9 0.1 0.5 Encoder Encoder π =(Economic, growth, has, slowed, down, in, recent, years,.) economic growth has slowed down in recent years .
Encoder-Decoder with Attention (Bahdanau et al., 2014) θΏ ε εΉ΄ η»ζ΅ εε± ε ζ ’ δΊ . </S> θΏ , Decoder π =( ε εΉ΄ , η»ζ΅ , εε± , ε , ζ ’ , δΊ , . ) Decoder Attention 0.7 0.0 0.2 0.2 0.9 0.1 0.5 Attention β¨ Weight β Encoder Encoder π =(Economic, growth, has, slowed, down, in, recent, years,.) economic growth has slowed down in recent years .
T ransformer(Vaswani et al., 2016) θΏ /0 ε εΉ΄ /1 ,/2 η»ζ΅ /3 εε± /4 Decoder Attention β Encoder economic/0 growth/1 has/2 slowed/3 down/4 in/5 recent/6 years/7
Residual Source Hidden States Link FFN Residual Link Self Attention Residual Link FFN Residual Link Self Attention
Residual FFN Link Residual Link Attention to Source Hidden Residual Link Source Hidden Self Attention
β’ Pre-trained models
POS/NER/Parsing β’ Question Answering β’ Text Summarization β’ Model Model Model Machine Translation β’ β¦ Image Retrieval β’ for Task 1 for Task 2 for Task π Video Captioning β’ β¦ β’ Fine-tuni tuning ng stage: transfer Classification learnt knowledge to β’ Fine-tuning Fine-tuning Fine-tuning β¦ Sequence Labeling β’ downstream tasks by for Task 1 for Task 2 for Task π Structure Prediction β’ discriminative training. Sequence Generation β’ Monolingual β’ Pre-trained Model Multilingual β’ Multimodal β’ Pre-tra raini ning ng stage: learn task-agnostic general knowledge from large- Autoregressive LM β’ scale corpus by self- Self-super supervi vised sed Learning rning Auto-Encoding β’ supervised learning. Texts β’ Large-scale Corpus Text-Image Pairs β’ Text-Video Pairs β’
embed task-ag agno nostic stic general eral knowled wledge ge transfer sfer learnt arnt knowledg ledge e to downstr tream eam tasks ks hold state te-of of-the he-ar art results ults on (a (almost ost) ) all NLP P tasks ks
A s simplified plified example ple of self-at attention ention in Transfor sformer er
natural (a) word-level LM is a typical task in natural language processing processing (b) sentence-level Auto toreg egress ssiv ive e (AR) LM Auto to-en encod coding ng (AE) Self-supervised learning is a form of unsupervised learning where the data itself provides the supervision.
An apple (e.g. Multilayer Transformer) is a Pre-trained Model β¦ β¦ 0 sweet β¦ β¦ 0 , edible β¦ 0 [MASK] fruit fruit fruit 0.4 1 produced loss Contextualized β¦ β¦ 0 by representations company 0.41 0 an apple β¦ β¦ 0 tree β¦ β¦ 0 . Vocabulary Ground truth Prediction Unsupervised (Self- supervised) Learning An apple is a sweet, edible fruit produced by an apple tree.
BERT RT-base based Sentence ence Pair ir Match tchin ing g Given the final hidden vector π· β β πΌ of the first input token ([CLS]), fine-tune BERT by a standard classification loss with π· and π : log(softmax(π·πΏ T ) ) where πΏ β β πΏΓπΌ is a classification layer, πΏ is the number of labels.
GREEN: monolingual pre-trained models BLUE: multilingual pre-trained models U: for understanding tasks G: for generation tasks NLP Tasks Machine ine BERT Trans nslat lation ion BART 2018 MASS ULMFiT XLM ELMo U Search Engin ine mBART RT (Peters et al., 2018) UNILM LM 2017 G MT-DN MT DNN G Unicoder er G GPT ProphetNet etNet Semant antic ic Parsing ing U CoVe Word2Vec U G 2017 U 2018 G U Questio ion n 2013 U G Answering ing Chatbot & Dialog logue Tim imeli eline ne of of Pr Pre-tra trained ined Mod odels els for or Na Natural l Language guage Parap aphr hras ase Classifi ifica cation ion Text Entailm ilment Sentim iment nt Analy lysis is β¦ β¦ https://arxiv.org/abs/2003.08271
Connections and Differences Between (Monolingual) Pre-trained Models Model Name Model Usage Model Backbone Model Contribution 1 st unidirectional pre-trained LM based on Transformer GPT (OpenAI) Understanding & Generation Transformer Encoder 1 st bidirectional pre-trained LM based on Transformer BERT (Google) Understanding Transformer Encoder MT-DNN (MS) Understanding Transformer Encoder use multiple understanding tasks in pre-training MASS (MS) Generation Separate Transformer Encoder-Decoder use masked span prediction for generation tasks unify understanding and generation tasks in pre-training UniLM (MS) Understanding & Generation Unified Transformer Encoder-Decoder with different attention masks use better pre-training tricks , such as dynamic masking, RoBERTa (FB) Understanding Transformer Encoder large batches, removing NSP , data sampling prove noun phrase masking and entity masking are ERNIE (Baidu) Understanding Transformer Encoder better than word masking SpanBERT (FB) Understanding Transformer Encoder prove random span masking is better than others unify autoregressive LM and autoencoding tasks in pre- XLNet (Google) Understanding Transformer Encoder training with the two-stream self-attention use a separate encoder-decoder for understanding and T5 (Google) Generation Separate Transformer Encoder-Decoder generation tasks and prove it is the best choice; compare different hyper-parameters and show the best settings BART (FB) Generation Separate Transformer Encoder-Decoder try different text noising methods for generation tasks ELECTRA Understanding Transformer Generator-Discriminator use a simple but effective GAN-style pre-training task (Google) ProphetNet use future n-gram prediction for generation tasks with Generation Separate Transformer Encoder-Decoder (MS) the π -stream self-attention
Pre-training tasks Knowledge distillation β’ β’ Pre-trained model structures Inference acceleration β’ β’ Pre-trained model compression Fine-tuning strategies β’ β’ Pre-training acceleration β’ Models els for Pre-tra raini ning ng Fine-tuning tuning Pre-tra rained ned Large ge-scal scale e Downst nstream am Model el Corpus pus Tasks ks Task-specifi specific c Dataset asets GREEN: efforts for performance BLUE: efforts for practical usage
https://gluebenchmark.com/ CoLA : The Corpus of Linguistic Acceptability SST-2 : The Stanford Sentiment Treebank MRPC : The Microsoft Research Paraphrase Corpus STS-B : The Semantic Textual Similarity Benchmark QQP : The Quora Question Pairs MNLI : The Multi-Genre Natural Language Inference Corpus QNLI : The Stanford Question Answering Dataset RTE : The Recognizing Textual Entailment WNLI : The Winograd Schema Challenge
UniLM (Dong et al., 2019)
Recommend
More recommend