Neural Network based NLP: Its Progresses and Challenges Dr. Ming - PowerPoint PPT Presentation

Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020

Text NMT Generation Multi- MRC Modality Conversati Question- onal Answering System This figure was credited to AMiner, Tsinghua University, 2019

• Neural NLP(NN-NLP)

Word embedding (Mikol olov ov et al., , 2013) 3) Senten tence ce Embedding ng Encod oder-Decod oder er with atten tention tion (Bahd hdana anau et al., , 2014) 4) Transfo nsforme rmer r (Vasw swani ani et al., 2016) 16)

r 𝑓𝑚𝑣 ℎ𝑢𝑏𝑜ℎ 𝑢𝑏𝑜ℎ 𝑚𝑝𝑕𝑗𝑡𝑢𝑗𝑑 ℎ 0 = 𝑔(𝑥 0 𝑦) ℎ 1 = 𝑔(𝑥 1 ℎ 0 ) ℎ 2 = 𝑔(𝑥 2 ℎ 1 ) Active functions : 𝑧 = 𝑔(𝑦) 𝑧 = 𝑔(𝑥 3 ℎ 2 )

Word embedding tries to map words from a discrete space into a semantic space, in which the semantically similar words have similar embedding vectors. You ou sh shal all l kn know a a wor ord by y th the company it it keeps Mikolov et al. Efficient Estimation of Word Representations in Vector Space . EMNLP, 2014

𝑥 𝑢−2 𝑥 𝑢−2 SUM 𝑥 𝑢−1 𝑥 𝑢−1 𝑥 𝑢 𝑥 𝑢 𝑥 𝑢+1 𝑥 𝑢+1 𝑥 𝑢+2 𝑥 𝑢+2 CBOW(Continuous Bag-of-Words) :Using Skip ip-gram ram(Continuous Skip-gram) :Using the central word to predict the context the context words in a window to predict the central word. words in a window. Mikolov et al. Efficient Estimation of Word Representations in Vector Space . EMNLP, 2014

electronic leaders of products China Psychological leaders of comparative reaction words companies adjectives

Sentence Embedding

Encoder-Decoder (Cho et al., 2014) 近几年经济发展变慢了 . </S> 近 , Decoder 𝒈 =( 几年 , 经济 , 发展 , 变 , 慢 , 了 , . ) Decoder 0.7 0.0 0.2 0.2 0.9 0.1 0.5 Encoder Encoder 𝒇 =(Economic, growth, has, slowed, down, in, recent, years,.) economic growth has slowed down in recent years .

Encoder-Decoder with Attention (Bahdanau et al., 2014) 近几年经济发展变慢了 . </S> 近 , Decoder 𝒈 =( 几年 , 经济 , 发展 , 变 , 慢 , 了 , . ) Decoder Attention 0.7 0.0 0.2 0.2 0.9 0.1 0.5 Attention ⨀ Weight ⊕ Encoder Encoder 𝒇 =(Economic, growth, has, slowed, down, in, recent, years,.) economic growth has slowed down in recent years .

T ransformer(Vaswani et al., 2016) 近 /0 几年 /1 ,/2 经济 /3 发展 /4 Decoder Attention ⊕ Encoder economic/0 growth/1 has/2 slowed/3 down/4 in/5 recent/6 years/7

Residual Source Hidden States Link FFN Residual Link Self Attention Residual Link FFN Residual Link Self Attention

Residual FFN Link Residual Link Attention to Source Hidden Residual Link Source Hidden Self Attention

• Pre-trained models

POS/NER/Parsing • Question Answering • Text Summarization • Model Model Model Machine Translation • … Image Retrieval • for Task 1 for Task 2 for Task 𝑂 Video Captioning • … • Fine-tuni tuning ng stage: transfer Classification learnt knowledge to • Fine-tuning Fine-tuning Fine-tuning … Sequence Labeling • downstream tasks by for Task 1 for Task 2 for Task 𝑂 Structure Prediction • discriminative training. Sequence Generation • Monolingual • Pre-trained Model Multilingual • Multimodal • Pre-tra raini ning ng stage: learn task-agnostic general knowledge from large- Autoregressive LM • scale corpus by self- Self-super supervi vised sed Learning rning Auto-Encoding • supervised learning. Texts • Large-scale Corpus Text-Image Pairs • Text-Video Pairs •

embed task-ag agno nostic stic general eral knowled wledge ge transfer sfer learnt arnt knowledg ledge e to downstr tream eam tasks ks hold state te-of of-the he-ar art results ults on (a (almost ost) ) all NLP P tasks ks

A s simplified plified example ple of self-at attention ention in Transfor sformer er

natural (a) word-level LM is a typical task in natural language processing processing (b) sentence-level Auto toreg egress ssiv ive e (AR) LM Auto to-en encod coding ng (AE) Self-supervised learning is a form of unsupervised learning where the data itself provides the supervision.

An apple (e.g. Multilayer Transformer) is a Pre-trained Model … … 0 sweet … … 0 , edible … 0 [MASK] fruit fruit fruit 0.4 1 produced loss Contextualized … … 0 by representations company 0.41 0 an apple … … 0 tree … … 0 . Vocabulary Ground truth Prediction Unsupervised (Self- supervised) Learning An apple is a sweet, edible fruit produced by an apple tree.

BERT RT-base based Sentence ence Pair ir Match tchin ing g Given the final hidden vector 𝐷 ∈ ℝ 𝐼 of the first input token ([CLS]), fine-tune BERT by a standard classification loss with 𝐷 and 𝑋 : log(softmax(𝐷𝑿 T ) ) where 𝑿 ∈ ℝ 𝐿×𝐼 is a classification layer, 𝐿 is the number of labels.

GREEN: monolingual pre-trained models BLUE: multilingual pre-trained models U: for understanding tasks G: for generation tasks NLP Tasks Machine ine BERT Trans nslat lation ion BART 2018 MASS ULMFiT XLM ELMo U Search Engin ine mBART RT (Peters et al., 2018) UNILM LM 2017 G MT-DN MT DNN G Unicoder er G GPT ProphetNet etNet Semant antic ic Parsing ing U CoVe Word2Vec U G 2017 U 2018 G U Questio ion n 2013 U G Answering ing Chatbot & Dialog logue Tim imeli eline ne of of Pr Pre-tra trained ined Mod odels els for or Na Natural l Language guage Parap aphr hras ase Classifi ifica cation ion Text Entailm ilment Sentim iment nt Analy lysis is … … https://arxiv.org/abs/2003.08271

Connections and Differences Between (Monolingual) Pre-trained Models Model Name Model Usage Model Backbone Model Contribution 1 st unidirectional pre-trained LM based on Transformer GPT (OpenAI) Understanding & Generation Transformer Encoder 1 st bidirectional pre-trained LM based on Transformer BERT (Google) Understanding Transformer Encoder MT-DNN (MS) Understanding Transformer Encoder use multiple understanding tasks in pre-training MASS (MS) Generation Separate Transformer Encoder-Decoder use masked span prediction for generation tasks unify understanding and generation tasks in pre-training UniLM (MS) Understanding & Generation Unified Transformer Encoder-Decoder with different attention masks use better pre-training tricks , such as dynamic masking, RoBERTa (FB) Understanding Transformer Encoder large batches, removing NSP , data sampling prove noun phrase masking and entity masking are ERNIE (Baidu) Understanding Transformer Encoder better than word masking SpanBERT (FB) Understanding Transformer Encoder prove random span masking is better than others unify autoregressive LM and autoencoding tasks in pre- XLNet (Google) Understanding Transformer Encoder training with the two-stream self-attention use a separate encoder-decoder for understanding and T5 (Google) Generation Separate Transformer Encoder-Decoder generation tasks and prove it is the best choice; compare different hyper-parameters and show the best settings BART (FB) Generation Separate Transformer Encoder-Decoder try different text noising methods for generation tasks ELECTRA Understanding Transformer Generator-Discriminator use a simple but effective GAN-style pre-training task (Google) ProphetNet use future n-gram prediction for generation tasks with Generation Separate Transformer Encoder-Decoder (MS) the 𝑜 -stream self-attention

Pre-training tasks Knowledge distillation • • Pre-trained model structures Inference acceleration • • Pre-trained model compression Fine-tuning strategies • • Pre-training acceleration • Models els for Pre-tra raini ning ng Fine-tuning tuning Pre-tra rained ned Large ge-scal scale e Downst nstream am Model el Corpus pus Tasks ks Task-specifi specific c Dataset asets GREEN: efforts for performance BLUE: efforts for practical usage

https://gluebenchmark.com/ CoLA : The Corpus of Linguistic Acceptability SST-2 : The Stanford Sentiment Treebank MRPC : The Microsoft Research Paraphrase Corpus STS-B : The Semantic Textual Similarity Benchmark QQP : The Quora Question Pairs MNLI : The Multi-Genre Natural Language Inference Corpus QNLI : The Stanford Question Answering Dataset RTE : The Recognizing Textual Entailment WNLI : The Winograd Schema Challenge

UniLM (Dong et al., 2019)

Neural Network based NLP: Its Progresses and Challenges Dr. Ming - PowerPoint PPT Presentation

Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020 Text NMT Generation Multi- MRC Modality Conversati Question- onal Answering System

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ In

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ NLP and

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Machine Learning for NLP The Neural Network Zoo Aurlie Herbelot 2019 Centre for Mind/Brain

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2020/ In

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

o-glassesX: Compiler Provenance Recovery with Attention Mechanism from a Short Code Fragment

Dynamic Fitness Functions for Genetic Improvement in Compilers and Interpreters Kyoto 16.7.2018

Persistence of Gaussian Stationary Processes: a spectral perspective Naomi Feldheim (Stanford)

A System-Wide Debugging Assistant Powered by Natural Language Processing Karthik Narasimhan

Spectral methods for Quantum walks, ( aka Discrete time unitary evolutions) F. Alberto Grnbaum

Remez inequality and propagation of smallness for solutions of second order elliptic PDEs Part

s , ABM PDFs, and quark masses S.Alekhin ( Univ. of Hamburg & IHEP Protvino) sa, Blmlein,

Neural Networks. Petr Po s k petr.posik@fel.cvut.cz Czech Technical University in Prague

Neural Network based NLP: Its Progresses and Challenges Dr. Ming - PowerPoint PPT Presentation

Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020 Text NMT Generation Multi- MRC Modality Conversati Question- onal Answering System

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ In

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ NLP and

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Machine Learning for NLP The Neural Network Zoo Aurlie Herbelot 2019 Centre for Mind/Brain

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2020/ In

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

o-glassesX: Compiler Provenance Recovery with Attention Mechanism from a Short Code Fragment

Dynamic Fitness Functions for Genetic Improvement in Compilers and Interpreters Kyoto 16.7.2018

Persistence of Gaussian Stationary Processes: a spectral perspective Naomi Feldheim (Stanford)

A System-Wide Debugging Assistant Powered by Natural Language Processing Karthik Narasimhan

Spectral methods for Quantum walks, ( aka Discrete time unitary evolutions) F. Alberto Grnbaum

Remez inequality and propagation of smallness for solutions of second order elliptic PDEs Part

s , ABM PDFs, and quark masses S.Alekhin ( Univ. of Hamburg &amp; IHEP Protvino) sa, Blmlein,

Neural Networks. Petr Po s k petr.posik@fel.cvut.cz Czech Technical University in Prague

s , ABM PDFs, and quark masses S.Alekhin ( Univ. of Hamburg & IHEP Protvino) sa, Blmlein,