Introduction to Machine Learning Language Models and Transfer Learning Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from various sources (see reference page) Yifeng Tao Carnegie Mellon University 1
What is a Language Model o A statistical language model is a probability distribution over sequences of words o Given such a sequence, say of length m, it assigns a probability to the whole sequence: o Main problem: data sparsity [Slide from https://en.wikipedia.org/wiki/Language_model.] Yifeng Tao Carnegie Mellon University 2
Unigram model: Bag of words o General probability distribution: o Unigram model assumption: o Essentially, bag of words model o Estimation of unigram params: count word frequency in the doc [Slide from https://en.wikipedia.org/wiki/Language_model.] Yifeng Tao Carnegie Mellon University 3
n-gram model o n-gram assumption: o Estimation of n-gram params: [Slide from https://en.wikipedia.org/wiki/Language_model.] Yifeng Tao Carnegie Mellon University 4
Word2Vec o Word2Vec: Learns distributed representations of words o Continuous bag-of-words (CBOW) o Predicts current word from a window of surrounding context words o Continuous skip-gram o Uses current word to predict surrounding window of context words o Slower but does a better job for infrequent words [Slide from https://www.tensorflow.org/tutorials/representation/word2vec.] Yifeng Tao Carnegie Mellon University 5
Skip-gram Word2Vec o All words: o Parameters of skip-gram word2vec model o Word embedding for each word: o Context embedding for each word: o Assumption: [Slide from https://www.tensorflow.org/tutorials/representation/word2vec.] Yifeng Tao Carnegie Mellon University 6
Distributed Representations of Words o The trained parameters of words in skip-gram word2vec model o Semantics and embedding space [Slide from https://www.tensorflow.org/tutorials/representation/word2vec.] Yifeng Tao Carnegie Mellon University 7
Word Embeddings in Transfer Learning o Transfer learning: o Labeled data are limited o Unlabeled text corpus enormous o Pretrained word embeddings can be transferred to other supervised tasks. E.g., POS, NER, QA, MT, Sentiment classification [Slide from Matt Gormley .] Yifeng Tao Carnegie Mellon University 8
SOTA Language Models: ELMo o Embeddings from Language Models: ELMo o Fits full conditional probability in forward direction: o Fits full conditional probability in both directions using LSTM: [Slide from https://arxiv.org/pdf/1802.05365.pdf and https :// arxiv.org/abs/1810.04805.] Yifeng Tao Carnegie Mellon University 9
SOTA Language Models: OpenAI GPT & BERT o Uses transformer other than LSTM to model language o OpenAI GPT: single direction o BERT: bi-direction [Slide from https://arxiv.org/abs /1810.04805 .] Yifeng Tao Carnegie Mellon University 10
SOTA Language Models: BERT o Additional language modeling task: predict whether sentences come from same paragraph. [Slide from https://arxiv.org/abs /1810.04805 .] Yifeng Tao Carnegie Mellon University 11
SOTA Language Models: BERT o Instead of extract embeddings and hidden layer outputs, can be fine-tuned to specific supervised learning tasks. [Slide from https://arxiv.org/abs/1810.04805.] Yifeng Tao Carnegie Mellon University 12
The Transformer and Attention Mechanism o An encoder-decoder structure o Our focus: encoder and attention mechanism [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 13
The Transformer and Attention Mechanism o Self-attention o Ignores positions of words, assign weights globally. o Can be parallelized, in contrast to LSTM. o E.g., the attention weights related to word “it_”: [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 14
Self-attention Mechanism [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 15
Self-attention Mechanism o More… o https://jalammar.githu b.io/illustrated- transformer/ [Slide from https://jalammar.github.io/illustrated-transformer/.] Yifeng Tao Carnegie Mellon University 16
Take home message o Language models suffer from data sparsity o Word2vec portrays language probability using distributed word embedding parameters o ELMo, OpenAI GPT, BERT model language using deep neural networks o Pre-trained language models or their parameters can be transferred to supervised learning problems in NLP o Self-attention has the advantage over LSTM that it can be parallelized and consider interactions across the whole sentence Yifeng Tao Carnegie Mellon University 17
References o Wikipedia: https://en.wikipedia.org/wiki/Language_model o Tensorflow. Vector Representations of Words: https://www.tensorflow.org/tutorials/representation/word2vec o Matt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html o Matthew E. Peters et al. Deep contextualized word representations: https://arxiv.org/pdf/1802.05365.pdf o Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/abs/1810.04805 o Jay Alammar. The Illustrated Transformer: https://jalammar.github.io/illustrated-transformer/ Yifeng Tao Carnegie Mellon University 18
Recommend
More recommend