Language Models Assignment of probabilities to sequences of words - PowerPoint PPT Presentation

N-Grams and Language Models Language Models ◮ Assignment of probabilities to sequences of words ◮ Can be used incrementally to predict the next word ◮ N-gram ◮ Sequence of n words (bigram, trigram, . . . ) ◮ The size of the corpus constrains n ◮ Can go high on web-scale data ◮ In 2006, Google released 10 9 (1, 2, 3, 4, 5)-grams occurring ≥ 40 times in corpus of 10 12 words (1 . 3 × 10 6 unique) Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 16

N-Grams and Language Models Predicting a Word Language models: bigram, trigram, n-gram ◮ Sequence of words: w 1 ... w n ◮ w j i means w i ... w j ◮ Chain rule: P ( w n 1 ) = P ( w 1 ) P ( w 2 | w 1 ) ... P ( w n | w n − 1 ) 1 ◮ Not quite usable. Why? ◮ Language use is creative ◮ Huge amount of data needed to get enough coverage ◮ Bigram: Assume P ( w n | w n − 1 ) ≈ P ( w n | w n − 1 ) 1 ◮ Trigram: Look at two words in the past ◮ n -gram: Look at n − 1 words in the past Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 17

N-Grams and Language Models Maximum Likelihood Estimation (MLE) Technique to estimate probabilities ◮ Symbols for start � s � and end � /s � ◮ Obtain a corpus ◮ Calculate relative frequencies (bigram count ÷ unigram count) ◮ P ( w n | w n − 1 ) = count( w n − 1 w n ) count( w n − 1 ) Example: � s � I am Sam � /s � � s � Sam I am � /s � � s � I do not like green eggs and ham � /s � Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 18

N-Grams and Language Models Evaluation ◮ Extrinsic ◮ Real-world usage ◮ Intrinsic ◮ From the data itself based on held out data ◮ Split into training and test data ◮ Safer to split into training, development (devset), and test data ◮ n-fold testing Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 19

N-Grams and Language Models Perplexity Lower is better ◮ N th root of the inverse probability of the test set = P ( w 1 ... w N ) − 1 / N PP ( W ) � 1 = N P ( w 1 ... w N ) � ∏ N 1 = N i P ( w i | w 1 ... w i − 1 ) ◮ Weighted average branching factor of a language ◮ Branching factor: the number of possible next words that can follow any word ◮ Weighted by probability ◮ Calculate for the Sam I Am stanza Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 20

N-Grams and Language Models Sparsity ◮ Rare n-grams may not appear in the corpus ◮ Zero count ⇒ Estimated probability of zero Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 21

N-Grams and Language Models Unknown (Out of Vocabulary) Words ◮ Closed vocabulary ◮ Assume all unknown words are the same � UNK � ◮ Open vocabulary ◮ Treat all rare words as the same � UNK � ◮ Treat the top N most frequent words as words and replace the rest by � UNK � ◮ The number of unknown words can be over-estimated when a language has complex inflected forms ◮ Stemming can reduce (apparent) unknowns but is a coarse approach ◮ Perplexity can be lowered by making the vocabulary smaller Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 22

N-Grams and Language Models Smoothing ◮ Calculate for the Sam I Am stanza as a corpus ◮ Adjusted counts, c ∗ ◮ Discounting (i.e., reducing) of the nonzero counts ◮ Frees up some probability mass to assign to the zero counts ◮ Laplace: add 1 to each count ◮ Simple ◮ Invented by Pierre-Simon Laplace in the early days of Bayesian reasoning ◮ Since there so many zero count bigrams, Laplace takes away too much probability mass from the nonzero counts ◮ Add k smoothing ( k < 1) ◮ Requires tuning, via devset Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 23

N-Grams and Language Models Backoff ◮ Backoff: Reduce context when insufficient data ◮ If not enough trigrams, use bigram (of last two) ◮ If not enough bigrams, use unigram ◮ Interpolation: combine all n-gram estimators ◮ Linear combination of probabilities estimated from unigram, bigram, trigram counts ◮ Use held-out corpus to estimate Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 24

N-Grams and Language Models Kneser-Ney Smoothing ◮ Based on an empirical observation ◮ Get counts of n-grams from one corpus ◮ Get counts of the n-grams from a held-out corpus ◮ The average counts in the second corpus are lower by about 0.75 (or 0.80) for bigrams ◮ Bigrams of count zero are more popular in the second ◮ Bigrams of count 1 average about 0.5 ◮ Gale and Church: reduce by 0.75 for bigrams of counts of 3 or higher and place that probability mass on counts of bigrams 0 and 1 ◮ Kneser-Ney ◮ P (continuation) ∝ number of times a unigram has appeared in a distinct context—as second words of bigrams ◮ Interpolate based on P (continuation) Munindar P. Singh (NCSU) Natural Language Processing Fall 2020 25

Language Models Assignment of probabilities to sequences of words - PowerPoint PPT Presentation

N-Grams and Language Models Language Models Assignment of probabilities to sequences of words Can be used incrementally to predict the next word N-gram Sequence of n words (bigram, trigram, . . . ) The size of the corpus

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

CSE 421: Algorithms Winter 2014 Lecture 24-25: Poly-time reductions Reading: Sections 8.4-8.8

Language Modeling Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 24, 2017

Science, Decision-Making, and the Law: The Impact Assessment Cat in the Science Hat Dr. Aerin

Sum-of-Product Datatypes in SML CS251 Programming Languages Spring

Understanding Unemployment Insurance Benefits During the COVID-19 Pandemic Federal Unemployment

Sales Tax Exams vs. IRS Exams Issues and Strategies Presented by: Eric L. Green, Esq. 1 About Tax

What We Will Do Today Topic 2 Introduction to Java Programming What are computer languages?

Experiences with Ethical Committees for Computer Science Jeroen van der Ham University of

Language Models Assignment of probabilities to sequences of words - PowerPoint PPT Presentation

N-Grams and Language Models Language Models Assignment of probabilities to sequences of words Can be used incrementally to predict the next word N-gram Sequence of n words (bigram, trigram, . . . ) The size of the corpus

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

CSE 421: Algorithms Winter 2014 Lecture 24-25: Poly-time reductions Reading: Sections 8.4-8.8

Language Modeling Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 24, 2017

Science, Decision-Making, and the Law: The Impact Assessment Cat in the Science Hat Dr. Aerin

Sum-of-Product Datatypes in SML CS251 Programming Languages Spring

Understanding Unemployment Insurance Benefits During the COVID-19 Pandemic Federal Unemployment

Sales Tax Exams vs. IRS Exams Issues and Strategies Presented by: Eric L. Green, Esq. 1 About Tax

What We Will Do Today Topic 2 Introduction to Java Programming What are computer languages?

Experiences with Ethical Committees for Computer Science Jeroen van der Ham University of

N-grams & Language ID If N-gram models represent language models, can we use N-gram