Foundations of Language Science and Technology: Statistical Language Models Dietrich Klakow
Using Language Models 2
How Speech Recognition works Speech Signal Feature Extraction Feature Extraction Acoustic Model Stream of feature P(A|W) vectors A Search ^ W=argmax [P(A|W) P(W)] Language Model Language Model [W] P(W) P(W) ^ ^ Recognized word sequence W 3
Guess the next word What‘s in your hometown newspaper ??? 4
Guess the next word What‘s in your hometown newspaper today 5
Guess the next word It‘s raining cats and ??? 6
Guess the next word It‘s raining cats and dogs 7
Guess the next word President Bill ??? 8
Guess the next word President Bill Gates 9
Information Retrieval • Language model introduced to information retrieval in 1998 by Ponte&Croft Query D 1 D 7 D 3 Q D 6 D 2 D 4 P(Q|D 2 ) D 5 Ranking according to P(Q|D i ) 10
Measuring the Quality of Language Models 11
Definition of Perplexity − 1 / N = PP P ( w ... w ) 1 N − 1 ∑ ( ) = exp N ( w , h ) log P ( w | h ) N w , h P(w|h): language model N(w,h): frequency of sequence w,h in some test corpus N: size of test corpus 12
Interpretation Calculate perplexity of uniform distribution (white board) 13
Perplexity and Word Error Rate Perplexity and error rate are correlate within error bars 14
Estimating the Parameters of a Language Model 15
Goal • Minimize perplexity on training data − 1 ∑ ( ) = PP exp N ( w , h ) log P ( w | h ) Train N w , h Train 16
Define likelihood L=-log (PP) 1 ∑ ( ) = L N ( w , h ) log P ( w | h ) Train N w , h Train Minimizing perplexity How to take normalization � constraint into account? maximizing likelihood 17
Calculating the maximum likelihood estimate (white board) 18
Maximum likelihood estimator N ( w , h ) Train = P ( w | h ) N ( h ) Train What´s the problem? 19
Backing-off and Smoothing 20
Absolute Discounting • See white board 21
Influence of Discounting Parameter 22
Possible further Improvements 23
Linear Smoothing N ( w w ) − Train 1 0 = λ ( | ) P w w − 0 1 1 N ( w ) − Train 1 N ( w ) Train 0 + λ 2 N Train 1 + − λ − λ ( 1 ) 1 2 V V: size of vocabulary 24
Marginal Backing-Off (Kneser-Ney-Smoothing) • Dedicated backing-off distributions • Usually about 10% to 20% reduction in perplexity 25
Class Language Models • Automatically group words into classes • Map all words in the language model to classes • Dramatic reduction in number of parameters to estimate • Usually used in linear with word language model 26
Summary • How to build a state-of-the art plain vanilla language model: • Trigram • Absolute discounting • Marginal backing-off (Kneser-Ney smoothing) • Linear interpolation with class model 27
Recommend
More recommend