Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 11 2. R. Rosenfeld, ”Two Decades of Statistical Language Modeling: Where Do We Go from Here?,” Proceedings of IEEE, August, 2000 3. Joshua Goodman’s (Microsoft Research) Public Presentation Material 4. S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE ASSP, March 1987 5. R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” ICASSP 1995
Acoustic vs. Linguistic ( ) ) ( ) ( ˆ = = W arg max P W X arg max P W P X W W W � Acoustic pattern matching and knowledge about language are equally important in recognizing and understanding natural speech – Lexical knowledge (vocabulary definition and word pronunciation) is required, as are the syntax and semantics of the language (the rules that determine what sequences of words are grammatically well-formed and meaningful) – In addition, knowledge of the pragmatics of language (the structure of extended discourse, and what people are likely to say in particular contexts) can be important to achieving the goal of spoken language understanding systems 2
Language Modeling - Formal vs. Probabilistic � The formal language model – grammar and parsing – The grammar is a formal specification of the permissible structures for the language – The parsing technique is the method of analyzing the sentence to see if its structure is compliant with the grammar � The probabilistic (or stochastic) language model – Stochastic language models take a probabilistic viewpoint of language modeling • The probabilistic relationship among a sequence of words can be derived and modeled from the corpora – Avoid the need to create broad coverage formal grammars – Stochastic language models play a critical role in building a working spoken language system – N -gram language models are most widely used 3
N -gram Language Models - Applications � N -gram language models are widely used in many application domains – Speech recognition – Spelling correction – Handwriting recognition – Optical character recognition (OCR) – Machine translation – Document classification and routing – Information retrieval 4
N -gram Language Models � For a word sequence W , P (W) can be decomposed into a product of conditional probabilities: chain rule ( ) ( ) = P W P w , w ,..., w 1 2 m ) ( ) ( ) ( ) ( = P w P w w P w w , w ... P w w , w ,..., w − 1 1 2 3 1 2 m 1 2 m 1 ( ) ( ) m = ∏ P w P w w , w ,..., w − 1 i 1 2 i 1 = i 2 History of w i ( ) P w w , w ,..., w – In reality, the probability is impossible to − i 1 2 i 1 estimate for even moderate values of i (data sparseness problem) ( ) – A practical solution is to assume that P w w , w ,..., w − i 1 2 i 1 w , w ,..., w depends only on the several previous words − + − + − i N 1 i N 2 i 1 N -gram language models 5
N -gram Language Models (cont.) � If the word depends on the previous two words, we have ( ) a trigram : P w w 2 , w − − i i i 1 ( ) ( ) � Similarly, we can have unigram : or bigram : P w P w i w − i i 1 � To calculate P ( Mary loves that person ) – In trigram models, we would take P ( Mary loves that person )= P ( Mary|<s> ) P ( loves|<s>,Mary )P( that|Mary , loves ) P ( person|loves,that ) P ( </s>|that,person ) – In bigram models, we would take P ( Mary loves that person )= P ( Mary|<s> ) P ( loves|Mary )P( that|loves ) P ( person|that ) P ( </s>|person ) – In unigram models, we would take P ( Mary loves that person )= P ( Mary ) P ( loves )P( that ) P ( person ) 6
N -gram Probability Estimation � The trigram can be estimated by observing from a text corpus the frequencies or counts of the word pair C ( w i-2 , w i-1 ) and the triplet C ( w i-2 , w i-1 , w i ) as follows: C ( w , w , w ) = − − P ( w | w , w ) i 2 i 1 i − − i i 2 i 1 C ( w , w ) − − i 2 i 1 – This estimation is based on the maximum likelihood (ML) principle • This assignment of probabilities yields the trigram model that assigns the highest probability to the training data of all possible trigram models C ( w , w ) � The bigram can be estimated as = − P ( w | w ) i 1 i − i i 1 C ( w ) − i 1 � The unigram can be estimated as C ( w ) = i P ( w ) i corpus size 7
Maximum Likelihood Estimation of N -gram Probability � Given a training corpus T and the language model Λ = Corpus T w w ... w ...... w − − − − 1 th 2 th k th L th { } = Vocabulary W w ,w ,..., w 1 2 V ( ) ( ) Λ ≅ ∏ p T p w history of w N-grams with − − k th k th same history w − k th ( ) are collected ∀ ∈ λ = h T , p w h , N = λ together ∏ ∏ hw i hw i i [ ] hw λ = = ∑ 1 , N C hw i h w hw hw i i i i w i … 陳水扁 總統 訪問 美國 紐約 … 陳水扁 總統 在 巴拿馬 表示 … P(總統|陳水扁)=? 8
Maximum Likelihood Estimation of N -gram Probability (cont.) ( ) Λ p T � Take logarithm of , we have ( ) ( ) ∑ ∑ Φ Λ = Λ = λ log p T N log hw hw i i h w i ( ) Φ Λ � For any pair , try to maximize subject to ( h , w ) i λ = ∀ ∑ 1 , h hw i w i ( ) ∑ ( ) ∑ Φ Λ = Φ Λ + λ − l 1 h hw j h w j ∑ ∂ λ + λ − ∑ ∑ N log ∑ l 1 ( ) hw hw h hw ∂ Φ Λ i i j h w h w = i j ∂ λ ∂ λ hw hw i i N N N N hw hw hw hw ⇒ + = ⇒ = = = = − l 0 ...... l i 1 2 V h h λ λ λ λ hw hw hw hw i 1 2 V ∑ N [ ] N hw C hw s ˆ w hw ⇒ = − ⇒ = − = − ∴ λ = = l l ∑ N N i s i [ ] h h hw h hw λ ∑ N C h s i w hw s h j w j 9
A Simple Bigram Example bigram bigram 10
Major Issues for N -gram LM � Evaluation – How can you tell a good language model from a bad one – Run a speech recognizer or adopt other statistical measurements � Smoothing – Deal with the data sparseness of real training data – Variant approaches have been proposed � Adaptation – Dynamic adjustment of the language model parameter, such as the n -gram probabilities, vocabulary size, and the choice of words in the vocabulary – E.g P (table|the operating), P (system|the operating) 11
How to Evaluate a Language Model? Given two languages models, how to compare them? Given two languages models, how to compare them? � Use them in a recognizer and find the one that leads to the lower recognition error rate – The best way to evaluate a language model – Expensive! � Use the information theory ( Entropy & Perplexity ) to get an estimate of how good a language model might be – Perplexity : the geometric mean of the number of words that can follow a history (or word) after the language model has been applied 12
Entropy � The information derivable from outcome x i depends on its probability P ( x i ) , and the amount of information is defined as 1 ( ) = I w log i P ( x ) i � The entropy H ( X ) of the random variable X is defined as the average amount of information 1 = = = = − ∑ ∑ H ( X ) E [ I ( X )] P ( x ) I ( x ) P ( x ) log E [ log P ( x )] i i i i P ( x ) S S i – The entropy H ( X ) attains the maximum value when the random variable X has a uniform distribution; i.e., 1 = ∀ P ( x ) i i N – The entropy H ( X ) is nonnegative and becomes zero only if the probability function is deterministic; i.e., = P ( x ) 1 for some x i i 13
Cross-Entropy � The entropy of a language is = − ∑ H ( language ) P ( E ) log P ( E ) , where E is a language event i 2 i i � It can be proved that The cross-entropy of a model with respect to − ≤ − ∑ ∑ P ( E ) log P ( E ) P ( E ) log P ( E | Model ) i 2 i i 2 i the correct model Better models will have lower cross-entropy � The entropy of a language with a vocabulary size of | V | on a per word basis is | V | = − ∑ H P ( w ) log P ( w ) i 2 i = i 1 – If every word is equally likely true entropy 1 1 | V | − = ≥ ∑ log log | V | H 2 2 | V | | V | = i 1 14
Logprob � For a language with a vocabulary size of | V |, the cross entropy of a model with respect to the correct model on a per word basis is | V | − ∑ P ( w ) log P ( w | Model ) i 2 i = i 1 � Given a text corpus W = w 1 ,w 2 ,…w N , the cross-entropy of a model can be estimated by logprob ( LP ) defined as 1 1 N = − = − ∑ LP ( Model ) log P ( W | Model ) log P ( w i Model | ) 2 2 N N = i 1 true entropy ≥ LP ( Model ) H The goal is to find a model which has a logprob that is as close as possible to the true entropy 15
Recommend
More recommend