language model adaptation
play

Language Model Adaptation Hsin-min Wang References: X. Huang et. - PowerPoint PPT Presentation

Language Model Adaptation Hsin-min Wang References: X. Huang et. al., Spoken Language Processing (2001), Chapter 11. M. Bacchiani and B. Roark, Unsupervised language model adaptation, ICASSP2003. Marcello Federico,


  1. Language Model Adaptation Hsin-min Wang References: • X. Huang et. al., Spoken Language Processing (2001), Chapter 11. • M. Bacchiani and B. Roark, “Unsupervised language model adaptation,” ICASSP2003. • Marcello Federico, “Efficient language model adaptation through MDI ,“ Eurospeech99 • Langzhou Chen,Jean-Luc Gauvain,Lori Lamel and Gilles Adda, “Using information retrieval methods for language model adaptation,” Eurospeech2001. • Langzhou Chen, Jean-Luc Gauvain, Lori Lamel and Gilles Adda, “Unsupervised language model adaptation for broadcast news,” ICASSP 2003. 1

  2. Definition of Speech Recognition Problem � For the given acoustic observation X = X 1 X 2 … X n , the goal of speech recognition is to find out the corresponding word sequence W = w 1 w 2 … w m that has the maximum posterior probability P ( W | X ) ( ) ˆ = W arg max P W X W = W w w ... w ...w ) ( ) ( 1 2 i m P W P X W { } ∈ = where w V : v , v , ...,v arg max ( ) i 1 2 N P X W ) ( ) ( = arg max P W P X W W Language Modeling Acoustic Modeling 2

  3. Language Model (LM) Adaptation? Why Language Model Adaptation? Why Language Model Adaptation? � Dynamic adjustment of the language model parameter, such as n -gram probabilities, vocabulary size, and the choice of words in the vocabulary, is important since the topic changes from time to time What is Language Model Adaptation? What is Language Model Adaptation? � Language model adaptation attempts to obtain language models for a new domain with a small amount adaptation data acoustic model adaptation How to Adapt N N - -gram Probabilities? gram Probabilities? How to Adapt � The most widely used approaches are model interpolation and count mixing 3

  4. MAP LM Adaptation 4

  5. MAP The model parameters θ are assumed to be a random vector in the space Θ . Given an observatio n sample X , the MAP estimate is obtained as the mode of the posterior distributi on of θ denoted as g ( .| X ) θ = θ = θ θ arg max g ( | X ) arg max f ( X | ) g ( ). MAP θ θ 5

  6. → L w e e e 1 1 1 1 MAP Estimation for N-gram LM M M M L → w e e L e K K K K x x x 1 2 T Let w be the probabilit y of observing the k th discrete event e k k { } K = = ∑ among a set of K possible outcomes e | k 1 , , K and w 1 . L k k = k 1 Then, the probabilit y of observing a sequence of i.i.d discrete ( ) ( ) K = = n ∏ observatio ns X x , , x is p x , , x | w , , w w , L L L k 1 T 1 T 1 K k = k 1 T = = ∑ where n 1 ( x e ) is the number of occurrence observing k t k = t 1 ⋅ the k th event in the sequence with 1( ) as the indicator function. 6

  7. MAP Estimation for N-gram LM (cont.) ( ) The prior distributi on of w , , w can be assumed as a Dirichlet density L 1 K K ν − ν ν ∝ 1 ∏ p ( w ,..., w | , ) w L k 1 K 1 K k = k 1 { } ν > = where 0 | k 1 , , K is the set of hyperparam eters. L k K + ν − ∝ n 1 ∏ So, p ( w ,..., w | x ,..., x ) w k k 1 K 1 T k = k 1 ∑    K K ⇒ = Ψ + + ν − + − ∑ log p ( w ,..., w | x ,..., x ) ( n 1 ) log w l  w 1    1 K 1 T j j j j     = = j 1 j 1 + ν − 1 n 1 ⇒ × + ν − + = ⇒ = − k k Differenti ate w.r.t w ( n 1 ) l 0 w k k k k w l k + ν − ν − + ( ) n 1 1 n K K K = − = ⇒ = − + ν − ∴ = ∑ ∑ ∑ w k k 1 l n 1 w k k Q ( ) k j j k K K l ν − + = = = ∑ ∑ k 1 k 1 j 1 1 n j j = = j 1 j 1 7

  8. MAP N-gram LM Adaptation Let the count for a word w in n - gram history h be denoted as c ( hw ) . i i = ∑ Let the count for an n - gram history h be denoted as c ( h ) c ( hw ) . i i Let the correspond ing counts from the general - domain sample be ~ ~ denoted as c ( hw ) and c ( h ). i ~ Let P ( w | h ) and P ( w | h ) denote the probabilit y of w in history h estimated i i i from the general - domain sample and the adaptation sample, respective ly. α ~ ~ ν = + If we choose c ( h ) P ( w | h ) 1 , then i i β α ~ ~ + c ( h ) P ( w | h ) c ( hw ) ~ α + β β i i c ( hw ) c ( hw ) ˆ = = P ( w | h ) i i ~ i   α + β α c ( h ) c ( h ) ~ K ~ + ∑ c ( h ) P ( w | h ) c ( h )   j β   = j 1 Count mixing approach 8

  9. MAP N-gram LM Adaptation λ ~ ν = + If we choose c ( h ) P ( w | h ) 1 , then i i − λ 1 λ ~ c ( hw ) λ ~ + P ( w | h ) i + c ( h ) P ( w | h ) c ( hw ) i − λ 1 c ( h ) − λ i i 1 ˆ = = P ( w | h ) λ λ i   ~ K + + ∑ 1 c ( h ) P ( w | h ) c ( h )   − λ j − λ 1  1  = j 1 ~ = λ + − λ P ( w | h ) ( 1 ) P ( w | h ) i i The MAP estimate reduces to the model interpolation approach 9

  10. MDI LM Adaptation 10

  11. MDI LM Adaptation � Minimum Discrimination Information (MDI) – A new LM is estimated so that it is “ as close as possible ” to a general background LM – Given a background model P B ( h,w ) and an adaptive corpus A , we want to find model P A ( h,w ) that satisfies the following set of linear constraints ˆ δ = = ∑ P ( h , w ) ( hw ) P ( S ), i 1 ,...,M A i A i n ∈ hw V S ⊂ n V where δ i (.) are indicator functions of features , and i ˆ are empirical estimates of the features on A P A S ( ) i and minimizes the Kullback-Leibler distance between P A ( h,w ) and P B ( h,w ) Q ( h , w ) = ∑ P ( h , w ) arg min Q ( h , w ) log A P ( h , w ) n ⋅ ∈ Q ( ) hw V B 11

  12. MDI LM Adaptation (cont.) � MDI model can be trained by using the GIS (Generalized Iterative Scaling) algorithm – It performs the following iterations = ( 0 ) P ( h , w ) P ( h , w ) A B δ ( hw ) i  ˆ  ∈ n P ( S ) k Assume hw V M   + = ( r 1 ) ( r ) P ( h , w ) P ( h , w ) ∏ A i   A A ( r ) P ( S ) satisfies exactly   = i 1 A i k features where = δ = ( r ) ∑ ( r ) P ( S ) P ( h , w ) ( hw ) i 1 ,..., M A i A i n ∈ hw V 12

  13. MDI LM Adaptation (cont.) � Given that the adaptation sample is typically small, we assume only unigram features can be reliably estimated ˆ δ = = ∑ P ( h , w ) ( hw ) P ( S ), i 1 ,...,M A i A i n ∈ hw V ˆ δ = ∀ ∈ ∑ ˆ ˆ P ( h , w ) ( hw ) P ( w ), w V P ( w | h ) A w ˆ A n A ∈ hw V α P ( w | h ) P ( h ) ( w ) δ = = ˆ where ( hw ) 1 if w w and 0 otherwise = B B ˆ w α ∑ ˆ ˆ P ( w | h ) P ( h ) ( w ) ∈ ˆ w V B B α P ( w | h ) ( w ) = ( 0 ) P ( h , w ) P ( h , w ) = B A B α ∑ P ( w ˆ | h ) ( w ˆ ) ∈ w ˆ V δ B ( hw ) i  ˆ  P ( S ) k M   + = ( r 1 ) ( r ) P ( h , w ) P ( h , w ) ∏ A i   A A ( r ) P ( S )   = i 1 A i ˆ P ( w ) = α α = P ( h , w ) P ( h , w ) ( w ), where ( w ) A A B P ( w ) B 13

  14. MDI LM Adaptation (cont.) γ ˆ   P ( w )   α = ( w ) A   P ( w )   B where γ ranges from 0 to 1 14

  15. Unsupervised LM Adaptation for Broadcast News Using IR Methods 15

  16. Introduction � Unsupervised language model adaptation is an outstanding challenge for speech recognition,especially for complex task such as broadcast news transcription where the content of any given show is related to multiple topics � It is not possible to select adaptation data in advance – the dynamic nature of the task � Seen as turning a general LM to some special topics without domain specific training data � Information retrieval techniques have been proposed to address this problem � The speech recognition hypothesis is used as a query to extract articles or text segment with related topic 16

  17. Adaptation Method Overview � The adaptive algorithm can be divided into two parts – Extraction of the adaptation corpus • Initial hypothesis segmentation • Keyword selection • Retrieving relevant articles – LM adaptation • MAP adaptation • MDI adaptation • Dynamic mixture model 17

  18. Keyword Selection � The content words with the most relevant topic information are selected as query terms � The relevance of word w i and story s j is given by the following score p ( w , v ) = ∑ R ( w s ) log i , i j p ( w ) p ( v ) ∈ v k j i where p ( w i ,v ) is the probability that w i and v appear in the same story and k j is the set of all words in story s j – All the words with a relevance score higher than an empirically determined threshold are selected 18

Recommend


More recommend