Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel – IBM William Byrne – Johns Hopkins University Pattern Recognition in Speech and Language Processing – Chap2
Outline • Minimum Bayes-Risk Classification Framework – Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR • Practical MBR Procedures for ASR – Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices
Outline • Segmental MBR Procedures – Segmental Voting – ROVER – e-ROVER • Experimental Results – Parameter Tuning within the MBR Classification Rule – Utterance Level MBR Word and Keyword Recognition – ROVER and e-ROVER for Multilingual ASR • Summary
• Minimum Bayes-Risk Classification Framework – Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR • Practical MBR Procedures for ASR – Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices
Minimum Bayes-Risk Classification Framework • Definition: A : acoustic observatio n sequence W : word string A W : the hypothesis space of the observatio n A h δ → A ( A ) : A W : ASR classifier h ′ ′ l ( W , W ) : loss function, where W is mistranscr iption of W P ( W , A ) : true distributi on of speech and language
Minimum Bayes-Risk Classification Framework How to measure classifier performanc e? [ ] ∑ ∑ → δ = δ Using Bayes - risk E l ( W, ( A )) l ( W , ( A )) P ( W , A ) (2.1) P ( W, A ) A W ∑ ′ ∴ δ = ( A ) arg min l ( W, W ) P ( W | A ) (2.2) is chosen to minimize Bayes - risk ′ A ∈ W W h W ⎛ ⎞ ′ δ = but we use ( A ) arg min l ( W , W ) , W is the correct tr anscriptio n of A ⎜ ⎟ c c ⎝ ⎠ ′ ∈ A W W h { } ⇒ = > A A Let W be the subset of W , with nonzero P ( W | A ) W W | P ( W | A ) 0 e e ∑ ′ ∴ δ = Equation 2.2 can be rewritten as ( A ) arg min l ( W, W ) P ( W | A ) (2.4) ′ A ∈ W W A h ∈ W W e ∑ ′ ′ ′ = ∴ δ = Let l ( W, W ) P ( W | A ) S ( W ) ( A ) arg min S ( W ) A ′ ∈ W W A ∈ h W W e
Minimum Bayes-Risk Classification Framework A Since the observatio ns in W serve as the evidence used by MBR classifier . e ∴ A W is refered as evidence space e and P ( W | A ) is refered as evidence distributi on How to define loss function ? Two ways : = ⇒ = δ loss function l ( X,Y ) classifier ( A ) LRT LRT → method likelihood ratio hypothesis testing = ⇒ = δ loss function l ( X,Y ) classifier ( A ) 0 / 1 MAP → method maximum a - posteriori classifica tion
Likelihood Ratio Based Hypothesis Testing = = ⎧ 0 if X H , Y H n n ⎪ = = t if X H , Y H ⎪ { } { } = = = 1 a n If W H , H and W H , H and define l ( X , Y ) ⎨ e n a h n a LRT = = t if X H , Y H ⎪ 2 n a ⎪ = = 0 if X H , Y H ⎩ a a ∑ ′ ∴ δ = ( A ) arg min l ( W, W ) P ( W | A ) ′ A ∈ W W A h ∈ W W e } [ ] ′ ′ = + arg min l ( H , W ) P ( H | A ) l ( H , W ) P ( H | A ) n n a a { ′ ∈ W H , H n a [ ] + ⎧ ⎫ l ( H , H ) P ( H | A ) l ( H , H ) P ( H | A ) , n n n a n a = arg min ⎨ ⎬ [ ] + l ( H , H ) P ( H | A ) l ( H , H ) P ( H | A ) ⎩ ⎭ n a n a a a [ ] [ ] ⎧ t P ( H | A ) , ⎫ ⎧ t P ( A | H ) P ( H ) , ⎫ = 1 a = 1 a a arg min arg min ⎨ ⎬ ⎨ ⎬ [ ] [ ] t P ( H | A ) t P ( A | H ) P ( H ) ⎩ ⎭ ⎩ ⎭ 2 n 2 n n ⎧ P ( A | H ) t P ( H ) > ⎧ H t P ( A | H ) P ( H ) t P ( A | H ) P ( H ) > = ⎪ H n 1 a t n 2 n n 1 a a = = n ⎨ ⎨ P ( A | H ) t P ( H ) a 2 n H otherwise ⎩ ⎪ H otherwise a ⎩ a
Likelihood Ratio Based Hypothesis Testing ⎧ P ( A | H ) > ⎪ H if n t ∴ δ = ( A ) n (2.6) ⎨ P ( A | H ) LRT a ⎪ H otherwise ⎩ a The threshold t is set in an applicatio n specific manner; it determines the balance between false rejection and false aceptance. A null class H n H a alternative class
Maximum A-Posteriori Probability Classification ′ ≠ ⎧ 1 if W W ′ = Define l ( W , W ) ⎨ 0 / 1 0 otherwise ⎩ ∑ ′ ∴ δ = ( A ) arg min l ( W, W ) P ( W | A ) A ′ ∈ W W A h ∈ W W e ∑ = arg min P ( W | A ) ′ A ∈ W W ′ h ≠ W W ( ) ′ = − arg min 1 P ( W | A ) A ′ ∈ W W h ′ = arg max P ( W | A ) ′ A ∈ W W h
Previous Studies of Application Sensitive ASR • Use of risk minimization in automatic speech has not been extensive. • Early investigations into the minimum Bayes-risk training criteria for speech recognizers were performed by Nadas . • However our focus in this chapter is in minimum-risk classification rather than estimation .
Previous Studies of Application Sensitive ASR • Stolcke et.al. proposed an approximation to a minimum Bayes risk classifier for generation of minimum word error rate hypothesis from recognition N-best lists . • Other researchers have proposed posterior probability and confidence based hypothesis selection strategies for word error rate reduction.
• Minimum Bayes-Risk Classification Framework – Likelihood Ratio Based Hypothesis Testing – Maximum A-Posteriori Probability Classification – Previous Studies of Application Sensitive ASR • Practical MBR Procedures for ASR – Summation over Hidden State Sequences – MBR Recognition with N-best Lists – MBR Recognition with Lattices
Practical MBR Procedures for ASR • Why difficult to implement? – The evidence and hypothesis spaces in Equation 2.4 tend to be quite large . – The problem of large spaces is worsened by the fact that an ASR recognizer often has to process many consecutive utterances . – There are efficient DP techniques for MAP recognizer, such methods are not yet available for an MBR recognizer under an arbitrary loss function.
Practical MBR Procedures for ASR • How to implement? – Two implementation: • N-best list rescoring procedure • Search over a recognition lattice – Segment long acoustic data into sentence or phrase length utterances. – Restrict the evidence and hypothesis spaces to manageable sets of word strings.
Summation over Hidden State Sequences • A computational issue associated with the use of HMM in the evidence distribution will be addressed. • How to obtain the true distribution? P ( W ) P ( A | W ) = P ( W | A ) (2.12) P ( A ) Here P ( W ) is approximat ed using a language model , it is usually a Markov chain based N - gram model. P ( A | W ) is usually approximat ed using a HMM called the acoustic model. Let S be the set of all the states in the acoustic HMM P ( A | W ). Let χ denote the set of all possible state sequences that could generate A . The probabilit y P ( A | W ) is computed as ∑ ∑ = = P ( A | W ) P ( A , X | W ) P ( X | W ) P ( A | X , W ) (2.13) ∈ ∈ X χ X χ
Summation over Hidden State Sequences The summation over all possible hidden state sequences is too expensive. A computatio nally feasible alternativ e is to modify the Equation 2.4 as ∑ ′ δ = ( A ) arg min l ( W, W ) P ( W | A ) A ′ ∈ W W A h ∈ W W e ∑ P ( W ) P ( X | W ) P ( A | X , W ) ∑ ∈ ′ = X χ arg min l ( W, W ) P ( A ) A ′ ∈ W W A h ∈ W W e ∑ ∑ ′ = arg min l ( W, W ) P ( W , X , A ) A ′ ∈ W W A ∈ h ∈ X χ W W e ( ) ∑ ′ ′ ≈ arg min l ( W , X ), ( W , X ) P ( W , X , A ) A A ′ ′ ∈ × ( W , X ) W χ A A h ∈ × ( W , X ) W χ e A where χ is a sparse sampling of the most likely state sequences in χ .
Summation over Hidden State Sequences For convenienc e we use W rather tha n ( W , X ) A A × A W rather tha n W χ h h × A A A W rather tha n W χ e e ∑ ′ ∴ δ = we have ( A ) arg min l ( W, W ) P ( W , A ) (2.15) ′ A ∈ W W h W A ∈ W e
Recommend
More recommend