discriminative keyword spotting
play

Discriminative Keyword Spotting Joseph Keshet, The Hebrew - PowerPoint PPT Presentation

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP Research Institute Samy Bengio , Google Inc. Joseph Keshet, The Hebrew University Outline Problem Definition Keyword Spotting with HMMs


  1. Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP Research Institute Samy Bengio , Google Inc. Joseph Keshet, The Hebrew University

  2. Outline • Problem Definition • Keyword Spotting with HMMs • Discriminative Keyword Spotting – derivation – analysis – feature functions • Experimental Results Joseph Keshet, The Hebrew University

  3. Problem Definition Goal: find a keyword in a speech signal h iy z bcl b ao t ix tcl he's bought it Joseph Keshet, The Hebrew University

  4. Problem Definition Goal: find a keyword in a speech signal h iy z bcl b ao t ix tcl he's bought it Joseph Keshet, The Hebrew University

  5. Problem Definition Notation: alignment sequence ¯ s s 1 s 2 s 3 s 4 e 4 keyword phoneme sequence bcl b ao t ¯ p keyword bought k x = ( x 1 , x 2 , x 3 , ¯ x T ) . . . acoustic feature vectors Joseph Keshet, The Hebrew University

  6. Problem Definition predicted speech decision signal Keyword detection (yes/no) ¯ x Spotter s ′ p = /b ao t/ ¯ ¯ f (¯ x , ¯ p ) predicted keyword alignment (phoneme sequence) Joseph Keshet, The Hebrew University

  7. Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords total utterances without keywords Joseph Keshet, The Hebrew University

  8. Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords area under curve A false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University

  9. Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords A = 1 false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University

  10. Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords A false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University

  11. Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords area under curve A false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University

  12. HMM-based Keyword Spotting Joseph Keshet, The Hebrew University

  13. HMM-based Keyword Spotting Whole Word Modeling bought ¯ q ¯ x 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] Joseph Keshet, The Hebrew University

  14. HMM-based Keyword Spotting Whole Word Modeling a garbage model bought ¯ q ¯ x 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] Joseph Keshet, The Hebrew University

  15. HMM-based Keyword Spotting Whole Word Modeling bought ¯ q ¯ x 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] Joseph Keshet, The Hebrew University

  16. HMM-based Keyword Spotting Phoneme-Based garbage bought garbage h iy b ao t ih t ¯ p ¯ q ¯ x 10 ms [Bourlard et al, 1994; Manos & Zue, 1997; Rohlicek et al, 1993] Joseph Keshet, The Hebrew University

  17. HMM-based Keyword Spotting Large Vocabulary Based • Linguistic constraints on the garbage model • Does a human listener need to have a large vocabulary in order to recognize one word? (Cardillo et al, 2002; Rose & Paul, 1990; Szoke et al, 2005; Weintraub, 1995) Joseph Keshet, The Hebrew University

  18. HMM Approaches to Keyword Spotting • Do not address specifically the goal of maximizing the area under the ROC curve for the task of keyword spotting Joseph Keshet, The Hebrew University

  19. Discriminative Approach Joseph Keshet, The Hebrew University

  20. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Joseph Keshet, The Hebrew University

  21. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − keyword (phoneme sequence) Joseph Keshet, The Hebrew University

  22. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − utterance in which the keyword is uttered Joseph Keshet, The Hebrew University

  23. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − utterance in which the keyword is not uttered Joseph Keshet, The Hebrew University

  24. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − alignment of the keyword and the utterance with keyword Joseph Keshet, The Hebrew University

  25. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Joseph Keshet, The Hebrew University

  26. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Discriminative Keyword Spotting f (¯ Keyword spotter x , ¯ p ) Joseph Keshet, The Hebrew University

  27. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Discriminative Class of all keyword Keyword spotting functions F w Spotting f (¯ Keyword spotter x , ¯ p ) Joseph Keshet, The Hebrew University

  28. Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Discriminative f (¯ x , ¯ p ) = max w · φ (¯ x , ¯ p, ¯ s ) Keyword ¯ s Spotting w ∈ R n f (¯ Keyword spotter x , ¯ p ) Joseph Keshet, The Hebrew University

  29. Feature Functions We define 7 feature functions of the form: keyword (phoneme sequence of Confidence in sequence) acoustic features the keyword and suggested alignment Feature (¯ x , ¯ p ) Functions R φ j ¯ s Suggested alignment Joseph Keshet, The Hebrew University

  30. Feature Functions I Cumulative spectral change around the boundaries | ¯ p | − 1 � φ j (¯ x , ¯ p, ¯ s ) = d ( x − j + s i , x j + s i ) , j ∈ { 1 , 2 , 3 , 4 } i =2 s i − j + s i j + s i Joseph Keshet, The Hebrew University

  31. Feature Functions I Cumulative spectral change around the boundaries | ¯ p | − 1 � φ j (¯ x , ¯ p, ¯ s ) = d ( x − j + s i , x j + s i ) , j ∈ { 1 , 2 , 3 , 4 } i =2 − j + s i s i j + s i Joseph Keshet, The Hebrew University

  32. Feature Functions II Cumulative confidence in the phoneme sequence | ¯ p | s i +1 − 1 � � φ 5 (¯ x , ¯ p, ¯ s ) = g ( x t , p i ) t = s i i =1 p i − 1 = t p i = eh . . . . . . . . . s i s i +1 s i − 1 Joseph Keshet, The Hebrew University

  33. Feature Functions II Cumulative confidence in the phoneme sequence | ¯ p | s i +1 − 1 � � φ 5 (¯ x , ¯ p, ¯ s ) = g ( x t , p i ) t = s i i =1 We build a static frame-based phoneme classifier g : X × Y → R g ( x t , p i ) is the confidence that phoneme was uttered at p i p i − 1 = t p i = eh frame x t . . . . . . . . . [Dekel, Keshet, Singer, ‘04] s i s i +1 s i − 1 Joseph Keshet, The Hebrew University

  34. Feature Functions II Cumulative confidence in the phoneme sequence | ¯ p | s i +1 − 1 � � φ 5 (¯ x , ¯ p, ¯ s ) = g ( x t , p i ) t = s i frame based i =1 phoneme classifier p i − 1 = t p i = eh . . . . . . . . . s i s i +1 s i − 1 Joseph Keshet, The Hebrew University

  35. Feature Functions III Phoneme duration model | ¯ p | � log N ( s i +1 − s i ; ˆ φ 6 (¯ x , ¯ p, ¯ s ) = µ p i , ˆ σ p i ) i =1 s i − s i − 1 s i +1 − s i Joseph Keshet, The Hebrew University

  36. Feature Functions III Phoneme duration model | ¯ p | � log N ( s i +1 − s i ; ˆ φ 6 (¯ x , ¯ p, ¯ s ) = µ p i , ˆ σ p i ) i =1 - average length of phoneme ˆ p i µ p i - standard deviation of the ˆ σ p i length of phoneme p i s i − s i − 1 s i +1 − s i Joseph Keshet, The Hebrew University

  37. Feature Functions III Statistics of Phoneme duration model phoneme p i | ¯ p | � log N ( s i +1 − s i ; ˆ φ 6 (¯ x , ¯ p, ¯ s ) = µ p i , ˆ σ p i ) i =1 s i − s i − 1 s i +1 − s i Joseph Keshet, The Hebrew University

Recommend


More recommend