Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP Research Institute Samy Bengio , Google Inc. Joseph Keshet, The Hebrew University
Outline • Problem Definition • Keyword Spotting with HMMs • Discriminative Keyword Spotting – derivation – analysis – feature functions • Experimental Results Joseph Keshet, The Hebrew University
Problem Definition Goal: find a keyword in a speech signal h iy z bcl b ao t ix tcl he's bought it Joseph Keshet, The Hebrew University
Problem Definition Goal: find a keyword in a speech signal h iy z bcl b ao t ix tcl he's bought it Joseph Keshet, The Hebrew University
Problem Definition Notation: alignment sequence ¯ s s 1 s 2 s 3 s 4 e 4 keyword phoneme sequence bcl b ao t ¯ p keyword bought k x = ( x 1 , x 2 , x 3 , ¯ x T ) . . . acoustic feature vectors Joseph Keshet, The Hebrew University
Problem Definition predicted speech decision signal Keyword detection (yes/no) ¯ x Spotter s ′ p = /b ao t/ ¯ ¯ f (¯ x , ¯ p ) predicted keyword alignment (phoneme sequence) Joseph Keshet, The Hebrew University
Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords total utterances with keywords false positive = detected utterances without keywords total utterances without keywords Joseph Keshet, The Hebrew University
Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords area under curve A false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University
Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords A = 1 false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University
Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords A false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University
Fat is Good The performance of a keyword spotting system is measured by a Receiver Operating Characteristics (ROC) curve. true positive = detected utterances with keywords true positive rate total utterances with keywords area under curve A false positive = detected utterances without keywords total utterances without keywords false positive rate Joseph Keshet, The Hebrew University
HMM-based Keyword Spotting Joseph Keshet, The Hebrew University
HMM-based Keyword Spotting Whole Word Modeling bought ¯ q ¯ x 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] Joseph Keshet, The Hebrew University
HMM-based Keyword Spotting Whole Word Modeling a garbage model bought ¯ q ¯ x 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] Joseph Keshet, The Hebrew University
HMM-based Keyword Spotting Whole Word Modeling bought ¯ q ¯ x 10 ms [Rahim et al, 1997; Rohlicek et al, 1989] Joseph Keshet, The Hebrew University
HMM-based Keyword Spotting Phoneme-Based garbage bought garbage h iy b ao t ih t ¯ p ¯ q ¯ x 10 ms [Bourlard et al, 1994; Manos & Zue, 1997; Rohlicek et al, 1993] Joseph Keshet, The Hebrew University
HMM-based Keyword Spotting Large Vocabulary Based • Linguistic constraints on the garbage model • Does a human listener need to have a large vocabulary in order to recognize one word? (Cardillo et al, 2002; Rose & Paul, 1990; Szoke et al, 2005; Weintraub, 1995) Joseph Keshet, The Hebrew University
HMM Approaches to Keyword Spotting • Do not address specifically the goal of maximizing the area under the ROC curve for the task of keyword spotting Joseph Keshet, The Hebrew University
Discriminative Approach Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − keyword (phoneme sequence) Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − utterance in which the keyword is uttered Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − utterance in which the keyword is not uttered Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − alignment of the keyword and the utterance with keyword Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Discriminative Keyword Spotting f (¯ Keyword spotter x , ¯ p ) Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Discriminative Class of all keyword Keyword spotting functions F w Spotting f (¯ Keyword spotter x , ¯ p ) Joseph Keshet, The Hebrew University
Learning Paradigm Discriminative learning from examples x + x + S = { (¯ p 1 , ¯ 1 , ¯ 1 , ¯ s 1 ) , . . . , (¯ p m , ¯ m , ¯ m , ¯ s m ) } x − x − Discriminative f (¯ x , ¯ p ) = max w · φ (¯ x , ¯ p, ¯ s ) Keyword ¯ s Spotting w ∈ R n f (¯ Keyword spotter x , ¯ p ) Joseph Keshet, The Hebrew University
Feature Functions We define 7 feature functions of the form: keyword (phoneme sequence of Confidence in sequence) acoustic features the keyword and suggested alignment Feature (¯ x , ¯ p ) Functions R φ j ¯ s Suggested alignment Joseph Keshet, The Hebrew University
Feature Functions I Cumulative spectral change around the boundaries | ¯ p | − 1 � φ j (¯ x , ¯ p, ¯ s ) = d ( x − j + s i , x j + s i ) , j ∈ { 1 , 2 , 3 , 4 } i =2 s i − j + s i j + s i Joseph Keshet, The Hebrew University
Feature Functions I Cumulative spectral change around the boundaries | ¯ p | − 1 � φ j (¯ x , ¯ p, ¯ s ) = d ( x − j + s i , x j + s i ) , j ∈ { 1 , 2 , 3 , 4 } i =2 − j + s i s i j + s i Joseph Keshet, The Hebrew University
Feature Functions II Cumulative confidence in the phoneme sequence | ¯ p | s i +1 − 1 � � φ 5 (¯ x , ¯ p, ¯ s ) = g ( x t , p i ) t = s i i =1 p i − 1 = t p i = eh . . . . . . . . . s i s i +1 s i − 1 Joseph Keshet, The Hebrew University
Feature Functions II Cumulative confidence in the phoneme sequence | ¯ p | s i +1 − 1 � � φ 5 (¯ x , ¯ p, ¯ s ) = g ( x t , p i ) t = s i i =1 We build a static frame-based phoneme classifier g : X × Y → R g ( x t , p i ) is the confidence that phoneme was uttered at p i p i − 1 = t p i = eh frame x t . . . . . . . . . [Dekel, Keshet, Singer, ‘04] s i s i +1 s i − 1 Joseph Keshet, The Hebrew University
Feature Functions II Cumulative confidence in the phoneme sequence | ¯ p | s i +1 − 1 � � φ 5 (¯ x , ¯ p, ¯ s ) = g ( x t , p i ) t = s i frame based i =1 phoneme classifier p i − 1 = t p i = eh . . . . . . . . . s i s i +1 s i − 1 Joseph Keshet, The Hebrew University
Feature Functions III Phoneme duration model | ¯ p | � log N ( s i +1 − s i ; ˆ φ 6 (¯ x , ¯ p, ¯ s ) = µ p i , ˆ σ p i ) i =1 s i − s i − 1 s i +1 − s i Joseph Keshet, The Hebrew University
Feature Functions III Phoneme duration model | ¯ p | � log N ( s i +1 − s i ; ˆ φ 6 (¯ x , ¯ p, ¯ s ) = µ p i , ˆ σ p i ) i =1 - average length of phoneme ˆ p i µ p i - standard deviation of the ˆ σ p i length of phoneme p i s i − s i − 1 s i +1 − s i Joseph Keshet, The Hebrew University
Feature Functions III Statistics of Phoneme duration model phoneme p i | ¯ p | � log N ( s i +1 − s i ; ˆ φ 6 (¯ x , ¯ p, ¯ s ) = µ p i , ˆ σ p i ) i =1 s i − s i − 1 s i +1 − s i Joseph Keshet, The Hebrew University
Recommend
More recommend