Speech Processing 15-492/18-492 Speech Recognition Template - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Template matching

Speech Recognition by Templates A little history … � A little history … � Matching Templates � Matching Templates � DTW (Dynamic Time Warping) � DTW (Dynamic Time Warping) � Beyond template matching � Beyond template matching �

Radio Rex (1922) • Toys always lead technology … • Call “Rex” and he comes out of his kennel • (Crystalradio.com and Rhys Jones)

Toy ASR“Tricks” Radio Rex � Radio Rex � � Recognizes vowel formants in “EH” Recognizes vowel formants in “EH” � Voice activated toy train � Voice activated toy train � � Multilingual stop/go Multilingual stop/go hashire/tomate hashire/tomate � Toys “pets” don’t need perfect ASR � Toys “pets” don’t need perfect ASR �

Template Matching Record templates from user � Record templates from user � � Store in library Store in library � Record ASR example � Record ASR example � � Compare against each library template Compare against each library template � Select closest example � Select closest example � For example … � For example … � � On a voice dialing system On a voice dialing system �

Voice Dialing System • Library – Mom – Dad – Bob – Mario’s Pizza – Let’s Go Bus Information System

Matching in Time Domain Duration � Duration � � Will discriminate some examples Will discriminate some examples � � But Mom, Bob and Dad will be confused But Mom, Bob and Dad will be confused � What about spectral properties � What about spectral properties �

Matching in Frequency Domain Mom Bob

Different deliveries We change durations � We change durations � � Two utterances are never the same Two utterances are never the same � When it fails we change our delivery � When it fails we change our delivery � � Become more Become more articular articular � � “clearer” “clearer” �

Dynamic Time Warping Template Sample Speech

DTW algorithm i Template i-1 j-1 j Sample � For each square For each square � � Dist(template[i],sample[j Dist(template[i],sample[j]) + ]) + � smallest_of (Dist(template[i (Dist(template[i- -1],sample[j]) 1],sample[j]) smallest_of Dist(template[i],sample[j- -1]) 1]) Dist(template[i],sample[j Dist(template[i- -1],sample[j 1],sample[j- -1]) 1]) Dist(template[i Remember which choice your took (count path) Remember which choice your took (count path)

Multiple Templates Compare against each � Compare against each � Find closest � Find closest � Need to normalize scores � Need to normalize scores � � (divide by length of matches) (divide by length of matches) �

Matching Templates Template Library Sample Word0 Word1 Word2 … For Word in Templates Score = dtw(Template[Word], Sample); if (Score < BestScore) BestWord = Word; DoAction(Action[BestWord])

DTW issues What happens with no- -matches matches � What happens with no � � Need to deal with none of the above Need to deal with none of the above � What happens with more templates � What happens with more templates � � Harder to choose between Harder to choose between � � Once variance greater than differences Once variance greater than differences � Choose templates that are very different � Choose templates that are very different �

DTW/Template Applications Voice dialer � Voice dialer � Simple command and control � Simple command and control � Speaker ID � Speaker ID �

Speaker ID Template Library Sample Speaker0 Speaker1 Speaker2 … For Speaker in Templates Score = dtw(Template[Speaker], Sample); if (Score < BestScore) BestSpeaker = Speaker;

DTW � Advantages Advantages � � Works well for small number of templates (<20) Works well for small number of templates (<20) � � Language independent Language independent � � Speaker specific Speaker specific � � Easy to train (end user controls it) Easy to train (end user controls it) � � Disadvantages Disadvantages � � Limited number of templates Limited number of templates � � Speaker specific Speaker specific � � Need actual training examples Need actual training examples �

More reliable matching • Distance metric – Euclidean • But some distances are bigger than others – Silence is pretty similar – Fricatives are quite larger • A longer fricative might give large score • A longer vowel might give smaller score

More reliable matching • Having multiple template examples – Individual matches or – Average them together • DTW align all of the examples • Collect statistics as a Gaussian – Mean and standard deviation for each coeff

More reliable distances • Instead of Euclidean distance – Doesn’t care about the standard deviation • Use Mahalanobis distance – Care about means and standard deviation

Extending Template matching String word templates together � String word templates together � � Need to find word segmentation Need to find word segmentation � Word0 Word1 Word2 … But there are many words … � But there are many words … �

Extending template model String phoneme templates together � String phoneme templates together � � A template model for each phoneme A template model for each phoneme � Phoneme Templates Sample Phone0 k ae t Phone1 Phone2 …

Summary Speech Recognition by Templates � Speech Recognition by Templates � � Good for simple small vocabulary tasks Good for simple small vocabulary tasks � Dynamic Time Warping (DTW) � Dynamic Time Warping (DTW) � � Can match different durational examples Can match different durational examples � Averaging over multiple models � Averaging over multiple models � Distance metrics � Distance metrics � � Euclidean Euclidean vs vs Mahalanobis Mahalanobis �

Speech Processing 15-492/18-492 Speech Recognition Template - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by Templates A little history A little history Matching Templates Matching Templates DTW (Dynamic Time Warping) DTW (Dynamic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

91.304 Foundations of (Theoretical) Computer Science (Th ti l) C t S i Chapter 1 Lecture

Accessibility Niels Olof Bouvin 1 Overview What is accessibility? Making HTML accessible

Boundary Layers Lecture 8 ME EN 412 Andrew Ning aning@byu.edu Outline Boundary Layer

Ozone in the Tropical Tropopause Layer (TTL) over the Western Pacific Eric Hintsa, Fred Moore,

TORAH DEVOTIONALS Oh how I love your law ( torah ) ! It is my meditation all the day. (Ps.

Style (con+nued) LINGUIST 159 American Dialects 11/13/2014

Hierarchical Graph Representation Learning via Differentiable Pooling Rex Ying, Jiaxuan You,

who am i ? H D Moore <hdm [at] metasploit.com> Metasploit project Core developer and

Speech Processing 15-492/18-492 Speech Recognition Template - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by Templates A little history A little history Matching Templates Matching Templates DTW (Dynamic Time Warping) DTW (Dynamic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

91.304 Foundations of (Theoretical) Computer Science (Th ti l) C t S i Chapter 1 Lecture

Accessibility Niels Olof Bouvin 1 Overview What is accessibility? Making HTML accessible

Boundary Layers Lecture 8 ME EN 412 Andrew Ning aning@byu.edu Outline Boundary Layer

Ozone in the Tropical Tropopause Layer (TTL) over the Western Pacific Eric Hintsa, Fred Moore,

TORAH DEVOTIONALS Oh how I love your law ( torah ) ! It is my meditation all the day. (Ps.

Style (con+nued) LINGUIST 159 American Dialects 11/13/2014

Hierarchical Graph Representation Learning via Differentiable Pooling Rex Ying, Jiaxuan You,

who am i ? H D Moore &lt;hdm [at] metasploit.com&gt; Metasploit project Core developer and

who am i ? H D Moore <hdm [at] metasploit.com> Metasploit project Core developer and