Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Acoustic Modeling Acoustic Modeling  Speech and Signal Variability Speech and Signal Variability  Measuring Error Measuring Error  Pronunciation lexicons Pronunciation lexicons

Variability in Speech Signal Variability in Speech Signal  “ “Mr Wright should write to Ms Wright right Mr Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda.  Homophones: same pronunciation Homophones: same pronunciation  “ “wright” “right” “write” / r ay t / wright” “right” “write” / r ay t /  “ “ford or” “four door” / f ao r d ao r / ford or” “four door” / f ao r d ao r /

Style Variability Style Variability  Different articulation in different situations Different articulation in different situations  Clear vs Conversational Clear vs Conversational  Whisper vs shouting Whisper vs shouting  Talking to machine, talking to others Talking to machine, talking to others  Frustrated speech Frustrated speech

Speaker variability Speaker variability  Gender, age, dialect, health Gender, age, dialect, health  Speaker dependent systems Speaker dependent systems  Speaker independent systems Speaker independent systems  Speaker adaptive systems Speaker adaptive systems  Enrolment stage (acoustics and language) Enrolment stage (acoustics and language)

Environment Variability Environment Variability  Different background noises Different background noises  Office vs Outside Office vs Outside  Different applications, different Different applications, different environments environments  Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick  Single speaker vs Multispeaker Single speaker vs Multispeaker  Background music Background music

Channel Variability Channel Variability  Telephone vs Desktop Telephone vs Desktop  8KHz vs 16KHz 8KHz vs 16KHz  Mobile vs Desktop Mobile vs Desktop  Close-talking vs far-field Close-talking vs far-field  Cell Phone vs Landline vs VOIP Cell Phone vs Landline vs VOIP

Measuring Speech Recognition Error Measuring Speech Recognition Error  Word Error Rate Word Error Rate  Substitutions: word is replaced Substitutions: word is replaced  Deletions: word is missed out Deletions: word is missed out  Insertions: word is added Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- WER = 100% x ----------------------------------- word in correct sentence word in correct sentence

Word Error Rate Word Error Rate  WER requires: WER requires:  Transcription (the correct word string) Transcription (the correct word string)  Alignment between ASR output and Transcript Alignment between ASR output and Transcript  Not just left to right matching Not just left to right matching  Sometimes Accuracy is given Sometimes Accuracy is given  100-WER 100-WER  NOT number of words correct NOT number of words correct

Word Error Rate Word Error Rate  Can get > 100% Can get > 100%  But something is very wrong But something is very wrong  Outputting “the” only, ignoring the speech Outputting “the” only, ignoring the speech  Sometimes gives WER < 100% Sometimes gives WER < 100%  All words are treated equal All words are treated equal  “ “This specimen” vs “The specimen” This specimen” vs “The specimen”  “ “Is absent” vs “Is present” Is absent” vs “Is present”

Signal Acquisition Signal Acquisition  High quality signal quality High quality signal quality  Lower sample rate will increase WER Lower sample rate will increase WER  8KHz baseline 8KHz baseline  16KHz -10% 16KHz -10%

End-Point Detection End-Point Detection  Long silence will likely increase WER Long silence will likely increase WER  It will recognize phantom words It will recognize phantom words  Need to find the speech in the signal Need to find the speech in the signal  VAD (Voice Activity Detection) VAD (Voice Activity Detection)  Find beginning and end of speech Find beginning and end of speech  Typically do continuous recognition Typically do continuous recognition  Recognized while listening Recognized while listening  But need end point (have to wait) But need end point (have to wait)

Feature normalization Feature normalization  Sometimes do normalization Sometimes do normalization  Remove mean from MFCCs Remove mean from MFCCs  Can make recognition more reliable in noise Can make recognition more reliable in noise  Often include deltas and delta deltas Often include deltas and delta deltas  Sometimes to feature reduction Sometimes to feature reduction  Principal Component Analysis Principal Component Analysis

What phones/segments What phones/segments  Need the best set for discrimination Need the best set for discrimination  Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones  More phones means more training More phones means more training  And needs to have consistent Lexicon And needs to have consistent Lexicon  Extra phones Extra phones  t vs dx t vs dx  t vs nx: /t w eh n t iy/ vs / t w eh nx iy / t vs nx: /t w eh n t iy/ vs / t w eh nx iy /  Stops as closures and bursts Stops as closures and bursts  Schwas: ax and ix Schwas: ax and ix  Syllabics: el, em, en Syllabics: el, em, en  Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, ….

Context dependency Context dependency  Care about the contexts of each phone Care about the contexts of each phone  Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel  Utterances start and end affect phonemes Utterances start and end affect phonemes  Need more than simple phone models Need more than simple phone models

Tri-phone Models Tri-phone Models  Have models for each phone and context Have models for each phone and context  43^3 contexts about 80K models 43^3 contexts about 80K models  Not all contexts have enough examples Not all contexts have enough examples  oy (oy) oy very rare oy (oy) oy very rare  sh (ax) n very common sh (ax) n very common  Merge tri-phones that are similar Merge tri-phones that are similar  E.g t(ih)n with d(ih)n E.g t(ih)n with d(ih)n

Find phones to merge Find phones to merge  Using phonetic features Using phonetic features  Most similar feature, most similar acoustics Most similar feature, most similar acoustics  Stops, voicing, vowel type … Stops, voicing, vowel type …  Usually automatic cluster of triphones Usually automatic cluster of triphones  Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features

Adaptation Adaptation  Change behavior after use Change behavior after use  Human adaptation Human adaptation  They will change how they speak They will change how they speak  Channel adaptation Channel adaptation  Cepstral Normalization Cepstral Normalization  Model adaptation Model adaptation  Move the means (or weights on means) Move the means (or weights on means)

Adaptation Adaptation  Assume recognition is correct Assume recognition is correct  (Maybe with some threshold) (Maybe with some threshold)  Modify model to make answer more correct Modify model to make answer more correct  Adaptation to speaker characteristics Adaptation to speaker characteristics  Adaptation to speaker style Adaptation to speaker style  Can improve accuracy by a few % Can improve accuracy by a few %

Pronunciation lexicon Pronunciation lexicon  Need list of words and their pronunciation Need list of words and their pronunciation  Pencil p eh n s ih l Pencil p eh n s ih l  Two t uw Two t uw  Too t uw Too t uw  … …  Need pronunciation of ALL words Need pronunciation of ALL words

What’s a word What’s a word  Basic words are clear Basic words are clear  What about morphological variants What about morphological variants  walk, walks, walked, walking walk, walks, walked, walking  Multi-word words Multi-word words  Los Angeles, New York Los Angeles, New York  Contractions Contractions  Wanna, gonna … Wanna, gonna …  Yes ALL words that you will recognize Yes ALL words that you will recognize

Pronunciation variants Pronunciation variants  Homographs: (same writing different Homographs: (same writing different pronuncation) pronuncation)  bass: / b ae s / (fish) / b ey s / (music) bass: / b ae s / (fish) / b ey s / (music)  project: N / p r aa jh eh k t / V /p r ax jh eh k t / project: N / p r aa jh eh k t / V /p r ax jh eh k t /  Natural variants Natural variants  route: / r uw t / and / r aw t / route: / r uw t / and / r aw t /  coupon: / k uw p ao n / and / k y uw p ao n / coupon: / k uw p ao n / and / k y uw p ao n /  water: / w ao t er / and / w ao dx er / water: / w ao t er / and / w ao dx er /

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Intersections and Unions of Session Types Co sku Acay Frank Pfenning Carnegie Mellon

Decentralized Consensus Proto cols 1 Goals of the lecture Decentralized Consensus

Preview question Whats common about the Common Criteria? A. Every kind of product is

Yoshimura-Kuh Channel Routing Perform YK channel routing with K = 100 TOP =

Top-Antitop Production at Hadron Colliders Roberto BONCIANI Laboratoire de Physique Subatomique

Taking a Closer Look: Housing Myths & Truths HOMEs 18th Annual Ventura County Housing

< Q > < < < > > > H A 0 if P(x) does not halt 0 if Q sometimes

Cryptanalysis of the Legendre PRF and Generalizations W. Beullens 1 T. Beyne 1 A. Udovenko 2 G.

Sambuz

Useful Links

Newsletter

Mail Us

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Intersections and Unions of Session Types Co sku Acay Frank Pfenning Carnegie Mellon

Decentralized Consensus Proto cols 1 Goals of the lecture Decentralized Consensus

Preview question Whats common about the Common Criteria? A. Every kind of product is

Yoshimura-Kuh Channel Routing Perform YK channel routing with K = 100 TOP =

Top-Antitop Production at Hadron Colliders Roberto BONCIANI Laboratoire de Physique Subatomique

Taking a Closer Look: Housing Myths &amp; Truths HOMEs 18th Annual Ventura County Housing

&lt; Q &gt; &lt; &lt; &lt; &gt; &gt; &gt; H A 0 if P(x) does not halt 0 if Q sometimes

Cryptanalysis of the Legendre PRF and Generalizations W. Beullens 1 T. Beyne 1 A. Udovenko 2 G.

Sambuz

Useful Links

Newsletter

Mail Us

Taking a Closer Look: Housing Myths & Truths HOMEs 18th Annual Ventura County Housing

< Q > < < < > > > H A 0 if P(x) does not halt 0 if Q sometimes