Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary
Acoustic Modeling Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation lexicons Pronunciation lexicons
Variability in Speech Signal Variability in Speech Signal “ “Mr Wright should write to Ms Wright right Mr Wright should write to Ms Wright right away about his Ford or four door Honda. away about his Ford or four door Honda. Homophones: same pronunciation Homophones: same pronunciation “ “wright” “right” “write” / r ay t / wright” “right” “write” / r ay t / “ “ford or” “four door” / f ao r d ao r / ford or” “four door” / f ao r d ao r /
Style Variability Style Variability Different articulation in different situations Different articulation in different situations Clear vs Conversational Clear vs Conversational Whisper vs shouting Whisper vs shouting Talking to machine, talking to others Talking to machine, talking to others Frustrated speech Frustrated speech
Speaker variability Speaker variability Gender, age, dialect, health Gender, age, dialect, health Speaker dependent systems Speaker dependent systems Speaker independent systems Speaker independent systems Speaker adaptive systems Speaker adaptive systems Enrolment stage (acoustics and language) Enrolment stage (acoustics and language)
Environment Variability Environment Variability Different background noises Different background noises Office vs Outside Office vs Outside Different applications, different Different applications, different environments environments Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick Single speaker vs Multispeaker Single speaker vs Multispeaker Background music Background music
Channel Variability Channel Variability Telephone vs Desktop Telephone vs Desktop 8KHz vs 16KHz 8KHz vs 16KHz Mobile vs Desktop Mobile vs Desktop Close-talking vs far-field Close-talking vs far-field Cell Phone vs Landline vs VOIP Cell Phone vs Landline vs VOIP
Measuring Speech Recognition Error Measuring Speech Recognition Error Word Error Rate Word Error Rate Substitutions: word is replaced Substitutions: word is replaced Deletions: word is missed out Deletions: word is missed out Insertions: word is added Insertions: word is added Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- WER = 100% x ----------------------------------- word in correct sentence word in correct sentence
Word Error Rate Word Error Rate WER requires: WER requires: Transcription (the correct word string) Transcription (the correct word string) Alignment between ASR output and Transcript Alignment between ASR output and Transcript Not just left to right matching Not just left to right matching Sometimes Accuracy is given Sometimes Accuracy is given 100-WER 100-WER NOT number of words correct NOT number of words correct
Word Error Rate Word Error Rate Can get > 100% Can get > 100% But something is very wrong But something is very wrong Outputting “the” only, ignoring the speech Outputting “the” only, ignoring the speech Sometimes gives WER < 100% Sometimes gives WER < 100% All words are treated equal All words are treated equal “ “This specimen” vs “The specimen” This specimen” vs “The specimen” “ “Is absent” vs “Is present” Is absent” vs “Is present”
Signal Acquisition Signal Acquisition High quality signal quality High quality signal quality Lower sample rate will increase WER Lower sample rate will increase WER 8KHz baseline 8KHz baseline 16KHz -10% 16KHz -10%
End-Point Detection End-Point Detection Long silence will likely increase WER Long silence will likely increase WER It will recognize phantom words It will recognize phantom words Need to find the speech in the signal Need to find the speech in the signal VAD (Voice Activity Detection) VAD (Voice Activity Detection) Find beginning and end of speech Find beginning and end of speech Typically do continuous recognition Typically do continuous recognition Recognized while listening Recognized while listening But need end point (have to wait) But need end point (have to wait)
Feature normalization Feature normalization Sometimes do normalization Sometimes do normalization Remove mean from MFCCs Remove mean from MFCCs Can make recognition more reliable in noise Can make recognition more reliable in noise Often include deltas and delta deltas Often include deltas and delta deltas Sometimes to feature reduction Sometimes to feature reduction Principal Component Analysis Principal Component Analysis
What phones/segments What phones/segments Need the best set for discrimination Need the best set for discrimination Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones More phones means more training More phones means more training And needs to have consistent Lexicon And needs to have consistent Lexicon Extra phones Extra phones t vs dx t vs dx t vs nx: /t w eh n t iy/ vs / t w eh nx iy / t vs nx: /t w eh n t iy/ vs / t w eh nx iy / Stops as closures and bursts Stops as closures and bursts Schwas: ax and ix Schwas: ax and ix Syllabics: el, em, en Syllabics: el, em, en Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, ….
Context dependency Context dependency Care about the contexts of each phone Care about the contexts of each phone Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel Utterances start and end affect phonemes Utterances start and end affect phonemes Need more than simple phone models Need more than simple phone models
Tri-phone Models Tri-phone Models Have models for each phone and context Have models for each phone and context 43^3 contexts about 80K models 43^3 contexts about 80K models Not all contexts have enough examples Not all contexts have enough examples oy (oy) oy very rare oy (oy) oy very rare sh (ax) n very common sh (ax) n very common Merge tri-phones that are similar Merge tri-phones that are similar E.g t(ih)n with d(ih)n E.g t(ih)n with d(ih)n
Find phones to merge Find phones to merge Using phonetic features Using phonetic features Most similar feature, most similar acoustics Most similar feature, most similar acoustics Stops, voicing, vowel type … Stops, voicing, vowel type … Usually automatic cluster of triphones Usually automatic cluster of triphones Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features
Adaptation Adaptation Change behavior after use Change behavior after use Human adaptation Human adaptation They will change how they speak They will change how they speak Channel adaptation Channel adaptation Cepstral Normalization Cepstral Normalization Model adaptation Model adaptation Move the means (or weights on means) Move the means (or weights on means)
Adaptation Adaptation Assume recognition is correct Assume recognition is correct (Maybe with some threshold) (Maybe with some threshold) Modify model to make answer more correct Modify model to make answer more correct Adaptation to speaker characteristics Adaptation to speaker characteristics Adaptation to speaker style Adaptation to speaker style Can improve accuracy by a few % Can improve accuracy by a few %
Pronunciation lexicon Pronunciation lexicon Need list of words and their pronunciation Need list of words and their pronunciation Pencil p eh n s ih l Pencil p eh n s ih l Two t uw Two t uw Too t uw Too t uw … … Need pronunciation of ALL words Need pronunciation of ALL words
What’s a word What’s a word Basic words are clear Basic words are clear What about morphological variants What about morphological variants walk, walks, walked, walking walk, walks, walked, walking Multi-word words Multi-word words Los Angeles, New York Los Angeles, New York Contractions Contractions Wanna, gonna … Wanna, gonna … Yes ALL words that you will recognize Yes ALL words that you will recognize
Pronunciation variants Pronunciation variants Homographs: (same writing different Homographs: (same writing different pronuncation) pronuncation) bass: / b ae s / (fish) / b ey s / (music) bass: / b ae s / (fish) / b ey s / (music) project: N / p r aa jh eh k t / V /p r ax jh eh k t / project: N / p r aa jh eh k t / V /p r ax jh eh k t / Natural variants Natural variants route: / r uw t / and / r aw t / route: / r uw t / and / r aw t / coupon: / k uw p ao n / and / k y uw p ao n / coupon: / k uw p ao n / and / k y uw p ao n / water: / w ao t er / and / w ao dx er / water: / w ao t er / and / w ao dx er /
Recommend
More recommend