speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Speech Recognition Acoustic - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation


  1. Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

  2. Acoustic Modeling Speech and Signal Variability � Speech and Signal Variability � Measuring Error � Measuring Error � Pronunciation lexicons � Pronunciation lexicons �

  3. Variability in Speech Signal “Mr Mr Wright should write to Ms Wright right Wright should write to Ms Wright right � “ � away about his Ford or four door Honda. away about his Ford or four door Honda. � Homophones: same pronunciation Homophones: same pronunciation � � “ “wright wright” “right” “write” / r ay t / ” “right” “write” / r ay t / � � “ford or” “four door” / f “ford or” “four door” / f ao ao r d r d ao ao r / r / �

  4. Style Variability Different articulation in different situations � Different articulation in different situations � Clear vs vs Conversational Conversational � Clear � Whisper vs vs shouting shouting � Whisper � Talking to machine, talking to others � Talking to machine, talking to others � Frustrated speech � Frustrated speech �

  5. Speaker variability Gender, age, dialect, health � Gender, age, dialect, health � Speaker dependent systems � Speaker dependent systems � Speaker independent systems � Speaker independent systems � Speaker adaptive systems � Speaker adaptive systems � � Enrolment stage (acoustics and language) Enrolment stage (acoustics and language) �

  6. Environment Variability Different background noises � Different background noises � � Office Office vs vs Outside Outside � Different applications, different � Different applications, different � environments environments � Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick � Single speaker vs vs Multispeaker Multispeaker � Single speaker � Background music � Background music �

  7. Channel Variability Telephone vs vs Desktop Desktop � Telephone � � 8KHz 8KHz vs vs 16KHz 16KHz � PDA vs vs Desktop Desktop � PDA � Close- -talking talking vs vs far far- -field field � Close � Cell Phone vs vs Landline Landline � Cell Phone �

  8. Measuring Speech Recognition Error Word Error Rate � Word Error Rate � � Substitutions: word is replaced Substitutions: word is replaced � � Deletions: word is missed out Deletions: word is missed out � � Insertions: word is added Insertions: word is added � Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- ----------------------------------- WER = 100% x word in correct sentence word in correct sentence

  9. Word Error Rate WER requires: � WER requires: � � Transcription (the correct word string) Transcription (the correct word string) � � Alignment between ASR output and Transcript Alignment between ASR output and Transcript � � Not just left to right matching Not just left to right matching � Sometimes Accuracy is given � Sometimes Accuracy is given � � 100 100- -WER WER � � NOT number of words correct NOT number of words correct �

  10. Word Error Rate Can get > 100% � Can get > 100% � � But something is very wrong But something is very wrong � Outputting “the” only, ignoring the speech � Outputting “the” only, ignoring the speech � � Sometimes gives WER < 100% Sometimes gives WER < 100% � All words are treated equal � All words are treated equal � � “This specimen” “This specimen” vs vs “The specimen” “The specimen” � � “Is absent” “Is absent” vs vs “Is present” “Is present” �

  11. Signal Acquisition High quality signal quality � High quality signal quality � � Lower sample rate will increase WER Lower sample rate will increase WER � � 8KHz baseline 8KHz baseline � � 16KHz 16KHz - -10% 10% �

  12. End-Point Detection Long silence will likely increase WER � Long silence will likely increase WER � � It will recognize phantom words It will recognize phantom words � Need to find the speech in the signal � Need to find the speech in the signal � � VAD (Voice Activity Detection) VAD (Voice Activity Detection) � � Find beginning and end of speech Find beginning and end of speech � Typically do continuous recognition � Typically do continuous recognition � � Recognized while listening Recognized while listening � � But need end point (have to wait) But need end point (have to wait) �

  13. Feature normalization Sometimes do normalization � Sometimes do normalization � � Remove mean from Remove mean from MFCCs MFCCs � � Can make recognition more reliable in noise Can make recognition more reliable in noise � Often include deltas and delta deltas � Often include deltas and delta deltas � Sometimes to feature reduction � Sometimes to feature reduction � � Principal Component Analysis Principal Component Analysis �

  14. What phones/segments � Need the best set for discrimination Need the best set for discrimination � � Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones � � More phones means more training More phones means more training � � And needs to have consistent Lexicon And needs to have consistent Lexicon � � Extra phones Extra phones � � t t vs vs dx dx � � t t vs vs nx nx: /t w eh n t : /t w eh n t iy iy/ / vs vs / t w eh / t w eh nx nx iy iy / / � � Stops as closures and bursts Stops as closures and bursts � � Schwas: ax and ix Schwas: ax and ix � � Syllabics: el, Syllabics: el, em em, en , en � � Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, …. �

  15. Context dependency Care about the contexts of each phone � Care about the contexts of each phone � � Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel � � Utterances start and end affect phonemes Utterances start and end affect phonemes � Need more than simple phone models � Need more than simple phone models �

  16. Tri-phone Models Have models for each phone and context � Have models for each phone and context � � 43^3 contexts about 80K models 43^3 contexts about 80K models � Not all contexts have enough examples � Not all contexts have enough examples � � oy oy ( (oy oy) ) oy oy very rare very rare � � sh sh (ax) n very common (ax) n very common � Merge tri- -phones that are similar phones that are similar � Merge tri � � E.g E.g t(ih)n t(ih)n with with d(ih)n d(ih)n �

  17. Find phones to merge Using phonetic features � Using phonetic features � � Most similar feature, most similar acoustics Most similar feature, most similar acoustics � � Stops, voicing, vowel type … Stops, voicing, vowel type … � Usually automatic cluster of triphones triphones � Usually automatic cluster of � � Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features �

  18. Adaptation Change behavior after use � Change behavior after use � Human adaptation � Human adaptation � � They will change how they speak They will change how they speak � Channel adaptation � Channel adaptation � � Cepstral Cepstral Normalization Normalization � Model adaptation � Model adaptation � � Move the means (or weights on means) Move the means (or weights on means) �

  19. Adaptation Assume recognition is correct � Assume recognition is correct � � (Maybe with some threshold) (Maybe with some threshold) � Modify model to make answer more correct � Modify model to make answer more correct � � Adaptation to speaker characteristics Adaptation to speaker characteristics � � Adaptation to speaker style Adaptation to speaker style � � Can improve accuracy by a few % Can improve accuracy by a few % �

  20. Pronunciation lexicon Need list of words and their pronunciation � Need list of words and their pronunciation � � Pencil p eh n s Pencil p eh n s ih ih l l � � Two t Two t uw uw � � Too t Too t uw uw � � … … � Need pronunciation of ALL words � Need pronunciation of ALL words �

  21. What’s a word Basic words are clear � Basic words are clear � What about morphological variants � What about morphological variants � � walk, walks, walked, walking walk, walks, walked, walking � Multi- -word words word words � Multi � � Los Angeles, New York Los Angeles, New York � Contractions � Contractions � � Wanna Wanna, , gonna gonna … … � Yes ALL words that you will recognize � Yes ALL words that you will recognize �

  22. Pronunciation variants Homographs: (same writing different � Homographs: (same writing different � pronuncation) ) pronuncation � bass: / b bass: / b ae ae s / (fish) / b s / (fish) / b ey ey s / (music) s / (music) � � project: N / p r project: N / p r aa aa jh jh eh k t / V /p r ax eh k t / V /p r ax jh jh eh k t / eh k t / � Natural variants � Natural variants � � route: / r route: / r uw uw t / and / r aw t / t / and / r aw t / � � coupon: / k coupon: / k uw uw p p ao ao n / and / k y n / and / k y uw uw p p ao ao n / n / � � water: / w water: / w ao ao t t er er / and / w / and / w ao ao dx dx er er / / �

Recommend


More recommend