Speech Processing 15-492/18-492 Speech Recognition Acoustic - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Acoustic Modeling Speech and Signal Variability � Speech and Signal Variability � Measuring Error � Measuring Error � Pronunciation lexicons � Pronunciation lexicons �

Variability in Speech Signal “Mr Mr Wright should write to Ms Wright right Wright should write to Ms Wright right � “ � away about his Ford or four door Honda. away about his Ford or four door Honda. � Homophones: same pronunciation Homophones: same pronunciation � � “ “wright wright” “right” “write” / r ay t / ” “right” “write” / r ay t / � � “ford or” “four door” / f “ford or” “four door” / f ao ao r d r d ao ao r / r / �

Style Variability Different articulation in different situations � Different articulation in different situations � Clear vs vs Conversational Conversational � Clear � Whisper vs vs shouting shouting � Whisper � Talking to machine, talking to others � Talking to machine, talking to others � Frustrated speech � Frustrated speech �

Speaker variability Gender, age, dialect, health � Gender, age, dialect, health � Speaker dependent systems � Speaker dependent systems � Speaker independent systems � Speaker independent systems � Speaker adaptive systems � Speaker adaptive systems � � Enrolment stage (acoustics and language) Enrolment stage (acoustics and language) �

Environment Variability Different background noises � Different background noises � � Office Office vs vs Outside Outside � Different applications, different � Different applications, different � environments environments � Desktop dictation, to Warehouse pick Desktop dictation, to Warehouse pick � Single speaker vs vs Multispeaker Multispeaker � Single speaker � Background music � Background music �

Channel Variability Telephone vs vs Desktop Desktop � Telephone � � 8KHz 8KHz vs vs 16KHz 16KHz � PDA vs vs Desktop Desktop � PDA � Close- -talking talking vs vs far far- -field field � Close � Cell Phone vs vs Landline Landline � Cell Phone �

Measuring Speech Recognition Error Word Error Rate � Word Error Rate � � Substitutions: word is replaced Substitutions: word is replaced � � Deletions: word is missed out Deletions: word is missed out � � Insertions: word is added Insertions: word is added � Subs+Dels+Ins Subs+Dels+Ins WER = 100% x ----------------------------------- ----------------------------------- WER = 100% x word in correct sentence word in correct sentence

Word Error Rate WER requires: � WER requires: � � Transcription (the correct word string) Transcription (the correct word string) � � Alignment between ASR output and Transcript Alignment between ASR output and Transcript � � Not just left to right matching Not just left to right matching � Sometimes Accuracy is given � Sometimes Accuracy is given � � 100 100- -WER WER � � NOT number of words correct NOT number of words correct �

Word Error Rate Can get > 100% � Can get > 100% � � But something is very wrong But something is very wrong � Outputting “the” only, ignoring the speech � Outputting “the” only, ignoring the speech � � Sometimes gives WER < 100% Sometimes gives WER < 100% � All words are treated equal � All words are treated equal � � “This specimen” “This specimen” vs vs “The specimen” “The specimen” � � “Is absent” “Is absent” vs vs “Is present” “Is present” �

Signal Acquisition High quality signal quality � High quality signal quality � � Lower sample rate will increase WER Lower sample rate will increase WER � � 8KHz baseline 8KHz baseline � � 16KHz 16KHz - -10% 10% �

End-Point Detection Long silence will likely increase WER � Long silence will likely increase WER � � It will recognize phantom words It will recognize phantom words � Need to find the speech in the signal � Need to find the speech in the signal � � VAD (Voice Activity Detection) VAD (Voice Activity Detection) � � Find beginning and end of speech Find beginning and end of speech � Typically do continuous recognition � Typically do continuous recognition � � Recognized while listening Recognized while listening � � But need end point (have to wait) But need end point (have to wait) �

Feature normalization Sometimes do normalization � Sometimes do normalization � � Remove mean from Remove mean from MFCCs MFCCs � � Can make recognition more reliable in noise Can make recognition more reliable in noise � Often include deltas and delta deltas � Often include deltas and delta deltas � Sometimes to feature reduction � Sometimes to feature reduction � � Principal Component Analysis Principal Component Analysis �

What phones/segments � Need the best set for discrimination Need the best set for discrimination � � Not necessary the same as Linguistic Phones Not necessary the same as Linguistic Phones � � More phones means more training More phones means more training � � And needs to have consistent Lexicon And needs to have consistent Lexicon � � Extra phones Extra phones � � t t vs vs dx dx � � t t vs vs nx nx: /t w eh n t : /t w eh n t iy iy/ / vs vs / t w eh / t w eh nx nx iy iy / / � � Stops as closures and bursts Stops as closures and bursts � � Schwas: ax and ix Schwas: ax and ix � � Syllabics: el, Syllabics: el, em em, en , en � � Accents/Tones: ah1, ah0, …. Accents/Tones: ah1, ah0, …. �

Context dependency Care about the contexts of each phone � Care about the contexts of each phone � � Post vocalic /r/ and /n/ /m/ affect vowel Post vocalic /r/ and /n/ /m/ affect vowel � � Utterances start and end affect phonemes Utterances start and end affect phonemes � Need more than simple phone models � Need more than simple phone models �

Tri-phone Models Have models for each phone and context � Have models for each phone and context � � 43^3 contexts about 80K models 43^3 contexts about 80K models � Not all contexts have enough examples � Not all contexts have enough examples � � oy oy ( (oy oy) ) oy oy very rare very rare � � sh sh (ax) n very common (ax) n very common � Merge tri- -phones that are similar phones that are similar � Merge tri � � E.g E.g t(ih)n t(ih)n with with d(ih)n d(ih)n �

Find phones to merge Using phonetic features � Using phonetic features � � Most similar feature, most similar acoustics Most similar feature, most similar acoustics � � Stops, voicing, vowel type … Stops, voicing, vowel type … � Usually automatic cluster of triphones triphones � Usually automatic cluster of � � Using CART trees indexed by phonetic features Using CART trees indexed by phonetic features �

Adaptation Change behavior after use � Change behavior after use � Human adaptation � Human adaptation � � They will change how they speak They will change how they speak � Channel adaptation � Channel adaptation � � Cepstral Cepstral Normalization Normalization � Model adaptation � Model adaptation � � Move the means (or weights on means) Move the means (or weights on means) �

Adaptation Assume recognition is correct � Assume recognition is correct � � (Maybe with some threshold) (Maybe with some threshold) � Modify model to make answer more correct � Modify model to make answer more correct � � Adaptation to speaker characteristics Adaptation to speaker characteristics � � Adaptation to speaker style Adaptation to speaker style � � Can improve accuracy by a few % Can improve accuracy by a few % �

Pronunciation lexicon Need list of words and their pronunciation � Need list of words and their pronunciation � � Pencil p eh n s Pencil p eh n s ih ih l l � � Two t Two t uw uw � � Too t Too t uw uw � � … … � Need pronunciation of ALL words � Need pronunciation of ALL words �

What’s a word Basic words are clear � Basic words are clear � What about morphological variants � What about morphological variants � � walk, walks, walked, walking walk, walks, walked, walking � Multi- -word words word words � Multi � � Los Angeles, New York Los Angeles, New York � Contractions � Contractions � � Wanna Wanna, , gonna gonna … … � Yes ALL words that you will recognize � Yes ALL words that you will recognize �

Pronunciation variants Homographs: (same writing different � Homographs: (same writing different � pronuncation) ) pronuncation � bass: / b bass: / b ae ae s / (fish) / b s / (fish) / b ey ey s / (music) s / (music) � � project: N / p r project: N / p r aa aa jh jh eh k t / V /p r ax eh k t / V /p r ax jh jh eh k t / eh k t / � Natural variants � Natural variants � � route: / r route: / r uw uw t / and / r aw t / t / and / r aw t / � � coupon: / k coupon: / k uw uw p p ao ao n / and / k y n / and / k y uw uw p p ao ao n / n / � � water: / w water: / w ao ao t t er er / and / w / and / w ao ao dx dx er er / / �

Speech Processing 15-492/18-492 Speech Recognition Acoustic - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

David Sankoff's projects: a biased sample of the first 25 years Vieques, Puerto Rico 1963

Leadership Briefings Summer Term 2017 - 2018 Introduction and Welcome Slides are available @

. IoT or Internet of {Things,Threats} Thomas (@nyx__o) Malware Researcher at ESET CTF lover

Ch 5: Marks and Channels Tamara Munzner Department of Computer Science University of British

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS

Second-Order Masked Lookup Table Compression Scheme Annapurna Valiveti , Srinivas Vivek IIIT

Foundations of Chemical Kinetics Lecture 32: Heterogeneous kinetics: Gases and surfaces Marc R.

Sambuz

Useful Links

Newsletter

Mail Us

Speech Processing 15-492/18-492 Speech Recognition Acoustic - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary Acoustic Modeling Speech and Signal Variability Speech and Signal Variability Measuring Error Measuring Error Pronunciation

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

David Sankoff's projects: a biased sample of the first 25 years Vieques, Puerto Rico 1963

Leadership Briefings Summer Term 2017 - 2018 Introduction and Welcome Slides are available @

. IoT or Internet of {Things,Threats} Thomas (@nyx__o) Malware Researcher at ESET CTF lover

Ch 5: Marks and Channels Tamara Munzner Department of Computer Science University of British

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck &amp; Jean-Pierre Martens ELIS

Second-Order Masked Lookup Table Compression Scheme Annapurna Valiveti , Srinivas Vivek IIIT

Foundations of Chemical Kinetics Lecture 32: Heterogeneous kinetics: Gases and surfaces Marc R.

Sambuz

Useful Links

Newsletter

Mail Us

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS