speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Speech Recognition Systems Other - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems How good are they? How good are they? Expected ASR Expected ASR Factors that make things worse Factors that make things worse


  1. Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques

  2. ASR Systems How good are they? � How good are they? � � Expected ASR Expected ASR � � Factors that make things worse Factors that make things worse � How good do they need to be? � How good do they need to be? � � What can you do with low WER? What can you do with low WER? �

  3. ASR Tasks

  4. What makes it worse � Channel Channel � � Telephone Telephone vs vs Wide band Wide band � � Close Close- -talking talking vs vs far far- -field field � � Style: Style: � � Command and Control Command and Control � � Limit information getting Limit information getting � � Limit domain but general speech Limit domain but general speech � � Machine directed Machine directed vs vs Human directed speech Human directed speech � � Broadcast (performance) Broadcast (performance) vs vs Conversational Conversational � � Single Single vs vs Dialog Dialog vs vs Multiperson Multiperson �

  5. Expected WER: Real-time � Command and Control Command and Control � � Limited vocabulary and directed speech Limited vocabulary and directed speech � � < 10% (< 5% for some users) < 10% (< 5% for some users) � � Simple Dialog Simple Dialog � � Machine directed speech with interested users Machine directed speech with interested users � � < 20% (but sometimes works with < 30%) < 20% (but sometimes works with < 30%) � � Dictation Dictation � � Single speaker, well performed Single speaker, well performed � � <5% for some <5% for some useds useds > 30% for (short term) users > 30% for (short term) users � � Speech Speech- -to to- -Speech Translation Speech Translation � � Machine mediated, target domain Machine mediated, target domain � � <20% (but will vary for different people) <20% (but will vary for different people) �

  6. Expected WER: offline � Broadcast News Broadcast News � � Large vocabulary, well performed Large vocabulary, well performed � � <10% but not real <10% but not real- -time (maybe 100 times real time) time (maybe 100 times real time) � � Conversational Speech (Call Home) Conversational Speech (Call Home) � � Large vocabulary, not well performed Large vocabulary, not well performed � � > 40% WER (depends on particular users and > 40% WER (depends on particular users and � conversations) conversations) � Information retrieval Information retrieval � � Large vocabulary very varied content Large vocabulary very varied content � � > 60% can still give useful results > 60% can still give useful results �

  7. Other uses TV show subtitling for the deaf � TV show subtitling for the deaf � Court transcription � Court transcription � Medical dictation � Medical dictation � Air traffic control transcription � Air traffic control transcription �

  8. Other ASR techniques � Including Including Articulatory Articulatory/Phonetic Features ( /Phonetic Features (Metze Metze) ) � � Build recognizers for Build recognizers for � � Voiced/unvoiced Voiced/unvoiced � � Nasality Nasality � � Closures (quiet part of stops) Closures (quiet part of stops) � � Aspiration (Fricatives) Aspiration (Fricatives) � � Tongue position Tongue position � � Run all in parallel and “join” them Run all in parallel and “join” them � � Combine with more standard approaches Combine with more standard approaches � � Can be more robust to speaking style Can be more robust to speaking style �

  9. Multi-engine Recognition � Use three recognizers and combine results Use three recognizers and combine results � � Rover Rover � � Combine scores per Combine scores per- -sentence sentence � � Combine lattices Combine lattices � � Confusion networks Confusion networks � � Cross adaptation Cross adaptation � � Interleave systems with adaptation Interleave systems with adaptation � � It usually works better when system different It usually works better when system different � � (and both of them good) (and both of them good) �

  10. Whispered Speech Doesn’t disturb other people � Doesn’t disturb other people � Can use throat mike � Can use throat mike � Works in noisy environment � Works in noisy environment �

  11. Muscle Movement EMG: Electromyographic Electromyographic Signals Signals � EMG: � � Recognize muscle impulses Recognize muscle impulses � Can work in noisy environments � Can work in noisy environments � Can work without you making a noise � Can work without you making a noise �

  12. Articulatory Movement Attach metal studs to: � Attach metal studs to: � � Lips, teeth, tongue, velum Lips, teeth, tongue, velum � Record movement in magnetic field � Record movement in magnetic field � � Non Non- -intrusive intrusive �

  13. EMA: Electromagentoarticulatograph

  14. ASR Summary ASR requires: � ASR requires: � � Acoustic model Acoustic model �  HMMs HMMs trained from lots of data trained from lots of data  � Pronunciation lexicon Pronunciation lexicon �  List of pronunciations for words List of pronunciations for words  � Language model Language model �  Trigrams trained from lots of data Trigrams trained from lots of data 

  15. ASR Trade-offs � More/better training data More/better training data � � Well transcribed and closest to target system Well transcribed and closest to target system � � Better signal Better signal � � Better microphone, no noise Better microphone, no noise � � Better speaker Better speaker � � Interested party, know how to speak Interested party, know how to speak � � Time and memory Time and memory � � Bigger systems do better Bigger systems do better � � Greater CPU does better Greater CPU does better �

  16. Homework 1 Build a speech recognition system � Build a speech recognition system � � An acoustic model An acoustic model � � A pronunciation lexicon A pronunciation lexicon � � A language model A language model � Note it takes time to build � Note it takes time to build � What is your initial WER � What is your initial WER � � How did you improve it How did you improve it � th Sep Submitted by 3:30pm Monday 29 th Sep � Submitted by 3:30pm Monday 29 �

Recommend


More recommend