Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques
ASR Systems How good are they? � How good are they? � � Expected ASR Expected ASR � � Factors that make things worse Factors that make things worse � How good do they need to be? � How good do they need to be? � � What can you do with low WER? What can you do with low WER? �
ASR Tasks
What makes it worse � Channel Channel � � Telephone Telephone vs vs Wide band Wide band � � Close Close- -talking talking vs vs far far- -field field � � Style: Style: � � Command and Control Command and Control � � Limit information getting Limit information getting � � Limit domain but general speech Limit domain but general speech � � Machine directed Machine directed vs vs Human directed speech Human directed speech � � Broadcast (performance) Broadcast (performance) vs vs Conversational Conversational � � Single Single vs vs Dialog Dialog vs vs Multiperson Multiperson �
Expected WER: Real-time � Command and Control Command and Control � � Limited vocabulary and directed speech Limited vocabulary and directed speech � � < 10% (< 5% for some users) < 10% (< 5% for some users) � � Simple Dialog Simple Dialog � � Machine directed speech with interested users Machine directed speech with interested users � � < 20% (but sometimes works with < 30%) < 20% (but sometimes works with < 30%) � � Dictation Dictation � � Single speaker, well performed Single speaker, well performed � � <5% for some <5% for some useds useds > 30% for (short term) users > 30% for (short term) users � � Speech Speech- -to to- -Speech Translation Speech Translation � � Machine mediated, target domain Machine mediated, target domain � � <20% (but will vary for different people) <20% (but will vary for different people) �
Expected WER: offline � Broadcast News Broadcast News � � Large vocabulary, well performed Large vocabulary, well performed � � <10% but not real <10% but not real- -time (maybe 100 times real time) time (maybe 100 times real time) � � Conversational Speech (Call Home) Conversational Speech (Call Home) � � Large vocabulary, not well performed Large vocabulary, not well performed � � > 40% WER (depends on particular users and > 40% WER (depends on particular users and � conversations) conversations) � Information retrieval Information retrieval � � Large vocabulary very varied content Large vocabulary very varied content � � > 60% can still give useful results > 60% can still give useful results �
Other uses TV show subtitling for the deaf � TV show subtitling for the deaf � Court transcription � Court transcription � Medical dictation � Medical dictation � Air traffic control transcription � Air traffic control transcription �
Other ASR techniques � Including Including Articulatory Articulatory/Phonetic Features ( /Phonetic Features (Metze Metze) ) � � Build recognizers for Build recognizers for � � Voiced/unvoiced Voiced/unvoiced � � Nasality Nasality � � Closures (quiet part of stops) Closures (quiet part of stops) � � Aspiration (Fricatives) Aspiration (Fricatives) � � Tongue position Tongue position � � Run all in parallel and “join” them Run all in parallel and “join” them � � Combine with more standard approaches Combine with more standard approaches � � Can be more robust to speaking style Can be more robust to speaking style �
Multi-engine Recognition � Use three recognizers and combine results Use three recognizers and combine results � � Rover Rover � � Combine scores per Combine scores per- -sentence sentence � � Combine lattices Combine lattices � � Confusion networks Confusion networks � � Cross adaptation Cross adaptation � � Interleave systems with adaptation Interleave systems with adaptation � � It usually works better when system different It usually works better when system different � � (and both of them good) (and both of them good) �
Whispered Speech Doesn’t disturb other people � Doesn’t disturb other people � Can use throat mike � Can use throat mike � Works in noisy environment � Works in noisy environment �
Muscle Movement EMG: Electromyographic Electromyographic Signals Signals � EMG: � � Recognize muscle impulses Recognize muscle impulses � Can work in noisy environments � Can work in noisy environments � Can work without you making a noise � Can work without you making a noise �
Articulatory Movement Attach metal studs to: � Attach metal studs to: � � Lips, teeth, tongue, velum Lips, teeth, tongue, velum � Record movement in magnetic field � Record movement in magnetic field � � Non Non- -intrusive intrusive �
EMA: Electromagentoarticulatograph
ASR Summary ASR requires: � ASR requires: � � Acoustic model Acoustic model � HMMs HMMs trained from lots of data trained from lots of data � Pronunciation lexicon Pronunciation lexicon � List of pronunciations for words List of pronunciations for words � Language model Language model � Trigrams trained from lots of data Trigrams trained from lots of data
ASR Trade-offs � More/better training data More/better training data � � Well transcribed and closest to target system Well transcribed and closest to target system � � Better signal Better signal � � Better microphone, no noise Better microphone, no noise � � Better speaker Better speaker � � Interested party, know how to speak Interested party, know how to speak � � Time and memory Time and memory � � Bigger systems do better Bigger systems do better � � Greater CPU does better Greater CPU does better �
Homework 1 Build a speech recognition system � Build a speech recognition system � � An acoustic model An acoustic model � � A pronunciation lexicon A pronunciation lexicon � � A language model A language model � Note it takes time to build � Note it takes time to build � What is your initial WER � What is your initial WER � � How did you improve it How did you improve it � th Sep Submitted by 3:30pm Monday 29 th Sep � Submitted by 3:30pm Monday 29 �
Recommend
More recommend