Speech Processing 15-492/18-492 Speech Recognition Systems Other - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques

ASR Systems How good are they? � How good are they? � � Expected ASR Expected ASR � � Factors that make things worse Factors that make things worse � How good do they need to be? � How good do they need to be? � � What can you do with low WER? What can you do with low WER? �

ASR Tasks

What makes it worse � Channel Channel � � Telephone Telephone vs vs Wide band Wide band � � Close Close- -talking talking vs vs far far- -field field � � Style: Style: � � Command and Control Command and Control � � Limit information getting Limit information getting � � Limit domain but general speech Limit domain but general speech � � Machine directed Machine directed vs vs Human directed speech Human directed speech � � Broadcast (performance) Broadcast (performance) vs vs Conversational Conversational � � Single Single vs vs Dialog Dialog vs vs Multiperson Multiperson �

Expected WER: Real-time � Command and Control Command and Control � � Limited vocabulary and directed speech Limited vocabulary and directed speech � � < 10% (< 5% for some users) < 10% (< 5% for some users) � � Simple Dialog Simple Dialog � � Machine directed speech with interested users Machine directed speech with interested users � � < 20% (but sometimes works with < 30%) < 20% (but sometimes works with < 30%) � � Dictation Dictation � � Single speaker, well performed Single speaker, well performed � � <5% for some <5% for some useds useds > 30% for (short term) users > 30% for (short term) users � � Speech Speech- -to to- -Speech Translation Speech Translation � � Machine mediated, target domain Machine mediated, target domain � � <20% (but will vary for different people) <20% (but will vary for different people) �

Expected WER: offline � Broadcast News Broadcast News � � Large vocabulary, well performed Large vocabulary, well performed � � <10% but not real <10% but not real- -time (maybe 100 times real time) time (maybe 100 times real time) � � Conversational Speech (Call Home) Conversational Speech (Call Home) � � Large vocabulary, not well performed Large vocabulary, not well performed � � > 40% WER (depends on particular users and > 40% WER (depends on particular users and � conversations) conversations) � Information retrieval Information retrieval � � Large vocabulary very varied content Large vocabulary very varied content � � > 60% can still give useful results > 60% can still give useful results �

Other uses TV show subtitling for the deaf � TV show subtitling for the deaf � Court transcription � Court transcription � Medical dictation � Medical dictation � Air traffic control transcription � Air traffic control transcription �

Other ASR techniques � Including Including Articulatory Articulatory/Phonetic Features ( /Phonetic Features (Metze Metze) ) � � Build recognizers for Build recognizers for � � Voiced/unvoiced Voiced/unvoiced � � Nasality Nasality � � Closures (quiet part of stops) Closures (quiet part of stops) � � Aspiration (Fricatives) Aspiration (Fricatives) � � Tongue position Tongue position � � Run all in parallel and “join” them Run all in parallel and “join” them � � Combine with more standard approaches Combine with more standard approaches � � Can be more robust to speaking style Can be more robust to speaking style �

Multi-engine Recognition � Use three recognizers and combine results Use three recognizers and combine results � � Rover Rover � � Combine scores per Combine scores per- -sentence sentence � � Combine lattices Combine lattices � � Confusion networks Confusion networks � � Cross adaptation Cross adaptation � � Interleave systems with adaptation Interleave systems with adaptation � � It usually works better when system different It usually works better when system different � � (and both of them good) (and both of them good) �

Whispered Speech Doesn’t disturb other people � Doesn’t disturb other people � Can use throat mike � Can use throat mike � Works in noisy environment � Works in noisy environment �

Muscle Movement EMG: Electromyographic Electromyographic Signals Signals � EMG: � � Recognize muscle impulses Recognize muscle impulses � Can work in noisy environments � Can work in noisy environments � Can work without you making a noise � Can work without you making a noise �

Articulatory Movement Attach metal studs to: � Attach metal studs to: � � Lips, teeth, tongue, velum Lips, teeth, tongue, velum � Record movement in magnetic field � Record movement in magnetic field � � Non Non- -intrusive intrusive �

EMA: Electromagentoarticulatograph

ASR Summary ASR requires: � ASR requires: � � Acoustic model Acoustic model �  HMMs HMMs trained from lots of data trained from lots of data  � Pronunciation lexicon Pronunciation lexicon �  List of pronunciations for words List of pronunciations for words  � Language model Language model �  Trigrams trained from lots of data Trigrams trained from lots of data 

ASR Trade-offs � More/better training data More/better training data � � Well transcribed and closest to target system Well transcribed and closest to target system � � Better signal Better signal � � Better microphone, no noise Better microphone, no noise � � Better speaker Better speaker � � Interested party, know how to speak Interested party, know how to speak � � Time and memory Time and memory � � Bigger systems do better Bigger systems do better � � Greater CPU does better Greater CPU does better �

Homework 1 Build a speech recognition system � Build a speech recognition system � � An acoustic model An acoustic model � � A pronunciation lexicon A pronunciation lexicon � � A language model A language model � Note it takes time to build � Note it takes time to build � What is your initial WER � What is your initial WER � � How did you improve it How did you improve it � th Sep Submitted by 3:30pm Monday 29 th Sep � Submitted by 3:30pm Monday 29 �

Speech Processing 15-492/18-492 Speech Recognition Systems Other - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems How good are they? How good are they? Expected ASR Expected ASR Factors that make things worse Factors that make things worse

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Signal Representations Part 2: Speech Signal Processing Hsin-min Wang References: 1 X.

Speech Processing 15-492/18-492 Speech Recognition Grammars Other ASR techniques But not just

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

SYNTAX PROCESSING Statistical Natural Language Processing 23.04.19 1 Syntax, Grammars, Parsing

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015

KALDI GPU ACCELERATION GTC - March 2019 1) Brief introduction to speech processing 2) What we

FINITE STATE MORPHOLOGY 24.05.19 Statistical Natural Language Processing 1 Morphology with FSAs

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

Speech Processing 15-492/18-492 Speech Recognition Systems Other - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems How good are they? How good are they? Expected ASR Expected ASR Factors that make things worse Factors that make things worse

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Signal Representations Part 2: Speech Signal Processing Hsin-min Wang References: 1 X.

Speech Processing 15-492/18-492 Speech Recognition Grammars Other ASR techniques But not just

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

SYNTAX PROCESSING Statistical Natural Language Processing 23.04.19 1 Syntax, Grammars, Parsing

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015

KALDI GPU ACCELERATION GTC - March 2019 1) Brief introduction to speech processing 2) What we

FINITE STATE MORPHOLOGY 24.05.19 Statistical Natural Language Processing 1 Morphology with FSAs

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and