Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting

Listening for Keywords Listening for Keywords  No need to use push-to-talk No need to use push-to-talk  Always on Always on

Examples Examples

Uses Uses  Activate a computer/task Activate a computer/task – “Computer, locate Commander Riker” Computer, locate Commander Riker” “  Robot/device control Robot/device control – “Next slide” Next slide” “  Broadcast News, Meetings Broadcast News, Meetings – Tell me when “Microsoft” is mentioned Tell me when “Microsoft” is mentioned  “ “Triggers” vs “(General) Keyword spotting” Triggers” vs “(General) Keyword spotting”

Google Now Google Now  “ “Okay Google schedule” Okay Google schedule”  Always on Always on  Hands free Hands free  Uses battery all the time Uses battery all the time  But “Okay Google” is only said when its meant But “Okay Google” is only said when its meant

How to do it How to do it  Full ASR Full ASR – Run full ASR all the time Run full ASR all the time – Post process it to find keyword Post process it to find keyword – Very computationally expensive Very computationally expensive  Model for Keyword Model for Keyword – Build an acoustic model just for keyword Build an acoustic model just for keyword – Run DTW (or similar) on the acoustics Run DTW (or similar) on the acoustics

How to measure its success How to measure its success  False Positives False Positives – Find examples that aren't there Find examples that aren't there  False Negatives False Negatives – Miss examples that are there Miss examples that are there  What is the relative cost of the error What is the relative cost of the error – FN: FN: • Trigger: person will say it again Trigger: person will say it again • KWS: its lost KWS: its lost – FP FP • Trigger: an extra command will be interpreted Trigger: an extra command will be interpreted • KWS: time wasted in looking at example to discard it KWS: time wasted in looking at example to discard it  Change your thresholds Change your thresholds – Trigger: less FP Trigger: less FP – KWS: less FN KWS: less FN

Hot Spots Hot Spots  Only look in good places Only look in good places  Speech vs non-speech Speech vs non-speech  Target Speaker vs Other speakers Target Speaker vs Other speakers  “ “Long” speech vs (very) short speech Long” speech vs (very) short speech  Prosodically interesting parts Prosodically interesting parts

Noise Cancellation Noise Cancellation  Remove known (irrelevant) channels Remove known (irrelevant) channels – Remove TV feed from ASR stream Remove TV feed from ASR stream – Remove Others from conference call Remove Others from conference call

Boosting Boosting  (For Keyword Spotting) (For Keyword Spotting)  Words are defined by the company they keep Words are defined by the company they keep  Words will typically appear more than once Words will typically appear more than once – Near to each other Near to each other  Recognition with lattices (i.e. choices) Recognition with lattices (i.e. choices) – If a document has one occurrence If a document has one occurrence – boost others boost others  If related words in document If related words in document – Boost others Boost others

Choose your Trigger Word Choose your Trigger Word  Something unlikely to appear elsewhere Something unlikely to appear elsewhere  Something easy to recognize Something easy to recognize  Something not confusable Something not confusable  Something easy to remember Something easy to remember  Something relevant Something relevant  Good examples Good examples – Affirmative and negative (vs yes and no) Affirmative and negative (vs yes and no) – “Okay Google” Okay Google” “ – “Nebuchadnezzar” Nebuchadnezzar” “  Bad examples Bad examples – “huh” “sass” huh” “sass” “

IARPA Babel Project IARPA Babel Project  4 teams 4 teams – CMU/JHU CMU/JHU – BBN BBN – IBM IBM – ICSI (and others) ICSI (and others)  35 languages over 5 years 35 languages over 5 years – Low resource languages Low resource languages – Pashto, Bengali, Vietnamese, Cantonese,... Pashto, Bengali, Vietnamese, Cantonese,...  100 hours, 10 hours, and 0 hours 100 hours, 10 hours, and 0 hours

0 Data Case 0 Data Case  No labeled data in “unknown” language No labeled data in “unknown” language – So can't build initial ASR engine So can't build initial ASR engine – Build index in the audio domain Build index in the audio domain  Keywords are spoken Keywords are spoken – “Look for 'apple computers' Look for 'apple computers' “  Issues Issues – Cross speaker mapping Cross speaker mapping – (Use of synthesis – but need data) (Use of synthesis – but need data)

Spoken Term Detection Spoken Term Detection  “ “Old” goal but popular again Old” goal but popular again – Not in fact much easier than full ASR Not in fact much easier than full ASR – You can constrain the problem though You can constrain the problem though • Limited keywords, train people Limited keywords, train people  Can be used for search Can be used for search – But need good ASR in language But need good ASR in language

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting Listening for Keywords Listening for Keywords No need to use push-to-talk No need to use push-to-talk Always on Always on Examples Examples Uses Uses

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Signal Representations Part 2: Speech Signal Processing Hsin-min Wang References: 1 X.

Speech Processing 15-492/18-492 Speech Recognition Grammars Other ASR techniques But not just

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

SYNTAX PROCESSING Statistical Natural Language Processing 23.04.19 1 Syntax, Grammars, Parsing

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and

Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015

KALDI GPU ACCELERATION GTC - March 2019 1) Brief introduction to speech processing 2) What we

FINITE STATE MORPHOLOGY 24.05.19 Statistical Natural Language Processing 1 Morphology with FSAs

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting Listening for Keywords Listening for Keywords No need to use push-to-talk No need to use push-to-talk Always on Always on Examples Examples Uses Uses

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 11-492/18-492 Speech Synthesis Signal Processing Signal Manipulation

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Signal Representations Part 2: Speech Signal Processing Hsin-min Wang References: 1 X.

Speech Processing 15-492/18-492 Speech Recognition Grammars Other ASR techniques But not just

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Cepstral analysis in speech processing From speech production model, we have: s[n] = (p[n]*g[n] +

Human Speech Hermansky Spring 2020 EN.520.680 Speech and Auditory Processing by Humans and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

SYNTAX PROCESSING Statistical Natural Language Processing 23.04.19 1 Syntax, Grammars, Parsing

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

End-to-End Speech Processing: From Pipeline to Integrated Architecture Shinji Watanabe Center

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Microphone Array Processing for Distant Speech Recognition From close-talking microphones to

EE E6820: Speech &amp; Audio Processing &amp; Recognition Lecture 5: Speech modeling and

Workshop on the Role of Speech in Developing Robust Speech Processing Applications May 7-8, 2015

KALDI GPU ACCELERATION GTC - March 2019 1) Brief introduction to speech processing 2) What we

FINITE STATE MORPHOLOGY 24.05.19 Statistical Natural Language Processing 1 Morphology with FSAs

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE679: Speech Processing EE679: Speech Processing A preview A preview Dept of Electrical

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling and