Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting
Listening for Keywords Listening for Keywords No need to use push-to-talk No need to use push-to-talk Always on Always on
Examples Examples
Uses Uses Activate a computer/task Activate a computer/task – “Computer, locate Commander Riker” Computer, locate Commander Riker” “ Robot/device control Robot/device control – “Next slide” Next slide” “ Broadcast News, Meetings Broadcast News, Meetings – Tell me when “Microsoft” is mentioned Tell me when “Microsoft” is mentioned “ “Triggers” vs “(General) Keyword spotting” Triggers” vs “(General) Keyword spotting”
Google Now Google Now “ “Okay Google schedule” Okay Google schedule” Always on Always on Hands free Hands free Uses battery all the time Uses battery all the time But “Okay Google” is only said when its meant But “Okay Google” is only said when its meant
How to do it How to do it Full ASR Full ASR – Run full ASR all the time Run full ASR all the time – Post process it to find keyword Post process it to find keyword – Very computationally expensive Very computationally expensive Model for Keyword Model for Keyword – Build an acoustic model just for keyword Build an acoustic model just for keyword – Run DTW (or similar) on the acoustics Run DTW (or similar) on the acoustics
How to measure its success How to measure its success False Positives False Positives – Find examples that aren't there Find examples that aren't there False Negatives False Negatives – Miss examples that are there Miss examples that are there What is the relative cost of the error What is the relative cost of the error – FN: FN: • Trigger: person will say it again Trigger: person will say it again • KWS: its lost KWS: its lost – FP FP • Trigger: an extra command will be interpreted Trigger: an extra command will be interpreted • KWS: time wasted in looking at example to discard it KWS: time wasted in looking at example to discard it Change your thresholds Change your thresholds – Trigger: less FP Trigger: less FP – KWS: less FN KWS: less FN
Hot Spots Hot Spots Only look in good places Only look in good places Speech vs non-speech Speech vs non-speech Target Speaker vs Other speakers Target Speaker vs Other speakers “ “Long” speech vs (very) short speech Long” speech vs (very) short speech Prosodically interesting parts Prosodically interesting parts
Noise Cancellation Noise Cancellation Remove known (irrelevant) channels Remove known (irrelevant) channels – Remove TV feed from ASR stream Remove TV feed from ASR stream – Remove Others from conference call Remove Others from conference call
Boosting Boosting (For Keyword Spotting) (For Keyword Spotting) Words are defined by the company they keep Words are defined by the company they keep Words will typically appear more than once Words will typically appear more than once – Near to each other Near to each other Recognition with lattices (i.e. choices) Recognition with lattices (i.e. choices) – If a document has one occurrence If a document has one occurrence – boost others boost others If related words in document If related words in document – Boost others Boost others
Choose your Trigger Word Choose your Trigger Word Something unlikely to appear elsewhere Something unlikely to appear elsewhere Something easy to recognize Something easy to recognize Something not confusable Something not confusable Something easy to remember Something easy to remember Something relevant Something relevant Good examples Good examples – Affirmative and negative (vs yes and no) Affirmative and negative (vs yes and no) – “Okay Google” Okay Google” “ – “Nebuchadnezzar” Nebuchadnezzar” “ Bad examples Bad examples – “huh” “sass” huh” “sass” “
IARPA Babel Project IARPA Babel Project 4 teams 4 teams – CMU/JHU CMU/JHU – BBN BBN – IBM IBM – ICSI (and others) ICSI (and others) 35 languages over 5 years 35 languages over 5 years – Low resource languages Low resource languages – Pashto, Bengali, Vietnamese, Cantonese,... Pashto, Bengali, Vietnamese, Cantonese,... 100 hours, 10 hours, and 0 hours 100 hours, 10 hours, and 0 hours
0 Data Case 0 Data Case No labeled data in “unknown” language No labeled data in “unknown” language – So can't build initial ASR engine So can't build initial ASR engine – Build index in the audio domain Build index in the audio domain Keywords are spoken Keywords are spoken – “Look for 'apple computers' Look for 'apple computers' “ Issues Issues – Cross speaker mapping Cross speaker mapping – (Use of synthesis – but need data) (Use of synthesis – but need data)
Spoken Term Detection Spoken Term Detection “ “Old” goal but popular again Old” goal but popular again – Not in fact much easier than full ASR Not in fact much easier than full ASR – You can constrain the problem though You can constrain the problem though • Limited keywords, train people Limited keywords, train people Can be used for search Can be used for search – But need good ASR in language But need good ASR in language
Recommend
More recommend