speech processing 11 492 18 492
play

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word - PowerPoint PPT Presentation

Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting Listening for Keywords Listening for Keywords No need to use push-to-talk No need to use push-to-talk Always on Always on Examples Examples Uses Uses


  1. Speech Processing 11-492/18-492 Spoken Term Detection/Key Word Spotting

  2. Listening for Keywords Listening for Keywords  No need to use push-to-talk No need to use push-to-talk  Always on Always on

  3. Examples Examples

  4. Uses Uses  Activate a computer/task Activate a computer/task – “Computer, locate Commander Riker” Computer, locate Commander Riker” “  Robot/device control Robot/device control – “Next slide” Next slide” “  Broadcast News, Meetings Broadcast News, Meetings – Tell me when “Microsoft” is mentioned Tell me when “Microsoft” is mentioned  “ “Triggers” vs “(General) Keyword spotting” Triggers” vs “(General) Keyword spotting”

  5. Google Now Google Now  “ “Okay Google schedule” Okay Google schedule”  Always on Always on  Hands free Hands free  Uses battery all the time Uses battery all the time  But “Okay Google” is only said when its meant But “Okay Google” is only said when its meant

  6. How to do it How to do it  Full ASR Full ASR – Run full ASR all the time Run full ASR all the time – Post process it to find keyword Post process it to find keyword – Very computationally expensive Very computationally expensive  Model for Keyword Model for Keyword – Build an acoustic model just for keyword Build an acoustic model just for keyword – Run DTW (or similar) on the acoustics Run DTW (or similar) on the acoustics

  7. How to measure its success How to measure its success  False Positives False Positives – Find examples that aren't there Find examples that aren't there  False Negatives False Negatives – Miss examples that are there Miss examples that are there  What is the relative cost of the error What is the relative cost of the error – FN: FN: • Trigger: person will say it again Trigger: person will say it again • KWS: its lost KWS: its lost – FP FP • Trigger: an extra command will be interpreted Trigger: an extra command will be interpreted • KWS: time wasted in looking at example to discard it KWS: time wasted in looking at example to discard it  Change your thresholds Change your thresholds – Trigger: less FP Trigger: less FP – KWS: less FN KWS: less FN

  8. Hot Spots Hot Spots  Only look in good places Only look in good places  Speech vs non-speech Speech vs non-speech  Target Speaker vs Other speakers Target Speaker vs Other speakers  “ “Long” speech vs (very) short speech Long” speech vs (very) short speech  Prosodically interesting parts Prosodically interesting parts

  9. Noise Cancellation Noise Cancellation  Remove known (irrelevant) channels Remove known (irrelevant) channels – Remove TV feed from ASR stream Remove TV feed from ASR stream – Remove Others from conference call Remove Others from conference call

  10. Boosting Boosting  (For Keyword Spotting) (For Keyword Spotting)  Words are defined by the company they keep Words are defined by the company they keep  Words will typically appear more than once Words will typically appear more than once – Near to each other Near to each other  Recognition with lattices (i.e. choices) Recognition with lattices (i.e. choices) – If a document has one occurrence If a document has one occurrence – boost others boost others  If related words in document If related words in document – Boost others Boost others

  11. Choose your Trigger Word Choose your Trigger Word  Something unlikely to appear elsewhere Something unlikely to appear elsewhere  Something easy to recognize Something easy to recognize  Something not confusable Something not confusable  Something easy to remember Something easy to remember  Something relevant Something relevant  Good examples Good examples – Affirmative and negative (vs yes and no) Affirmative and negative (vs yes and no) – “Okay Google” Okay Google” “ – “Nebuchadnezzar” Nebuchadnezzar” “  Bad examples Bad examples – “huh” “sass” huh” “sass” “

  12. IARPA Babel Project IARPA Babel Project  4 teams 4 teams – CMU/JHU CMU/JHU – BBN BBN – IBM IBM – ICSI (and others) ICSI (and others)  35 languages over 5 years 35 languages over 5 years – Low resource languages Low resource languages – Pashto, Bengali, Vietnamese, Cantonese,... Pashto, Bengali, Vietnamese, Cantonese,...  100 hours, 10 hours, and 0 hours 100 hours, 10 hours, and 0 hours

  13. 0 Data Case 0 Data Case  No labeled data in “unknown” language No labeled data in “unknown” language – So can't build initial ASR engine So can't build initial ASR engine – Build index in the audio domain Build index in the audio domain  Keywords are spoken Keywords are spoken – “Look for 'apple computers' Look for 'apple computers' “  Issues Issues – Cross speaker mapping Cross speaker mapping – (Use of synthesis – but need data) (Use of synthesis – but need data)

  14. Spoken Term Detection Spoken Term Detection  “ “Old” goal but popular again Old” goal but popular again – Not in fact much easier than full ASR Not in fact much easier than full ASR – You can constrain the problem though You can constrain the problem though • Limited keywords, train people Limited keywords, train people  Can be used for search Can be used for search – But need good ASR in language But need good ASR in language

Recommend


More recommend