visually grounded cross lingual keyword spotting in speech
play

Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/ Advances in


  1. Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/

  2. Advances in speech recognition 1 / 12

  3. Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] 1 / 12

  4. Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language 1 / 12

  5. Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language • Sometimes not possible, e.g., for unwritten languages 1 / 12

  6. Images as weak labels for speech 2 / 12

  7. Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play 2 / 12

  8. Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? 2 / 12

  9. Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? • Goal: Use this type of data for cross-lingual keyword spotting 2 / 12

  10. Cross-lingual keyword spotting kuwaka Written query: burning (English) kuwaka Swahili speech corpus 3 / 12

  11. Cross-lingual word prediction from images 4 / 12

  12. Cross-lingual word prediction from images t n r t a y vis a i m h h s VGG 4 / 12

  13. Cross-lingual word prediction from images t n r t a y vis a i m h h s 0 . 85 0 . 8 0 . 9 VGG 4 / 12

  14. Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s max feedfwd conv VGG max X 4 / 12

  15. Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s Loss max feedfwd ℓ conv VGG max X 4 / 12

  16. Cross-lingual word prediction from images f ( X ) max feedfwd conv max X 4 / 12

  17. Cross-lingual word prediction from images f ( X ) max feedfwd conv max Swahili speech X 4 / 12

  18. Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd conv max Swahili speech X 4 / 12

  19. Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max Swahili speech X 4 / 12

  20. Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a cross-lingual conv spoken bag-of-words (BoW) classifier max Swahili speech X 4 / 12

  21. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting 5 / 12

  22. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries 5 / 12

  23. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech 5 / 12

  24. Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech • Data: 8000 images with 5 English spoken captions ( ∼ 37 hours) • Weak labels: German visual tagger trained on German Multi30k 5 / 12

  25. Predictions on test data Given German keyword: ‘Hunde’ English speech collection (want to search) corresponds to dim. w f ( X 1 ) f ( X 2 ) f ( X 3 ) f w ( X i ) = P θ ( w | X i ) : score for whether (English) speech X i contains translation of given (German) keyword w Evaluation: Does predicted keyword occur in reference translation? 6 / 12

  26. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword 7 / 12

  27. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad 7 / 12

  28. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): Play • 7 / 12

  29. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day 7 / 12

  30. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city 7 / 12

  31. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) 7 / 12

  32. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): Play • 7 / 12

  33. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street 7 / 12

  34. Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street • people on the city street walk past a puppet theater • an asian woman rides a bicycle in front of two cars 7 / 12

  35. Cross-lingual keyword spotting performance DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 8 / 12

  36. Example predictions marked as errors Input: Feld (field) 9 / 12

  37. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ 9 / 12

  38. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) 9 / 12

  39. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ 9 / 12

  40. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) 9 / 12

  41. Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) Output: • a small group of people sitting together outside ∗ 9 / 12

  42. Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

  43. Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

  44. Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

  45. Cross-lingual keyword spotting kuwaka Written query: moto burning (English) kuwaka 11 / 12

  46. Conclusions and future work • Visual grounding makes it possible to perform cross-lingual keyword spotting without any parallel speech and text or translations 12 / 12

Recommend


More recommend