Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/

Advances in speech recognition 1 / 12

Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] 1 / 12

Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language 1 / 12

Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language • Sometimes not possible, e.g., for unwritten languages 1 / 12

Images as weak labels for speech 2 / 12

Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play 2 / 12

Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? 2 / 12

Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? • Goal: Use this type of data for cross-lingual keyword spotting 2 / 12

Cross-lingual keyword spotting kuwaka Written query: burning (English) kuwaka Swahili speech corpus 3 / 12

Cross-lingual word prediction from images 4 / 12

Cross-lingual word prediction from images t n r t a y vis a i m h h s VGG 4 / 12

Cross-lingual word prediction from images t n r t a y vis a i m h h s 0 . 85 0 . 8 0 . 9 VGG 4 / 12

Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s max feedfwd conv VGG max X 4 / 12

Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s Loss max feedfwd ℓ conv VGG max X 4 / 12

Cross-lingual word prediction from images f ( X ) max feedfwd conv max X 4 / 12

Cross-lingual word prediction from images f ( X ) max feedfwd conv max Swahili speech X 4 / 12

Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd conv max Swahili speech X 4 / 12

Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max Swahili speech X 4 / 12

Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a cross-lingual conv spoken bag-of-words (BoW) classifier max Swahili speech X 4 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting 5 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries 5 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech 5 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech • Data: 8000 images with 5 English spoken captions ( ∼ 37 hours) • Weak labels: German visual tagger trained on German Multi30k 5 / 12

Predictions on test data Given German keyword: ‘Hunde’ English speech collection (want to search) corresponds to dim. w f ( X 1 ) f ( X 2 ) f ( X 3 ) f w ( X i ) = P θ ( w | X i ) : score for whether (English) speech X i contains translation of given (German) keyword w Evaluation: Does predicted keyword occur in reference translation? 6 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): Play • 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): Play • 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street • people on the city street walk past a puppet theater • an asian woman rides a bicycle in front of two cars 7 / 12

Cross-lingual keyword spotting performance DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 8 / 12

Example predictions marked as errors Input: Feld (field) 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) Output: • a small group of people sitting together outside ∗ 9 / 12

Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

Cross-lingual keyword spotting kuwaka Written query: moto burning (English) kuwaka 11 / 12

Conclusions and future work • Visual grounding makes it possible to perform cross-lingual keyword spotting without any parallel speech and text or translations 12 / 12

Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/ Advances in

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Automatic speech recognition and keyword spotting in under-resourced languages Digital Signal

Cross-lingual topic prediction for speech using translations Sameer Bansal Herman Kamper Adam

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP

Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning Guillaume Wisniewski Nicolas

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

Pronunciation Extraction Through Cross-Lingual Word-to-Phoneme Alignment Felix Stahlberg, Tim

A New Universal Morphological Feature Schema for Rich Morphological Annotation and Cross-Lingual

Cross-lingual NLP Sara Stymne Uppsala University Department of Linguistics and Philology

ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource

Cross-lingual Information Retrieval Pavel Pecina Institute of Formal and Applied Linguistics

WMT 2016 Shared Task on Cross-lingual Pronoun Prediction . Liane Guillou, Christian Hardmeier,

Visually Grounded, Task-oriented Dialogue Elia Bruni Outline Language grounding Visual dialogue

Deep Learning Feature for Handwritten Keyword Spotting Baptiste Wicht Andreas Fischer Jean

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen

Cross-Lingual Cross-Document Coreference with Entity Linking Sean Monahan, John Lehmann, Timothy

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

Cross-Lingual Cross-Document Coreference with Entity Linking Sean Monahan, John Lehmann, Timothy

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction. Rob van der Goot, Nikola

From Dictionaries to Cross-lingual Lexical Resources Guadalupe Aguado-de-Cea, Elena

Cross-lingual similarity calculation for plagiarism detection and more Tools and resources