Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August 2017 Herman Kamper 1 , Shane Settle 2 , Gregory Shakhnarovich 2 , Karen Livescu 2 1 Stellenbosch University, South Africa 2 Toyota Technological Institute at Chicago, USA http://www.kamperh.com/
Success in speech recognition 1 / 13
Success in speech recognition 1 / 13
Success in speech recognition 1 / 13
Success in speech recognition 1 / 13
Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] 1 / 13
Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 13
Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 13
Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 13
Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours transcribed speech audio; ∼ 350M/560M words text • Can we do this for all 7000 languages spoken in the world? 1 / 13
What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) 2 / 13
What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) • Criticism: You always have some labelled data 2 / 13
What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) • Criticism: You always have some labelled data, but. . . • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] 2 / 13
What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) • Criticism: You always have some labelled data, but. . . • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] • New insights and models for speech processing [Jansen et al., ’13] 2 / 13
Using images to ground language 3 / 13
Using images to ground language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] 3 / 13
Using images to ground language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] • We consider images paired with unlabelled spoken captions: Play 3 / 13
Word prediction from images and speech 4 / 13
Word prediction from images and speech t n r t a y vis a i m h h s VGG 4 / 13
Word prediction from images and speech t n r t a y vis a i m h h s 0 . 85 0 . 8 0 . 9 VGG 4 / 13
Word prediction from images and speech t y vis n r f ( X ) t a a i m h h s max feedfwd conv VGG max X 4 / 13
Word prediction from images and speech t y vis n r f ( X ) t a a i m h h s Loss max feedfwd L conv VGG max X 4 / 13
Word prediction from images and speech f ( X ) max feedfwd conv max X 4 / 13
Word prediction from images and speech n f ( X ) t a a m h max feedfwd conv max X 4 / 13
Word prediction from images and speech n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max X 4 / 13
Word prediction from images and speech n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a spoken bag-of-words conv (BoW) classifier max X 4 / 13
Images paired with untranscribed speech We are still in this setting: • We do not use any of the speech transcriptions during model training (only for evaluation) • But our resulting model can make bag-of-words (BoW) predictions 5 / 13
Images paired with untranscribed speech We are still in this setting: • We do not use any of the speech transcriptions during model training (only for evaluation) • But our resulting model can make bag-of-words (BoW) predictions • Note: Vision system could be seen as language independent (future) 5 / 13
Experimental details • Data: 8000 images with 5 spoken captions, divided into train, development and test sets [Harwath and Glass, ASRU’15] • Prediction: Output words w where f w ( X ) > α • Tasks: Spoken bag-of-words prediction; keyword spotting • Evaluation: Compare to words in transcriptions of test data 6 / 13
Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels Play 7 / 13
Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels bicycle , bike, man , riding, Play wearing 7 / 13
Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing 7 / 13
Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing a little girl is climbing a ladder child, girl , little , young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog , field, grass , running a man in a miami basketball uniform ball, basketball , man , looking to the right player, uniform , wearing 7 / 13
Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing a little girl is climbing a ladder child, girl , little , young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog , field, grass , running a man in a miami basketball uniform ball, basketball , man , looking to the right player, uniform , wearing 7 / 13
Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 8 / 13
Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 8 / 13
Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 8 / 13
Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 8 / 13
Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances 9 / 13
Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances: running playing ocean white dogs three lake two boy two dog water mouth small biker white dogs rides two ball dirt red brown riding snowboarder jumping wearing person snowy men man hill air air man snow standing women child three little girls girl boy two two young woman running bicycle grassy ramp white small biker dogs blue two bike grass 9 / 13
Task 2: Keyword spotting Keyword Example of matched utterance Type Play (one of top 10) beach behind bike boys large play sitting yellow young 10 / 13
Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . behind bike boys large play sitting yellow young 10 / 13
Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind bike boys large play sitting yellow young 10 / 13
Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave bike boys large play sitting yellow young 10 / 13
Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike boys large play sitting yellow young 10 / 13
Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air boys large play sitting yellow young 10 / 13
Recommend
More recommend