multimodal learning from images and speech
play

Multimodal learning from images and speech KU Leuven & UPF - PowerPoint PPT Presentation

Multimodal learning from images and speech KU Leuven & UPF Barcelona, January 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ Advances in speech recognition 3 / 35 Advances in speech


  1. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large Play play sitting yellow young 16 / 35

  2. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water play sitting yellow young 16 / 35

  3. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play sitting yellow young 16 / 35

  4. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play children playing in a ball pit variant sitting two people are seated at a table with drinks semantic yellow a tan dog jumping over a red and blue toy mistake young a little girl on a kid swing semantic 16 / 35

  5. Task 3: Semantic speech retrieval burning Written query: fire burning burning [Kamper et al., TASLP’19] 17 / 35

  6. Human (MTurk) evaluation 18 / 35

  7. Human (MTurk) evaluation Keyword Top retrieved utterance Human label ocean man falling off a blue surfboard in the ocean 5 / 5 snowy a skier catches air over the snow 5 / 5 bike a dirt biker rides through some trees 4 / 5 children a group of young boys playing soccer 4 / 5 field two white dogs running in the grass together 3 / 5 swimming a woman holding a young boy slide down a 3 / 5 water slide into a pool carrying small dog running in the grass with a toy in its 2 / 5 ∗ mouth large a group of people on a zig path through the 1 / 5 ∗ mountains hair two women and a man smile for the camera 0 / 5 ∗ 18 / 35

  8. Task 3: Semantic speech retrieval 19 / 35

  9. Task 3: Semantic speech retrieval TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 0 20 40 60 80 100 P @10 19 / 35

  10. Task 3: Semantic speech retrieval TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 0 10 20 30 Spearman’s ρ 20 / 35

  11. But this model is trained for English? n t y vis r t a f ( X ) a i m h h s Loss max feedfwd L conv VGG max X [Kamper et al., Interspeech’17] 21 / 35

  12. Task 4: Cross-lingual keyword spotting Arapaho speech collection (want to search) Given English keyword: ‘Disease’ [Kamper and Roth, SLTU’18] 22 / 35

  13. Task 4: Cross-lingual keyword spotting English speech collection (want to search) Given German keyword: ‘Hunde’ [Kamper and Roth, SLTU’18] 22 / 35

  14. Task 4: Cross-lingual keyword spotting Cross-lingual keyword spotter German (text) tags springt Hunde d ˆ y de l f ( X ) e F Loss max feedfwd ℓ conv VGG-16 max I X English speech [Kamper and Roth, SLTU’18] 23 / 35

  15. 2. Multimodal One-Shot Learning from Images and Speech

  16. 2. Multimodal One-Shot Learning from Images and Speech Herman Ryan Eloff Engelbrecht

  17. You are the robot 26 / 35

  18. You are the robot 26 / 35

  19. You are the robot 26 / 35

  20. You are the robot 26 / 35

  21. You are the robot 26 / 35

  22. You are the robot 26 / 35

  23. You are the robot 26 / 35

  24. You are the robot ? 26 / 35

  25. Unimodal one-shot learning and classification – three – one – five – two – four [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  26. Unimodal one-shot learning and classification – three – one Query: – five y = ? ˆ ( two ) – two – four [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  27. Unimodal one-shot learning and classification – three – one Query: – five y = ? ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  28. Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  29. Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  30. Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  31. Unimodal one-shot learning and classification – three – one Support set Query: y = two – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  32. Multimodal one-shot learning and matching Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

  33. Multimodal one-shot learning and matching Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

  34. Multimodal one-shot learning and matching Matching set Query: Support set ( two ) ? Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

  35. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  36. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  37. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  38. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  39. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  40. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  41. Our approach to multimodal one-shot learning 30 / 35

  42. Our approach to multimodal one-shot learning • Requires within-modality distance metrics • Can be done directly over features: DTW over speech, cosine over image pixels • Or distance metrics can be learned from background data • Compare these on TIDigits (speech) paired with MNIST (images) 30 / 35

  43. Background data Omniglot (no digits): 31 / 35

  44. Background data Omniglot (no digits): Isolated labelled words (no digits): 31 / 35

  45. Models for metric learning Classifier network: g n cricket i d n e g a r t a s l X 32 / 35

  46. Models for metric learning Classifier network: Siamese network: g n d ( y 1 , y 2 ) cricket distance i d n e g a r t a s l y 1 = f ( X 1 ) y 2 = f ( X 2 ) X X 2 X 1 32 / 35

  47. Multimodal one-shot matching DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online) 0 20 40 60 80 100 Accuracy (%) 33 / 35

Recommend


More recommend