workshop programme
play

Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC - PowerPoint PPT Presentation

Workshop Programme Introduction: VoxCeleb, VoxConverse & VoxSRC Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: X-vectors: Neural Speech Embeddings for Speaking Recognition - Daniel Garcia-Romero 19:25


  1. Workshop Programme Introduction: “ VoxCeleb, VoxConverse & VoxSRC ” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition ” - Daniel Garcia-Romero 19:25 Speaker verification: Leaderboards and winners for Tracks 1-3 20:00 Participant talks from Tracks 1, 2 and 3, live Q & A 20:05 Coffee break 20:50 Keynote: “ Tackling Multispeaker Conversation Processing based on Speaker Diarization and 21:10 Multispeaker Speech Recognition ” - Shinji Watanabe 21:40 Diarization : Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A 21:42 Wrap-up discussions and closing 22:00

  2. Organisers Andrew Zisserman Joon Son Chung Arsha Nagrani Jaesung Huh Andrew Brown Mitch Mclaren Doug Reynolds Ernesto Coto Weidi Xie

  3. Workshop Programme Introduction: “ VoxCeleb, VoxConverse & VoxSRC ” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition ” - Daniel Garcia-Romero 19:25 Speaker verification: Leaderboards and winners for Tracks 1-3 20:00 Participant talks from Tracks 1, 2 and 3, live Q & A 20:05 Coffee break 20:50 Keynote: “ Tackling Multispeaker Conversation Processing based on Speaker Diarization and 21:10 Multispeaker Speech Recognition ” - Shinji Watanabe 21:40 Diarization : Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A 21:42 Wrap-up discussions and closing 22:00

  4. Introduction 1. Data : VoxCeleb and VoxConverse 2. Challenge Mechanics: New tracks, rules and metrics

  5. VoxCeleb datasets • Multi-speaker environments • Varying audio quality and background channel noise Red Carpet Interviews • Freely available Studio Interviews Outdoor and pitch Interviews

  6. VoxCeleb - automatic pipeline Transferring labels from Vision to Speech Felicity Jones Face + landmarks Input Video detection Face verification match match Active Speaker VOXCELEB Detection

  7. VoxCeleb Statistics VoxCeleb2 dev set -> primary data for speaker verification ● Validation toolkit for scoring ● Train Validation # Speakers 5,994 1,251 # Utterances 1,092,009 153,516

  8. A more challenging test set - VoxMovies • Hard samples found from VoxCeleb identities speaking in movies • Playing characters, showing strong emotions, background noise VoxCeleb VoxMovies accent change background music emotion

  9. A more challenging test set - VoxMovies • Hard samples found from VoxCeleb identities speaking in movies • Playing characters, showing strong emotions, background noise VoxCeleb VoxMovies accent change background music emotion

  10. A more challenging test set - VoxMovies • Audio dataset, but we use visual methods to collect it (VoxCeleb automatic pipeline) VoxCeleb VoxMovies Steve Martin

  11. Audio speaker diarization • Solving “who spoke when” in multi-speaker video. Steve Martin

  12. Diarization - The VoxConverse dataset ● 526 videos from YouTube ● Mostly debates, talk shows, news segments http://www.robots.ox.ac.uk/~vgg/data/voxconverse/

  13. Automatic audio-visual diarization method Active speaker Audio-visual source Face detection & detection separation Input video face track clustering VoxConverse X O Speaker verification Chung, Joon Son, et al. "Spot the conversation: speaker diarisation in the wild." INTERSPEECH (2020).

  14. The VoxCeleb Speaker Recognition Challenge

  15. VoxSRC-2020 tracks • Track 1 : Supervised speaker verification (closed) • Track 2 : Supervised speaker verification (open) • Track 3 : Self-supervised speaker verification (closed) • Track 4 : Speaker diarization (open) TWO NEW tracks this year!

  16. New Tracks Track 3: Self-Supervised No speaker labels allowed • Can use future frames, visual frames, or any other objective • from the video itself Track 4: Speaker Diarization Solving “who spoke when” in multi-speaker video. • Speaker overlap, challenging background conditions •

  17. Mechanics Metrics (Tracks 1-3) • DCF , EER • Following NIST-SRE 2018 • Metrics (Track 4 ) • DER , JER • Overlapping speech counted, collar of 0.25s • Only 1 submission per day, 5 in total • Submissions via CodaLab •

  18. Test Sets • New, more difficult test sets • Manual verification of all speech segments • In addition, annotators pay particular attention to examples whose speaker embeddings are far from cluster centres

Recommend


More recommend