Workshop Programme Introduction: “ VoxCeleb, VoxConverse & VoxSRC ” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition ” - Daniel Garcia-Romero 19:25 Speaker verification: Leaderboards and winners for Tracks 1-3 20:00 Participant talks from Tracks 1, 2 and 3, live Q & A 20:05 Coffee break 20:50 Keynote: “ Tackling Multispeaker Conversation Processing based on Speaker Diarization and 21:10 Multispeaker Speech Recognition ” - Shinji Watanabe 21:40 Diarization : Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A 21:42 Wrap-up discussions and closing 22:00
Organisers Andrew Zisserman Joon Son Chung Arsha Nagrani Jaesung Huh Andrew Brown Mitch Mclaren Doug Reynolds Ernesto Coto Weidi Xie
Workshop Programme Introduction: “ VoxCeleb, VoxConverse & VoxSRC ” – Arsha Nagrani, Joon Son Chung & Andrew Zisserman 19:00 Keynote: “X-vectors: Neural Speech Embeddings for Speaking Recognition ” - Daniel Garcia-Romero 19:25 Speaker verification: Leaderboards and winners for Tracks 1-3 20:00 Participant talks from Tracks 1, 2 and 3, live Q & A 20:05 Coffee break 20:50 Keynote: “ Tackling Multispeaker Conversation Processing based on Speaker Diarization and 21:10 Multispeaker Speech Recognition ” - Shinji Watanabe 21:40 Diarization : Leaderboards and winners for Track 4 Participant talks from Track 4, live Q & A 21:42 Wrap-up discussions and closing 22:00
Introduction 1. Data : VoxCeleb and VoxConverse 2. Challenge Mechanics: New tracks, rules and metrics
VoxCeleb datasets • Multi-speaker environments • Varying audio quality and background channel noise Red Carpet Interviews • Freely available Studio Interviews Outdoor and pitch Interviews
VoxCeleb - automatic pipeline Transferring labels from Vision to Speech Felicity Jones Face + landmarks Input Video detection Face verification match match Active Speaker VOXCELEB Detection
VoxCeleb Statistics VoxCeleb2 dev set -> primary data for speaker verification ● Validation toolkit for scoring ● Train Validation # Speakers 5,994 1,251 # Utterances 1,092,009 153,516
A more challenging test set - VoxMovies • Hard samples found from VoxCeleb identities speaking in movies • Playing characters, showing strong emotions, background noise VoxCeleb VoxMovies accent change background music emotion
A more challenging test set - VoxMovies • Hard samples found from VoxCeleb identities speaking in movies • Playing characters, showing strong emotions, background noise VoxCeleb VoxMovies accent change background music emotion
A more challenging test set - VoxMovies • Audio dataset, but we use visual methods to collect it (VoxCeleb automatic pipeline) VoxCeleb VoxMovies Steve Martin
Audio speaker diarization • Solving “who spoke when” in multi-speaker video. Steve Martin
Diarization - The VoxConverse dataset ● 526 videos from YouTube ● Mostly debates, talk shows, news segments http://www.robots.ox.ac.uk/~vgg/data/voxconverse/
Automatic audio-visual diarization method Active speaker Audio-visual source Face detection & detection separation Input video face track clustering VoxConverse X O Speaker verification Chung, Joon Son, et al. "Spot the conversation: speaker diarisation in the wild." INTERSPEECH (2020).
The VoxCeleb Speaker Recognition Challenge
VoxSRC-2020 tracks • Track 1 : Supervised speaker verification (closed) • Track 2 : Supervised speaker verification (open) • Track 3 : Self-supervised speaker verification (closed) • Track 4 : Speaker diarization (open) TWO NEW tracks this year!
New Tracks Track 3: Self-Supervised No speaker labels allowed • Can use future frames, visual frames, or any other objective • from the video itself Track 4: Speaker Diarization Solving “who spoke when” in multi-speaker video. • Speaker overlap, challenging background conditions •
Mechanics Metrics (Tracks 1-3) • DCF , EER • Following NIST-SRE 2018 • Metrics (Track 4 ) • DER , JER • Overlapping speech counted, collar of 0.25s • Only 1 submission per day, 5 in total • Submissions via CodaLab •
Test Sets • New, more difficult test sets • Manual verification of all speech segments • In addition, annotators pay particular attention to examples whose speaker embeddings are far from cluster centres
Recommend
More recommend