  1. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Semi-Supervised Adversarial Audio Source Separation applied to Singing Voice Extraction Daniel Stoller 1 , Sebastian Ewert 2 , Simon Dixon 1 1 Centre for Digital Music Queen Mary University London 2 Spotify MLSP-L8: Deep Learning III ICASSP 19.04.2018

  2. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Audio source separation Task: Recover sources from mixtures Example: Music instrument separation:

  3. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Current state of the art [5, 3, 1] Training on multitrack datasets Neural network Discriminative, MSE loss

  4. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Current state of the art [5, 3, 1] Training on multitrack datasets (small ⇒ overfitting!) Neural network Discriminative, MSE loss

  5. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Our goal ⇒ How to also learn from unpaired mixtures and sources? Random mixing ignores source correlations [4, 2]

  6. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Intuition Magnitude Unlabeled mixtures spectrogram Accompaniment estimates Separator Magnitude Mixture network database spectrogram Vocal estimates Magnitude spectrogram

  7. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Intuition Unlabeled accompaniment Accompaniment Magnitude database spectrogram Magnitude Unlabeled mixtures spectrogram Accompaniment estimates Separator Magnitude Mixture network database spectrogram Vocal estimates Magnitude spectrogram Magnitude Singing voice spectrogram database Unlabeled vocals

  8. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Intuition

  9. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Derivation of unsupervised loss For optimal separator: q φ ( s k | m ) = p ( s k | m )

  10. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Derivation of unsupervised loss For optimal separator: q φ ( s k | m ) = p ( s k | m ) E m ∼ p data q φ ( s k | m ) E m ∼ p data p ( s k | m ) = Overall separator output = Source distribution

  11. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Derivation of unsupervised loss For optimal separator: q φ ( s k | m ) = p ( s k | m ) E m ∼ p data q φ ( s k | m ) E m ∼ p data p ( s k | m ) = out q k p k = φ s

  12. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Derivation of unsupervised loss For optimal separator: q φ ( s k | m ) = p ( s k | m ) E m ∼ p data q φ ( s k | m ) E m ∼ p data p ( s k | m ) = out q k p k = φ s Necessary condition for optimal separator Loss: Minimise divergence between source outputs: L u = � K k =1 D [ out q k φ || p k s ]

  13. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Overall approach Supervised loss: MSE between estimate and ground truth

  14. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Overall approach Supervised loss: MSE between estimate and ground truth Unsupervised loss: L u = � K k =1 D [ out q k φ || p k s ] L add : MSE between sum of source estimates and mixture

  15. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Theoretical framework Overall approach Supervised loss: MSE between estimate and ground truth Unsupervised loss: L u = � K k =1 D [ out q k φ || p k s ] L add : MSE between sum of source estimates and mixture Total loss: L = L s + α L u + β L add

  16. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Implementation using GANs Divergence minimization with GANs Discriminator estimates divergence D between generator and real distribution Generator minimises divergence D

  17. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Implementation using GANs Divergence minimization with GANs Discriminator estimates divergence D between generator and real distribution Generator minimises divergence D Our separator is a conditional generator ⇒ We use one discriminator per source to estimate the Wasserstein distance W [ out q k φ || p k s ]

  18. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Experimental setup Avoids dataset bias Supervised and semi-supervised training with early stopping U-Net as separator, DCGAN as discriminator

  19. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Results Performance Mean accompaniment SDR 12 Baseline 11 Ours 10 9 8 7 6 Test set DSD100 MedleyDB CCMixter iKala Mean vocal SDR 12 Baseline 10 Ours 8 6 4 2 Test set DSD100 MedleyDB CCMixter iKala

  20. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Results Qualitative t (s) t (s) 0 0.5 1.0 1.5 2.0 0 0.5 1.0 1.5 2.0 0 0 256 256 f (Hz) f (Hz) 512 512 768 768 1024 1024 (a) Separator estimate x (b) ∇ x D ( x )

  21. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Results Qualitative t (s) t (s) 0 0.5 1.0 1.5 2.0 0 0.5 1.0 1.5 2.0 0 0 256 256 f (Hz) f (Hz) 512 512 768 768 1024 1024 (a) Separator estimate x (b) ∇ x D ( x ) ⇒ Discriminator appears to work More perceptual loss function?

  22. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary Summary Current SotA methods only use multi-track data Our approach also uses solo source recordings Performance improvement in singing voice separation experiment More perceptual loss? (seeks posterior modes, not means)

  23. Motivation State of the art Proposed approach Experiment: Singing voice separation Discussion and summary End Code available at Thank you for your attention!

