gct634 kaist invited lecture sound source separation
play

GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 - PowerPoint PPT Presentation

GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io Sound Source Separation Lets isolate the target audio signal! Cocktail party e ff ects ..as if were


  1. GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io

  2. Sound Source Separation Let’s isolate the “target” audio signal! • “ Cocktail party e ff ects ” 
 ..as if we’re simulating human brain (as if we know what’s going on there)

  3. Sound Source Separation problem = f(assumptions} assumptions = {environments: {dry, wet, ..}, signal = {ch: {mono, stereo, ..}, content: {speech, music}}, target: {...}} Input Target Noise Speech, Ambience Speech Noise Mixture of speech Speaker i all j != i Music ((Vocal, Drum, Instrument i all j != i Guitar, Bass + ..)

  4. SSS Applications • KA-RA-O-KE! • Transcription • DJing/Mixing • Many other MIR tasks • Once we called it a “Chicken-and-egg”; 
 S olving SSS would make many tasks extremely easier

  5. BIG ASSUMPTION 
 FOR A LONG WHILE • with |STFT| (or CQT) • 1 time-frequency bin, 1 instrument -- aka “W-disjoint” • Phase doesn’t matter much • It used to apply to (almost) every research

  6. Problem config 1 - 
 mixing matrix A • There was a mixing matrix A that we’ll estimate its inverse. s : source (instruments) a_xx : amplitude mixing coeffs x : stereo signal input ! w : estimated mixing coeffs ; y : estimated source (instruments)

  7. ICA • Independent Component Analysis (ICASSP ’98) • Based on some stats -- independency, (non-)Gaussianity • Not directly about audio but a general technique • Example: http://www.kecl.ntt.co.jp/icl/signal/sawada/ demo/bss2to4/index.html Further study: https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf

  8. ADRESS • A zimuth D iscrimination and Res ynthesi s (DAFx 2004) • 1-dim clustering; for stereo sound source separation

  9. Problem config 2 - 
 mixing matrix and delay • Sources are at di ff erent angles and distances 
 → Mixing matrix A is also about time delay

  10. DUET • Location = {angle, distance} • Each location, each 2D cluster, each instrument • DOA (Direction Of Arrival) estimation • Something similar is in your phone (with 2+ microphones) to suppress non-speech sounds (but perhaps not in your earphones/headphones)

  11. Problem config 3 - music - spectra of instruments http://www.physics.usyd.edu.au/teach_res/hsp/sp/mod31/m31_strings.htm

  12. NMF • Assumptions of using NMF for SSS: • The spectral shapes of musical instruments are known. • NMF would separate each note (aka basis)! • Many applications for drum separation (it works) https://www.slideshare.net/DaichiKitamura/robust-music-signal-separation-based-on- supervised-nonnegative-matrix-factorization-with-prevention-of-basis-sharing

  13. Problem config 4 - music - repeats • “Instrumental parts repeat!” ( ↔ vocal) • “Drums/beats repeat!” ( ↔ harmonic instruments) • A valid assumption for modern popular music • E.g., REPET (IEEE 2013), KAM (D. Fano Yela, ICASSP 2017), ...

  14. Problem config 5 - music - some musical cases • “Central” (~= vocal) source separation • Because - main vocals are almost always at the centre (and we all love karaoke) • Harmonic/percussive source separation • Because - they are (almost) completely di ff erent in spectral/temporal axes • Median filtering for drum separation (D. Fitzgerald, DAFx) “Gaussian mixture model for singing voice separation from stereophonic music”, M Kim et al, 2011

  15. Problem config 6 - music - ‘informed’ source separation • Exploiting the score as side information “Score-Informed Source Separation for Musical Audio Recordings”, S Ewert et al., 2013

  16. History so far... as time goes by less generality stronger assumptions

  17. DEEP! LEARNING!!

  18. DL and SSS • Less assumptions (let’s think further papers!) • Data-related; trained models do NOT extrapolate. 
 E.g., A model with speech probably wouldn’t work with music. • Model-related; E.g., frame-based? context-free? Does it estimate the phase? Stereo-input?

  19. Frame-based DL-SS • Because vocals are distinguishable in a frame (or frames) “Deep Learning For Monaural Speech Separation” , Po-sen Huang et al, 2014

  20. U-Net and SS • Because vocals are distinguishable in the |STFT| image “U-Net: Convolutional Networks for Biomedical Image Segmentation”, O Ronneberger et al., 2015 “Singing Voice Separation With Deep U-Net Convolutional Networks”, A Jansson et al., ismir 2017

  21. A practical limitation • Supervised learning requires a * paired* paired dataset dataset • for such a system; 
 1 Inst Vocal x: [mixtures] 
 Vocal 2 Inst y: [instrumental mixtures; vocal tracks] 3 Inst Vocal • → not sustainable

  22. GANs and SS paired unpaired dataset • Weakly labelled dataset: 
 dataset {many instrumental tracks} 
 1 Inst Vocal (aka Real) 
 Inst tracks Vocal 2 Inst + 
 3 Inst Vocal {many voc + instrumental tracks} 
 mix tracks (input of aka Fake) • We alternate to show a GAN-based model 
 {real instrumental / vocal-separated (fake) instrumental} 
 and let the model learns 
 - i) to classify real/fake 
 - ii) to fake an instrumental track = to remove vocal 
 simultaneously. “Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction”, D Stoller, 2018 ICASSP

  23. Further study • A great SS tutorial: http://ismir2010.ismir.net/ proceedings/tutorial_1_Vincent-Ono.pdf

  24. Further me • keunwoochoi.wordpress.com • keunwoochoi.blogspot.com • groovo.io • spotify.com • http://c4dm.eecs.qmul.ac.uk

Recommend


More recommend