GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 Keunwoo Choi at QMUL.uk, Spotify.us, groovo.io
Sound Source Separation Let’s isolate the “target” audio signal! • “ Cocktail party e ff ects ” ..as if we’re simulating human brain (as if we know what’s going on there)
Sound Source Separation problem = f(assumptions} assumptions = {environments: {dry, wet, ..}, signal = {ch: {mono, stereo, ..}, content: {speech, music}}, target: {...}} Input Target Noise Speech, Ambience Speech Noise Mixture of speech Speaker i all j != i Music ((Vocal, Drum, Instrument i all j != i Guitar, Bass + ..)
SSS Applications • KA-RA-O-KE! • Transcription • DJing/Mixing • Many other MIR tasks • Once we called it a “Chicken-and-egg”; S olving SSS would make many tasks extremely easier
BIG ASSUMPTION FOR A LONG WHILE • with |STFT| (or CQT) • 1 time-frequency bin, 1 instrument -- aka “W-disjoint” • Phase doesn’t matter much • It used to apply to (almost) every research
Problem config 1 - mixing matrix A • There was a mixing matrix A that we’ll estimate its inverse. s : source (instruments) a_xx : amplitude mixing coeffs x : stereo signal input ! w : estimated mixing coeffs ; y : estimated source (instruments)
ICA • Independent Component Analysis (ICASSP ’98) • Based on some stats -- independency, (non-)Gaussianity • Not directly about audio but a general technique • Example: http://www.kecl.ntt.co.jp/icl/signal/sawada/ demo/bss2to4/index.html Further study: https://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf
ADRESS • A zimuth D iscrimination and Res ynthesi s (DAFx 2004) • 1-dim clustering; for stereo sound source separation
Problem config 2 - mixing matrix and delay • Sources are at di ff erent angles and distances → Mixing matrix A is also about time delay
DUET • Location = {angle, distance} • Each location, each 2D cluster, each instrument • DOA (Direction Of Arrival) estimation • Something similar is in your phone (with 2+ microphones) to suppress non-speech sounds (but perhaps not in your earphones/headphones)
Problem config 3 - music - spectra of instruments http://www.physics.usyd.edu.au/teach_res/hsp/sp/mod31/m31_strings.htm
NMF • Assumptions of using NMF for SSS: • The spectral shapes of musical instruments are known. • NMF would separate each note (aka basis)! • Many applications for drum separation (it works) https://www.slideshare.net/DaichiKitamura/robust-music-signal-separation-based-on- supervised-nonnegative-matrix-factorization-with-prevention-of-basis-sharing
Problem config 4 - music - repeats • “Instrumental parts repeat!” ( ↔ vocal) • “Drums/beats repeat!” ( ↔ harmonic instruments) • A valid assumption for modern popular music • E.g., REPET (IEEE 2013), KAM (D. Fano Yela, ICASSP 2017), ...
Problem config 5 - music - some musical cases • “Central” (~= vocal) source separation • Because - main vocals are almost always at the centre (and we all love karaoke) • Harmonic/percussive source separation • Because - they are (almost) completely di ff erent in spectral/temporal axes • Median filtering for drum separation (D. Fitzgerald, DAFx) “Gaussian mixture model for singing voice separation from stereophonic music”, M Kim et al, 2011
Problem config 6 - music - ‘informed’ source separation • Exploiting the score as side information “Score-Informed Source Separation for Musical Audio Recordings”, S Ewert et al., 2013
History so far... as time goes by less generality stronger assumptions
DEEP! LEARNING!!
DL and SSS • Less assumptions (let’s think further papers!) • Data-related; trained models do NOT extrapolate. E.g., A model with speech probably wouldn’t work with music. • Model-related; E.g., frame-based? context-free? Does it estimate the phase? Stereo-input?
Frame-based DL-SS • Because vocals are distinguishable in a frame (or frames) “Deep Learning For Monaural Speech Separation” , Po-sen Huang et al, 2014
U-Net and SS • Because vocals are distinguishable in the |STFT| image “U-Net: Convolutional Networks for Biomedical Image Segmentation”, O Ronneberger et al., 2015 “Singing Voice Separation With Deep U-Net Convolutional Networks”, A Jansson et al., ismir 2017
A practical limitation • Supervised learning requires a * paired* paired dataset dataset • for such a system; 1 Inst Vocal x: [mixtures] Vocal 2 Inst y: [instrumental mixtures; vocal tracks] 3 Inst Vocal • → not sustainable
GANs and SS paired unpaired dataset • Weakly labelled dataset: dataset {many instrumental tracks} 1 Inst Vocal (aka Real) Inst tracks Vocal 2 Inst + 3 Inst Vocal {many voc + instrumental tracks} mix tracks (input of aka Fake) • We alternate to show a GAN-based model {real instrumental / vocal-separated (fake) instrumental} and let the model learns - i) to classify real/fake - ii) to fake an instrumental track = to remove vocal simultaneously. “Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction”, D Stoller, 2018 ICASSP
Further study • A great SS tutorial: http://ismir2010.ismir.net/ proceedings/tutorial_1_Vincent-Ono.pdf
Further me • keunwoochoi.wordpress.com • keunwoochoi.blogspot.com • groovo.io • spotify.com • http://c4dm.eecs.qmul.ac.uk
Recommend
More recommend