Nonnegative Tensor Factorization for Source Separation of Loops in Audio Jordan B. L. Smith National Institute of Advanced Industrial Science and Technology (AIST), Japan Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan
Introduction
Extracting loops from music • In some musical styles, songs are built from loops. E.g.: → composition process → 3. Song mixed 1. Collection of 2. Loops arranged to make a song down to audio loops D D D FX C C C C Bass A B B B B B B B Melody C D A A A A A A A Drum 0:00 0:30 1:00 Audio examples (and test data) all borrowed from [López-Serrano et al. 2016]
Extracting loops from music • In some musical styles, songs are built from loops. E.g.: 3. Song mixed 1. Collection of 2. Loops arranged to make a song down to audio loops D D D FX C C C C Bass A B B B B B B B Melody C D A A A A A A A Drum 0:00 0:30 1:00 ← decomposition procedure ← • Goal: decompose the audio signal to recover: • the layout of the song • the source-separated loops
Extracting loops from music • Two previous approaches that inspired us: • Fingerprint-based loop detection [López-Serrano et al. 2016] Inputs: Output: D D D A B + → C C C C B B B B B B C D A A A A A A A Original loops Mixed audio Map of loop activations • Iterative NMF [Seetharaman & Pardo 2016] Inputs: Output: D: Assumption that C: + → loops are introduced B: additively A: Separated tracks, Mixed audio one per loop
Extracting loops from music • Our proposed system: Input: Outputs: D: D D D → C: C C C C + B: B B B B B B A: A A A A A A A Separated tracks, Mixed audio Map of loop activations one per loop • We attempt to solve both problems in one step, without assumption of additive layout • We do so by extending nonnegative matrix factorization (NMF) to handle periodicity
Source separation using NMF* • • NMF can Steady-state notes NMF with harmonic templates • • Note sequences NMFD with time-evolving handle many repeated in time templates types of [Smaragdis 2004] • repetition: • NMF2D with transposed harmonic Transposed notes templates [e.g., FitzGerald, Cranitch & Coyle 2008] • ...no nonnegative approach! • Periodicity NB: REPET, a median-filtering (especially at approach downbeats) [Rafii, Liutkus, & Pardo 2014]
Method
Nonnegative tensor factorization • Step 1: estimate downbeats [madmom, Böck et al. 2016]
Nonnegative tensor factorization • Step 1: estimate downbeats [madmom, Böck et al. 2016]
Nonnegative tensor factorization • Step 1: estimate downbeats • Step 2: stack the 2D spectrograms into a 3D volume (a “spectral cube”)
Nonnegative tensor factorization • Step 1: estimate downbeats • Step 2: stack the 2D spectrograms into a 3D volume (a “spectral cube”)
Nonnegative tensor factorization • Step 1: estimate downbeats • Step 2: stack the 2D spectrograms into a 3D volume (a “spectral cube”)
Detour: understanding the spectral cube Time in bar Frequency Bar number (time in piece)
Detour: understanding the spectral cube Time in bar Frequency Bar number (time in piece)
Detour: understanding the spectral cube Time in bar Frequency Bar number (time in piece)
Detour: understanding the spectral cube Time in bar Frequency Bar number (time in piece)
Visualizing a 3D volume: CT scan Back to front Left to right Bottom to top
Visualizing a 3D volume: CT scan Beginning to end of piece Beginning to end of a bar Low frequency to high Time in bar Frequency Bar number (time in piece)
Nonnegative tensor factorization • Step 1: estimate downbeats • Step 2: stack the 2D spectrograms into a 3D volume (a “spectral cube”) • Step 3: use nonnegative tensor factorization (NTF) to model the spectral cube
Nonnegative matrix factorization • NMF: X ≈ W ◦ H • W = note templates • H = activation functions H r × N W ≈ X M M × r N • Needs post-processing to separate sources: • which templates in W belong to the same source? • di ff erent sources could use the same harmonic components!
Nonnegative tensor factorization • Tucker Decomposition: X ≈ C ◦ (W ◦ H ◦ D) • W = note templates • H = activation functions (time-in-bar) • D = loop activation functions (time-in-piece) • C = core tensor = recipe for each loop type Tucker decomposition ≈ = 𝓨 M Q P
Interpreting the NTF model • W, H , and D all musically intuitive: Loop template D D D activations directly C C C C estimate layout of song B B B B B B A A A A A A A
Interpreting the NTF model • Core tensor C = recipe for each loop type Loop recipes ( C ) (w 4 , h 7 ) + (w 11 , h 10 ) • Pixel C(i, j, k) tells us to play note w i with activation + function h j whenever loop d k appears. (w 24 , h 16 )
Interpreting the NTF model • Core tensor C = recipe for each loop type Loop recipes ( C ) • To recover entire spectrogram: C ◦ (W ◦ H ◦ D) • To recover individual loop source: C [:,:,k] ◦ (W ◦ H ◦ D [k,:] )
Evaluation
Evaluation • We used synthetic data [López-Serrano et al. 2016] • 7 sets of loops x 3 di ff erent layouts (arrangements) • Algorithm output 1: separated signals • Evaluate quality with SDR, SIR, SAR estimated source tracks stem tracks • Algorithm output 2: loop layout • Evaluate accuracy with correlation estimated map ground truth map D D D C C C C B B B B B B A A A A A A A
Good separation example Collection of loops Extracted loops for genre: “Acid” Drum Melody 1 2 Bass FX 3 4 • When it works, it works
Flawed separation example Original tracks for genre “Brezo” D D D C C C C B B B B B B A A A A A A A Source separated tracks D D D C C C C B B B A A A
Flawed separation D D D C C C C B B B B B B A A A A A A A example swap rows Original tracks for genre “Brezo” D D D B B B B B B C C C C D D D A A A A A A A C C C C = C substitute C A B B B B B B D D D B B B B B B C C C C A A A A A A A A A A = C substitute C A Source separated tracks D D D D D D B B B C C C C A A A C C C C swap rows B B B D D D A A A C C C C B B B A A A
10 Our reconstruction 5 SDR quality is average. 0 :-| –5 20 We have less 15 crosstalk than SIR 10 others! [Seetharaman & Pardo 2016] 5 :-D 0 10 We have more (proposed) SAR noisy artifacts. 5 :-( 0 Correlation 1.0 (performance ceiling) 0.8 We get very 0.6 clean layouts! 0.4 :-D 0.2 0.0
Conclusion
Conclusion • Proposed method of decomposing audio into loops that: • Models periodicity using the spectral cube • Models source signals and song composition jointly • Tucker decomposition is musically intuitive • Weaknesses include: • Very conservative reconstructions don’t model the whole signal • Like NMFD, we cannot distinguish between algebraically equivalent decompositions • Future work: searching for repetitions at multiple hierarchical time scales
Future work: hierarchical analysis • Di ff erent loops in the song have di ff erent lengths and periods • Spectral cubes with di ff erent periods highlight di ff erent consistent repetitions PERIOD: 2 beats 1 downbeat 4 downbeats
Future work: hierarchical analysis • Di ff erent loops in the song have di ff erent lengths and periods • Spectral cubes with di ff erent periods highlight di ff erent consistent repetitions PERIOD: 2 beats 1 downbeat 2 downbeats 4 downbeats
Thank you! PS. Jordan is now at: +
Recommend
More recommend