Music/Voice Separation using the Similarity Matrix Zafar Rafii & Bryan Pardo
Introduction • Musical pieces are often characterized by an underlying repeating structure over which varying elements are superimposed Propellerheads - History Repeating 1 0 -1 2 4 6 8 10 12 time (s) 10/12/12 Zafar Rafii & Bryan Pardo 2
Introduction • The REpeating Pattern Extraction Technique (REPET) was proposed to extract the repeating structure from the non-repeating structure Repeating Structure Mixture REPET Non-repeating Structure 10/12/12 Zafar Rafii & Bryan Pardo 3
REPET Mixture Spectrogram V Step 1 Beat Spectrum b Mixture Signal x 500 1000 1 1 1500 .9 8 .8 2000 6 .7 4 .6 2500 2 .5 3000 0 .4 2 3500 .3 4 .2 4000 6 .1 8 0 4500 -1 p 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5000 5500 Median V Repeating Segment S Step 2 500 500 500 1000 1000 500 1000 1500 1500 1000 1500 2000 2000 1500 2000 2500 2500 2000 2500 3000 3000 2500 3000 3500 3500 3000 3500 4000 4000 3500 4000 4500 4500 4000 4500 5000 5000 4500 5000 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 5500 5000 5500 1p 2p 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 5500 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time-Frequency Mask M V Repeating Spectrogram W S Step 3 500 500 500 1000 1000 1000 1500 1500 1500 2000 2000 2000 2500 2500 2500 3000 3000 3000 3500 3500 3500 4000 4000 4000 4500 4500 4500 5000 5000 5000 5500 5500 5500 1 2 3 4 5 6 Zafar Rafii & Bryan Pardo 4 min min min
Adaptive REPET Mixture Spectrogram V Beat Spectrogram B Step 1 Mixture Signal x 500 1000 1 1500 8 2000 6 4 2500 2 3000 0 2 3500 4 4000 6 8 4500 -1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 p i 5000 5500 i i 1 .9 .8 .7 .6 .5 .4 .3 .2 .1 0 Median V Repeating Spectrogram U Step 2 500 500 1000 1000 1500 1500 2000 2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 5000 5000 5500 i-1p i i+1p i i-1p i i i+1p i i 5500 i Time-Frequency Mask M V Repeating Spectrogram W Step 3 U 500 500 500 1000 1000 1000 1500 1500 1500 500 2000 2000 2000 1000 2500 2500 2500 1500 3000 3000 3000 2000 3500 3500 3500 2500 4000 4000 4000 3000 4500 4500 4500 3500 5000 5000 5000 4000 5500 5500 5500 4500 1 2 3 4 5 6 Zafar Rafii & Bryan Pardo 5 5000 min 5500
Limitations • Both the original and the adaptive REPET assume periodically repeating patterns Periodically Mixture repeating background Beat spectrogram period finder 10/12/12 Zafar Rafii & Bryan Pardo 6
Limitations • Repetitions can also happen intermittently or without a global (or local) period Non-periodically Mixture repeating background Beat spectrogram period finder 10/12/12 Zafar Rafii & Bryan Pardo 7
Limitations • Instead of looking for periodicities, we can look for similarities , using a similarity matrix Similarity matrix Non-periodically +similar Mixture repeating background +dissimilar 10/12/12 Zafar Rafii & Bryan Pardo 8
Similarity Matrix • The similarity matrix is a matrix where each bin measures the (dis)similarity between any two elements of a sequence given a metric Similarity matrix +similar Sequence metric +dissimilar i 1 i 2 i 1 i 2 10/12/12 Zafar Rafii & Bryan Pardo 9
Similarity Matrix • In audio, the SM can help to visualize the time structure and find repeating/similar patterns Similarity Matrix +similar 1 Spectrogram 12 frequency (kHz) 20 0 10 8 time (s) 0 cosine 10 6 +dissimilar 0 0 4 2 4 6 8 10 12 0 2 time (s) 0 2 4 6 8 10 12 time (s) 10/12/12 Zafar Rafii & Bryan Pardo 10
Assumptions • Given a mixture of music + voice: – The repeating background is dense & low-ranked – The non-repeating foreground is sparse & varied Mixture Spectrogram Background Spectrogram Foreground Spectrogram frequency (kHz) frequency (kHz) frequency (kHz) 20 20 20 10 10 10 0 0 0 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 time (s) time (s) time (s) 10/12/12 Zafar Rafii & Bryan Pardo 11
Assumptions • The SM of a mixture is then likely to reveal the structure of the repeating background Similarity Matrix 12 Mixture Spectrogram Background Spectrogram 10 frequency (kHz) frequency (kHz) 20 20 8 time (s) 6 10 10 4 0 0 2 2 4 6 8 10 12 2 4 6 8 10 12 time (s) time (s) 2 4 6 8 10 12 time (s) 10/12/12 Zafar Rafii & Bryan Pardo 12
REPET-SIM • REPET with Similarity Matrix! 1. Identify the repeating/similar elements 2. Derive a repeating model 3. Extract the repeating structure Repeating Structure Mixture Signal REPET- Non-repeating Structure SIM 10/12/12 Zafar Rafii & Bryan Pardo 13
REPET-SIM • Advantages compared with REPET: – Can handle intermittent repeating elements – Can handle fast-varying repeating structures – Can handle full-track songs Repeating Structure Mixture Signal REPET- Non-repeating Structure SIM 10/12/12 Zafar Rafii & Bryan Pardo 14
Interests • Practical Interests – Audio post processing – Melody extraction – Karaoke gaming • Intellectual Interests – Music perception – Music understanding – Simply based on self-similarity! 10/12/12 Zafar Rafii & Bryan Pardo 15
REPET-SIM Similarity Matrix S Mixture Spectrogram V 6 Step 1 5 Mixture Signal x j 3 500 1000 4 1 j 2 1500 8 2000 6 3 4 2500 2 3000 0 2 2 3500 j 1 4 4000 6 1 8 4500 -1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5000 5500 i i 1 2 3 4 5 6 Median V Repeating Spectrogram U Step 2 500 500 1000 1000 1500 1500 2000 2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 5000 5000 5500 j 1 j 2 =i j 3 j 1 j 2 j 3 5500 i Time-Frequency Mask M V Repeating Spectrogram W Step 3 U 500 500 500 1000 1000 1000 1500 1500 1500 500 2000 2000 2000 1000 2500 2500 2500 1500 3000 3000 3000 2000 3500 3500 3500 2500 4000 4000 4000 3000 4500 4500 4500 3500 5000 5000 5000 4000 5500 5500 5500 4500 1 2 3 4 5 6 Zafar Rafii & Bryan Pardo 16 min 5000 5500
1. Repeating Elements Similarity Matrix S Mixture Spectrogram V 6 Step 1 5 Mixture Signal x j 3 500 1000 4 1 j 2 1500 8 2000 6 3 4 2500 2 3000 0 2 2 3500 j 1 4 4000 6 1 8 4500 -1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5000 5500 i i 1 2 3 4 5 6 Median V Repeating Spectrogram U Step 2 500 500 1000 1000 1500 1500 2000 2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 5000 5000 5500 j 1 j 2 =i j 3 j 1 j 2 j 3 5500 i Time-Frequency Mask M V Repeating Spectrogram W Step 3 U 500 500 500 1000 1000 1000 1500 1500 1500 500 2000 2000 2000 1000 2500 2500 2500 1500 3000 3000 3000 2000 3500 3500 3500 2500 4000 4000 4000 3000 4500 4500 4500 3500 5000 5000 5000 4000 5500 5500 5500 4500 1 2 3 4 5 6 Zafar Rafii & Bryan Pardo 17 min 5000 5500
1. Repeating Elements • We take the cosine similarity between any two pairs of columns and get a similarity matrix Similarity Matrix Mixture Spectrogram frequency (kHz) 12 20 10 i 2 cosine 10 8 time (s) 6 0 2 4 6 8 10 12 i 1 i 2 4 time (s) 2 2 4 6 8 10 12 i 1 time (s) 10/12/12 Zafar Rafii & Bryan Pardo 18
1. Repeating Elements • The SM reveals for every frame i, the frames j k that are the most similar to frame i Similarity Matrix Mixture Spectrogram Mixture Spectrogram frequency (kHz) frequency (kHz) 12 20 20 10 j 3 cosine 10 10 8 time (s) 6 0 0 2 4 6 8 10 12 2 4 6 8 10 12 i j 1 j 2 j 3 4 time (s) time (s) j 2 2 j 1 2 4 6 8 10 12 i time (s) 10/12/12 Zafar Rafii & Bryan Pardo 19
1. Repeating Elements Similarity Matrix S Mixture Spectrogram V 6 Step 1 5 Mixture Signal x j 3 500 1000 4 1 j 2 1500 8 2000 6 3 4 2500 2 3000 0 2 2 3500 j 1 4 4000 6 1 8 4500 -1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5000 5500 i i 1 2 3 4 5 6 Median V Repeating Spectrogram U Step 2 500 500 1000 1000 1500 1500 2000 2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 5000 5000 5500 j 1 j 2 =i j 3 j 1 j 2 j 3 5500 i Time-Frequency Mask M V Repeating Spectrogram W Step 3 U 500 500 500 1000 1000 1000 1500 1500 1500 500 2000 2000 2000 1000 2500 2500 2500 1500 3000 3000 3000 2000 3500 3500 3500 2500 4000 4000 4000 3000 4500 4500 4500 3500 5000 5000 5000 4000 5500 5500 5500 4500 1 2 3 4 5 6 Zafar Rafii & Bryan Pardo 20 min 5000 5500
Recommend
More recommend