GCT634: Musical Applications of Machine Learning Polyphonic Music Transcription Non-negative Matrix Factorization Graduate School of Culture Technology, KAIST Juhan Nam
Outlines • Introduction • Score-Audio Alignment • Multi-Pitch Estimation • Non-negative Matrix Factorization (NMF)
Polyphonic Music Transcription • Converting an acoustic musical signal into some form of music notation - MIDI piano roll, staff notation - Note information: pitch, onset, offset, loudness Model Input Output
Related Tasks • Multi-pitch estimation - Single source: piano, guitar - Multiple source: quartet (woodwind, string) • Predominant F0 estimation - Melody extraction, singing melody • Drum transcription - Kick, snare, high-hat • Let’s listen to a piece and try to transcribe (hum) the
Two Directions • Performance transcription - Detecting exact timing and dynamics of notes (micro-timing with 10ms resolution or so) - Frame-level: onset, offset, intensity - Piano-roll notation is usually used (performance score) • Score transcription - Transform performance into staff notation - Note-level: tempo, beat, downbeat - Rhythmic transcription (tempo, beat, downbeat) à Temporal quantization - Expression detection (pedal, articulation), often phrase-level - Instrument identification - Very challenging
Score and Performance MIDI (score) Valentina Lisitsa Vladimir Horowitz
Where Are The Differences? • Tempo - Note-level, (note onset/offset timings), phrase-level, song-level • Dynamics - Note-level, (note velocity), phrase-level, song-level • Different interpretation of musical expressions in score - Temporal: ritardando, rubato - Dynamics: piano, forte, crescendo, … - Play techniques or articulation: legato, staccato - Mood and emotion: dolce, grazioso
Score-to-Audio Alignment • Temporal alignment between score and audio from a piece of music - Audio-to-audio and MIDI-to-MIDI (either one is performance) are possible • Why do we synchronize them? - Automatic page turning - Performance analysis - Score following - Auto-accompaniment [Müller]
Algorithm Overview • Choose feature representations to compare - Often, MIDI is convert to audio for alignment on the same feature space • Compute a similarity matrix between two features sequences - All possible combinations of local feature pairs • Find a path that makes the best alignment on the similarity matrix - Dynamic Time Warping (DTW) Feature Seq. #1 Similarity Dynamic Matrix Programming Feature Seq. #2 Compute Find the local similarity the best path
Feature Representations • Audio feature representations - Frequent choice for piano music is chroma MIDI Lisitsa CENS : Normalized Chroma Features (Muller, 2005)
Similarity Matrix • Similarity between every pair of frame-level features - Euclidean or cosine distance
Finding the Optimal Path • There are so many possible paths from one corner to another 250 Schumann − Traumerei − MIDI 200 150 100 50 50 100 150 200 250 300 Schumann − Traumerei − Lisitsa
3D Surface Plot of Similarity Matrix • Finding the optimal path is analogous to figuring out a trail route that you can take with minimum efforts in hiking.
Dynamic Time Warping • Finding an (N, M)-warped path of length L - P = (p1, p2, p3, .. pL) where pi = (ni, mi) • Three conditions - Boundary condition: p1=(1,1), pL=(N,M) - Monotonicity condition - n1 <= n2 <= … <= nL - m1 <=m2 <= .. <mL - Step size condition - Move only upward, rightward, diagonal (upper-right) [Müller]
Dynamic Time Warping : Bad Examples [Müller]
Dynamic Programming for DTW • Algorithm - Initialization: D(n,1) = sum(C(1:n,1)), n=1…N D(1,m) = sum(C(1,1:m)), n=1…M - Recurrence Relation : For each m = 1…M For each n = 1…N D(n-1,m) D(n,m)= C(n,m)+ min D(n,m-1) D(n-1,m-1) - Termination : D(N,M) is distance
Dynamic Programming for DTW • Toy Example Similarity Matrix ( C ) Accumulated cost ( D ) [Müller]
Score and Audio Alignment by DTW D(i,j) C(i,j)
Limitations • The optimal path is obtained after we arrive the destination (by back-tracking) - In other words, DTW works offline - What if the sequences are very long? - Online version of DTW? • Every frame is equally important - In general, human is more sensitive to note onsets - Perceptually, every frame is not equally important
Online DTW • Set a moving search window and calculate the cost only within the window 20 - Time and space cost: quadratic à linear 17 16 13 21 11 18 19 • The movement is determined by the 10 9 14 15 position that gives a minimum cost within 7 12 5 the current window. If the position is ... 3 1 2 4 6 8 - Corner: move both up and right (alternatively) Figure 2: An example of the on-line time warping algorithm with - Upper edge: move up search window c = 4 , showing the order of evaluation for a partic- ular sequence of row and column increments. The axes represent - Right edge: move right the variables t and j (see Figure 1) respectively. All calculated cells are framed in bold, and the optimal path is coloured grey. [Dixon, 2005]
Automatic Page Turner (JKU, Austria)
Onset-sensitive Alignment • We are sensitive to the time alignment on note onsets. - The similarity matrix has no additional weight to onsets • DLNCO Features - D ecaying L ocally-adapted N ormalized C hroma O nset - Capture only onset strength on chroma features - Normalize onset energy and note length (by artificially-created note tail) [Ewert, 2009]
Demo: PerformScore • https://jdasam.github.io/PerformScore/
Multi-pitch Estimation • Two types of polyphonic settings - Polyphonic instruments: piano, guitar - Ensemble of monophonic instruments: woodwind quintet, string quartet, chorale • Three levels of subtasks - First-level: frame-wise estimation of pitches and polyphony (number of notes) - Second-level: tracking pitch within a note based on temporal continuity - Third-level: tracking notes for each sound source, usually for ensembles of monophonic instruments
Challenges • Many sources are mixed and played simultaneously - They are likely to be harmonically related in music - Some sources can be masked by others - Content changes continuously by musical expressions (e.g. vibrato) • Compromises - Transcribe as many source sounds as possible - Only dominant sources: melody, bass, drum
Frame-wise Multi-pitch Estimation • Three categories of approaches - Iterative F0 search: repeatedly finds predominant-F0 and removes its related sources - Joint source estimation: examines possible combinations of multiples sources, e.g., NMF - Classification-base approach: no prior knowledge of musical acoustics, only relies on supervised learning
Iterative F0 estimation • Based on repeated cancellation of harmonic overtones of detected F0s (Klapuri, 2003) • Procedure Set the original to the residual 1. Detect predominant F0: based on the harmonic sieve method 2. Spectral smoothing on harmonics on the detected F0 3. Cancel the smoothed harmonics from the residual 4. Repeat the step 2 & 3 until the residual is sufficiently flat 5. Cancel sound From mixture Y R ( k ) ← max( Y R ( k ) − d Y D ( k ),0) F0 detection Y R ( k )
Iterative F0 estimation Spectral Smoothness Iterative Estimation Spectral Smoothness ECE 477 - Computer Audition, Zhiyao Duan 2014
Iterative F0 estimation • Advantages - Deterministic: only by signal processing and no data-driven training - Can handle inharmonicity (e.g. piano) and vibratio • Limitations - F0 estimation becomes unreliable as iteration increases - Spectral smoothing is not accurate enough
Joint Source Estimation • Based on a model for sound mixture - All sources compete with each other to explain the mixture and find a subset that are mostly likely - The number of sources are limited - Non-negative matrix factorization (NMF) has been most widely explored
Joint Source Estimation • How many spectral templates can explain the source ?
Joint Source Estimation We can explain the spectrogram with three spectral basis ( 𝑋 ) • and corresponding activations ( 𝐼 ) Can we decompose 𝑊 into 𝑋 and 𝐼 automatically ? • 𝑋 𝑊 ≈ 𝑋𝐼 𝐼
Non-negative Matrix Factorization (NMF) • One of matrix factorization algorithms but all elements are non- negative - 𝑊 ( 𝑁 x 𝑂 matrix): original data (e.g. spectrogram) - 𝑋 ( 𝑁 x 𝐿 matrix ): 𝐿 basis vectors (e.g. dictionary) - 𝐼 ( 𝐿 x 𝑂 matrix): activation matrix (e.g. weights or gains) • Note that this provides a compressed representation. - A low-rank approximation ! $ ! $ ! $ # & # & # & # & # & ≈ # & # & # & # & # & # & " % " % " % 𝑊 𝐼 𝑋
Algorithm for NMF • 𝑊 is known, and 𝑋 and 𝐼 are unknown. How? • Alternative the estimation (similar to the EM algorithm) - Start with random 𝑋 - Estimate an 𝐼 given 𝑋 - Estimate a new 𝑋 given 𝐼 - Repeat until convergence
Recommend
More recommend