GCT634: Musical Applications of Machine Learning Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam
Outlines • Time-domain Representation - Sampling - Quantization • Time-Frequency Representations - Short-time Fourier Transform - Spectrogram - Mel-Spectrogram - Constant-Q transforms - Auditory Filterbank
Music Representations • Audio … 0 1 1 0 1 1 0 … … 0 0 1 1 0 1 1 … • Score
Time-Domain Representation of Audio • Musical sounds arrives in microphones as vibration of air. • In computer, they are converted into a sequence of binary values via Sampling and Quantization . … 0 1 1 0 1 1 0 …
Sampling • What is an appropriate sampling rate? - Too high: increase data rate - Too low: become hard to reconstruct the original signal • Sampling Theorem - Sampling rate must be greater than twice the maximum frequency in the signal in order to reconstruct fully (or to avoid aliasing) 𝑔 𝑡 : sampling rate 𝑔 𝑡 > 2 ∙ 𝑔 𝑛 𝑔 𝑛 : maximum frequency - Half the sampling rate ( 𝑔 𝑡 /2 ) is called Nyquist frequency
Sampling Rate • Determined by the bandwidth of signals or hearing limits - Audio CD: 44.1 kHz (44100 samples per second) - Speech communication: 8 kHz (8000 samples per second)
Sampling Rate Conversion (Resampling) • Up-sampling and Down-sampling Up-sampling Down-sampling - 44.1kHz CD quality music is often down- sampled to 22.05 kHz or lower rates to reduce the data size for analysis tasks • Resampling is computed by interpolation using a low-pass filter (e.g. windowed sinc function)
Quantization • Discretizing the amplitude of real-valued signals - Round the amplitude to the nearest discrete steps • The discrete steps are determined by the number of bit bits - The range of B bit quantization: -2 B -1 ~ 2 B -1 -1 - Audio CD (16 bits) : -2 15 ~ 2 15 -1 à this is often scaled to -1.0 ~ 1.0
Time-Domain Representation of Audio • Waveforms are a natural representation of audio but limited in analyzing the content Zoom-in view (e.g. one frame): wave shape is not very intuitive in explaining timbre, particularly when the music is polyphonic Zoom-out view: limited to explaining temporal loudness change
Time-Frequency Representations of Audio • We hear and observe musical sounds in terms of how the content changes over time and frequency
Short-Time Fourier Transform (STFT) • Definition 𝐼 : hop size :5; 𝑥 𝑜 𝑌 𝑚, 𝑙 = - 𝑥 𝑜 𝑦 𝑜 + 𝑚 2 𝐼 𝑓 56789/: : window 𝑂 : FFT size 9<= • Computation Steps - Take a window (one frame) - Compute FFT - Shifting by the hop size - Repeat above • This returns a 2D matrix
Windowing • Types of window functions - Trade-off between the width of main-lobe and the level of side-lobe - Tapering windows suppresses side-lobe levels at the expense of wider main lobe. - Hann window is the most widely used in music analysis. Rectangular Triangular Hann Blackmann 1 1 1 1 Amplitude 0.5 0.5 0.5 0.5 0 0 0 0 − 200 0 200 − 200 0 200 − 200 0 200 − 200 0 200 40 40 40 40 Magnitude(dB) 20 20 20 20 0 0 0 0 − 20 − 20 − 20 − 20 − 40 − 40 − 40 − 40 − 60 − 60 − 60 − 60 − 500 0 500 − 500 0 500 − 500 0 500 − 500 0 500 Spectra of Windowed Sines
Time-Frequency Resolution by Window Size • Trade-off between time and frequency resolutions - Short window: low frequency-resolution and high time-resolution - Long window: high frequency-resolution and low time-resolution
� Discrete Fourier Transform (DFT) • Definition - Inner product with the signal and complex sinusoids 𝑂−1 𝑌 𝑙 = 𝑦 𝑜 ∙ 𝑡 𝑙 (𝑜) = - 𝑦 𝑜 𝑓 −𝑘2𝜌𝑙𝑜 = 𝑌 𝑆 𝑙 + 𝑘𝑌 𝐽 𝑙 = 𝐵(𝑙) 𝑘ϕ 𝑙 𝑂 𝑜=0 = cos 2𝜌𝑙𝑜 + 𝑘sin 2𝜌𝑙𝑜 𝑡 𝑙 𝑜 = 𝑓 𝑘2𝜌𝑙𝑜 By Euler’s identity 𝑂 𝑂 𝑂 2 𝑙 + 𝑌 𝐽 2 𝑙 - Magnitude spectrum: 𝑌 𝑙 = 𝐵 𝑙 = 𝑌 𝑆 ∠𝑌 𝑙 = ϕ 𝑙 = tan −1 (𝑌 𝐽 (𝑙) - Phase spectrum: 𝑌 𝑆 (𝑙))
Fast Fourier Transform (FFT) • Matrix multiplication view of DFT • “Fast Fourier Transform (FFT)” is an efficient algorithm that computes the matrix multiplication fast - Based on "divide and conquer” - Complexity reduction: O( N 2 ) à O( N log 2 N )
Frequency Scale in Spectrogram • Linear frequency scale - Great to see the harmonic structure of a single tone. - However, spectral distributions are skewed toward low frequency “Chopin Prelude E minor” “How insensitive”
Human Pitch Perception • The basilar membrane in cochlea is a rough spectral analyzer - Resonate at a different position selectively according to the frequency of incoming vibration - The resonance frequency increases in a log scale along the basilar membrane (Unrolled) Cochlear Oval Inner Ear window Round window Basilar Membrane (P. Cook)
Human Pitch Perception • Pitch Resolution - Just noticeable difference (JND) increases as the frequency goes up • Critical bandwidth - Frequency bandwidth within which one tone interferes with the perception of another tone by auditory masking - Constant at low frequency but linear at high frequency
Psychoacoustical Pitch Scales • Mel scale - Based on pitch ratio of tones 1 m = 2595log 10 (1 + f / 700) 0.9 0.8 0.7 • Bark scale normalized scales 0.6 - Critical band measurement by masking 0.5 0.4 Bark = 13arctan(0.00075 f ) + 3.5arctan(( f / 7500) 2 ) 0.3 0.2 ERB 0.1 Mel • Equivalent regular bandwidth rate Bark 0 0 0.5 1 1.5 2 2.5 frequency (Hz) 4 x 10 - Critical band measurement using the Comparison of pitch scales notched-noise method ERBS = 21.4 ⋅ log 10 (1 + 0.00437 f ) Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/
Tuning System in Musical Instrument • Equal temperament - 1: 2 1/12 ratio between two adjacent notes - Music note ( m ) and frequency ( f ) in Hz m = 12log 2 ( f ( m − 69) ) + 69, f = 440 ⋅ 2 12 440 https://newt.phys.unsw.edu.au/jw/notes.html
Log-Spectrogram Using Frequency Mapping • Mapping linear scale to a perceptual (log-like) scale - Locate center frequencies according to the frequency mapping - Linear interpolation on the center frequency with the corresponding bandwidth skirt Band Center width Frequency Log-Frequency Spectrogram Linear-Frequency Spectrogram
Log-Spectrogram Using Frequency Mapping • The mapping can be formed as matrix multiplication - Each column of the mapping matrix contain the interpolation coefficients Y = M ⋅ X ( M : mapping matrix, X : spectrogram, Y : scaled spectrogram) 20 40 × = 60 80 100 120 100 200 300 400 500 600 • Limitation - Simple but time frequency resolutions are still constrained on STFT
Mel-Frequency Spectrogram • Mel-scaled spectrogram is widely used for music classification - Usually mapped to a smaller number of mel-scaled bins that the FFT size Linear-Frequency Spectrogram Mel-Frequency Spectrogram
Constant-Q Transform • Use a set of sinusoidal kernels with: - Logarithmically spaced frequencies - Constant Q = frequency/bandwidth
Constant-Q Filter Bank Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)
Example: Constant-Q Filter Bank • Müller’s 88-note filter bank - The center frequency is set to each of 88 piano notes - The bandwidth is set to have constant-Q with +/- 25 cent around the center (Müller, 2011)
Comparison of Time-Frequency Representations frequency frequency time time Spectrogram (long window) Spectrogram (short window) frequency frequency time time Constant-Q transform Mel Spectrogram
Auditory Filter Bank • A set of filter bank that imitates the magnitude and delay of traveling waves on basilar membrane in cochlear - Produce 3-D representation (time-channel-lag) or “auditory images” Stabilize & Combine ACF HC HC ACF Correlogram . input . . . . . HC ACF Correlogram Hair cells Summary ACF Oval Summary ACF window Auto-Correlation High Low Functions Freq. Freq. Cochlear Filter banks
Types of Auditory Filter Banks • Gamma-tone Filter banks (R. Patterson) g ( t ) = at n − 1 e − 2 π bt cos(2 π ft + ϕ ) u ( t ) - Gamma-tone: • Pole-Zero Filter Cascade (D. Lyon)
Hair-Cell • (Inner) Hair-cell - Transform mechanical movement into neural spikes • Modeled as cascade of - Half-wave rectification - Compression - Low-pass filtering • This conducts a non-linear processing - Generate new harmonic partials - Associated with missing fundamentals
Example of Correlogram: Piano
Example of Correlogram: Rock Music
Software • Tools - C++: http://soundlab.cs.princeton.edu/software/sndpeek/ - WebAudio: https://musiclab.chromeexperiments.com/Spectrogram - Audacity, Sonic Visualizer, Adobe Audition, Praat, … • Libraries - Librosa (python): http://librosa.github.io/librosa/ - Auditory Toolbox (Matlab): https://engineering.purdue.edu/~malcolm/interval/1998-010/
Recommend
More recommend