GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam
Types of Music Data ● Audio MP3, WAV ○ ● Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) ○ ● Image Score (scanned image), album/playlist cover, performance video ○ ● Text Meta data, tags, lyrics, reviews ○ ● User Data Listening history, rating ○
Types of Music Data ● Audio MP3, WAW ○ ● Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) ○ ● Image Score (scanned image), album/playlist cover, performance video ○ ● Text Meta data, tags, lyrics, reviews ○ ● User Data Listening history, favorites or scores ○
Types of Audio Data Representations ● Waveform (digital audio samples): sampling and quantization ● Spectrogram: short-time Fourier transform ● Mel-spectrogram: human pitch perception ● Constant-Q transform: transform into musical (chromatic) scale
Digital Audio Chain Analog-to-Digital microphone \ Conversion Lowpass Sampling Quantization Filters Storage, Processing …0 0 1 0 1 0 … Amplifier Digital-to-Analog Lowpass Conversion Filters loudspeaker
Sampling and Quantization Analog-to-Digital microphone \ Conversion Lowpass Sampling Quantization Filters Storage, Processing …0 0 1 0 1 0 … Amplifier Digital-to-Analog Lowpass Conversion Filters loudspeaker
Sampling ● Convert continuous-time signals to discrete-time signals by periodically picking up the instantaneous values Represented as a sequence of numbers ○ Sampling period ( T s ): the amount of time between samples ○ Sampling rate ( f s = 1/ T s ) ○ T s Signal notation x ( t ) → x ( nT s )
Sampling Theorem ● What is an appropriate sampling rate? Too high: increase the data size in the digital domain ○ Too low: cannot reconstruct the original signal ○ ● Sampling Theorem The sampling rate must be greater than twice the maximum frequency in the ○ signal in order to reconstruct the original signal 𝑔 ! : sampling rate 𝑔 ! > 2 $ 𝑔 " 𝑔 " : maximum frequency of the signal Half the sampling rate is called Nyquist frequency ( 𝑔 ! /2 ) ○
Sampling in the Frequency Domain Frequency -f m f m 0 Alias Alias ( 𝑔 ! > 2 $ 𝑔 " ) Frequency -f s -f m -f s -f s +f m -f m f m f s f s +f m f s -f m 0 The high-frequency content above the Nyquist frequency is folded over ( 𝑔 ! < 2 $ 𝑔 " ) -f m Frequency -f s -f m f m f s +f m -f s +f m 0 f s -f m
Sampling Rate ● Determined by the bandwidth of signals or hearing limits Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional) ○ Speech communication: 8 kHz ○ Music Speech
Sampling Rate Conversion (Resampling) ● We often increase or decrease the sampling rate 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower ○ rates to reduce the data size Up-sampling Down-sampling Computed by signal interpolation ○ In down-sampling, preceded by a low-pass filter ■ to avoid the aliasing noise Windowed sinc function ■ https://ccrma.stanford.edu/~jos/resample/ ■
Quantization ● Discretizing the amplitude of real-valued signals Round the amplitude to the nearest discrete steps ○ The bit discrete steps are determined by the number of bit bits (bit depth) ○ N bits can range from -2 N -1 to 2 N -1 -1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766) ■ 2 N -1 -1 Quantization step -2 N -1
Quantization ● Determined by the dynamic range of of signals Adding 1 bits to LSB increases 6dB in sound level: N bits à 6 N dB ○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB ○ Music Speech
Loading Audio Files ● Check the sampling rate and bit depth You can check them using audio software such as Audacity ○ ● Do resampling (usually down-sampling) if necessary Librosa provides resampling when loading audio files ○
Waveform ● Waveform is a natural representation of audio but limited in analyzing the content Mainly show the temporal energy ○
Spectrogram ● 2D-image representation of audio using short-time Fourier transform x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude ○ Easy to match what you hear to what you see ○
Computing Spectrogram 𝑌 !%&' ● For each short segment (frame) Take a window (one frame) ○ Magnitude Compute DFT (FFT) ○ Compression Convert them to polar coordinate ○ 𝑌 "#$ Short-Time Magnitude and Phase ■ Fourier DFT Transform Compress the magnitude ○ (STFT) 𝑦(𝑚 − 1) 20log !" 𝑌 #$% : decibel ■ 𝑦(𝑚) Shifting by a hop size ○ window size ● Spectrogram parameters Windowing Window size (FFT size) ○ Windowing Hop size ○ hop size Window type ○
Discrete Fourier Transform (DFT) ● Find the frequency (sinusoidal) component of 𝑦 𝑜 '() ) ' 𝐵 𝑙 cos( *+$, ● Represent 𝑦 𝑜 with 𝑦 𝑜 = ∑ $%& + ϕ(𝑙) ) ' 𝐵 𝑙 : amplitude (or magnitude) of the sinusoid ○ ϕ(𝑙) : phase of the sinusoid ○ 𝑂 : size of DFT or the input segment ○ 𝑙 : frequency bin index ( 0 to 𝑂 − 1 ) ○ " ( # 𝑔 ! is the frequency at each frequency bin) ● DFT provides the way of finding 𝐵 𝑙 and ϕ(𝑙) Pink Floyd ”The Dark Side of the Moon”
Discrete Fourier Transform (DFT) ● Use the orthogonality of sinusoids (equivalent to − 𝑚) ( ) = ,𝑂/2 if 𝑙 = 𝑚 or 𝑙 = 𝑂 − 𝑚 ()! cos( *+,& ) cos( *+-& ○ ∑ &'" 0 otherwise ( ()! cos( *+,& ) sin( *+-& ○ ∑ &'" ( ) = 0 ( 0 otherwise ()! sin( *+,& ) sin( *+-& ○ ∑ &'" 𝑂/2 ( ) = ? 𝑙 = 𝑚 ( −𝑂/2 𝑙 = 𝑂 − 𝑚 (equivalent to − 𝑚) ● The inner product (or correlation) between the two sinusoids: If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other) ○
Discrete Fourier Transform (DFT) ● Inner product with the input and sinusoids ()! 𝑦 𝑜 cos *+-& ()! ! ( 𝐵 𝑙 cos( *+,& + ϕ(𝑙) ) )cos *+-& ()! (∑ ,'" ○ = ∑ &'" = ∑ &'" 𝑌 ./ 𝑙 ( ( ( = 𝐵 𝑙 cos ϕ 𝑙 ()! 𝑦 𝑜 sin *+-& ()! ∑ ,'" ()! ! ( 𝐵 𝑙 cos( *+,& + 𝜚(𝑙) ) sin *+-& ○ 𝑌 0# 𝑙 = − ∑ &'" = − ∑ &'" ( ( ( = 𝐵 𝑙 sin ϕ 𝑙 ● The magnitude and phase * 𝑙 + 𝑌 0# 𝑙 , 𝑌 12$3/ (𝑙) = ϕ 𝑙 = tan )! ( 4 () (,) * ○ 𝑌 #$% (𝑙) = 𝐵 𝑙 = 𝑌 ./ 4 *+ (,) ) ● The definition of DFT can be simplified using complex sinusoids ()! 𝑦 𝑜 𝑓 )7 ,-./ = 𝑌 ./ 𝑙 + 𝑘𝑌 0# 𝑙 = 𝐵(𝑙) 78 , ○ 𝑌 𝑙 = ∑ &'" = cos 2𝜌𝑙𝑜 + 𝑘sin 2𝜌𝑙𝑜 𝑓 !"#$% 0 & 𝑂 𝑂 Euler’s identity
Discrete Fourier Transform (DFT) 𝑌 "#$ 𝑌 %5#!& ● Can be viewed as matrix multiplication ∗ (𝑜) = 𝑓 89:;63 𝑡 6 < to polar 𝑌 4& 𝑌 2" 𝑋 𝑋 '1! !23 𝑦(𝑜) ● In practice, we use an FFT algorithm instead of direct multiplication Divide the matrix into small matrices recursively ○ Complexity reduction: O ( N 2 ) à O( N log 2 N ) ○ 𝑋 𝑋 '1! !23
Discrete Fourier Transform (DFT) ● When DFT is applied to musical sounds ○ A musical tone with pitch has periodic waveforms DFT shows harmonic spectrum (harmonic overtones) ○ Pitch information can be also extracted ○ ○ The magnitude is generally more sparse than the waveform 𝐺0 2𝐺0 3𝐺0 𝑦(𝑜) 𝑌 "#$ (𝑙)
Effect of Window Type ● Types of window functions Trade-off between the width of main-lobe and the level of side-lobe ○ Hann window is the most widely used in music analysis. ○ Rectangular Triangular Hann Blackmann 1 1 1 1 Amplitude 0.5 0.5 0.5 0.5 0 0 0 0 � 200 0 200 � 200 0 200 � 200 0 200 � 200 0 200 40 40 40 40 Magnitude(dB) 20 20 20 20 0 0 0 0 � 20 � 20 � 20 � 20 � 40 � 40 � 40 � 40 � 60 � 60 � 60 � 60 � 500 0 500 � 500 0 500 � 500 0 500 � 500 0 500 Spectra of windowed single sinusoids
Effect of Window Size ● Trade-off between time and frequency resolutions Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution ○ Hop=128, N=4096 Hop=128, N=256
Human Ears ● Human ear is a spectrum analyzer? Our ear has a complicated pathway from the ear drum to the auditory nerve ○ The cochlea in the inner ear is a bandpass-filter bank ○ The membrane resonates at a different position depending the frequency of ○ the input. The resonance frequency increases in a log scale along the membrane Membrane (Unrolled) Cochlear
Human Pitch Perception ● Pitch Resolution Just noticeable difference (JND) increases ○ as the frequency goes up ● Mel scale Approximate the human pitch resolution ○ based on pitch ratio of tones Most widely used for speech and music ○ analysis A log frequency scale ○ m = 2595log 10 (1 + f / 700)
Recommend
More recommend