audio data representations
play

Audio Data Representations Juhan Nam Types of Music Data Audio - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) Image Score


  1. GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam

  2. Types of Music Data ● Audio MP3, WAV ○ ● Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) ○ ● Image Score (scanned image), album/playlist cover, performance video ○ ● Text Meta data, tags, lyrics, reviews ○ ● User Data Listening history, rating ○

  3. Types of Music Data ● Audio MP3, WAW ○ ● Score (symbolic) MIDI, typesetting script languages (e.g., MusicXML) ○ ● Image Score (scanned image), album/playlist cover, performance video ○ ● Text Meta data, tags, lyrics, reviews ○ ● User Data Listening history, favorites or scores ○

  4. Types of Audio Data Representations ● Waveform (digital audio samples): sampling and quantization ● Spectrogram: short-time Fourier transform ● Mel-spectrogram: human pitch perception ● Constant-Q transform: transform into musical (chromatic) scale

  5. Digital Audio Chain Analog-to-Digital microphone \ Conversion Lowpass Sampling Quantization Filters Storage, Processing …0 0 1 0 1 0 … Amplifier Digital-to-Analog Lowpass Conversion Filters loudspeaker

  6. Sampling and Quantization Analog-to-Digital microphone \ Conversion Lowpass Sampling Quantization Filters Storage, Processing …0 0 1 0 1 0 … Amplifier Digital-to-Analog Lowpass Conversion Filters loudspeaker

  7. Sampling ● Convert continuous-time signals to discrete-time signals by periodically picking up the instantaneous values Represented as a sequence of numbers ○ Sampling period ( T s ): the amount of time between samples ○ Sampling rate ( f s = 1/ T s ) ○ T s Signal notation x ( t ) → x ( nT s )

  8. Sampling Theorem ● What is an appropriate sampling rate? Too high: increase the data size in the digital domain ○ Too low: cannot reconstruct the original signal ○ ● Sampling Theorem The sampling rate must be greater than twice the maximum frequency in the ○ signal in order to reconstruct the original signal 𝑔 ! : sampling rate 𝑔 ! > 2 $ 𝑔 " 𝑔 " : maximum frequency of the signal Half the sampling rate is called Nyquist frequency ( 𝑔 ! /2 ) ○

  9. Sampling in the Frequency Domain Frequency -f m f m 0 Alias Alias ( 𝑔 ! > 2 $ 𝑔 " ) Frequency -f s -f m -f s -f s +f m -f m f m f s f s +f m f s -f m 0 The high-frequency content above the Nyquist frequency is folded over ( 𝑔 ! < 2 $ 𝑔 " ) -f m Frequency -f s -f m f m f s +f m -f s +f m 0 f s -f m

  10. Sampling Rate ● Determined by the bandwidth of signals or hearing limits Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional) ○ Speech communication: 8 kHz ○ Music Speech

  11. Sampling Rate Conversion (Resampling) ● We often increase or decrease the sampling rate 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower ○ rates to reduce the data size Up-sampling Down-sampling Computed by signal interpolation ○ In down-sampling, preceded by a low-pass filter ■ to avoid the aliasing noise Windowed sinc function ■ https://ccrma.stanford.edu/~jos/resample/ ■

  12. Quantization ● Discretizing the amplitude of real-valued signals Round the amplitude to the nearest discrete steps ○ The bit discrete steps are determined by the number of bit bits (bit depth) ○ N bits can range from -2 N -1 to 2 N -1 -1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766) ■ 2 N -1 -1 Quantization step -2 N -1

  13. Quantization ● Determined by the dynamic range of of signals Adding 1 bits to LSB increases 6dB in sound level: N bits à 6 N dB ○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB ○ Music Speech

  14. Loading Audio Files ● Check the sampling rate and bit depth You can check them using audio software such as Audacity ○ ● Do resampling (usually down-sampling) if necessary Librosa provides resampling when loading audio files ○

  15. Waveform ● Waveform is a natural representation of audio but limited in analyzing the content Mainly show the temporal energy ○

  16. Spectrogram ● 2D-image representation of audio using short-time Fourier transform x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude ○ Easy to match what you hear to what you see ○

  17. Computing Spectrogram 𝑌 !%&' ● For each short segment (frame) Take a window (one frame) ○ Magnitude Compute DFT (FFT) ○ Compression Convert them to polar coordinate ○ 𝑌 "#$ Short-Time Magnitude and Phase ■ Fourier DFT Transform Compress the magnitude ○ (STFT) 𝑦(𝑚 − 1) 20log !" 𝑌 #$% : decibel ■ 𝑦(𝑚) Shifting by a hop size ○ window size ● Spectrogram parameters Windowing Window size (FFT size) ○ Windowing Hop size ○ hop size Window type ○

  18. Discrete Fourier Transform (DFT) ● Find the frequency (sinusoidal) component of 𝑦 𝑜 '() ) ' 𝐵 𝑙 cos( *+$, ● Represent 𝑦 𝑜 with 𝑦 𝑜 = ∑ $%& + ϕ(𝑙) ) ' 𝐵 𝑙 : amplitude (or magnitude) of the sinusoid ○ ϕ(𝑙) : phase of the sinusoid ○ 𝑂 : size of DFT or the input segment ○ 𝑙 : frequency bin index ( 0 to 𝑂 − 1 ) ○ " ( # 𝑔 ! is the frequency at each frequency bin) ● DFT provides the way of finding 𝐵 𝑙 and ϕ(𝑙) Pink Floyd ”The Dark Side of the Moon”

  19. Discrete Fourier Transform (DFT) ● Use the orthogonality of sinusoids (equivalent to − 𝑚) ( ) = ,𝑂/2 if 𝑙 = 𝑚 or 𝑙 = 𝑂 − 𝑚 ()! cos( *+,& ) cos( *+-& ○ ∑ &'" 0 otherwise ( ()! cos( *+,& ) sin( *+-& ○ ∑ &'" ( ) = 0 ( 0 otherwise ()! sin( *+,& ) sin( *+-& ○ ∑ &'" 𝑂/2 ( ) = ? 𝑙 = 𝑚 ( −𝑂/2 𝑙 = 𝑂 − 𝑚 (equivalent to − 𝑚) ● The inner product (or correlation) between the two sinusoids: If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other) ○

  20. Discrete Fourier Transform (DFT) ● Inner product with the input and sinusoids ()! 𝑦 𝑜 cos *+-& ()! ! ( 𝐵 𝑙 cos( *+,& + ϕ(𝑙) ) )cos *+-& ()! (∑ ,'" ○ = ∑ &'" = ∑ &'" 𝑌 ./ 𝑙 ( ( ( = 𝐵 𝑙 cos ϕ 𝑙 ()! 𝑦 𝑜 sin *+-& ()! ∑ ,'" ()! ! ( 𝐵 𝑙 cos( *+,& + 𝜚(𝑙) ) sin *+-& ○ 𝑌 0# 𝑙 = − ∑ &'" = − ∑ &'" ( ( ( = 𝐵 𝑙 sin ϕ 𝑙 ● The magnitude and phase * 𝑙 + 𝑌 0# 𝑙 , 𝑌 12$3/ (𝑙) = ϕ 𝑙 = tan )! ( 4 () (,) * ○ 𝑌 #$% (𝑙) = 𝐵 𝑙 = 𝑌 ./ 4 *+ (,) ) ● The definition of DFT can be simplified using complex sinusoids ()! 𝑦 𝑜 𝑓 )7 ,-./ = 𝑌 ./ 𝑙 + 𝑘𝑌 0# 𝑙 = 𝐵(𝑙) 78 , ○ 𝑌 𝑙 = ∑ &'" = cos 2𝜌𝑙𝑜 + 𝑘sin 2𝜌𝑙𝑜 𝑓 !"#$% 0 & 𝑂 𝑂 Euler’s identity

  21. Discrete Fourier Transform (DFT) 𝑌 "#$ 𝑌 %5#!& ● Can be viewed as matrix multiplication ∗ (𝑜) = 𝑓 89:;63 𝑡 6 < to polar 𝑌 4& 𝑌 2" 𝑋 𝑋 '1! !23 𝑦(𝑜) ● In practice, we use an FFT algorithm instead of direct multiplication Divide the matrix into small matrices recursively ○ Complexity reduction: O ( N 2 ) à O( N log 2 N ) ○ 𝑋 𝑋 '1! !23

  22. Discrete Fourier Transform (DFT) ● When DFT is applied to musical sounds ○ A musical tone with pitch has periodic waveforms DFT shows harmonic spectrum (harmonic overtones) ○ Pitch information can be also extracted ○ ○ The magnitude is generally more sparse than the waveform 𝐺0 2𝐺0 3𝐺0 𝑦(𝑜) 𝑌 "#$ (𝑙)

  23. Effect of Window Type ● Types of window functions Trade-off between the width of main-lobe and the level of side-lobe ○ Hann window is the most widely used in music analysis. ○ Rectangular Triangular Hann Blackmann 1 1 1 1 Amplitude 0.5 0.5 0.5 0.5 0 0 0 0 � 200 0 200 � 200 0 200 � 200 0 200 � 200 0 200 40 40 40 40 Magnitude(dB) 20 20 20 20 0 0 0 0 � 20 � 20 � 20 � 20 � 40 � 40 � 40 � 40 � 60 � 60 � 60 � 60 � 500 0 500 � 500 0 500 � 500 0 500 � 500 0 500 Spectra of windowed single sinusoids

  24. Effect of Window Size ● Trade-off between time and frequency resolutions Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution ○ Hop=128, N=4096 Hop=128, N=256

  25. Human Ears ● Human ear is a spectrum analyzer? Our ear has a complicated pathway from the ear drum to the auditory nerve ○ The cochlea in the inner ear is a bandpass-filter bank ○ The membrane resonates at a different position depending the frequency of ○ the input. The resonance frequency increases in a log scale along the membrane Membrane (Unrolled) Cochlear

  26. Human Pitch Perception ● Pitch Resolution Just noticeable difference (JND) increases ○ as the frequency goes up ● Mel scale Approximate the human pitch resolution ○ based on pitch ratio of tones Most widely used for speech and music ○ analysis A log frequency scale ○ m = 2595log 10 (1 + f / 700)

Recommend


More recommend