Audio Representations Graduate School of Culture Technology, KAIST - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam

Outlines • Time-domain Representation - Sampling - Quantization • Time-Frequency Representations - Short-time Fourier Transform - Spectrogram - Mel-Spectrogram - Constant-Q transforms - Auditory Filterbank

Music Representations • Audio … 0 1 1 0 1 1 0 … … 0 0 1 1 0 1 1 … • Score

Time-Domain Representation of Audio • Musical sounds arrives in microphones as vibration of air. • In computer, they are converted into a sequence of binary values via Sampling and Quantization . … 0 1 1 0 1 1 0 …

Sampling • What is an appropriate sampling rate? - Too high: increase data rate - Too low: become hard to reconstruct the original signal • Sampling Theorem - Sampling rate must be greater than twice the maximum frequency in the signal in order to reconstruct fully (or to avoid aliasing) 𝑔 𝑡 : sampling rate 𝑔 𝑡 > 2 ∙ 𝑔 𝑛 𝑔 𝑛 : maximum frequency - Half the sampling rate ( 𝑔 𝑡 /2 ) is called Nyquist frequency

Sampling Rate • Determined by the bandwidth of signals or hearing limits - Audio CD: 44.1 kHz (44100 samples per second) - Speech communication: 8 kHz (8000 samples per second)

Sampling Rate Conversion (Resampling) • Up-sampling and Down-sampling Up-sampling Down-sampling - 44.1kHz CD quality music is often down- sampled to 22.05 kHz or lower rates to reduce the data size for analysis tasks • Resampling is computed by interpolation using a low-pass filter (e.g. windowed sinc function)

Quantization • Discretizing the amplitude of real-valued signals - Round the amplitude to the nearest discrete steps • The discrete steps are determined by the number of bit bits - The range of B bit quantization: -2 B -1 ~ 2 B -1 -1 - Audio CD (16 bits) : -2 15 ~ 2 15 -1 à this is often scaled to -1.0 ~ 1.0

Time-Domain Representation of Audio • Waveforms are a natural representation of audio but limited in analyzing the content Zoom-in view (e.g. one frame): wave shape is not very intuitive in explaining timbre, particularly when the music is polyphonic Zoom-out view: limited to explaining temporal loudness change

Time-Frequency Representations of Audio • We hear and observe musical sounds in terms of how the content changes over time and frequency

Short-Time Fourier Transform (STFT) • Definition 𝐼 : hop size :5; 𝑥 𝑜 𝑌 𝑚, 𝑙 = - 𝑥 𝑜 𝑦 𝑜 + 𝑚 2 𝐼 𝑓 56789/: : window 𝑂 : FFT size 9<= • Computation Steps - Take a window (one frame) - Compute FFT - Shifting by the hop size - Repeat above • This returns a 2D matrix

Windowing • Types of window functions - Trade-off between the width of main-lobe and the level of side-lobe - Tapering windows suppresses side-lobe levels at the expense of wider main lobe. - Hann window is the most widely used in music analysis. Rectangular Triangular Hann Blackmann 1 1 1 1 Amplitude 0.5 0.5 0.5 0.5 0 0 0 0 − 200 0 200 − 200 0 200 − 200 0 200 − 200 0 200 40 40 40 40 Magnitude(dB) 20 20 20 20 0 0 0 0 − 20 − 20 − 20 − 20 − 40 − 40 − 40 − 40 − 60 − 60 − 60 − 60 − 500 0 500 − 500 0 500 − 500 0 500 − 500 0 500 Spectra of Windowed Sines

Time-Frequency Resolution by Window Size • Trade-off between time and frequency resolutions - Short window: low frequency-resolution and high time-resolution - Long window: high frequency-resolution and low time-resolution

� Discrete Fourier Transform (DFT) • Definition - Inner product with the signal and complex sinusoids 𝑂−1 𝑌 𝑙 = 𝑦 𝑜 ∙ 𝑡 𝑙 (𝑜) = - 𝑦 𝑜 𝑓 −𝑘2𝜌𝑙𝑜 = 𝑌 𝑆 𝑙 + 𝑘𝑌 𝐽 𝑙 = 𝐵(𝑙) 𝑘ϕ 𝑙 𝑂 𝑜=0 = cos 2𝜌𝑙𝑜 + 𝑘sin 2𝜌𝑙𝑜 𝑡 𝑙 𝑜 = 𝑓 𝑘2𝜌𝑙𝑜 By Euler’s identity 𝑂 𝑂 𝑂 2 𝑙 + 𝑌 𝐽 2 𝑙 - Magnitude spectrum: 𝑌 𝑙 = 𝐵 𝑙 = 𝑌 𝑆 ∠𝑌 𝑙 = ϕ 𝑙 = tan −1 (𝑌 𝐽 (𝑙) - Phase spectrum: 𝑌 𝑆 (𝑙))

Fast Fourier Transform (FFT) • Matrix multiplication view of DFT • “Fast Fourier Transform (FFT)” is an efficient algorithm that computes the matrix multiplication fast - Based on "divide and conquer” - Complexity reduction: O( N 2 ) à O( N log 2 N )

Frequency Scale in Spectrogram • Linear frequency scale - Great to see the harmonic structure of a single tone. - However, spectral distributions are skewed toward low frequency “Chopin Prelude E minor” “How insensitive”

Human Pitch Perception • The basilar membrane in cochlea is a rough spectral analyzer - Resonate at a different position selectively according to the frequency of incoming vibration - The resonance frequency increases in a log scale along the basilar membrane (Unrolled) Cochlear Oval Inner Ear window Round window Basilar Membrane (P. Cook)

Human Pitch Perception • Pitch Resolution - Just noticeable difference (JND) increases as the frequency goes up • Critical bandwidth - Frequency bandwidth within which one tone interferes with the perception of another tone by auditory masking - Constant at low frequency but linear at high frequency

Psychoacoustical Pitch Scales • Mel scale - Based on pitch ratio of tones 1 m = 2595log 10 (1 + f / 700) 0.9 0.8 0.7 • Bark scale normalized scales 0.6 - Critical band measurement by masking 0.5 0.4 Bark = 13arctan(0.00075 f ) + 3.5arctan(( f / 7500) 2 ) 0.3 0.2 ERB 0.1 Mel • Equivalent regular bandwidth rate Bark 0 0 0.5 1 1.5 2 2.5 frequency (Hz) 4 x 10 - Critical band measurement using the Comparison of pitch scales notched-noise method ERBS = 21.4 ⋅ log 10 (1 + 0.00437 f ) Using Matlab code from https://www.speech.kth.se/~giampi/auditoryscales/

Tuning System in Musical Instrument • Equal temperament - 1: 2 1/12 ratio between two adjacent notes - Music note ( m ) and frequency ( f ) in Hz m = 12log 2 ( f ( m − 69) ) + 69, f = 440 ⋅ 2 12 440 https://newt.phys.unsw.edu.au/jw/notes.html

Log-Spectrogram Using Frequency Mapping • Mapping linear scale to a perceptual (log-like) scale - Locate center frequencies according to the frequency mapping - Linear interpolation on the center frequency with the corresponding bandwidth skirt Band Center width Frequency Log-Frequency Spectrogram Linear-Frequency Spectrogram

Log-Spectrogram Using Frequency Mapping • The mapping can be formed as matrix multiplication - Each column of the mapping matrix contain the interpolation coefficients Y = M ⋅ X ( M : mapping matrix, X : spectrogram, Y : scaled spectrogram) 20 40 × = 60 80 100 120 100 200 300 400 500 600 • Limitation - Simple but time frequency resolutions are still constrained on STFT

Mel-Frequency Spectrogram • Mel-scaled spectrogram is widely used for music classification - Usually mapped to a smaller number of mel-scaled bins that the FFT size Linear-Frequency Spectrogram Mel-Frequency Spectrogram

Constant-Q Transform • Use a set of sinusoidal kernels with: - Logarithmically spaced frequencies - Constant Q = frequency/bandwidth

Constant-Q Filter Bank Log-Frequency Spectrogram (mapping) Log-Frequency Spectrogram (Constant-Q transform)

Example: Constant-Q Filter Bank • Müller’s 88-note filter bank - The center frequency is set to each of 88 piano notes - The bandwidth is set to have constant-Q with +/- 25 cent around the center (Müller, 2011)

Comparison of Time-Frequency Representations frequency frequency time time Spectrogram (long window) Spectrogram (short window) frequency frequency time time Constant-Q transform Mel Spectrogram

Auditory Filter Bank • A set of filter bank that imitates the magnitude and delay of traveling waves on basilar membrane in cochlear - Produce 3-D representation (time-channel-lag) or “auditory images” Stabilize & Combine ACF HC HC ACF Correlogram . input . . . . . HC ACF Correlogram Hair cells Summary ACF Oval Summary ACF window Auto-Correlation High Low Functions Freq. Freq. Cochlear Filter banks

Types of Auditory Filter Banks • Gamma-tone Filter banks (R. Patterson) g ( t ) = at n − 1 e − 2 π bt cos(2 π ft + ϕ ) u ( t ) - Gamma-tone: • Pole-Zero Filter Cascade (D. Lyon)

Hair-Cell • (Inner) Hair-cell - Transform mechanical movement into neural spikes • Modeled as cascade of - Half-wave rectification - Compression - Low-pass filtering • This conducts a non-linear processing - Generate new harmonic partials - Associated with missing fundamentals

Example of Correlogram: Piano

Example of Correlogram: Rock Music

Software • Tools - C++: http://soundlab.cs.princeton.edu/software/sndpeek/ - WebAudio: https://musiclab.chromeexperiments.com/Spectrogram - Audacity, Sonic Visualizer, Adobe Audition, Praat, … • Libraries - Librosa (python): http://librosa.github.io/librosa/ - Auditory Toolbox (Matlab): https://engineering.purdue.edu/~malcolm/interval/1998-010/

Audio Representations Graduate School of Culture Technology, KAIST - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam Outlines Time-domain Representation - Sampling - Quantization Time-Frequency Representations - Short-time

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

61A Lecture 16 Announcements String Representations String Representations 4 String

ARREL AUDIO ML-118 Mid-Side Unit Livio Argentini, Marco Re ARREL AUDIO Rome Via Arnoldo

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

CobraNet CobraNet Audio Network Audio Network Overview Overview Developed by Peak Audio

CS378 - Mobile Computing Audio Android Audio Use the MediaPlayer class Common Audio

CMPT 365 Multimedia Systems Media Representations - Audio Spring 2017 CMPT365 Multimedia

Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Interactive Design Audio and Design Leonard Paul of Lotus Audio Vancouver, Canada Interactive

The Dynamic Audio of Vessel The Dynamic Audio of Vessel Leonard J. Paul Leonard J. Paul

Federal EdTech Legislation and Regulations that You Need to Follow Audio Setup Test Your Audio

Chapter 4 Hearing, Auditory Models, and Speech Perception

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

Schedule Date Day Class Title Chapters HW Lab Exam No. Due date Due date 1 Oct Wed

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Quiz #2 Friday, March 20 QuickCheck 28.14 The three bulbs are identical and the two

Using Prior Knowledge Ji Kubalk jiri.kubalik@cvut.cz Symbolic Regression Using Prior

Where are we? Layout - Line of Diffusion Lots of Layout issues Very common layout method

INC 151 Electrical Engineering So3ware Prac6ce Lecture #2 Scrip

Sambuz

Useful Links

Newsletter

Mail Us

Audio Representations Graduate School of Culture Technology, KAIST - PowerPoint PPT Presentation

GCT634: Musical Applications of Machine Learning Audio Representations Graduate School of Culture Technology, KAIST Juhan Nam Outlines Time-domain Representation - Sampling - Quantization Time-Frequency Representations - Short-time

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

61A Lecture 16 Announcements String Representations String Representations 4 String

ARREL AUDIO ML-118 Mid-Side Unit Livio Argentini, Marco Re ARREL AUDIO Rome Via Arnoldo

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

CobraNet CobraNet Audio Network Audio Network Overview Overview Developed by Peak Audio

CS378 - Mobile Computing Audio Android Audio Use the MediaPlayer class Common Audio

CMPT 365 Multimedia Systems Media Representations - Audio Spring 2017 CMPT365 Multimedia

Audio Data Representations Juhan Nam Types of Music Data Audio MP3, WAV Score

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Interactive Design Audio and Design Leonard Paul of Lotus Audio Vancouver, Canada Interactive

The Dynamic Audio of Vessel The Dynamic Audio of Vessel Leonard J. Paul Leonard J. Paul

Federal EdTech Legislation and Regulations that You Need to Follow Audio Setup Test Your Audio

Chapter 4 Hearing, Auditory Models, and Speech Perception

Non-Uniform Computation &amp; Circuits Lecture 10 Wherein every language can be decided 1

Schedule Date Day Class Title Chapters HW Lab Exam No. Due date Due date 1 Oct Wed

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Quiz #2 Friday, March 20 QuickCheck 28.14 The three bulbs are identical and the two

Using Prior Knowledge Ji Kubalk jiri.kubalik@cvut.cz Symbolic Regression Using Prior

Where are we? Layout - Line of Diffusion Lots of Layout issues Very common layout method

INC 151 Electrical Engineering So3ware Prac6ce Lecture #2 Scrip

Sambuz

Useful Links

Newsletter

Mail Us

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1