GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Machine Learning for Music: Intro Juhan Nam
Definition of Machine Learning ● Tom M. Mitchell provided a widely accepted definition: “A computer program is said to learn from experience E with respect to ○ some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
Definition of Machine Learning ● Tasks T Classification, Regression, Transcription, Machine Translation, Structured ○ output, Anomaly detection, Synthesis and Sampling, Imputation of missing values, Denoising, and Density Estimation (listed from the DL book) ● Experience E Data and their correspondence: supervised /unsupervised ○ learning/reinforcement learning ● Performance P Loss function, accuracy metrics ○
In Musical Context ● Tasks T Analysis tasks: music genre/mood classification, music-auto tagging, ○ automatic music transcription, source separation Synthesis tasks: sound synthesis, music generation (automatic music ○ composition or arrangement), expressive performance rendering ● Experience E Music data (audio, MIDI, text, images) and their correspondence ○ ● Performance P Objective measure: loss function, accuracy metrics (e.g., F-score) ○ Subjective measures: user test (i.e., human test) ○
Classification Tasks in Music ● Classification is the most commonly used supervised learning approach in music analysis tasks Train the model with audio data and its class labels and then predict labels ○ from new test audio “C2” Classification “C#2” Model “D2” … Pitch Estimation (frame-level)
Classification Tasks in Music ● Classification is the most commonly used supervised learning approach in many music analysis tasks Train the model with audio data and its class labels and then predict labels ○ from new test audio “Piano” Classification “Drum” Model “Guitar” … Instrument Recognition (note-level)
Classification Tasks in Music ● Classification is the most commonly used supervised learning approach in music analysis tasks Train the model with audio data and its class labels and then predict labels ○ from new test audio “Jazz” Classification “Metal” Model “Classical” … Genre Classification (segment-level)
Classification Model for Music ● The classification models are formed with the following steps in common Audio data representation: waveforms, spectrogram, mel-spectrogram ○ Feature extraction: highly depends on the tasks and the abstraction level ○ Higher-levels require longer input size and more complex features ■ Classifiers: measuring the distance between the feature vector and class ○ templates for the final classification “Class #1 ” Audio Data Feature “Class #2” Classifier Representation Extraction “Class #3” … Classification Model
Classification Model for Music ● It is important to extract good audio features! “Classical” “Classical” “Jazz” “Jazz” “Metal” “Metal” Feature Space Feature Space Good Features Bad Features
Classification Model for Music ● Traditional machine learning ● Deep learning
Traditional Machine Learning ● Use hand-designed features for the task Based on domain knowledge (e.g. acoustics, signal processing) ○ Mel-frequency cepstral coefficient (MFCC), chroma, spectral statistics ○ ● Use standard classifiers Logistic regression, support vector machine, multi-layer-perceptron ○ “Class #1 ” Audio Data Hand-designed “Class #2” Classifier Representation Features “Class #3” Learning algorithm … Classification Model
Traditional Machine Learning ● Advantages A small dataset is fine ○ The classifiers are fast to train ○ The hand-designed features are interpretable ○ ● Disadvantages Requires domain knowledge ○ The feature design is an art ○ The two-stage approach is sub-optimal ○ ● Good as a baseline algorithm
Deep Learning ● Learn feature representations using neural network modules Better to call it representation learning ○ Fully-connected, convolutional, recurrent, pooling, non-linear layers ○ Stack more layers as the output has a higher abstraction level ○ Audio data representation can be also learned (end-to-end learning) ○ Gradient-based learning: all neural network modules are differentiable. We ○ can also add a new custom layer as long as it is differentiable “Class #1 ” Neural Network Linear Audio Data “Class #2” Modules Classifier Representation “Class #3” Learning algorithm … Classification Model Learned features via feature embedding
Deep Learning ● Advantages Less domain knowledge required. We can borrow many successful models ○ from other domains (e.g. image or speech) The trained model is reusable (transfer learning) ○ Superior performance in numerous machine learning tasks ○
Deep Learning ● Disadvantages (or challenges) A large-scale labeled dataset and the models are slow to train ○ Semi-supervised/unsupervised/self-supervised learning are actively developed ■ Required regularization to avoid overfitting ○ Many regularization techniques have been studied ■ Designing neural nets and searching hyperparameter is an art ○ Model and hyperparameter optimization is another research topic: e.g., AutoML ■ Understanding learned features is hard ○ Feature visualization techniques ■ Disentangled learning models where one parameter controls one sub-dimension ■ of learned features
Example: Mel-Frequency Cepstral Coefficient (MFCC) ● Most popularly used audio feature to extract “timbre” Extract spectrum envelop from an audio frame: remove pitch information ○ Standard audio feature in the legacy speech recognition systems ○ ● Computation Steps Mel-spectrum: use a mel-filter bank ○ Discrete cosine transform (DCT): a small set of cosine kernels with low ○ frequencies. It captures slowly varying trend of mel-spectrum over frequency which correspond to the spectrum envelope abs log Mel DFT MFCC DCT (magnitude) compression Filterbank Magnitude Spectrum Mel-spectrum
Example: Mel-Frequency Cepstral Coefficient (MFCC) Mel DCT filterbank Frequency spectrum Magnitude spectrum MFCC (mel-scaled, 60 bins) (512 bins) (13 dim) Inverse Inverse DCT mel filterbank Reconstructed Reconstructed Magnitude spectrum Mel spectrum
Example: Mel-Frequency Cepstral Coefficient (MFCC) Spectrogram Mel-frequency Spectrogram MFCC Reconstructed Spectrogram from MFCC
Representation Learning Point of View: MFCC ● We can replace the hand-designed modules with the trainable modules DFT, Mel-filterbank and DCT is a linear transform ○ Abs and log compression is a non-linear function ○ The linear transforms are designed by hands in MFCC but they can be ○ optimized further using the trainable modules Abs Mel Log DFT DCT (magnitude) Filterbank compression MFCC Linear Non-linear Non-linear Linear Linear Transform function function Transform Transform Deep Neural Network
Example: Chroma ● Musical notes are denoted with a pitch class and an octave number Pitch class: C, C#, D, D#, E, F, F#, G, G#, A, A#, B ○ Octave number: 0, 1, 2, 3, 4, 5, … ○ Example: C4 (middle C), E3, G5 ○ ● The octave difference is the most consonant pitch interval Therefore, they belong to the same pitch class ○ ● This can be represented with “pitch helix” Chroma: inherent circularity of pitch organization ○ Height: naturally increase and have one octave above ○ for one rotation Pitch Helix and Chroma (Shepard, 2001)
Example: Chroma ● Compute the energy distribution of an audio frame on 12 pitch classes " Convert the frequency to a musical note (= 12log ! ##$ + 69 ) and take the ○ pitch class from the musical note (e.g. 69 à A4 à A) Extract harmonic characteristics while removing timbre information ○ Useful in music synchronization, chord recognition, music structure ○ analysis, music genre classification ● Computation Steps Projecting the DFT or Constant-Q transform onto 12 pitch classes ○ DFT or abs Chroma Chroma Constant-Q Transform (magnitude) Mapping
Example: Chroma Chroma Spectrogram Chroma mapping (Reconstructed Chroma: Shepard tone)
Representation Learning Point of View: Chroma ● We can replace the hand-designed modules with the trainable modules DFT, constant-Q transform, and chroma mapping are a linear transform ○ Abs correspond to a non-linear function ○ The linear transforms are designed by hands in chroma but they can be ○ optimized further using the trainable modules DFT or Abs Chroma Constant-Q Transform (magnitude) Mapping Chroma Linear Non-linear Linear Transform function Transform Deep Neural Network
Summary ● Introduce machine learning in the perspective of representation learning (or feature learning) ● In the traditional machine learning approach, we design feature representations by hands. Once the features are extracted, we use standard machine learning algorithms. ● In the deep learning approach, we design the network architecture by hands. The feature representations are learned through the neural network modules and the optimization
Recommend
More recommend