traditional machine learning unsupervised learning
play

Traditional Machine Learning: Unsupervised Learning Juhan Nam - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Traditional Machine Learning: Unsupervised Learning Juhan Nam Traditional Machine Learning Pipeline in Classification Tasks A set of hand-designed audio features are selected


  1. GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Traditional Machine Learning: Unsupervised Learning Juhan Nam

  2. Traditional Machine Learning Pipeline in Classification Tasks ● A set of hand-designed audio features are selected for a given task and they are concatenated The majority of them are extracted in frame-level: MFCC, chroma, spectral ○ statistics The concatenated features are complementary to each other ○ “Class #1 ” . . . Chroma “Class #2” Classifier Spectral Statistics “Class #3” MFCC …

  3. Issues: Redundancy and Dimensionality ● The information in the concatenated feature vectors can be repetitive ● Adding more features increases the dimensionality of the feature vectors. The classification will become more demanding. “Class #1 ” . . . Chroma “Class #2” Classifier ? Spectral Statistics “Class #3” MFCC …

  4. Issues: Temporal Summarization ● Taking the entire frames as a single vector is too much for classifiers 10 ~ 100 frames per second is typical in the frame-level processing ○ ● Temporal order is important and thus taking multiple features (that capture local temporal order) are acceptable MFCC: concatenated with its frame-wise differences (delta and double-delta) ○ ● However, extracting the long-term temporal dependency is hard! Averaging is OK but too simple. ○ DCT over time for each feature dimension is an option ○ . . . Chroma Classifier ? ? Spectral Statistics MFCC

  5. Unsupervised Learning ● Principal Component Analysis (PCA) Learn a linear transform so that the transformed features are de-correlated ○ Dimensionality reduction: 2D for visualization ○ ● K-means Learn K cluster centers and determine the membership ○ Move each data point to a fixed set of learned vectors (cluster centers): ○ vector quantization and one-hot sparse feature representation ● Gaussian Mixture Models (GMM) Learn K Gaussian distribution parameters and the soft membership ○ Density estimation (likelihood estimation): can be used for classification ○ when estimated for each class

  6. Principal Component Analysis ● Correlation and Redundancy We can measure the redundancy between two elements in a feature vector ○ by computing their correlation ∑ ! " ! # $ " ! % ! #% ! Pearson correlation coefficient = " ! " % " " ! # $ %# & If some of the elements have high correlations, we can remove the ○ redundant elements 𝑦 ! 𝑦 " ⋮ 𝑦 #

  7. Principal Component Analysis ● Transform the input space ( 𝑌 ) into a latent space ( 𝑎 ) such that the latent space is de-correlated (i.e., each dimension is orthogonal to each other) Linear transform designed to maximize the variance of the first principal ○ component and minimize the variance of the last principle component 𝑎 $ 𝑎 𝑌 Orthogonal vectors (principal components)

  8. Principal Component Analysis ● Transform the input space ( 𝑌 ) into a latent space ( 𝑎 ) such that the latent space is de-correlated (i.e., each dimension is orthogonal to each other) Linear transform designed to maximize the variance of the first principal ○ component and minimize the variance of the last principle component 𝜇 " 0 0 0 0 𝜇 # 0 0 𝑎𝑎 ! = 𝑂𝐉 = = ⋱ ⋮ 0 0 ⋯ 𝜇 $ 0 0 𝑎 𝑌 𝑋 The diagonal elements correspond to the variances of transformed data points on each dimension

  9. Principal Component Analysis: Eigenvalue Decomposition ● To derive 𝑋 𝑋𝑌𝑌 ! 𝑋 ! = 𝑂𝐉 𝑎𝑎 ! = 𝑂𝐉 (𝑋𝑌)(𝑋𝑌) ! = 𝑂𝐉 𝑋Cov(𝑌)𝑋 ! = 𝑂𝐉 ● Eigenvalue Decomposition ( 𝑅 : eigenvectors, Λ : eigenvalue matrix) 𝑅 %" 𝐵𝑅 = Λ 𝑅 ! 𝐵𝑅 = Λ𝐉 𝐵𝑦 & = 𝜇 & 𝑦 & 𝐵𝑅 = 𝑅Λ (If 𝐵 is symmetric) 𝑅 = [𝑦 ! 𝑦 " … 𝑦 # ] Λ = diag( 𝜇 % ) ● 𝑋 is obtained from the eigenvectors of Cov(𝑌) 𝑋 = 𝑅 !

  10. Principal Component Analysis: Eigenvalue Decomposition ● In addition, we can normalize the latent space Λ %"/# 𝑅 ! 𝐵𝑅Λ %"/# = 𝐉 𝑎 2 = Λ 2 𝑎 𝑅 ! 𝐵𝑅 = Λ𝐉 𝑋 2 Cov(𝑌)𝑋 2! = 𝐉 1 0 4 𝜇 ! ⋮ 1 0 4 𝑋 2 = Λ %"/# 𝑅 ! = Λ %"/# 𝑋 = Λ 2 𝑋 Λ $ = Λ &!/" = 𝜇 " ⋱ 0 1 ⋯ 0 4 𝜇 $ 𝑋 2 contains a set of orthonormal vectors ●

  11. Principal Component Analysis In Practice ● In practice, 𝑌 is a huge matrix where each column is a data point Computing the covariance matrix is a bottleneck and so we often sample the ○ input data Cov 𝑌 = . . . 𝑌 𝑌 ! . . .

  12. Principal Component Analysis In Practice ● Shift the distribution to have zero mean ● The normalization is optional: called PCA whitening 𝑌 $ = 𝑌 − mean(𝑌) Shifting 𝑌 Rotation Λ $ 𝑋𝑌 $ 𝑋𝑌 $ Normalization (Scaling)

  13. Dimensionality Reduction Using PCA ● We can remove principal components with small variances Sort the variances in the latent space (the eigenvalues) in descending order ○ and removing the tails A strategy is accumulating the variances from the first principal component. ○ When it reaches 90% or 95% of the sum of all variances, remove the remaining dimensions. This significantly reduces the dimensionality. Variances ⋯ 95% ● Note that you can reconstruct the original data with some loss You can use PCA as a data compression method ○

  14. Visualization Using PCA ● Taking the first two or three principal components only for 2D or 3D visualization A popularized used feature visualization method along with t -SNE in ○ analyzing the latent feature space in the trained deep neural network source:https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

  15. K-Means Clustering ● Grouping the data points into K clusters Each point has the membership to one of the clusters ○ Each cluster has a cluster center (not necessarily one of the data points) ○ The membership is determined by choosing the nearest cluster center ○ The cluster center is the mean of the data points that belong to the cluster ○ This is dilemma!

  16. K-Means: Definition ● The loss function to minimize is defined as: " 𝑦 (() − 𝜈 + * , (() = 51 if 𝑙 = argmin " 𝑠 (+ 𝑦 (() − 𝜈 + 𝑠 𝑀 = 2 2 / + 0 otherwise ()! +)! Regarded as a problem that learns cluster centers ( 𝜈 ! ) that minimize the loss ○ (#) is the binary indicator of the membership of each data point 𝑠 ○ ! ● Taking the derivative of the loss 𝑀 w.r.t the cluster center 𝜈 ; Again, we should know the * , (() 𝑦 (() * ∑ ()! 𝑒𝑀 𝑠 (() (𝑦 (() − 𝜈 + ) = 0 cluster centers (to determine + = 2 2 2𝑠 𝜈 + = + (() 𝑒𝜈 + membership) before computing * ∑ ()! 𝑠 ()! +)! + the cluster centers

  17. Learning Algorithm ● Iterative learning Initialize the cluster centers with random values (a) ○ Compute the memberships of each data point given the cluster centers (b) ○ Update the cluster centers by averaging the data points that belong to them (c) ○ Repeat the two steps above until convergence (d, e, f) ○ (a) (b) (c) 2 2 2 points) 1000 - 0 0 0 algo- J pro- − 2 − 2 − 2 as- 500 − 2 0 2 − 2 0 2 − 2 0 2 . (d) (e) (f) 2 2 2 0 0 0 0 1 2 3 4 − 2 − 2 − 2 The loss monotonically − 2 0 2 − 2 0 2 − 2 0 2 decreases every iteration (The PRML book)

  18. Data Compression Using K-means ● Vector Quantization The set of cluster centers is called “codebook”: ○ Encoding a sample vector to a single scalar ○ value of “codebook index” (membership index) The compressed data can be reconstructed ○ using the codebook Example: speech codec (CELP) ○ A component of speech sound is vector- ■ Example of a codebook for a 2D quantized and the codebook index is transmitted Gaussian with 16 code vectors in the speech communication 𝜈 0 𝜈 1 ⋯ 𝑦 (!) 𝑦 (") 3 5 ⋯ ⋯ Encoding Decoding source:https://wiki.aalto.fi/pages/viewpage.action?pageId=149883153

  19. Codebook-based Feature Summarization ● Compute the histogram of codebook index Represent the codebook index with one-hot vector ○ if K is a large number, it is regarded as a sparse representation of the features ■ Useful for summarizing a long sequence of feature-level features ○ Often called “a bag of features” (computer vision) or a bag of words (NLP) ■ 0 0 0 0 one-hot vector 1 0 … 𝑦 (!) 𝑦 (") ⋯ representation 0 0 Encoding 0 1 0 0 Summarization (histogram) a bag of features K -dimensional vector source:https://towardsdatascience.com/bag-of-visual-words-in-a-nutshell-9ceea97ce0fb

  20. Gaussian Mixture Model (GMM) ● Fit a set of multivariate Gaussian distribution to data Similar to K-means clustering but it learns not only the cluster centers ○ (means) but also the covariance of clusters The membership is a soft assignment as a multinomial distribution ○ The multinomial distribution is regarded as mapping on a latent space ■ K-means GMM

Recommend


More recommend