EE613 Machine Learning for Engineers SUBSPACE CLUSTERING Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Oct. 25, 2017 1
SUBSPACE CLUSTERING (Wed, Oct. 25) HIDDEN MARKOV MODELS (Wed, Nov. 1) LINEAR REGRESSION (Thu, Nov. 9) GAUSSIAN MIXTURE REGRESSION (Wed, Dec. 13) GAUSSIAN PROCESS REGRESSION (Wed, Dec. 20) Time series analysis and synthesis, Multivariate data processing 2
Outline • High-dimensional data clustering (HDDC) Matlab code: demo_HDDC01.m • Mixture of factor analyzers (MFA) Matlab code: demo_MFA01.m • Mixture of probabilistic principal component analyzers (MPPCA) Matlab code: demo_MPPCA01.m • GMM with semi-tied covariance matrices Matlab code: demo_semitiedGMM01.m 3
Introduction K clusters N datapoints D dimensions (original space) d dimensions (latent space) Subspace clustering aims at clustering data while reducing the dimension of each cluster (cluster-dependent subspace) Considering the two problems separately (clustering, then subspace projection) can be inefficient and can produce poor local optima, especially when datapoints of high dimensions are considered. 4
Example of application: Whole body motion Image: Dominici et al. (2010), J NEUROPHYSIOL About 90% of variance in walking motion can be explained by 2 principal components Each type of periodic motion can be characterized by a different subspace Walking Running Walking Requires clustering of the complete motion into different locomotion phases Requires extraction of coordination patterns for each cluster 5
Curse of dimensionality in GMM encoding K clusters N datapoints D dimensions (original space) d dimensions (latent space) Image: datasciencecentral.com 6
Curse of dimensionality Some characteristics of high-dimensional spaces can ease the classification of data. Indeed, having different groups living in different subspaces may be a useful property for discriminating the groups. Subspace clustering exploits the phenomenon that high-dimensional spaces are mostly empty to ease the discrimination between groups of points. Curse of dimensionality or… blessing of dimensionality? 7
N datapoints Curse of dimensionality D dimensions (original space) d dimensions (latent space) Bouveyron and Brunet (2014, COMPUT STAT DATA AN ) reviewed various ways of handling the problem of high-dimensional data in clustering problems: 1. Since D is too large w.r.t. N , a global dimensionality reduction should be applied as a pre-processing step to reduce D . 2. Since D is too large w.r.t. N , the solution space contains many poor local optima. The solution space should be smoothed by introducing ridge or lasso regularization in the estimation of the covariance (avoiding numerical problem and singular solutions when inverting the covariances). A simple form of regularization can be achieved after the maximization step of each EM loop. 3. Since D is too large w.r.t. N , the model is probably over-parametrized, and a more parsimonious model should be used (thus estimating a fewer number of parameters). 8
Gaussian Mixture Model (GMM) K Gaussians N datapoints of dimension D Equidensity contour of one standard deviation 9
Covariance structures in GMM 10
Multivariate normal distribution - Stochastic sampling 11
Expectation-maximization (EM) 12
Expectation-maximization (EM) M-step Converge? Stop Initial guess E-step 13
EM for GMM 14
EM for GMM 15
EM for GMM 16
EM for GMM 17
EM for GMM: Resulting procedure K Gaussians N datapoints These results can be intuitively interpreted in terms of normalized counts. EM provides a systematic approach to derive such procedure. Weighted averages taking into account the responsibility of each datapoint in each cluster. 18
EM for GMM 19
EM for GMM: Local optima issue 20
Local optima in EM EM will improve the likelihood at each iteration, but it can get trapped into poor local optima in the solution space Parameters initialization is important! Log-likelihood Unknown solution space Parameter space 21
Parameters estimation in GMM… in 1893 54 pages! Proposed solution: Moment-based approach requiring to solve a polynomial of degree 9… … which does not mean that moment - based approaches are old-fashioned! They are actually today popular again with new developments related to spectral decomposition.
High-dimensional data clustering (HDDC) Matlab code: demo_HDDC01.m [C. Bouveyron and C. Brunet. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71:52 – 78, March 2014] 23
Curse of dimensionality Bouveyron and Brunet (2014, COMPUT STAT DATA AN ) reviewed various ways of viewing the problem and coping with high-dimensional data in clustering problems: 1. Since D is too large wrt N , a global dimensionality reduction should be applied as a pre-processing step to reduce D . 2. Since D is too large wrt N , the solution space contains many poor local optima; the solution space should be smoothed by introducing ridge or lasso regularization in the estimation of the covariance (avoiding numerical problem and singular solutions when inverting the covariances). A simple form of regularization can be achieved after the maximization step of each EM loop. 3. Since D is too large wrt N , the model is probably over-parametrized, and a more parsimonious model should be used (thus estimating a fewer number of parameters). 24
Regularization of the GMM parameters The introduction of a regularization term can change the shape of the solution space Log-likelihood Unknown solution space Parameter space 25
Regularization of the GMM parameters Regularization with minimal admissible eigenvalue: Tikhonov regularization with diagonal isotropic covariance:
High-dimensional data clustering (HDDC) 27
Mixture of factor analyzers (MFA) Matlab code: demo_MFA01.m [P. D. McNicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285 – 296, September 2008] 28
Mixture of factor analyzers (MFA) 29
Mixture of factor analyzers (MFA) 30
Mixture of factor analyzers (MFA): graphical model 31
Mixture of factor analyzers (MFA) 32
Mixture of factor analyzers (MFA) 33
Estimation of parameters in MFA 34
Alternating Expectation Conditional Maximization (AECM) 35
AECM for MFA (UUU model in McNicholas and Murphy, 2008) covariance as in GMM 36
AECM for MFA (UUU model in McNicholas and Murphy, 2008) Same as standard GMM 37 covariance as in GMM
Mixture of probabilistic PCA (MPPCA) Matlab code: demo_MPPCA01.m [M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443 – 482, 1999] 38
Mixture of probabilistic PCA (MPPCA) covariance as in GMM 39
A taxonomy of parsimonious GMMs D in the slides of this lecture [C. Bouveyron and C. Brunet. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71:52 – 78, March 2014] 40
GMM with semi-tied covariance matrices Matlab code: demo_semitiedGMM01.m [M. J. F. Gales. Semi-tied covariance matrices for hidden Markov models. IEEE Trans. on Speech and Audio Processing, 7(3):272 – 281, 1999] 41
Sharing of parameters in mixture models 42
GMM with semi-tied covariance matrices H 43
GMM with semi-tied covariance matrices 44
GMM with semi-tied covariance matrices 45
GMM with semi-tied covariance matrices covariance as in GMM 46
Summary of relevant covariance structures H 47
Main references Parsimonious GMM C. Bouveyron and C. Brunet. Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 71:52 – 78, March 2014 P. D. McNicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285 – 296, September 2008 MFA G. J. McLachlan, D. Peel, and R. W. Bean. Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41(3-4):379 – 388, 2003 G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of handwritten digits. IEEE Trans. on Neural Networks, 8(1):65 – 74, 1997 MPPCA M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443 – 482, 1999 GMM with semi-tied covariances M. J. F. Gales. Semi-tied covariance matrices for hidden Markov models. IEEE Trans. on Speech and Audio Processing, 7(3):272 – 281, 1999 48
General textbooks 49
Recommend
More recommend