Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - PowerPoint PPT Presentation

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1

Outline • Gaussian Mixture Models (GMM) • Expectation-Maximization (EM) for maximum likelihood • Model selection, Bayesian learning 2

From distance to probability distance likely || x − µ || 2 exp{ − λ || x − µ || 2 } “The closer, the more likely.” Sum or integral to be one Probability It is more powerful to consider everything in probability framework! Gaussian distribution with the Mahalanobis distance S 3

Review the clustering problem again We have the following data: We want to cluster the data into two clusters (red and blue) 4

Instead if using { µ 1, µ 2 }, each cluster is represented as a Gaussian distribution K-means Gaussian Mixture Model (GMM) k =2 µ 2 µ 1 k =1 = 5

Maximum likelihood Maximizing the log-likelihood function: = 0 Similarly we get and are the maximum likelihood estimates of the mean and the co-variance matrix. 6

Matrix derivatives 7 http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

Gaussian Mixture Model (GMM) We use z k = 1 to indicate a point x belongs to cluster k k =2 z = ( z 1 , …, z K ) Assume the points in the same cluster follow a Gaussian distribution A mixing weight for each cluster: prior probability of point belonging to a cluster k =1 So, we get a distribution for the data point x : 8

Introduce a latent variable We use z k = 1 to indicate a point x belongs to cluster k k =2 z = ( z 1 , …, z K ) A mixing weight for each cluster: prior probability of point belonging to a cluster k =1 Assume the points in the same cluster follow a Gaussian distribution 9

Gaussian Mixture Model (GMM) Generative process • Randomly sample a z from a categorical distribution [ ! 1 , …, ! K ]; • Generate " according to Gaussian distribution Graphical representation of # $, & = #())# $ & So, we get a distribution for the data point x : 10

From minimizing sum of square distances to finding maximum likelihood X = { x 1 ,..., x N } minimize maximize likelihood π = { π 1 ,..., π K } µ = { µ 1 , … , µ K } Σ = { Σ 1 ,..., Σ K } r n1 = 1 k =2 r n2 = 0 µ 2 µ 1 k =1 Remember: The closer the distance, the more likely the probability. 11

Expectation-Maximization (EM) algorithm for maximum likelihood Initialization k =1 k =2 13

E Step When the parameters are given, the assignments of the points can be calculated by the posterior probability, i.e., the probability of a data point belonging to a cluster once we have observed the data point. Soft assignment: A point fractionally belongs to two clusters. For example, 0.2 belong to cluster 1 0.8 belong to cluster 2 14

M Step When the assignments γ ( z nk ) of the points to the clusters are known, parameters could be calculated for each cluster (Gaussian) separately. Mixing weight π k : the proportion of number of points in cluster k within all data points ; µ k, Σ k : the mean and the covariance matrix are calculated for each cluster L denotes the number of cycles of the EM algorithm. 15

E-Step M-Step initialization Convergence 16 L denotes the number of cycles of E-Step and M-Step.

Details of the EM Algorithm 17

K-means is a hard-cut EM Σ k = ε I Fixed equal GMM considers mixing covariance and weights mixing weights. { µ k } One-in-K assignment Soft assignment 18

The General EM Algorithm Given a joint distribution p ( X , Z | θ ) over observed variables X and latent variables Z , governed by parameters θ , the goal is to maximize the likelihood function p ( X | θ ) with respect to θ . 1. Choose an initial setting for the parameters θ old . 2. E step Evaluate p ( Z | X , θ old ) . 3. M step Evaluate θ new given by θ new = arg max Q ( θ , θ old ) (9.32) θ where � Q ( θ , θ old ) = p ( Z | X , θ old ) ln p ( X , Z | θ ) . (9.33) Z 4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let θ old ← θ new (9.34) and return to step 2. 19

Summary for the EM algorithm for GMM • Does it find the global optimum? – No, like K-means, EM only finds the nearest local optimum and the optimum depends on the initialization • GMM is more general then K-means by considering mixing weights, covariance matrices, and soft assignments. • Like K-means, it does not tell you the best K. 20

EM never decreases the likelihood KL( q || p ) New log likelihood ln[$ % & '56 ] quan- L ( q, θ ) ln p ( X | θ ) '56 ||$ . /, & '56 ] KL[+ , New lower bound '56 , & '56 ) ℱ(+ , Log likelihood ln $ % & ' ln[$ % & ' ] '56 , & ' ) = ℱ(+ , '56 | $ . /, & ' KL[+ , = 0 ' ||$ . /, & ' KL[+ , ] ' , & ' ) ℱ(+ , Lower bound (t+1) (t) E-Step M-Step 21

The KL [ q ( x ) k p ( x )] is non-negative and zero iff 8 x : p ( x ) = q ( x ) First let’s consider discrete distributions; the Kullback-Liebler divergence is: q i log q i X KL [ q k p ] = . p i i To find the distribution q which minimizes KL [ q k p ] we add a Lagrange multiplier to enforce the normalization constraint: q i log q i def X X X � � � � = KL [ q k p ] + λ 1 � = + λ 1 � E q i q i p i i i i We then take partial derivatives and set to zero: 9 ∂ E = log q i � log p i + 1 � λ = 0 ) q i = p i exp( λ � 1) > > ∂ q i > = ) q i = p i . ∂ E X X ∂λ = 1 � q i = 0 ) q i = 1 > > k > ; i i Check that the curvature (Hessian) is positive (definite), corresponding to a minimum: ∂ 2 E ∂ 2 E = 1 > 0 , = 0 , ∂ q i ∂ q i q i ∂ q i ∂ q j showing that q i = p i is a genuine minimum. 23 At the minimum is it easily verified that KL [ p k p ] = 0 .

� �� Jensen’s Inequality due to convexity 24

EM never decreases the likelihood KL( q || p ) New log likelihood ln[$ % & '56 ] quan- L ( q, θ ) ln p ( X | θ ) '56 ||$ . /, & '56 ] KL[+ , New lower bound '56 , & '56 ) ℱ(+ , Log likelihood ln $ % & ' ln[$ % & ' ] '56 , & ' ) = ℱ(+ , '56 | $ . /, & ' KL[+ , = 0 ' ||$ . /, & ' KL[+ , ] ' , & ' ) ℱ(+ , Lower bound (t+1) (t) E-Step M-Step 25

How to determine the cluster number K? K-mean GMM - Log-likelihood J K 0 K K J does not tell which K is better. Negative log-likelihood also decreases as K increases. 27

Model selection in general Probabilistic model Candidate models: Θ 1 ⊆ Θ 2 ⊆ ! ⊆ Θ K ⊆ ! p ( X N | Θ K ) A trade-off between fitting the data well and keeping the model simple Criterion Models become more and more complex as K increases. Negative log- likelihood (or fitting error) to make sure K the model fit the data well. ln p ( X N | ˆ Θ K ) − d k Akaike’s Information Criterion (AIC) d k : number of free parameters Θ K ) − 1 ln p ( X N | ˆ Bayesian Information Criterion (BIC) d k ln N N : sample size 2 28

Bayesian learning • Maximum A Posteriori (MAP) max $(Θ|%) 0 Equivalent to: log $ %, Θ = log $(%| Θ) + log $( Θ) Consider a simple example: $ 1 Θ = 2(1|3, Σ) 7 ) $ 3 = 2(3|3 5 , 6 5 29

Model selection Probabilistic model Candidate models: Θ 1 ⊆ Θ 2 ⊆ ! ⊆ Θ K ⊆ ! p ( X N | Θ K ) 30

Using Occam’s Razor to Learn Model Structure Compare model classes m using their posterior probability given the data: P ( m | y ) = P ( y | m ) P ( m ) Z P ( y | m ) = P ( y | θ m , m ) P ( θ m | m ) d θ m , P ( y ) Θ m Interpretation of P ( y | m ) : The probability that randomly selected parameter values from the model class would generate data set y . Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y| M i ) "just right" too complex Y 31 All possible data sets

Bayesian model selection • A model class m is a set of models parameterised by θ m , e.g. the set of all possible mixtures of m Gaussians. • The marginal likelihood of model class m : Z P ( y | m ) = P ( y | θ m , m ) P ( θ m | m ) d θ m Θ m is also known as the Bayesian evidence for model m . • The ratio of two marginal likelihoods is known as the Bayes factor: P ( y | m ) P ( y | m 0 ) • The Occam’s Razor principle is, roughly speaking, that one should prefer simpler explanations than more complex explanations. • Bayesian inference formalises and automatically implements the Occam’s Razor principle. 32

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - PowerPoint PPT Presentation

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline Gaussian Mixture Models (GMM) Expectation-Maximization (EM) for maximum likelihood Model selection, Bayesian learning 2 From distance to probability distance

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Clustering: Models and Algorithms Shikui Tu 2019-02-28 1 Outline Clustering K-mean

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Graph Neural Network to label particle hits in Liquid Argon Time Projection Chamber Hanfei Cui

SUBSPACE CLUSTERING Sylvain Calinon Robot Learning & Interaction Group Idiap Research

Instrumental Variables Regression, GMM, and Weak Instruments in Time Series James H. Stock

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Lecture 20 Lecture 20 Nov 12 th 2008 Clustering with Mixture of Gaussians Clustering with Mixture

Exemplar-based Recognition of Speech in Highly Variable Noise Antti Hurmalainen 1 Katariina

Lecture on Parameter Estimation for Stochastic Differential Equations Erik Lindstrm

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - PowerPoint PPT Presentation

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline Gaussian Mixture Models (GMM) Expectation-Maximization (EM) for maximum likelihood Model selection, Bayesian learning 2 From distance to probability distance

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Clustering: Models and Algorithms Shikui Tu 2019-02-28 1 Outline Clustering K-mean

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Graph Neural Network to label particle hits in Liquid Argon Time Projection Chamber Hanfei Cui

SUBSPACE CLUSTERING Sylvain Calinon Robot Learning &amp; Interaction Group Idiap Research

Instrumental Variables Regression, GMM, and Weak Instruments in Time Series James H. Stock

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Lecture 20 Lecture 20 Nov 12 th 2008 Clustering with Mixture of Gaussians Clustering with Mixture

Exemplar-based Recognition of Speech in Highly Variable Noise Antti Hurmalainen 1 Katariina

Lecture on Parameter Estimation for Stochastic Differential Equations Erik Lindstrm

SUBSPACE CLUSTERING Sylvain Calinon Robot Learning & Interaction Group Idiap Research