unsupervised learning
play

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised Learning Unsupervised Learning Setting Supervised learning requires the availability of labelled examples Labelling examples can be an extremely


  1. Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised Learning

  2. Unsupervised Learning Setting Supervised learning requires the availability of labelled examples Labelling examples can be an extremely expensive process Sometimes we don’t even know how to label examples Unsupervised techniques can be employed to group examples into clusters Unsupervised Learning

  3. k-means clustering Setting Assumes examples should be grouped into k clusters Each cluster i is represented by its mean µ i Algorithm Initialize cluster means µ 1 , . . . , µ k 1 Iterate until no mean changes: 2 Assign each example to cluster with nearest mean 1 Update cluster means according to assigned examples 2 Unsupervised Learning

  4. How can we define (dis)similarity between examples ? (Dis)similarity measures R d : Standard Euclidean distance in I � d � � � d ( x , x ′ ) = ( x i − x ′ i ) 2 � i = 1 Generic Minkowski metric for p ≥ 1: � d � 1 / p � d ( x , x ′ ) = | x i − x ′ i | p i = 1 Cosine similarity (cosine of the angle between vectors): x T x ′ s ( x , x ′ ) = || x |||| x ′ || Unsupervised Learning

  5. How can we define quality of obtained clusters ? Sum-of-squared error criterion Let n i be the number of samples in cluster D i Let µ i be the cluster sample mean: µ i = 1 � x n i x ∈D i The sum-of-squared errors is defined as: k � � || x − µ i || 2 E = i = 1 x ∈D i Measures the squared error incurred in representing each example with its cluster mean Unsupervised Learning

  6. Gaussian Mixture Model (GMM) Setting Cluster examples using a mixture of Gaussian distributions Assume number of Gaussians is given Estimate mean and possibly variance of each Gaussian Unsupervised Learning

  7. Gaussian Mixture Model (GMM) Parameter Estimation Maximum likelihood estimation cannot be applied as cluster assignment of examples is unknown Expectation-Maximization approach: Compute expected cluster assignment given current 1 parameter setting Estimate parameters given cluster assignment 2 Iterate 3 Unsupervised Learning

  8. Example: estimating means of k univariate Gaussians Setting A dataset of x 1 , . . . , x n examples is observed For each example x i , cluster assignment is modelled as z i 1 , . . . , z ik binary latent (i.e. unknown) variables z ij = 1 if Gaussian j generated x i , 0 otherwise. Parameters to be estimated are the µ 1 , . . . , µ k Gaussians means All Gaussians are assumed to have the same (known) variance σ 2 Unsupervised Learning

  9. Example: estimating means of k univariate Gaussians Algorithm Initialize h = � µ 1 , . . . , µ k � 1 Iterate until difference in maximum likelihood (ML) is below 2 a certain threshold: E-step Calculate expected value E [ z ij ] of each latent variable assuming current hypothesis h = � µ 1 , . . . , µ k � holds M-step Calculate a new ML hypothesis h ′ = � µ ′ 1 , . . . , µ ′ k � assuming values of latent variables are their expected values just computed. Replace h ← h ′ Unsupervised Learning

  10. Example: estimating means of k univariate Gaussians Algorithm E-step The expected value of z ij is the probability that x i is generated by Gaussian j assuming hypothesis h = � µ 1 , . . . , µ k � holds: exp − 1 2 σ 2 ( x i − µ j ) 2 p ( x i | µ j ) E [ z ij ] = = � k � k l = 1 exp − 1 2 σ 2 ( x i − µ l ) 2 l = 1 p ( x i | µ l ) M-step The maximum-likelihood mean µ j is the weighted sample mean, each instance being weighted by its probability of being generated by Gaussian j : � n i = 1 E [ z ij ] x i µ ′ j = � n i = 1 E [ z ij ] Unsupervised Learning

  11. Expectation-Maximization (EM) Formal setting We are given a dataset made of an observed part X and an unobserved part Z We wish to estimate the hypothesis maximizing the expected log-likelihood for the data, with expectation taken over unobserved data: h ∗ = argmax h E Z [ ln p ( X , Z | h )] Problem The unobserved data Z should be treated as random variables governed by the distribution depending on X and h Unsupervised Learning

  12. Expectation-Maximization (EM) Generic algorithm Initialize hypothesis h 1 Iterate until convergence 2 E-step Compute the expected likelihood of an hypothesis h ′ for the full data, where the unobserved data distribution is modelled according to the current hypothesis h and the observed data: Q ( h ′ ; h ) = E Z [ ln p ( X , Z | h ′ ) | h , X ] M-step replace the current hypothesis with the one maximizing Q ( h ′ ; h ) h ← argmax h ′ Q ( h ′ ; h ) Unsupervised Learning

  13. Example: estimating means of k univariate Gaussians Derivation the likelihood of an example is:   k ( x i − µ ′ j ) 2 1 � p ( x i , z i 1 , . . . , z ik | h ′ ) = √ exp  − z ij  2 σ 2 2 πσ j = 1 the dataset log-likelihood is:   n k j ) 2 ( x i − µ ′ 1 � � √ ln p ( X , Z | h ) =  ln − z ij  2 σ 2 2 πσ i = 1 j = 1 Unsupervised Learning

  14. Example: estimating means of k univariate Gaussians E-step the expected log-likelihood (remember linearity of the expectation operator):     n k ( x i − µ ′ j ) 2 1 E Z [ ln p ( X , Z | h ′ )] = E Z � � √ −  ln z ij    2 σ 2 2 πσ i = 1 j = 1   n k ( x i − µ ′ j ) 2 1 � � √ =  ln − E [ z ij ]  2 σ 2 2 πσ i = 1 j = 1 The expectation given current hypothesis h and observed data X is computed as: exp − 1 2 σ 2 ( x i − µ j ) 2 p ( x i | µ j ) E [ z ij ] = = � k � k l = 1 exp − 1 2 σ 2 ( x i − µ l ) 2 l = 1 p ( x i | µ l ) Unsupervised Learning

  15. Example: estimating means of k univariate Gaussians M-step The likelihood maximization gives:   n k j ) 2 ( x i − µ ′ 1 � � argmax h ′ Q ( h ′ ; h ) = argmax h ′ √  ln − E [ z ij ]  2 σ 2 2 πσ i = 1 j = 1 n k � � E [ z ij ]( x i − µ ′ j ) 2 = argmin h ′ i = 1 j = 1 zeroing the derivative wrt to each mean we get: n ∂ � E [ z ij ]( x i − µ ′ = − 2 j ) = 0 ∂µ j i = 1 � n i = 1 E [ z ij ] x i µ ′ j = � n i = 1 E [ z ij ] Unsupervised Learning

  16. How to choose the number of clusters? Elbow method: idea Increasing number of clusters allows for better modeling of data Needs to trade-off quality of clusters with quantity Stop increasing number of clusters when advantage is limited Unsupervised Learning

  17. How to choose the number of clusters? Elbow method: approach Run clustering algorithm for increasing number of clusters 1 Plot clustering evaluation metric (e.g. sum of squared 2 errors) for different k Choose k when there is an angle (making an elbow) in the 3 plot (drop in gain) Unsupervised Learning

  18. How to choose the number of clusters? Elbow method: problem The Elbow method can be ambiguous, with multiple candidate points (e.g. k=2 and k=4 in the figure). Unsupervised Learning

  19. How to choose the number of clusters? Average silhouette method: idea Increasing the numbers of clusters makes each cluster more homogeneuous Increasing the number of clusters can make different clusters more similar Use quality metric that trades-off intra-cluster similarity and inter-cluster dissimilarity Unsupervised Learning

  20. How to choose the number of clusters? Silhouette coefficient for example i Compute the average dissimilarity between i and examples 1 of its cluster C : a i = d ( i , C ) = 1 � d ( i , j ) | C | j ∈ C Compute the average dissimilarity between i and examples 2 of each cluster C ′ � = C , take the minimum: C ′ � = C d ( i , C ′ ) b i = min The silhouette coefficient is: 3 b i − a i s i = max ( a i , b i ) Unsupervised Learning

  21. How to choose the number of clusters? Average silhouette method: approach Run clustering algorithm for increasing number of clusters 1 Plot average (over examples) silhouette coefficient for 2 different k Choose k where the average silhouette coefficient is 3 maximal Unsupervised Learning

  22. Hierarchical clustering Setting Clustering does not need to be flat Natural grouping of data is often hierarchical (e.g. biological taxonomy, topic taxonomy, etc.) A hierarchy of clusters can be built on examples Top-down approach: start from a single cluster with all examples recursively split clusters into subclusters Bottom-up approach: start with n clusters of individual examples (singletons) recursively aggregate pairs of clusters Unsupervised Learning

  23. Dendograms Unsupervised Learning

  24. Agglomerative hierarchical clustering Algorithm Initialize: 1 Final cluster number k (e.g. k=1) Initial cluster number ˆ k = n Initial clusters D i = { x i } , i ∈ 1 , . . . , n while ˆ k > k : 2 find pairwise nearest clusters D i , D j 1 merge D i and D j 2 update ˆ k = ˆ k − 1 3 Note Stopping criterion can be threshold on pairwise similarity Unsupervised Learning

  25. Measuring cluster similarities Similarity measures Nearest-neighbour || x − x ′ || d min ( D i , D j ) = min x ∈D i , x ′ ∈D j Farthest-neighbour || x − x ′ || d max ( D i , D j ) = max x ∈D i , x ′ ∈D j Average distance 1 � � || x − x ′ || d avg ( D i , D j ) = n i n j x ∈D i x ′ ∈D j Distance between means d mean ( D i , D j ) = || µ i − µ j || d min and d max are more sensitive to outliers Unsupervised Learning

Recommend


More recommend