Learning From Data Lecture 19 A Peek At Unsupervised Learning k -Means Clustering Probability Density Estimation Gaussian Mixture Models M. Magdon-Ismail CSCI 4100/6100
recap: Radial Basis Functions Nonparametric RBF Parametric k -RBF-Network N � � k α n ( x ) � | | x − µ j | | � � � g ( x ) = · y n h ( x ) = w 0 + w j · φ � N r m =1 α m ( x ) n =1 j =1 = w t Φ ( x ) � � | | x − x n | | (bump on µ j ) α n ( x ) = φ (bump on x ) r linear model given µ j choose µ j as centers of k -clusters of data r = 0 . 05 y y x x k = 4 , r = 1 k = 10, regularized No Training k M Unsupervised Learning : 2 /23 � A c L Creator: Malik Magdon-Ismail Unsupervised learning − →
Unsupervised Learning • Preprocessor to organize the data for supervised learning: Organize data for faster nearest neighbor search Determine centers for RBF bumps. • Important to be able to organize the data to identify patterns. Learn the patterns in data, e.g. the patterns in a language before getting into a supervised setting. amazon.com organizes books into categories M Unsupervised Learning : 3 /23 � A c L Creator: Malik Magdon-Ismail Clustering digits − →
Clustering Digits 21-NN rule, 10 Classes 10 Clustering of Data 1 Symmetry 4 0 9 8 3 7 2 6 Average Intensity M Unsupervised Learning : 4 /23 � A c L Creator: Malik Magdon-Ismail Clustering − →
Clustering A cluster is a collection of points S A k -clustering is a partition of the data into k clusters S 1 , . . . , S k . ∪ k j =1 S j = D S i ∩ S j = ∅ for i � = j Each cluster has a center µ j M Unsupervised Learning : 5 /23 � A c L Creator: Malik Magdon-Ismail k -means error − →
How good is a clustering? Points in a cluster should be similar (close to each other, and the center) Error in cluster j : � | 2 . | | x n − µ j | E j = x n ∈ S j k -Means Clustering Error: k � E in ( S 1 , . . . , S k ; µ 1 , . . . , µ k ) = E j j =1 N | 2 � | | x n − µ ( x n ) | = n =1 µ ( x n ) is the center of the cluster to which x n belongs. M Unsupervised Learning : 6 /23 � A c L Creator: Malik Magdon-Ismail − →
k -Means Clustering You get to pick S 1 , . . . , S k and µ 1 , . . . , µ k to minimize E in ( S 1 , . . . , S k ; µ 1 , . . . , µ k ) If centers µ j are known, picking the sets is easy: Add to S j all points closest to µ j If the clusters S j are known, picking the centers is easy: Center µ j is the centroid of cluster S j 1 � µ j = x n | S j | x n ∈ S j M Unsupervised Learning : 7 /23 � A c L Creator: Malik Magdon-Ismail Lloyd’s algorithm − →
Lloyd’s Algorithm for k -Means Clustering N � | 2 E in ( S 1 , . . . , S k ; µ 1 , . . . , µ k ) = | | x n − µ ( x n ) | n =1 1: Initialize Pick well separated centers µ j . 2: Update S j to be all points closest µ j . S j ← { x n : | | x n − µ j | | ≤ | | x n − µ ℓ | | for ℓ = 1 , . . . , k } . 3: Update µ j to the centroid of S j . 1 � µ j ← x n | S j | x n ∈ S j 4: Repeat steps 2 and 3 until E in stops decreasing. M Unsupervised Learning : 8 /23 � A c L Creator: Malik Magdon-Ismail Update clusters − →
Lloyd’s Algorithm for k -Means Clustering N � | 2 E in ( S 1 , . . . , S k ; µ 1 , . . . , µ k ) = | | x n − µ ( x n ) | n =1 1: Initialize Pick well separated centers µ j . 2: Update S j to be all points closest µ j . S j ← { x n : | | x n − µ j | | ≤ | | x n − µ ℓ | | for ℓ = 1 , . . . , k } . 3: Update µ j to the centroid of S j . 1 � µ j ← x n | S j | x n ∈ S j 4: Repeat steps 2 and 3 until E in stops decreasing. M Unsupervised Learning : 9 /23 � A c L Creator: Malik Magdon-Ismail Update centers − →
Lloyd’s Algorithm for k -Means Clustering N � | 2 E in ( S 1 , . . . , S k ; µ 1 , . . . , µ k ) = | | x n − µ ( x n ) | n =1 1: Initialize Pick well separated centers µ j . 2: Update S j to be all points closest µ j . S j ← { x n : | | x n − µ j | | ≤ | | x n − µ ℓ | | for ℓ = 1 , . . . , k } . 3: Update µ j to the centroid of S j . 1 � µ j ← x n | S j | x n ∈ S j 4: Repeat steps 2 and 3 until E in stops decreasing. M Unsupervised Learning : 10 /23 � A c L Creator: Malik Magdon-Ismail Application to RBF-Network − →
Application to k -RBF-Network 10-center RBF-network 300-center RBF-network Choosing k - knowledge of problem (10 digits) or CV. M Unsupervised Learning : 11 /23 � A c L Creator: Malik Magdon-Ismail Probability density estimation − →
Probability Density Estimation P ( x ) P ( x ) measures how likely it is to generate inputs similar to x . Estimating P ( x ) results in a ‘softer/finer’ representation than clustering Clusters are regions of high probability. M Unsupervised Learning : 12 /23 � A c L Creator: Malik Magdon-Ismail Parzen windows − →
Parzen Windows – RBF density estimation Basic idea: put a bump of ‘size’ (volume) 1 N on each data point. P ( x ) x N � | | x − x i | | 1 � ˆ � P ( x ) = φ Nr d r i =1 (2 π ) d/ 2 e − 1 1 2 z 2 φ ( z ) = M Unsupervised Learning : 13 /23 � A c L Creator: Malik Magdon-Ismail Digits data − →
Digits Data RBF Density Estimate Density Contours y y x x M Unsupervised Learning : 14 /23 � A c L Creator: Malik Magdon-Ismail GMM − →
The Gaussian Mixture Model (GMM) Instead of N bumps − → k ≪ N bumps. (Similar to nonparametric RBF − → parametric k -RBF-network) Instead of uniform spherical bumps − → each bump has its own shape. Bump centers: µ 1 , . . . , µ k Bump shapes: Σ 1 , . . . , Σ k Gaussian formula for the bump: 1 (2 π ) d/ 2 | Σ j | 1 / 2 e − 1 2 ( x − µ j ) t Σ j − 1 ( x − µ j ) . N ( x ; µ j , Σ j ) = M Unsupervised Learning : 15 /23 � A c L Creator: Malik Magdon-Ismail GMM formula − →
GMM Density Estimate (2 π ) d/ 2 | Σ j | 1 / 2 e − 1 1 2 ( x − µ j ) t Σ j − 1 ( x − µ j ) . N ( x ; µ j , Σ j ) = k ˆ � w j N ( x ; µ j , Σ j ) P ( x ) = j =1 (Sum of k weighted bumps). k � w j > 0 , w j = 1 j =1 You get to pick { w j , µ j , Σ j } j =1 ,...,k M Unsupervised Learning : 16 /23 � A c L Creator: Malik Magdon-Ismail Maximum likelihood − →
Maximize Likelihood Estimation Pick { w j , µ j , Σ j } j =1 ,...,k to best explain the data. Maximize the likelihood of the data given { w j , µ j , Σ j } j =1 ,...,k (We saw this when we derived the cross entropy error for logistic regression) M Unsupervised Learning : 17 /23 � A c L Creator: Malik Magdon-Ismail E-M algorithm − →
Expectation-Maximization: The E-M Algorithm A simple algorithm to get to the local minimum of the likelihood. Partition variables into two sets. Given one-set, you can estimate the other ‘Bootstrap’ your way to a decent solution. Lloyd’s algorithm for k -means is an example for ‘hard clustering’ M Unsupervised Learning : 18 /23 � A c L Creator: Malik Magdon-Ismail γ nj − →
Bump Memberships Fraction of x n belonging to bump j (a ‘hidden variable’) γ nj N � N j = γ nj (‘number’ of points in bump j ) n =1 w j = N j (probability bump j ) N N 1 � µ j = γ nj x n (centroid of bump j ) N j n =1 N 1 � γ nj x n x t n − µ j µ t Σ j = (covariance matrix of bump j ) j N j n =1 M Unsupervised Learning : 19 /23 � A c L Creator: Malik Magdon-Ismail Parameters given γ nj − →
Bump Memberships Fraction of x n belonging to bump j (a ‘hidden variable’) γ nj N � N j = γ nj (‘number’ of points in bump j ) n =1 w j = N j (probability bump j ) N N 1 � µ j = γ nj x n (centroid of bump j ) N j n =1 N 1 � γ nj x n x t n − µ j µ t Σ j = (covariance matrix of bump j ) j N j n =1 M Unsupervised Learning : 20 /23 � A c L Creator: Malik Magdon-Ismail Restimating γ nj − →
Re-Estimating Bump Memberships w j N ( x n ; µ j , Σ j ) γ nj = � k ℓ =1 w ℓ N ( x n ; µ ℓ , Σ ℓ ) γ nj is the probability that x n came from bump j probability of bump j : w j probability density for x n given bump j : N ( x n ; µ j , Σ j ) M Unsupervised Learning : 21 /23 � A c L Creator: Malik Magdon-Ismail E-M Algorithm − →
E-M Algorithm E-M Algorithm for GMMs: 1: Start with estimates for the bump membership γ nj . 2: Estimate w j , µ j , Σ j given the bump memberships. 3: Update the bump memberships given w j , µ j , Σ j ; 4: Iterate to step 2 until convergence. M Unsupervised Learning : 22 /23 � A c L Creator: Malik Magdon-Ismail GMM on digits − →
GMM on Digits Data 10-center GMM Density Contours y y x x M Unsupervised Learning : 23 /23 � A c L Creator: Malik Magdon-Ismail
Recommend
More recommend