Clustering L´ eon Bottou NEC Labs America COS 424 – 3/4/2010
Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/27 COS 424 – 3/4/2010
Introduction Clustering Assigning observations into subsets with similar characteristics. Applications – medecine, biology, – market research, data mining – image segmentation – search results – topics, taxonomies – communities Why is clustering so attractive? – An embodiment of Descartes’ philosophy “Discourse on the Method of Rightly Conducting One’s Reason”: “. . . divide each of the difficulties under examination . . . as might be necessary for its adequate solution.” L´ eon Bottou 3/27 COS 424 – 3/4/2010
Summary 1. What is a cluster? 2. K-Means 3. Hierarchical clustering 4. Simple Gaussian mixtures L´ eon Bottou 4/27 COS 424 – 3/4/2010
What is a cluster? ����������������������� Two neatly separated classes leave a trace in P { X } . L´ eon Bottou 5/27 COS 424 – 3/4/2010
Input space transformations Input space is often an arbitrary decision. For instance: camera pixels versus retina pixels. What happens if we apply a reversible transformation to the inputs? L´ eon Bottou 6/27 COS 424 – 3/4/2010
Input space transformations The Bayes optimal decision boundary moves with the transformation. The Bayes optimal error rate is unchanged. The neatly separated clusters are gone! ����������������������� Clustering depends on the arbitrary definition of the input space! This is very different from classification, regression, etc. L´ eon Bottou 7/27 COS 424 – 3/4/2010
K-Means The K-Means problem – Given observations x 1 . . . x n , determine K centroids w 1 . . . w k n � k � x i − w k � 2 . that minimize the distortion C ( w ) = min i =1 Interpretation – Minimize the discretization error. Properties – Non convex objective. – Finding the global minimum is NP-hard in general. – Finding acceptable local minima is surprisingly easy. – Initialization dependent. L´ eon Bottou 8/27 COS 424 – 3/4/2010
Offline K-Means Lloyd’s algorithm initialize centroids w k repeat - assign points to classes: � x i − w k � 2 . ∀ i, s i ← arg min S k ← { i : s i = k } . k - recompute centroids: 1 � � � x i − w � 2 = ∀ k, w k ← arg min x i . card ( S k ) w i ∈ S k i ∈ S k until convergence. L´ eon Bottou 9/27 COS 424 – 3/4/2010
Lloyd’s algorithm – Illustration Initial state: – Squares = data points. – Circles = centroids. L´ eon Bottou 10/27 COS 424 – 3/4/2010
Lloyd’s algorithm – Illustration 1. Assign data points to clusters. L´ eon Bottou 11/27 COS 424 – 3/4/2010
Lloyd’s algorithm – Illustration 2. Recompute centroids. L´ eon Bottou 12/27 COS 424 – 3/4/2010
Lloyd’s algorithm – Illustration Assign data points to clusters. . . L´ eon Bottou 13/27 COS 424 – 3/4/2010
Why does Lloyd’s algorithm work? Consider an arbitrary cluster assignment s i . n n n � � � � x i − w s i � 2 − min k � x i − w k � 2 = � x i − w s i � 2 k � x i − w k � 2 C ( w ) = min − i =1 i =1 i =1 � �� � � �� � L ( s,w ) D ( s,w ) ≥ 0 ���������� � � ������������� D ������������������������ ���������� � � ������������� L ������������ D ������������������� D ���� ���� D ��� D � � ���� L L L L´ eon Bottou 14/27 COS 424 – 3/4/2010
Online K-Means MacQueen’s algorithm initialize centroids w k and n k = 0 . repeat - pick an observation x t and determine cluster � x t − w k � 2 . s t = arg min k - update centroid s t : � � 1 n s t ← n s t + 1 . w s t ← w s t + x t − w s t . n st until satisfaction. Comments – MacQueen’s algorithm finds decent clusters much faster. – Final convergence could be slow. Do we really care? – Just perform one or two passes over the randomly shuffled observations. L´ eon Bottou 15/27 COS 424 – 3/4/2010
Why does MacQueen’s algorithm work? Explanation 1: Recursive averages. n – Let u n = 1 x i . Then u n = u n − 1 + 1 � n ( x n − u n − 1 ) . n i =1 Explanation 2: Stochastic gradient. � n 1 i =1 min k � x i − w k � 2 : – Apply stochastic gradient to C ( w ) = 2 n � � w s t ← w s t + γ t x t − w s t Explanation 3: Stochastic gradient + Newton. – The Hessian H of C ( w ) is diagonal and contains the fraction of observations assigned to each cluster. w s t ← w s t + 1 = w s t + 1 tH − 1 � � � � x t − w s t x t − w s t n s t L´ eon Bottou 16/27 COS 424 – 3/4/2010
Example: Color quantization of images Problem – Convert a 24 bit RGB image into a indexed image with a palette of K colors. Solution – The ( r, g, b ) values of the pixels are the observations x i – The ( r, g, b ) values of the K palette colors are the centroids w k . – Initialize the w k with an all-purpose palette – Alternatively, initialize the w k with the color of random pixels. – Perform one pass of MacQueen’s algorithm – Eliminate centroids with no observations. – You are done. L´ eon Bottou 17/27 COS 424 – 3/4/2010
How many clusters? Rules of thumb ? – K = 10 , K = √ n , . . . The Elbow method ? ���� – Measure the distortion on a validation set. – The distortion decreases when k increases. ����������������� – Sometimes there is no elbow, or several elbows – Local minima mess the picture. � Rate-distortion – Each additional cluster reduces the distortion. – Cost of additional cluster vs. cost of distortion. – Just another way to select K . Conclusion – Clustering is a very subjective matter. L´ eon Bottou 18/27 COS 424 – 3/4/2010
Hierarchical clustering Agglomerative clustering – Initialization: each observation is its own cluster. – Repeatedly merge the closest clusters – single linkage D ( A, B ) = x ∈ A, y ∈ B d ( x, y ) min – complete linkage D ( A, B ) = x ∈ A, y ∈ B d ( x, y ) max – distortion estimates, etc. Divisive clustering – Initialization: one cluster contains all observations. – Repeatedly divide the largest cluster, e.g. 2-Means. – Lots of variants. L´ eon Bottou 19/27 COS 424 – 3/4/2010
K-Means plus Agglomerative Clustering Algorithm – Run K-Means with a large K. – Count the number of observation for each cluster. – Merge the closest clusters according to the following metric. Let A be a cluster with n A members and centroid w A . Let B be a cluster with n B members and centroid w B . The putative center of A ∪ B is w AB = ( n A w A + n b w B ) / ( n A + n B ) . Quick estimate of the distortion increase: � � � � x − w AB � 2 − � x − w A � 2 − � x − w B � 2 d ( A, B ) = x ∈ A ∪ B x ∈ A x ∈ B = n A � w A − w AB � 2 + n B � w B − w AB � 2 L´ eon Bottou 20/27 COS 424 – 3/4/2010
Dendogram L´ eon Bottou 21/27 COS 424 – 3/4/2010
Simple Gaussian mixture (1) Clustering via density estimation. – Pick a parametric model P θ ( X ) . – Maximize likelihood. Pick a parametric model – There are K components – To generate an observation: a.) pick a component k with probabilities λ 1 . . . λ K . b.) generate x from component k with probability N ( µ i , σ ) . Notes – Same standard deviation σ (for now). – That’s why I write “Simple GMM”. L´ eon Bottou 22/27 COS 424 – 3/4/2010
Simple Gaussian mixture (2) Parameters: θ = ( λ 1 , µ 1 , . . . , λ K , µ K ) � x − µy � 2 e − 1 1 σ 2 P θ ( X = x | Y = y ) = Model: P θ ( Y = y ) = λ y . . d σ (2 π ) 2 Likelihood n n K � � � P θ ( Y = y ) P θ ( X = x i | Y = y ) = . . . log L ( θ ) = log P θ ( X = x i ) = log i =1 i =1 y =1 Maximize! – This is non convex. – There are k ! copies of each minimum (local or global). – Conjugate gradients or Newton works. L´ eon Bottou 23/27 COS 424 – 3/4/2010
Recommend
More recommend