Clustering methods R.W. Oldford
Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be easily seen visually, even though it might be more difficult to describe mathematically. Moreover, the structure observed need not have been anticipated. Interaction which allows fairly arbitrary changes to the plot via direct manipulation (e.g. mouse gestures) and also via command line (i.e. programmatically), further enables the analyst, providing quick and easy data queries, marking of structure, and when the visualizations are themselves data structures quick setting and extraction of observed information. Direct interaction amplifies the advantage of data visualization and creates a powerful tool for uncovering structure. In contrast, we might choose to have some statistical algorithm search for structure in the data. This would of course require specifying in advance how that structure might be described mathematically.
Interactive data visualization The two approaches naturally complement one another. ◮ Structure searched for algorithmically must be precisely characterised mathematically, and so is necesssarily determined prior to the analysis. ◮ Interactive data visualization depends on the human visual system which has evolved over millions of years to be able to see patterns, both anticipated and not. In the hands of an experienced analyst, one complements and amplifies the other; the two are worked together to give much greater insight than either approach could alone. We have already seen the value of using both in conjunction with one another in, for example, hypothesis testing, density estimation, and smoothing.
Finding groups in data Oftentimes we observe that the data have grouped together in patterns. In a scatterplot, for example, we might notice that the observations concentrate more in some areas than they do in others. 110 The "Old Faithful" ‘geyser‘ data (from the ‘MASS‘ package). 100 A simple scatterplot with larger point sizes with alpha blending shows 0 . 0 0 2 0 . 0 0 6 90 ◮ 3 regions of 0.01 0 . 0 0 6 concentration, 0.014 0 . 0 0 8 ◮ 3 vertical lines, 80 ◮ and a few outliers. 0.012 y 0 . 0 1 2 0 . 0 0 8 Contours of constant kernel 0 . 0 1 70 0 . 0 0 4 density estimate show ◮ two modes at right, 60 0.01 ◮ a higher mode at left, and ◮ a smooth continuous 50 0.012 mathematical function 0 . 0 0 8 2 0 0 . 0 Perhaps the points could be 0 . 0 0 4 automatically grouped by using the contours? 1 2 3 4 5 x
Finding groups in data - K -means A great many methods exist (and continue to be developed) to automatically find groups in data. These have historically been called clustering methods by data analysts and more recently sometimes called unsupervised learning methods (in the sense that we do not know the “classes” of the observations as in “supervised learning”) by many artificial intelligence researchers. One of the earliest clustering methods is “ K -means”. In its simplest form, it begins with the knowledge that there are exactly K clusters to be determined. The idea is to identify K clusters, C 1 , . . . , C K , where every multivariate observation x i for i = 1 , . . . , n in the data set appears in one and only one cluster C k . The clusters are to be chosen so that the total within cluster spread is as small as possible. For every cluster, the total spread for that cluster is measured by the sum of squared Euclidean distances from the cluster “centroid”, namely � i ∈ C k d 2 ( i , k ) SSE k = i ∈ C k || x i − c k || 2 � = where c k is the cluster “centroid”. Typically, the cluster average x k = � i ∈ C k x i / n k (where n k denotes the cardinality of cluster C k ) is chosen as the cluster centroid (i.e. choose c k = x k ). The K clusters are chosen to minimize � K i =1 SSE k . Algorithms typically begin with “seed” centroids c 1 , . . . , c K , possibly randomly chosen, then assign every observation to its nearest centroid. Each centroid is then recalculated based on the values of x i ∀ i ∈ C k (e.g. c k = x k ), and the data are reassigned to the new centroids. Repeat until there is no change in the clustering.
Finding groups in data - K -means There are several implementations of K -means in R . Many of these are available via the base R function kmeans() . # First scale the data data <- as.data.frame ( scale (geyser[, c ("duration", "waiting")])) result <- kmeans (data, centers = 3) str (result) ## List of 9 ## $ cluster : Named int [1:299] 3 2 1 3 3 2 1 3 2 1 ... ## ..- attr(*, "names")= chr [1:299] "1" "2" "3" "4" ... ## $ centers : num [1:3, 1:2] 0.844 -1.273 0.56 -1.242 0.787 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:3] "1" "2" "3" ## .. ..$ : chr [1:2] "duration" "waiting" ## $ totss : num 596 ## $ withinss : num [1:3] 25.6 31.8 23.7 ## $ tot.withinss: num 81.2 ## $ betweenss : num 515 ## $ size : int [1:3] 101 107 91 ## $ iter : int 2 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans" The cluster component identifies which of three clusters the corresponding obeservation has been assigned.
Finding groups in data - K -means Plotting this information in loon library (loon) ## Loading required package: tcltk p <- l_plot (data, linkingGroup = "geyser", showGuides=TRUE) # Add the density contours l_layer_contourLines (p, kde2d (data $ duration, data $ waiting, n=100), color = "grey") ## loon layer "lines" of type lines of plot .l0.plot ## [1] "layer0" # Colour the clusters p['color'] <- result $ cluster
Finding groups in data - K -means Plotting this information in loon which looks pretty good.
Finding groups in data - K -means Had we selected only K = 2: which we might not completely agree with.
Finding groups in data - K -means How about K = 4?: which, again, we might or might not agree.
Finding groups in data - K -means How about K = 5?: which, again, we might or might not agree.
Finding groups in data - K -means How about K = 6?: which, again, we might or might not agree.
Finding groups in data - K -means Let’s try K = 6 again: which, is different!
Finding groups in data - K -means Some comments and questions: ◮ K means depends on total squared Euclidean distance to the centroids ◮ K means implicitly presumes that the clusters will be "globular" or "spherical" ◮ different clusters might arise on different calls (random starting position for centroids) ◮ how do we choose k ? ◮ should we enforce a hierarchy on the clusters? ◮ with an interactive visualization, we should be able to readjust the clusters by changing their colours
Finding groups in data - model based clustering A related approach to K means, but one that is much more general and which comes from a different reasoning base is that of so-called “model-based clustering”. Here, the main idea is that the data x i are a sample independently and identically distributed (iid) multivariate observations from some multivariate mixture distribution. That is X 1 , . . . , X n ∼ f p ( x ; Θ ) where f p ( x ; θ ) is a p -variate continuous density parameterized by some collection of parameters ˆ that can be expressed as a finite mixture of individual p -variate densities g p (): K � f p ( x ; Θ ) = α k g p ( x ; θ k ) . k =1 Here α k ≥ 0, � K k =1 α k = 1, the individual densities g p ( x ; θ k ) are of known shape and are identical up to differences given by their individual parameter vectors θ k . Neither α k nor θ k are known for any k = 1 , . . . , K and must be estimated from the observations (i.e. Θ = { α 1 , . . . , α K , θ 1 , . . . , θ K } ). Typically g p ( x , θ k ) are taken to be multivariate Gaussian densities of the form g p ( x ; θ k ) = φ p ( x ; µ k , Σ k ) with φ p ( x ; µ k , Σ k ) = (2 π ) − p 2 | Σ k | − p 2 e − 1 2 ( x − µ k ) T Σ − 1 ( x − µ k ) and θ k = ( µ k , Σ k ).
Finding groups in data - model based clustering We can imagine fitting the mixture model to the data for fixed K and α k > 0 ∀ k via, say, maximum likelihood. This can be accomplished by introducing latent variates z ik which are 1 when x i came from the k th mixture component and zero otherwise. Suffice to say that the z ik s are treated as “missing” and that an “EM” or “Expectation-Maximization” algorithm is then used to perform maximum likelihood estimation on the finite mixture. The parameters µ k and Σ can also be constrained (the eigen decomposition Σ k = OD σ O T is useful in this) to restrict the problem further. An information criterion like the “Bayesian Information Criterion”, or “BIC”, is used to compare values of K across models. This adds a negative penalty to the log-likelihood that penalizes larger models. Look for models that have high information (as measured by BIC). (Note that some writers (e.g. Wikipedia) use minus this and hence minimize their objective function.) Note that by using a Gaussian model that the clusters are inherently taken to be elliptically shaped. An implementation of model-based clustering can be found in the R package mclust
Recommend
More recommend