clustering methods
play

Clustering methods R.W. Oldford Interactive data visualization An - PowerPoint PPT Presentation

Clustering methods R.W. Oldford Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be


  1. Clustering methods R.W. Oldford

  2. Interactive data visualization An important advantage of data visualization is that much structure (e.g. density, groupings, regular patterns, relationships, outliers, connections across dimensions, etc.) can be easily seen visually, even though it might be more difficult to describe mathematically. Moreover, the structure observed need not have been anticipated. Interaction which allows fairly arbitrary changes to the plot via direct manipulation (e.g. mouse gestures) and also via command line (i.e. programmatically), further enables the analyst, providing quick and easy data queries, marking of structure, and when the visualizations are themselves data structures quick setting and extraction of observed information. Direct interaction amplifies the advantage of data visualization and creates a powerful tool for uncovering structure. In contrast, we might choose to have some statistical algorithm search for structure in the data. This would of course require specifying in advance how that structure might be described mathematically.

  3. Interactive data visualization The two approaches naturally complement one another. ◮ Structure searched for algorithmically must be precisely characterised mathematically, and so is necesssarily determined prior to the analysis. ◮ Interactive data visualization depends on the human visual system which has evolved over millions of years to be able to see patterns, both anticipated and not. In the hands of an experienced analyst, one complements and amplifies the other; the two are worked together to give much greater insight than either approach could alone. We have already seen the value of using both in conjunction with one another in, for example, hypothesis testing, density estimation, and smoothing.

  4. Finding groups in data Consider the “Old Faithful” geyser data (from the MASS package), centred and scaled as follows. library (MASS) ## ## Attaching package: 'MASS' ## The following object is masked from 'package:dplyr': ## ## select xrange <- diff ( range (geyser $ duration)) yrange <- diff ( range (geyser $ waiting)) data <- as.data.frame ( scale (geyser[, c ("duration", "waiting")], scale = c (xrange, yrange))) data is now centred with the average in each direction and scaled so that the ranges of the two directions are identical. We do this so that when we consider the clustering methods, they will work on data and visual distances observed on any (square) scatterplot will correspond to Euclidean distances in the space of measurememts (which any clustering method would use).

  5. Finding groups in data Oftentimes we observe that the data have grouped together in patterns. In a scatterplot, for example, we might notice that the observations concentrate more in some areas than they do in others. A simple scatterplot with larger 1 . 5 0.5 point sizes with alpha blending 2.5 1 shows 3 . 5 ◮ 3 regions of concentration, 2 3 ◮ 3 vertical lines, 4 . 5 3 . 5 ◮ and a few outliers. waiting 4 3 Contours of constant kernel 2 density estimate show 2.5 1 ◮ two modes at right, ◮ a higher mode at left, and 3 ◮ a smooth continuous 3.5 mathematical function Perhaps the points could be automatically grouped by using 2.5 the contours? 1 . 5 0.5 duration

  6. Finding groups in data - K -means A great many methods exist (and continue to be developed) to automatically find groups in data. These have historically been called clustering methods by data analysts and more recently sometimes called unsupervised learning methods (in the sense that we do not know the “classes” of the observations as in “supervised learning”) by many artificial intelligence researchers. One of the earliest clustering methods is “ K -means”. In its simplest form, it begins with the knowledge that there are exactly K clusters to be determined. The idea is to identify K clusters, C 1 , . . . , C K , where every multivariate observation x i for i = 1 , . . . , n in the data set appears in one and only one cluster C k . The clusters are to be chosen so that the total within cluster spread is as small as possible. For every cluster, the total spread for that cluster is measured by the sum of squared Euclidean distances from the cluster “centroid”, namely � i ∈ C k d 2 ( i , k ) SSE k = � i ∈ C k || x i − c k || 2 = where c k is the cluster “centroid”. Typically, the cluster average x k = � i ∈ C k x i / n k (where n k denotes the cardinality of cluster C k ) is chosen as the cluster centroid (i.e. choose c k = x k ). The K clusters are chosen to minimize � K i =1 SSE k . Algorithms typically begin with “seed” centroids c 1 , . . . , c K , possibly randomly chosen, then assign every observation to its nearest centroid. Each centroid is then recalculated based on the values of x i ∀ i ∈ C k (e.g. c k = x k ), and the data are reassigned to the new centroids. Repeat until there is no change in the clustering.

  7. Finding groups in data - K -means There are several implementations of K -means in R . Many of these are available via the base R function kmeans() . result <- kmeans (data, centers = 3) str (result) ## List of 9 ## $ cluster : Named int [1:299] 2 3 1 2 2 3 1 2 3 1 ... ## ..- attr(*, "names")= chr [1:299] "1" "2" "3" "4" ... ## $ centers : num [1:3, 1:2] 0.21 0.1392 -0.3166 -0.2655 0.0968 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:3] "1" "2" "3" ## .. ..$ : chr [1:2] "duration" "waiting" ## $ totss : num 32 ## $ withinss : num [1:3] 1.33 1.19 1.58 ## $ tot.withinss: num 4.09 ## $ betweenss : num 27.9 ## $ size : int [1:3] 101 91 107 ## $ iter : int 2 ## $ ifault : int 0 ## - attr(*, "class")= chr "kmeans" The cluster component identifies which of three clusters the corresponding obeservation has been assigned.

  8. Finding groups in data - K -means Plotting this information in loon library (loon) p <- l_plot (data, linkingGroup = "geyser", showScales = FALSE, showLabels = FALSE, showGuides= # Add the density contours l_layer_contourLines (p, kde2d (data $ duration, data $ waiting, n=100), color = "grey") ## loon layer "lines" of type lines of plot .l0.plot ## [1] "layer0" # Colour the clusters p['color'] <- result $ cluster

  9. Finding groups in data - K -means Plotting this information in loon plot (p) which looks pretty good.

  10. Finding groups in data - K -means Had we selected only K = 2: which we might not completely agree with.

  11. Finding groups in data - K -means How about K = 4?: which, again, we might or might not agree.

  12. Finding groups in data - K -means How about K = 5?: which, again, we might or might not agree.

  13. Finding groups in data - K -means How about K = 6?: which, again, we might or might not agree.

  14. Finding groups in data - K -means Let’s try K = 6 again: which, is different!

  15. Finding groups in data - K -means Some comments and questions: ◮ K means depends on total squared Euclidean distance to the centroids ◮ K means implicitly presumes that the clusters will be "globular" or "spherical" ◮ different clusters might arise on different calls (random starting position for centroids) ◮ how do we choose k ? ◮ should we enforce a hierarchy on the clusters? ◮ with an interactive visualization, we should be able to readjust the clusters by changing their colours

  16. Finding groups in data - model based clustering A related approach to K means, but one that is much more general and which comes from a different reasoning base is that of so-called “model-based clustering”. Here, the main idea is that the data x i are a sample independently and identically distributed (iid) multivariate observations from some multivariate mixture distribution. That is X 1 , . . . , X n ∼ f p ( x ; Θ ) where f p ( x ; θ ) is a p -variate continuous density parameterized by some collection of parameters ˆ that can be expressed as a finite mixture of individual p -variate densities g p (): K � f p ( x ; Θ ) = α k g p ( x ; θ k ) . k =1 Here α k ≥ 0, � K k =1 α k = 1, the individual densities g p ( x ; θ k ) are of known shape and are identical up to differences given by their individual parameter vectors θ k . Neither α k nor θ k are known for any k = 1 , . . . , K and must be estimated from the observations (i.e. Θ = { α 1 , . . . , α K , θ 1 , . . . , θ K } ). Typically g p ( x , θ k ) are taken to be multivariate Gaussian densities of the form g p ( x ; θ k ) = φ p ( x ; µ k , Σ k ) with φ p ( x ; µ k , Σ k ) = (2 π ) − p 2 | Σ k | − p 2 ( x − µ k ) T Σ − 1 2 e − 1 ( x − µ k ) k and θ k = ( µ k , Σ k ).

Recommend


More recommend