Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering CSE Colloquium Dr. Michael Hahsler Department of Computer Science and Engineering, Lyle School of Engineering, Southern Methodist University. Dallas, April 3, 2009.
Motivation Clustering: assignment of objects to groups (clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. 150 150 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 100 ● ● ● ● y y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 x x Assess the quality of a cluster solution: ❼ Typically judged by intra and inter-cluster similarities ❼ Visualization for judging the quality of a clustering and to explore the cluster structure 2
Motivation (cont’d) Dendrograms (Hartigan, 1967) for hierarchical clustering: Cluster Dendrogram Cluster Dendrogram 150 150 100 100 Height Height 50 50 0 0 → Unfortunately dendrograms are only possible for hierarchical/nested clusterings. 3
Outline 1. Clustering Basics 2. Existing Visualization Techniques 3. Matrix Shading 4. Seriation 5. Creating Dissimilarity Plots 6. Examples 4
Clustering Basics ❼ Partition: Each point is assigned to a (single) group. Γ : R m → { 1 , 2 , . . . , k } ❼ Typical partitional clustering algorithm: k -means Source: Wikipedia ( http://en.wikipedia.org/wiki/K-means_algorithm ) ❼ Dissimilarity (distance) matrix: d : O × O → R D O 1 O 2 O 3 O 4 O 1 0 4 1 8 4 0 2 2 O 2 1 2 0 3 O 3 8 2 3 0 O 4 5
Visualization Techniques for Partitions Project objects into 2-dimensional space (dimensionality reduction techniques, e.g., PCA, MDS; Pison et al. , 1999). Projection (PCA) Projection (MDS) 0.6 2 9 12 10 40 0.4 8 ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● 7 ● ● ● ● ● ● ● ● 20 0.2 ● ● ● 3 1 ● ● ● ● ● ● ● ● ● ● 11 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Component 2 Component 2 ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● 0.0 ● ● ● ● ● 0 ● ● ● ● 3 ● ● ● ● 4 ● ● ● ● −0.2 ● −20 −0.4 −40 −0.6 −60 −50 0 50 −0.5 0.0 0.5 Component 1 Component 1 These two components explain 100 % of the point variability. These two components explain 40.59 % of the point variability. → Problems with dimensionality (figure to the right: 16 dimensional data) 6
Visualization Techniques for Partitions (cont’d) ❼ Visualize metrics calculated from inter and intra-cluster similarities to judge cluster quality. For example, silhouette width (Rousseeuw, 1987; Kaufman and Rousseeuw, 1990). Silhouette plot 4 clusters C j n = 75 j : n j | ave i ∈ Cj s i 1 : 23 | 0.75 2 : 20 | 0.73 3 : 15 | 0.80 4 : 17 | 0.67 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width s i Average silhouette width : 0.74 → Only a diagnostic tool for cluster quality ❼ Several other visualization methods (e.g., based on self-organizing maps and neighborhood graphs) are reviewed in Leisch (2008). → Typically hide structure within clusters or are limited by the number of clusters and dimensionality of data. 7
Matrix Shading Each cell of the matrix (typically a dissimilarity matrix) is represented by a gray value (see, e.g., Sneath and Sokal, 1973; Ling, 1973; Gale et al. , 1984). Initially matrix shading was used with hierarchical clustering → heatmaps. For graph-based partitional clustering: CLUSION (Strehl and Ghosh, 2003). Uses coarse seriation such that “good” clusters from blocks around the main diagonal. CLUSION allows to judge cluster quality but does not reveal the structure of the data → Dissimilarity plots: improve matrix shading/CLUSION with (near) optimal placement of clusters and objects using seriation 8
Seriation Part of combinatorial data analysis (Arabie and Hubert, 1996) ❼ Aim: arrange objects in a linear order given available data and some loss function in order to reveal structural information. ❼ Problem: Requires to solve a discrete optimization problem → solution space grows by the order of O ( n !) Techniques: 1. Partial enumeration methods (currently solve problems with n ≤ 40 ) ❼ dynamic programming (Hubert et al. , 1987) ❼ branch-and-bound (Brusco and Stahl, 2005) 2. Heuristics for larger problems 9
Seriation (cont’d) Set of n objects O = { O 1 , O 2 , . . . , O n } (1) Symmetric dissimilarity matrix D = ( d ij ) (2) where d ij for 1 ≤ i, j ≤ n represents the dissimilarity between O i and O j , and d ii = 0 for all i . Permutation function Ψ reorders the objects in D by simultaneously permuting rows and columns Define a loss function L to evaluate a given permutation Seriation is the optimization problem: Ψ ∗ = argmin L (Ψ( D )) (3) Ψ 10
Column/row gradient measures Perfect anti-Robinson matrix (Robinson, 1951): A symmetric matrix where the values in all rows and columns only increase when moving away from the main diagonal Gradient conditions (Hubert et al. , 1987): d ik ≤ d ij 1 ≤ i < k < j ≤ n ; within rows: for (4) d kj ≤ d ij 1 ≤ i < k < j ≤ n. within columns: for (5) D Ψ(D) O 1 O 2 O 3 O 4 O 1 O 3 O 2 O 4 O 1 O 1 0 4 1 8 0 1 4 8 4 0 2 2 1 0 2 3 O 2 O 3 1 2 0 3 4 2 0 2 O 3 O 2 8 2 3 0 8 3 2 0 O 4 O 4 In an anti-Robinson matrix the smallest dissimilarity values appear close to the main diagonal, therefore, the closer objects are together in the order of the matrix, the higher their similarity. Note: Most matrices can only be brought into a near anti-Robinson form. 11
Column/row gradient measures (cont’d) Loss measure (quantifies the divergence from anti-Robinson form): � � L ( D ) = f ( d ik , d ij ) + f ( d kj , d ij ) (6) i<k<j i<k<j where f ( · , · ) is a function which defines how a violation or satisfaction of a gradient condition for an object triple ( O i , O k and O j ) is counted. Raw number of violations minus satisfactions: − 1 z > y ; if f ( z, y ) = sign( y − z ) = 0 z = y ; if (7) +1 z < y. if Weight each satisfaction or violation by its magnitude (absolute difference between the values): f ( z, y ) = | y − z | sign( y − z ) = y − z (8) 12
Anti-Robinson events An even simpler loss function can be created in the same way as the gradient measures above by concentrating on violations only. � � L ( D ) = f ( d ik , d ij ) + f ( d kj , d ij ) (9) i<k<j i<k<j To only count the violations we use � 1 z < y if and f ( z, y ) = I ( z, y ) = (10) 0 otherwise. I ( · ) is an indicator function returning 1 only for violations. Chen (2002) also introduced a weighted versions of this loss function by using the absolute deviations as weights: f ( z, y ) = | y − z | I ( z, y ) (11) 13
Hamiltonian path length The dissimilarity matrix D can be represented as a finite weighted graph G = (Ω , E ) where the set of objects constitute the vertices Ω = { O 1 , O 2 , . . . , O n } and each edge e ij ∈ E between the objects O i , O j ∈ Ω has a weight w ij associated which represents the dissimilarity d ij . An order Ψ of the objects can be seen as a path through the graph where each node is visited exactly once, i.e., a Hamiltonian path. Minimizing the Hamiltonian path length results in a seriation optimal with respect to dissimilarities between neighboring objects (see, e.g., Hubert, 1974; Caraux and Pinloche, 2005). D O 1 O 2 O 3 O 4 O 2 The loss function based on the Hamiltonian O 1 0 4 1 8 path length is: 4 0 2 2 O 2 O 1 O 3 n − 1 � 1 2 0 3 O 3 L ( D ) = d i,i +1 (12) 8 2 3 0 O 4 i =1 O 4 This optimization problem is related to the traveling salesperson problem (Gutin and Punnen, 2002) for which good solvers and efficient heuristics exist. 14
Creating dissimilarity plots We use matrix shading with two improvements: 1. Rearrange clusters: more similar clusters are placed closer together (macro-structure). 2. Rearrange objects: show micro-structure Γ Ψ 1 Ψ c Ψ 2 Ψ 3 Ψ 4 D D i Ψ i ( D i ) Ψ c ( D c ) The assignment function Γ assigns a cluster membership to each object (provided by a partitional clustering algorithm) 15
Examples We use the column/row gradient measure as the loss function for seriation. ❼ Placement (seriation) of clusters is done using branch-and-bound to find the optimal solution ❼ Placement (seriation) of objects within its cluster uses a simulated annealing heuristic Seriation algorithms are provides by Brusco and Stahl (2005) and are available in the R extension package seriation (Hahsler et al. , 2008). 16
Recommend
More recommend