Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen .: Data Mining in Bioinformatics, Page 1
What is clustering? Clustering Class discovery Given a set of objects, group them into clusters (classes that are unknown beforehand) an instance of unsupervised learning (no training data- set) In Practice Cluster images to find categories Cluster patient data to find disease subtypes Cluster persons in social networks to detect communi- ties .: Data Mining in Bioinformatics, Page 2
What is clustering? Supervised versus unsupervised learning general inference problem: given x i , predict y i by lear- ning a function f training set: set of examples ( x i , y i ) where y i = f ( x i ) (but f is still unknown!) test set: new set of data points x i where y i is unknown Supervised: use training data to infer your model, then apply this model to the test data Unsupervised: no training data, learn model and apply it directly on the test data .: Data Mining in Bioinformatics, Page 3
K-means Objective Partition the dataset into k clusters such that intra- cluster variance is minimised k � � ( x j − µ i ) 2 V ( D ) = (1) i =1 x j ∈ S i where V is the variance, S i is a cluster, µ i is its mean, D is the dataset of all points x j .: Data Mining in Bioinformatics, Page 4
K-means Llyods algorithm 1. Partition the data into k initial clusters 2. Compute the mean of each cluster 3. Assign each point to the cluster whose mean is closest to the point 4. If any point changed its cluster membership: Repeat from step 2 .: Data Mining in Bioinformatics, Page 5
K-means Example: before clustering 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 .: Data Mining in Bioinformatics, Page 6
K-means Example: after clustering (k=2) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 .: Data Mining in Bioinformatics, Page 7
K-means Things to note k-means is still the state-of-the-art method for most clus- tering tasks When proposing a new clustering method, one should always compare to k-means. Lloyds’ algorithm has several setbacks It is order-dependent. Its results depends on the initialisation of the clusters. Its result may be a local optimum, not the global opti- mal solution. .: Data Mining in Bioinformatics, Page 8
K-centroid ‘Brother’ of k-means Don’t use the mean of each cluster but the medoid The medoid is the point closest to the mean: m i = argmin x j ∈ S i || x j − µ i || 2 (2) One thereby restricts the cluster ‘means’ to points that are present in the dataset One only minimises variance with respect to these points .: Data Mining in Bioinformatics, Page 9
Kernel k-means Kernelised k-means? It would be attractive to perform clustering using kernels can move clustering problem to different feature spaces can cluster string and graph data But we have to be able to perform all steps in k-means using kernels! .: Data Mining in Bioinformatics, Page 10
Kernel k-means Kernelised k-means The key step in k-means is to compute the distance bet- ween one data point x 1 and the mean of a cluster of points x 2 , . . . , x m : m 1 φ ( x j ) � 2 = � � φ ( x 1 ) − ( m − 1) j =2 m m m 2 1 � � � k ( x 1 , x 1 ) − k ( x 1 , x j ) + k ( x i , x j ) ( m − 1) ( m − 1) 2 j =2 i =2 j =2 (3) This result is based on the fact that every kernel k indu- ces a distance d : d ( x i , x j ) 2 = � φ ( x j ) − φ ( x j ) � 2 = k ( x i , x i ) − 2 k ( x i , x j )+ k ( x j , x j ) .: Data Mining in Bioinformatics, Page 11
Graph-based clustering I Data representation dataset D is given in terms of a graph G = ( V, E ) a data objects v i is a node in G edge e ( i, j ) from node v i to node v j has weight w ( i, j ) Graph-based clustering Define a threshold θ Remove all edges e ( i, j ) from G with weight w ( i, j ) > θ Each connected component of the graph now corre- sponds to one cluster Two nodes are in the same connected component if the- re is a path between them Graph components can be found by depth-first search in a graph ( ( O ( | V | + | E | ) ) .: Data Mining in Bioinformatics, Page 12
Graph-based clustering II Original graph .: Data Mining in Bioinformatics, Page 13
Graph-based clustering III Thresholded graph ( θ = 0 . 5 ) .: Data Mining in Bioinformatics, Page 14
Graph-based clustering IV But how to get the graph in the first place? Think of the weights as a similarity measure. If two nodes are not connected, then their similarity mea- sure is 0. Graph-based clustering creates clusters of similar ob- jects For any object v i in a cluster, there is a second object v j such that similarity ( v i , v j ) is larger than θ . .: Data Mining in Bioinformatics, Page 15
DBScan I Noise-robust graph-based clustering Graph-based clustering can suffer from the fact that one noisy edge connects two clusters DBScan (Ester et al., 1996) is a noise-robust extension of graph-based clustering DBScan is short for Density-Based Spatial Clustering of Applications with Noise Core object Two objects v i and v j with distance d ( v i , v j ) < ǫ belong to the same cluster if either v i or v j are a core object. v i is a core object iff there are MinPoints points within a distance of ǫ from v i . A cluster is defined by iteratively checking this core ob- ject property. .: Data Mining in Bioinformatics, Page 16
DBScan II Code: Main DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextId(NOISE); for i FROM 1 TO SetOfPoints.size do Point := SetOfPoints.get(i); if Point.ClId = UNCLASSIFIED then if ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then ClusterId := nextId(ClusterId) end if end if end for .: Data Mining in Bioinformatics, Page 17
DBScan III Code: ExpandCluster ExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boolean; seeds:=SetOfPoints.regionQuery(Point,Eps); if seeds.size < MinPts then SetOfPoint.changeClId(Point,NOISE); RETURN False; else SetOfPoints.changeClIds(seeds,ClId); seeds.delete(Point); while seeds <> Empty currentP := seeds.first(); result := SetOfPoints.regionQuery(currentP , Eps); .: Data Mining in Bioinformatics, Page 18
DBScan IV if result.size >= MinPts then for i FROM 1 TO result.size do resultP := result.get(i); if resultP .ClId IN (UNCLASSIFIED, NOISE) then if resultP .ClId = UNCLASSIFIED then seeds.append(resultP); end if SetOfPoints.changeClId(resultP ,ClId); end if // UNCLASSIFIED or NOISE end for ; end if ; // result.size >= MinPts seeds.delete(currentP); end while; // seeds <> Empty RETURN True; end if end // ExpandCluster .: Data Mining in Bioinformatics, Page 19
DBScan V Original graph .: Data Mining in Bioinformatics, Page 20
DBScan VI DBScan-clustered graph (MinPts = 2, Eps = 0.5) .: Data Mining in Bioinformatics, Page 21
DBScan VII Original graph .: Data Mining in Bioinformatics, Page 22
DBScan VIII DBScan-clustered graph (MinPts = 3, Eps = 0.5) .: Data Mining in Bioinformatics, Page 23
DBScan IX Properties Cluster assignment of border points is order-dependent Unlike k-means, one does not have to specify the num- ber of clusters a priori But one has to set MinPts and Eps Ester et al. report that for 2D examples MinPts=4 is suf- ficient for good results They determine Eps by visual inspection of a k -distance plot Transfer question: How to kernelise DBScan? .: Data Mining in Bioinformatics, Page 24
Hierarchical Clustering Extension of original setting What if clusters contain clusters themselves? Then we need hierarchical clustering! .: Data Mining in Bioinformatics, Page 25
Hierarchical Clustering Join most similar clusters Iteratively join the two most similar clusters But how to measure similarity between clusters? Similarity of clusters x ∈ C i ,x ′ ∈ C j d ( x, x ′ ) Single Link: S ( C i , C j ) = min 1 d ( x, x ′ ) Average Link: S ( C i , C j ) = � | C i | | C j | x ∈ C i ,x ′ ∈ C j x ∈ C i ,x ′ ∈ C j d ( x, x ′ ) Maximum Link: S ( C i , C j ) = max .: Data Mining in Bioinformatics, Page 26
The end See you tomorrow! Next topic: Feature Selection .: Data Mining in Bioinformatics, Page 27
Recommend
More recommend