Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen .: Data Mining in Bioinformatics, Page 1

What is clustering? Clustering Class discovery Given a set of objects, group them into clusters (classes that are unknown beforehand) an instance of unsupervised learning (no training dataset) In Practice Cluster images to find categories Cluster patient data to find disease subtypes Cluster persons in social networks to detect communi- ties .: Data Mining in Bioinformatics, Page 2

What is clustering? Supervised versus unsupervised learning general inference problem: given x i , predict y i by learning a function f training set: set of examples ( x i , y i ) where y i = f ( x i ) (but f is still unknown!) test set: new set of data points x i where y i is unknown Supervised: use training data to infer your model, then apply this model to the test data Unsupervised: no training data, learn model and apply it directly on the test data .: Data Mining in Bioinformatics, Page 3

K-means Objective Partition the dataset into k clusters such that intra- cluster variance is minimised k � � ( x j − µ i ) 2 V ( D ) = (1) i =1 x j ∈ S i where V is the variance, S i is a cluster, µ i is its mean, D is the dataset of all points x j .: Data Mining in Bioinformatics, Page 4

K-means Llyods algorithm 1. Partition the data into k initial clusters 2. Compute the mean of each cluster 3. Assign each point to the cluster whose mean is closest to the point 4. If any point changed its cluster membership: Repeat from step 2 .: Data Mining in Bioinformatics, Page 5

K-means Example: before clustering 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 .: Data Mining in Bioinformatics, Page 6

K-means Example: after clustering (k=2) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 .: Data Mining in Bioinformatics, Page 7

K-means Things to note k-means is still the state-of-the-art method for most clustering tasks When proposing a new clustering method, one should always compare to k-means. Lloyds’ algorithm has several setbacks It is order-dependent. Its results depends on the initialisation of the clusters. Its result may be a local optimum, not the global opti- mal solution. .: Data Mining in Bioinformatics, Page 8

K-centroid ‘Brother’ of k-means Don’t use the mean of each cluster but the medoid The medoid is the point closest to the mean: m i = argmin x j ∈ S i || x j − µ i || 2 (2) One thereby restricts the cluster ‘means’ to points that are present in the dataset One only minimises variance with respect to these points .: Data Mining in Bioinformatics, Page 9

Kernel k-means Kernelised k-means? It would be attractive to perform clustering using kernels can move clustering problem to different feature spaces can cluster string and graph data But we have to be able to perform all steps in k-means using kernels! .: Data Mining in Bioinformatics, Page 10

Kernel k-means Kernelised k-means The key step in k-means is to compute the distance between one data point x 1 and the mean of a cluster of points x 2 , . . . , x m : m 1 φ ( x j ) � 2 = � � φ ( x 1 ) − ( m − 1) j =2 m m m 2 1 � � � k ( x 1 , x 1 ) − k ( x 1 , x j ) + k ( x i , x j ) ( m − 1) ( m − 1) 2 j =2 i =2 j =2 (3) This result is based on the fact that every kernel k indu- ces a distance d : d ( x i , x j ) 2 = � φ ( x j ) − φ ( x j ) � 2 = k ( x i , x i ) − 2 k ( x i , x j )+ k ( x j , x j ) .: Data Mining in Bioinformatics, Page 11

Graph-based clustering I Data representation dataset D is given in terms of a graph G = ( V, E ) a data objects v i is a node in G edge e ( i, j ) from node v i to node v j has weight w ( i, j ) Graph-based clustering Define a threshold θ Remove all edges e ( i, j ) from G with weight w ( i, j ) > θ Each connected component of the graph now corre- sponds to one cluster Two nodes are in the same connected component if there is a path between them Graph components can be found by depth-first search in a graph ( ( O ( | V | + | E | ) ) .: Data Mining in Bioinformatics, Page 12

Graph-based clustering II Original graph .: Data Mining in Bioinformatics, Page 13

Graph-based clustering III Thresholded graph ( θ = 0 . 5 ) .: Data Mining in Bioinformatics, Page 14

Graph-based clustering IV But how to get the graph in the first place? Think of the weights as a similarity measure. If two nodes are not connected, then their similarity measure is 0. Graph-based clustering creates clusters of similar objects For any object v i in a cluster, there is a second object v j such that similarity ( v i , v j ) is larger than θ . .: Data Mining in Bioinformatics, Page 15

DBScan I Noise-robust graph-based clustering Graph-based clustering can suffer from the fact that one noisy edge connects two clusters DBScan (Ester et al., 1996) is a noise-robust extension of graph-based clustering DBScan is short for Density-Based Spatial Clustering of Applications with Noise Core object Two objects v i and v j with distance d ( v i , v j ) < ǫ belong to the same cluster if either v i or v j are a core object. v i is a core object iff there are MinPoints points within a distance of ǫ from v i . A cluster is defined by iteratively checking this core object property. .: Data Mining in Bioinformatics, Page 16

DBScan II Code: Main DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextId(NOISE); for i FROM 1 TO SetOfPoints.size do Point := SetOfPoints.get(i); if Point.ClId = UNCLASSIFIED then if ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then ClusterId := nextId(ClusterId) end if end if end for .: Data Mining in Bioinformatics, Page 17

DBScan III Code: ExpandCluster ExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boolean; seeds:=SetOfPoints.regionQuery(Point,Eps); if seeds.size < MinPts then SetOfPoint.changeClId(Point,NOISE); RETURN False; else SetOfPoints.changeClIds(seeds,ClId); seeds.delete(Point); while seeds <> Empty currentP := seeds.first(); result := SetOfPoints.regionQuery(currentP , Eps); .: Data Mining in Bioinformatics, Page 18

DBScan IV if result.size >= MinPts then for i FROM 1 TO result.size do resultP := result.get(i); if resultP .ClId IN (UNCLASSIFIED, NOISE) then if resultP .ClId = UNCLASSIFIED then seeds.append(resultP); end if SetOfPoints.changeClId(resultP ,ClId); end if // UNCLASSIFIED or NOISE end for ; end if ; // result.size >= MinPts seeds.delete(currentP); end while; // seeds <> Empty RETURN True; end if end // ExpandCluster .: Data Mining in Bioinformatics, Page 19

DBScan V Original graph .: Data Mining in Bioinformatics, Page 20

DBScan VI DBScan-clustered graph (MinPts = 2, Eps = 0.5) .: Data Mining in Bioinformatics, Page 21

DBScan VII Original graph .: Data Mining in Bioinformatics, Page 22

DBScan VIII DBScan-clustered graph (MinPts = 3, Eps = 0.5) .: Data Mining in Bioinformatics, Page 23

DBScan IX Properties Cluster assignment of border points is order-dependent Unlike k-means, one does not have to specify the num- ber of clusters a priori But one has to set MinPts and Eps Ester et al. report that for 2D examples MinPts=4 is suf- ficient for good results They determine Eps by visual inspection of a k -distance plot Transfer question: How to kernelise DBScan? .: Data Mining in Bioinformatics, Page 24

Hierarchical Clustering Extension of original setting What if clusters contain clusters themselves? Then we need hierarchical clustering! .: Data Mining in Bioinformatics, Page 25

Hierarchical Clustering Join most similar clusters Iteratively join the two most similar clusters But how to measure similarity between clusters? Similarity of clusters x ∈ C i ,x ′ ∈ C j d ( x, x ′ ) Single Link: S ( C i , C j ) = min 1 d ( x, x ′ ) Average Link: S ( C i , C j ) = � | C i | | C j | x ∈ C i ,x ′ ∈ C j x ∈ C i ,x ′ ∈ C j d ( x, x ′ ) Maximum Link: S ( C i , C j ) = max .: Data Mining in Bioinformatics, Page 26

The end See you tomorrow! Next topic: Feature Selection .: Data Mining in Bioinformatics, Page 27

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is clustering? Clustering

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Preliminary valuation approach 1. . Lis isted peers approach 1.1. Selection of peers and

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

3DST-S report to spokespersons for LBNC March 31, 2019 1 Introduction The 3DST-Spectrometer

Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an

TLCPowerTalk.com Communication for Management Professionals QCon London 2009 (c) 12 March by

CS-5630 / CS-6630 Visualization for Data Science Filtering & Aggregation Alexander Lex

Activity and Performance Report 2017-2018 Four Objectives Developing our locality

Climate Justice Working Group Wednesday, December 2, 2020 2 Meeting Procedures Before

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is clustering? Clustering

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Preliminary valuation approach 1. . Lis isted peers approach 1.1. Selection of peers and

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

3DST-S report to spokespersons for LBNC March 31, 2019 1 Introduction The 3DST-Spectrometer

Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an

TLCPowerTalk.com Communication for Management Professionals QCon London 2009 (c) 12 March by

CS-5630 / CS-6630 Visualization for Data Science Filtering &amp; Aggregation Alexander Lex

Activity and Performance Report 2017-2018 Four Objectives Developing our locality

Climate Justice Working Group Wednesday, December 2, 2020 2 Meeting Procedures Before

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

CS-5630 / CS-6630 Visualization for Data Science Filtering & Aggregation Alexander Lex