3/19/09 CSCI1950‐Z Computa3onal Methods for Biology Lecture 13 Ben Raphael March 11, 2009 hFp://cs.brown.edu/courses/csci1950‐z/ Topic 2: Func3onal Genomics 1
3/19/09 Biology 101 Central Dogma What can we measure? Sequencing (expensive) Hybridiza3on (noisy) Sequencing (expensive) Hybridiza3on (noisy) Mass spectrometry (noisy) Hybridiza3on (very noisy!) 2
3/19/09 DNA Basepairing DNA/RNA Basepairing RNA is single stranded T U 3
3/19/09 RNA Microarrays Gene Expression Data Samples/Condi3ons Each microarray experiment: expression vector u = ( u 1 , …, u n ) u i = expression value for each Gene expression gene. BMC Genomics 2006, 7:279 4
3/19/09 Topics • Methods for Clustering Samples/Condi3ons – Hierarchical, Graph based (Clique‐finding), Matrix‐based (PCA), Gene expression • Methods for Classifica3on – Nearest neighbors, support vector machines • Data Integra3on: Bayesian Networks BMC Genomics 2006, 7:279 Gene Expression Data Samples/Condi3ons Each microarray experiment: expression vector u = ( u 1 , …, u n ) u i = expression value for each Gene expression gene. Goal : Group genes with similar expression paFerns over mul3ple samples/condi3ons. BMC Genomics 2006, 7:279 5
3/19/09 Clustering Goal: Group data into groups. • Input : n data points • Output : k clusters. Points in clusters “closer” than to points in other clusters. 0 11 7 5 1 4 11 0 4 6 7 4 0 9 5 6 9 0 3 2 n x n distance matrix Clustering Proper3es of a good clustering/par33on. • Separa.on : points in different clusters are far apart. • Homogeneity : points in the same cluster are close. 0 11 7 5 1 4 11 0 4 6 7 4 0 9 5 6 9 0 3 2 n x n distance matrix 6
3/19/09 Agglomera3ve Hierarchical Clustering Itera3vely combine closest groups into larger groups. 1 4 C { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] 3 5 2 d ( C i , C j ) = min d ( C i , C j ). C k C i ∪ C j [Replace C i and C j by C k . ] C ( C \ C i \ C j ) ∪ C k . Agglomera3ve Hierarchical Clustering How to compute d ? 1 4 C { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] 3 2 5 d ( C i , C j ) = min d ( C i , C j ). C k C i ∪ C j [Replace C i and C j by C k . ] C ( C \ C i \ C j ) ∪ C k . 7
3/19/09 Agglomera3ve Hierarchical Clustering Distance between clusters defined as average pairwise distance. 1 4 Average linkage clustering . Given two disjoint clusters C i , C j 3 5 2 1 d(C i, C j ) = ––––––––– Σ {p ∈ Ci, q ∈ Cj} d pq |C i | × |C j | Agglomera3ve Hierarchical Clustering Ini.aliza.on: 1 4 Assign each x i to its own cluster C i Itera.on: 3 5 2 Find two clusters C i and C j such that d ij is min Let C k = C i ∪ C j Delete C i and C j Termina.on: When a single cluster remains 2 3 5 1 4 Dendrogram 8
3/19/09 UPGMA Algorithm Unweighted Pair Group Method with Averages Ini.aliza.on: 1 4 Assign each x i to its own cluster C i Define one leaf per sequence, each at height 0 Itera.on: 3 5 2 Find two clusters C i and C j such that d ij is min Let C k = C i ∪ C j Add a vertex connec3ng C i , C j and place it at height d ij /2 Delete C i and C j Termina.on: When a single cluster remains 5 2 3 1 4 Agglomera3ve Hierarchical Clustering Clusters C i , C j Single linkage d ( C i , C j ) = min p ∈ C i ,q ∈ C j d pq Complete linkage d ( C i , C j ) = max p ∈ C i ,q ∈ C j d pq Average linkage 1 � � d ( C i , C j ) = d pq | C i || C j | p ∈ C i q ∈ C j 9
3/19/09 Agglomera3ve Hierarchical Clustering Where are the clusters? 1 4 Cut tree at some point. 3 5 2 Can define any number of clusters. 5 2 3 1 4 Cluster Centers Each cluster defined by center/centroid. (Whiteboard) 10
3/19/09 Another Greedy k‐means Move Cost(P) = k‐means “cost” of P. P i C : clustering w/ i moved to cluster C. Δ(i C) = cost(P) – cost(P i C ) How many clusters? 11
3/19/09 Distance Graph Distance graph G(Θ) = (V, E). V = data points E = {(i,j): d(i,j) < Θ Θ = 7 Cliques • A graph is complete provided all possible edges are present. • A subgraph that is a complete graph is called a clique . • Separa3on and homogeneity proper3es of good clustering imply: clusters = cliques. K 3 K 4 K 5 12
3/19/09 Cliques and Clustering Good clustering 1. One connected component for each cluster ( separa.on ) 2. Each connected component has edge b/w pair of ver3ces ( homogeneity ) K 3 K 4 K 5 Clique Graphs A graph whose connected components are all cliques. 13
3/19/09 Distance Graph Clique Graph Distance graphs from real data have missing edges and extra edges. Corrupted Cliques Problem Input : Graph G . Output : Smallest number of edges to add or remove to transform G into a clique graph. NP‐hard (Sharan, Shamir & Tsur 2004) 14
3/19/09 Extending a subpar33on • Suppose we knew op3mal clustering for subset V’ ⊆ V. • Extend this clustering to V. Cluster Affinity N ( v , C j ) = # of edges from v to C j . Define affinity (rela3ve density) of v to C j : N ( v , C j ) / | C j | Maximum Affinity Extension Assign v to argmax j N ( v , C j ) / | C j | 15
3/19/09 Parallel Clustering with Cores (PCC) (ben‐Dor et al. 1999) Score(P) = min. # edges to add/remove to par33on P to make clique graph. Straigh{orward to compute since P is known. PCC: Algorithmic Analysis Very inefficient: Number of such par33ons is equal to φ(|S’|, k ) S.rling number of second kind k φ ( r, k ) = 1 � k � � ( − 1) i ( k − i ) r k ! i i =0 16
3/19/09 Corrupted Cliques Random Graph 1) Start with clique graph H. 2) Randomly add/remove edges with probability p. Obtain graph G H,p PCC: Algorithmic Analysis • PCC selects two random sets of ver3ces. Analysis is relies on probability. • Let PCC(G) denote output graph (clique graph). • For graphs G = (V, E) and G’ = (V’, E’) define: Δ(G,G’) = | E Δ E’| = | E \E’| + |E’ \ E| • Can show (See Shamir notes) that with high probability, output graph from PCC is as good as clique graph H. Pr[ Δ( PCC( G H,p ), G H,p ) ≤ Δ(H, G H,p )] > 1 – δ. 17
3/19/09 Cluster Affinity Search Technique (CAST) Clustering of Gene Expression Samples Each microarray experiment: expression vector x = ( x 1 , …, x n ) x i = expression value for each gene. Gene expression Group similar vectors. BMC Genomics 2006, 7:279 18
3/19/09 Distances between vectors Pearson product‐moment correla3on coefficient � m k =1 ( x ik − x i )( x jk − x j ) r ij = ( m − 1) s i s j � m m � x i = 1 1 � � � s i = ( x ik − x i ) 2 x ik � m − 1 m k =1 k =1 Sample mean Sample standard devia3on Measures linear rela3onship between vectors x i and x j . ‐1 ≤ r ij ≤ 1. 19
Recommend
More recommend