Part1: SIB course 4-8 Feb 2008 Statistical analysis applied to genome Analysis tools for large datasets and proteome analyses • Standard tools Sven Bergmann k-means, PCA, SVD Department of Medical Genetics University of Lausanne Rue de Bugnon 27 - DGM 328 • Modular analysis tools CH-1005 Lausanne Switzerland CTWC, ISA, PPA work: ++41-21-692-5452 cell: ++41-78-663-4980 http://serverdgm.unil.ch/bergmann How to get large-scale expression data? Motivations Pool genome-wide expression measurements from many experiments! Why to study a large heterogeneous set of expression data? cell- large-scale cycle Large: Better signals from noisy data! stress expression data Heterogeneous: Global view at transcription program! 1000 1000 1000 Supervised vs. unsupervised approaches 2000 2000 2000 genes 3000 3000 3000 Large genome-wide data may contain answers to 4000 4000 4000 questions we do not ask! Need for both hypothesis- 5000 5000 5000 driven and exploratory analyses! 6000 6000 6000 1 2 3 4 5 2 4 6 8 200 400 600 800 1000 diverse conditions sets of specific conditions How to make sense of millions of numbers? K-means Clustering “guess” k=3 (# of clusters) Hundreds of samples Thousands of genes New Analysis and Visualization Tools are needed! http://en.wikipedia.org/wiki/K-means_algorithm 1
K-means Clustering K-means Clustering “guess” k=3 (# of clusters) “guess” k=3 (# of clusters) 1. Start with random 1. Start with random positions of centroids ( ) positions of centroids ( ) 2. Assign each data point to closest centroid http://en.wikipedia.org/wiki/K-means_algorithm http://en.wikipedia.org/wiki/K-means_algorithm K-means Clustering K-means Clustering “guess” k=3 (# of clusters) “guess” k=3 (# of clusters) 1. Start with random 1. Start with random positions of centroids ( ) positions of centroids ( ) 2. Assign each data point 2. Assign each data point to closest centroid to closest centroid 3. Move centroids to 3. Move centroids to center of assigned points center of assigned points Iterate 1-3 until minimal cost with k clusters S i , i = 1,2,..., k and centroids µ i (the mean point of all the points ) http://en.wikipedia.org/wiki/K-means_algorithm Hierachical Clustering K-means Clustering Plus: Plus: • visual • Shows (re-orderd) data • intuitive • Gives hierarchy • relatively fast Minus: Minus: • Does not work well for many genes • have to “guess” number of clusters (usually apply cut-off on fold-change) • can give different results for distinct • Similarity over all genes/conditions “starting seeds” • Clusters do not overlap • distances computed over all features • one cluster only per element • no cluster hierarchy 2
Example: 2PCs for 3d-data Principle Component Analysis Principle components (PCs) are projections onto subspace with the largest variation in the data Raw data points: {a, …, z} http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf http://ordination.okstate.edu/PCA.htm Example: 2PCs for 3d-data Example: 2PCs for 3d-data The direction of most variance Most variance is perpendicular to along PCA1 PCA1 defines PCA2 Identification of axes with the most variance Normalized data points: zero mean (& unit std)! http://ordination.okstate.edu/PCA.htm http://ordination.okstate.edu/PCA.htm Example: 2PCs for 3d-data Reminder: Matrix multiplications Definition: Cluster? Scheme: Vectorized: Example: http://ordination.okstate.edu/PCA.htm http://en.wikipedia.org/wiki/Matrix multiplication 3
PCA: Example deletion mutants How do we get the PCs? • The PCs are the eigenvectors of the 300 1 1 6k 1 300 = E T · covariance matrix C computed from the C C = E T · E / (n-1) E (mean-centered) data matrix E : 300 300 /(n-1) 6k C = E T · E / (n-1) C · pc = λ · pc 1 300 1 1 λ C · pc = λ · pc · · = C 300 300 300 pc PCA: Example deletion mutants And how to project? • The projected data is just the product of 1 n 1 300 the original data with the PCs : 1 … = · E’ = E · PC E’ E E’ = E · PC 1 2 n 300 6k 6k • Principle Component or Transformation Matrix: • The original gene expression profiles are over 300 arrays. PC = [ pc 1 , pc 2 , …, pc n ] • The transformed data contain projections on n “ eigen-genes” (where n is the number of PCs used) (linear combinations of the 300 arrays shown in red) PCA: Example deletion mutants PCA: Example deletion mutants 0.15 0.1 0.1 0.05 0 0.05 PCA2 PCA3 -0.05 0 -0.1 -0.05 -0.15 -0.1 -0.04 -0.02 0 0.02 0.04 0.06 0.08 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 PCA1 PCA1 Third “eigen-gene” (PCA3) reveals little structure! The first 2 “eigen-genes” separate data into 3 clusters 4
SVD: Matrix representation Singular Value Decomposition “SVD = bi-PCA” E = U · D · V T E = U · D · V T 1 300 1 n 1 n 1 300 … 0 v 1 V : PC matrix of “eigen-genes” λ 1 λ 2 … = v 2 · · E (composed of eigenvectors of C = E T · E ) … 0 u 1 u 2 u n v n λ n U : PC matrix of “eigen-arrays” 6k 6k n n V T U D (composed of eigenvectors of C’ = E · E T ) u i : eigen-arrays v i : eigen-genes λ i : eigenvalues D : diagonal matrix i = 1, …, n n : rank( E ) = #(independent arrays) Alter O., Brown P.O., Botstein D. Singular value decomposition for genome-wide http://public.lanl.gov/mewall/kluwer2002.html expression data processing and modeling . Proc Natl Acad Sci USA 2000; 97:10101-06. SVD: Example deletion mutants SVD: What is optimized? E 1 = λ 1 · u 1 · v 1 T E = U · D · V T = ∑ i λ i · u i · v i T (full expansion) T (rank-1 expansion) E 1 = λ 1 · u 1 · v 1 1 300 1 (1) ··· u 1 ∆ = | E - E 1 | 2 (sum of residuals) (1) ·v 1 (1) ·v 1 (300) u 1 = · · 1 = : : E 1 v 1 λ 1 : : 300 λ 1 (6k) ·v 1 (1) ··· u 1 (6k) ·v 1 (300) u 1 minimize ∆ for free u 1 and v 1 : 6k 6k u 1 E · v 1 = λ 1 · u 1 & E T · u 1 = λ 1 · v 1 implying: high high low high low = · · = E · E T · u 1 = λ 1 2 · u 1 & E T · E · v 1 = λ 1 2 · v 1 low low low Bergmann et al ., Phys. Rev. E 67, 031902 (2003) SVD: Example deletion mutants SVD: Example deletion mutants original data U (n=2) original data U (n=1) 50 50 50 50 genes genes genes genes 100 100 100 100 150 150 150 150 200 200 200 200 50 100 150 200 250 300 1 50 100 150 200 250 300 1 2 arrays eigen-arrays arrays eigen-arrays V T (n=1) SVD(data) = U D V T (n=1) V T (n=2) SVD(data) = U D V T (n=2) 50 1 1 50 1 eigen-genes eigen-genes genes genes 100 100 1 0 0 150 150 -1 2 -1 200 200 50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300 arrays arrays arrays arrays 5
SVD: Example deletion mutants Part1: Analysis tools for large datasets original data U (n=3) 50 50 genes genes 100 100 • Standard tools 150 150 k-means, PCA, SVD 200 200 50 100 150 200 250 300 1 2 3 arrays eigen-arrays V T (n=3) SVD(data) = U D V T (n=3) • Modular analysis tools 1 50 1 CTWC, ISA, PPA eigen-genes genes 100 2 0 150 -1 3 200 50 100 150 200 250 300 50 100 150 200 250 300 arrays arrays How to extract biological information from How to extract biological information from large-scale expression data? large-scale expression data? Hierarchical clustering and other correlation-based methods may be 1000 Search for transcription modules: 2000 good for small data sets, but: 3000 Set of genes co-regulated under Problems with large data: a certain set of conditions 4000 • Clusters cannot overlap! 5000 • context specific 6000 200 400 600 800 1000 • Clustering based on • allow for overlaps correlations over all conditions: - sensitive to noise - computation intensive Overview of “modular” analysis tools Coupled two-way Clustering • Cheng Y and Church GM. Biclustering of expression data . (Proc Int Conf Intell Syst Mol Biol. 2000;8:93-103) • Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data . (Proc Natl Acad Sci U S A. 2000 Oct 24;97(22):12079-84) • Tanay A, Sharan R, Kupiec M, Shamir R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data . (Proc Natl Acad Sci U S A. 2004 Mar 2;101(9):2981-6) • Sheng Q, Moreau Y, De Moor B. Biclustering microarray data by Gibbs sampling . (Bioinformatics. 2003 Oct;19 Suppl 2:ii196-205) • Gasch AP and Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering . (Genome Biol. 2002 Oct 10;3(11):RESEARCH0059) • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns . (Genome Biol. 2000;1(2):RESEARCH0003.) … and many more! http://serverdgm.unil.ch/bergmann/Publications/review.pdf 6
Recommend
More recommend