Clustering In this example distance matrix: and have the most - PDF document

30 ‐ Mar ‐ 15 A distance matrix allows us to create clusters Clustering • In this example distance matrix: and have the most similar vectors 0 0.265 0.799 – and are the second most similar – 0.265 0 0.534 and are the third most similar – • These relationships are hierarchical 0 0.799 0.534 • We can write them as a bracket ‐ notation or as a cladogram: (( , ), ); (( , ), ); • If we have a lot of genes/microbes/etc, then we can define hierarchical clusters that might represent: – Genes that are involved in the same metabolic pathway – Micro ‐ organisms that respond to the same environmental signals • Thus, clustering allows us to make biological discoveries! Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis – (or at least: make specific predictions about potential biological Utrecht University, March 30 th 2015 discoveries, and these predictions can be tested) Representation in fewer dimensions Clustering algorithms • Single linkage • Below is a two ‐ dimensional representation of ten multi ‐ – Distance between clusters is the distance between closest vectors dimensional vectors • Complete linkage – Some vectors are similar in multi ‐ dimensional space – Distance between clusters is distance between most distant vectors – This is still visible in the two ‐ dimensional representation • Average linkage • Distance matrix depends on distance measure (previous – A.k.a. Unweighted Pair Group Method with Arithmetic mean (UPGMA) lecture) – Distance between clusters is the average distance between all vectors Distance matrix: 0 0 a b a b c c d d e e f f g g h h i i a a 0 0 j j k k l l m n m n o p o p q q b b j j 0 0 r r s s t t u u v w x v w x c c k k r r 0 0 y y z α z α β γ β γ δ δ d d d d d l l s s y y 0 0 ε ε ζ ζ η θ η θ ι ι e m t e m t z z ε ε 0 0 κ κ λ μ λ μ ν ν f f n u n u α α ζ ζ κ κ 0 0 ξ ξ π π ρ ρ g g o v o v β β η η λ λ ξ ξ 0 0 ς ς σ σ Identical vector Similar vectors h h p w γ p w γ θ θ μ π μ π ς ς 0 0 τ τ Different vectors i i q q x x δ δ ι ι ν ρ ν ρ σ σ τ τ 0 0 Chaining versus clumping behavior Example • Cluster using single linkage • Single linkage cladograms typically A B C D E A ‐ 1 2 5 7 show chaining behavior ((((A,B),C),D),E); A B B ‐ 8 7 9 – Often, a single close data point (vector) C C ‐ 3 5 D is added to an existing cluster D ‐ 5 E E ‐ • Cluster using complete linkage ((A,B),((C,D),E))); • Conversely, complete linkage y, p g A B cladograms are more clumped C D E • Cluster using average linkage • Average linkage (UPGMA) is ((A,B),((C,D),E))); A intermediate B C D E 1

30 ‐ Mar ‐ 15 Branch lengths Effect of distance measures on clustering • Cluster using average linkage (UPGMA) ((A,B),((C,D),E))); A B C D E 0. 25 A ‐ 1 2 5 7 Euclidian B ‐ 8 7 9 0. 20 C ‐ 3 5 Gene 1 /Expression D ‐ 5 Gene 1 Gene 2 E ‐ 0. 15 Gene 2 Gene 3 0 5 0.5 A Abundance/ A 3.17 0. 10 B 0.5 B Gene 3 Correlation C 1.5 C 5 0.0 1 Gene 1 D 1.5 D 0.67 Gene 3 0 E 2.5 E Gene 2 1 2 3 4 5 6 7 8 9 10 Time/environments/samples… ((A ,B ) ,((C ,D ) ,E ) )); :0.5 :0.5 :3.17 :1.5 :1.5 :1 :2.5 :0.67 Changing perspective Bi ‐ clustering 0. 25 Between the genes (etc) 0. 20 0 x y Abundance/Expression 0. 15 x 0 z 0. 10 y z 0 0.0 5 1 1 2 3 2 3 4 4 5 6 7 5 6 7 8 9 10 8 9 10 Genes 0 Between the samples (etc) 1 0 a b c d e f g h i 1 2 3 4 5 6 7 8 9 10 2 a 0 j k l m n o p q Time/Environments/Samples 3 b j 0 r s t u v w x 4 c k r 0 y z α β γ δ 1 0.20 0.15 0.12 2 0.17 0.15 0.09 5 d l s y 0 ε ζ η θ ι 3 0.16 0.16 0.08 6 e m t z ε 0 κ λ μ ν 4 0.20 0.15 0.11 7 f n u α ζ κ 0 ξ π ρ 5 0.20 0.16 0.12 6 0.17 0.16 0.10 8 g o v β η λ ξ 0 ς σ 7 0.16 0.15 0.08 8 9 h p w γ θ μ π ς 0 τ 0.20 0.15 0.12 9 0.18 0.16 0.11 10 i q x δ ι ν ρ σ τ 0 Samples 10 0.16 0.15 0.08 How good is my clustering? The cell cycle • Does the picture or cladogram fit your expectation? • The cell cycle is an important process in all cellular life forms • Do genes that “should” cluster together, cluster together? – Reproduction – Growth and development – Tissue renewal 2

30 ‐ Mar ‐ 15 Genes involved in the cell cycle oscillate Discovering new cell cycle related genes • All ~6,200 genes in the genome of baker’s yeast ( Saccharomyces cerevisiae ) were measured during several cell cycles Time  • ~800 genes oscillate indicating that they are ~800 genes involved in the cell cycle Time  Enterotypes • Bacterial abundances in fecal samples from 39 people were determined by using metagenomics • The abundance profiles were clustered into three distinct types of gut flora, called “enterotypes” – Bacteroides ‐ dominated – Prevotella ‐ dominated – Ruminococcus ‐ dominated Ruminococcus dominated • Including more data revealed that enterotypes are actually gradients 3

Clustering In this example distance matrix: and have the most - PDF document

30 Mar 15 A distance matrix allows us to create clusters Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799 and are the second most similar 0.265 0 0.534 and are the third most

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Prediction for Processes on Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for

Genome Sequencing: Introduc2on to Fragment Assembly Lecture 5:

Structural Studies of an AAA+ ATPase Structural Studies of an AAA+ ATPase N-ethylmaleimide

BMI-206 Structure-Structure comparisons Sequence-Structure comparisons Marc A. Marti-Renom

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania,

Main parameters (invariants) 160 letters Omaha -Nebraska- -> Boston Diameter average

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Semester projects Semester projects Semester projects Semester projects Principles of Complex

Sambuz

Useful Links

Newsletter

Mail Us

Clustering In this example distance matrix: and have the most - PDF document

30 Mar 15 A distance matrix allows us to create clusters Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799 and are the second most similar 0.265 0 0.534 and are the third most

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Prediction for Processes on Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for

Genome Sequencing: Introduc2on to Fragment Assembly Lecture 5:

Structural Studies of an AAA+ ATPase Structural Studies of an AAA+ ATPase N-ethylmaleimide

BMI-206 Structure-Structure comparisons Sequence-Structure comparisons Marc A. Marti-Renom

Composite Pattern Discovery for PCR Application Stanislav Angelov University of Pennsylvania,

Main parameters (invariants) 160 letters Omaha -Nebraska- -&gt; Boston Diameter average

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Semester projects Semester projects Semester projects Semester projects Principles of Complex

Sambuz

Useful Links

Newsletter

Mail Us

Main parameters (invariants) 160 letters Omaha -Nebraska- -> Boston Diameter average