Co-expression analysis of RNA-seq data Etienne Delannoy & - PowerPoint PPT Presentation

Co-expression analysis of RNA-seq data Etienne Delannoy & Marie-Laure Martin-Magniette & Andrea Rau Plant Science Institut of Paris-Saclay (IPS2) Applied Mathematics and Informatics Unit (MIA-Paris) Genetique Animale et Biologie Integrative (GABI) ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 1 / 42

Outline Co-expression analysis introduction 1 Unsupervised clustering 2 Centroid-based clustering: K-means, HCA Model-based clustering Mixture models for RNA-seq data Conclusion / discussion 3 ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 2 / 42

Aims for this talk What is the biological/statistical meaning of co-expression for RNA-seq? What methods exist for performing co-expression analysis? How to choose the number of clusters present in data? Advantages / disadvantages of different approaches: speed, stability, robustness, interpretability, model selection, ... ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 3 / 42

Design of a transcriptomics project Biological question ↓ ↑ Experimental design ↓ Data acquisition ↓ Data analysis: Normalization, differential analysis, clustering, networks, ... ↓ ↑ Validation ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 4 / 42

Gene co-expression 1 1 Google image search: “Coexpression” ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 5 / 42

Gene co-expression is... The simultaneous expression of two or more genes 2 Groups of co-transcribed genes 3 Similarity of expression 4 (correlation, topological overlap, mutual information, ...) Groups of genes that have similar expression patterns 5 over a range of different experiments 2 https://en.wiktionary.org/wiki/coexpression 3 http://bioinfow.dep.usal.es/coexpression 4 http://coxpresdb.jp/overview.shtml 5 Yeung et al. (2001) 6 Eisen et al. (1998) ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 6 / 42

Gene co-expression is... The simultaneous expression of two or more genes 2 Groups of co-transcribed genes 3 Similarity of expression 4 (correlation, topological overlap, mutual information, ...) Groups of genes that have similar expression patterns 5 over a range of different experiments Related to shared regulatory inputs, functional pathways, and biological process(es) 6 2 https://en.wiktionary.org/wiki/coexpression 3 http://bioinfow.dep.usal.es/coexpression 4 http://coxpresdb.jp/overview.shtml 5 Yeung et al. (2001) 6 Eisen et al. (1998) ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 6 / 42

From co-expression to gene function prediction Transcriptomic data: main source of ’omic information available for living organisms Microarrays ( ∼ 1995 - ) High-throughput sequencing: RNA-seq ( ∼ 2008 - ) Co-expression (clustering) analysis Study patterns of relative gene expression ( profiles ) across several conditions ⇒ Co-expression is a tool to study genes without known or predicted function (orphan genes) Exploratory tool to identify expression trends from the data ( � = sample classification, identification of differential expression) ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 7 / 42

RNA-seq profiles for co-expression ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 8 / 42

RNA-seq profiles for co-expression Let y ij be the raw count for gene i in sample j , with library size s j y ij Profile for gene i : p ij = � ℓ y i ℓ ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 8 / 42

RNA-seq profiles for co-expression y ij / s j Normalized profile for gene i : p ij = � ℓ y i ℓ / s j ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 8 / 42

Unsupervised clustering Objective Define homogeneous and well-separated groups of genes from transcriptomic data What does it mean for a pair of genes to be close? Given this, how do we define groups? ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 9 / 42

Unsupervised clustering Objective Define homogeneous and well-separated groups of genes from transcriptomic data What does it mean for a pair of genes to be close? Given this, how do we define groups? Two broad classes of methods typically used: Centroid-based clustering (K-means and hierarchical clustering) 1 Model-based clustering (mixture models) 2 ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 9 / 42

Similarity measures Similarity between genes is defined with a distance: Euclidian distance (L2 norm): d 2 ( y i , y i ′ ) = � p ℓ = 1 ( y i ℓ − y i ′ ℓ ) 2 ⇒ Note: sensitive to scaling and differences in average expression level ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 10 / 42

Similarity measures Similarity between genes is defined with a distance: Euclidian distance (L2 norm): d 2 ( y i , y i ′ ) = � p ℓ = 1 ( y i ℓ − y i ′ ℓ ) 2 ⇒ Note: sensitive to scaling and differences in average expression level Pearson correlation coefficient: d pc ( y i , y i ′ ) = 1 − ρ i , i ′ Spearman rank correlation coefficient: as above but replace y ij with rank of gene g across all samples j Absolute or squared correlation: d ac ( y i , y i ′ ) = 1 − | ρ i , i ′ | or d sc ( y i , y i ′ ) = 1 − ρ 2 i , i ′ Manhattan distance: d Manhattan ( y i , y i ′ ) = � ℓ = 1 | y i ℓ − y i ′ ℓ | ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 10 / 42

Inertia measures Homogeneity of a group is defined with an inertia criterion: Let y D be the centroid of the dataset and y C k the centroid of group C k G � d 2 ( y i , y D ) Inertia = g = 1 K K � � d 2 ( y i , y C k ) + � n k d 2 ( y C k , y D ) = k = 1 g ∈ C k k = 1 = within-group inertia + between-group inertia ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 11 / 42

In practice... Objective: cluster G genes into K groups, maximizing the between-group inertia Exhaustive search is impossible Two algorithms are often used K-means 1 Hierarchical clustering 2 ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 12 / 42

K-means algorithm Initialization K centroids are chosen ramdomly or by the user Iterative algorithm Assignment Each gene is assigned to a group according to its 1 distance to the centroids. Calculation of the new centroids 2 Stopping criterion: when the maximal number of iterations is achived OR when groups are stable Properties Rapid and easy Results depend strongly on initialization Number of groups K is fixed a priori ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 13 / 42

K-means illustration Animation: http://shabal.in/visuals/kmeans/1.html ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 14 / 42

K-means algorithm: Choice of K ? Elbow plot of within-sum of squares: examine the percentage of variance explained as a function of the number of clusters Gap statistic: estimate change in within-cluster dispersion compared to that under expected reference null distribution Silhouette statistic: measure of how closely data within a cluster is matched and how loosely it is matched to neighboring clusters ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 15 / 42

Hierarchical clustering analysis (HCA) Objective Construct embedded partitions of ( G , G − 1 , . . . , 1) groups, forming a tree-shaped data structure (dendrogram) Algorithm Initialization G groups for G genes At each step: • Closest genes are clustered • Calculate distance between this new group and the remaining genes ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 16 / 42

Distances between groups for HCA Distances between groups Single-linkage clustering: y ′ ∈ C k ′ d 2 ( y , y ′ ) D ( C k , C k ′ ) = min min y ∈ C k Complete-linkage clustering: y ′ ∈ C k ′ d 2 ( y , y ′ ) D ( C k , C k ′ ) = max max y ∈ C k Ward distance: n k n k ′ D ( C k , C k ′ ) = d 2 ( y C k , y C k ′ ) × n k + n k ′ where n k is the number of genes in group C k ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 17 / 42

Distances between groups for HCA Source: http://compbio.pbworks.com/w/page/16252903/Microarray%20Clustering%20Methods%20and%20Gene%20Ontology ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 18 / 42

HCA: additional details Properties: HCA is stable since there is no initialization step K is chosen according to the tree Results strongly depend on the chosen distances Branch lengths are proportional to the percentage of inertia loss ⇒ a long branch indicates that the 2 groups are not homogeneous ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 19 / 42

Model-based clustering Probabilistic clustering models : data are assumed to come from distinct subpopulations, each modeled separately Rigourous framework for parameter estimation and model selection Output : each gene assigned a probability of cluster membership ED& MLMM& AR Co-expression analysis of RNA-seq data INRA 20 / 42

Co-expression analysis of RNA-seq data Etienne Delannoy & - PowerPoint PPT Presentation

Co-expression analysis of RNA-seq data Etienne Delannoy & Marie-Laure Martin-Magniette & Andrea Rau Plant Science Institut of Paris-Saclay (IPS2) Applied Mathematics and Informatics Unit (MIA-Paris) Genetique Animale et Biologie

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Differential expression analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Differential Expression Analysis using limma COMBINE RNA-seq Workshop limma package: Linear

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Normalization and differential expression II Katharina H oel Statistical Analysis of RNA-Seq

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

PTCL-NOS: Gene expression profiling Javeed Iqbal Department of Pathology and Microbiology

Mpri Internship Defense Advances in Holistic Ontology Alignment Antoine Amarilli Supervised by

FAI R data m anagem ent and Disqoverability iRODS UGM 2018 Maarten Coonen Data Architect

Industry Trial Data: Mountains To Mine For AI Gold Gregory Goldmacher, MD, PhD, MBA Executive

Abundance profiles The suffix ome refers to a totality of some sort Gene (genetics)

Sheila K. Coffman MT(ASCP) If you have seen ONE Point of Care program You have seen ONE Point

SPEAKERS Michelle Jackson Scheduling Coordinator St Lukes Health System Boise, ID James X

Co-expression analysis of RNA-seq data Etienne Delannoy & - PowerPoint PPT Presentation

Co-expression analysis of RNA-seq data Etienne Delannoy & Marie-Laure Martin-Magniette & Andrea Rau Plant Science Institut of Paris-Saclay (IPS2) Applied Mathematics and Informatics Unit (MIA-Paris) Genetique Animale et Biologie

Introduction to RNA-Seq Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

RNA-seq basics: From reads to differential expression COMBINE RNA-seq Workshop RNA sequencing

Winter School, 2 July 2012 Why do RNA-seq? Differential expression analysis of Discover new

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi &lt; lg

Overview of the DE analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Visualization of results Mary Piper Bioinformatics Consultant and Trainer DataCamp RNA-Seq

Differential expression analysis Mary Piper Bioinformatics Consultant and Trainer DataCamp

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop QC and

Jen Grenier Director, TREx Facility Announcements New and Improved Project Submission Form

Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in

RNA-seq Data Analysis Introduction to RNA-seq data analysis September, 2018 1 Guillermo Parada

Differential Expression Analysis using limma COMBINE RNA-seq Workshop limma package: Linear

Introduc)on to the Analysis of RNA-seq Data Lecture

Reducing technical variability and bias in RNA-seq data Francesca Finotello NETTAB 2012

RNA World Hypothesis and RNA folding By Lixin Dai October 16, 2002 Outline: RNA World

Normalization and differential expression II Katharina H oel Statistical Analysis of RNA-Seq

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

PTCL-NOS: Gene expression profiling Javeed Iqbal Department of Pathology and Microbiology

Mpri Internship Defense Advances in Holistic Ontology Alignment Antoine Amarilli Supervised by

FAI R data m anagem ent and Disqoverability iRODS UGM 2018 Maarten Coonen Data Architect

Industry Trial Data: Mountains To Mine For AI Gold Gregory Goldmacher, MD, PhD, MBA Executive

Abundance profiles The suffix ome refers to a totality of some sort Gene (genetics)

Sheila K. Coffman MT(ASCP) If you have seen ONE Point of Care program You have seen ONE Point

SPEAKERS Michelle Jackson Scheduling Coordinator St Lukes Health System Boise, ID James X

RNA-seq Data Analysis Introduction to RNA-seq data analysis June, 2018 1 Luigi Grassi < lg