Lessons from Gene Expression Kasper Daniel Hansen < - PowerPoint PPT Presentation

Lessons from Gene Expression Kasper Daniel Hansen < khansen@jhsph.edu | www.hansenlab.org > McKusick-Nathans Institute of Genetic Medicine Department of Biostatistics Johns Hopkins University 1

Genomic Data Science Specialization @Coursera by JHU Liliana Florea   Kasper D. Hansen   6 classes   Je ff Leek   Mihaela Pertea   4 weeks per class   Steven Salzberg James Taylor Continuous rollout 2

RNA-seq scRNA-seq Microarrays 4

Replication / Reproducibility Replicate samples Replicate the experiment Replicate the conclusion Computation replication / reproduction It is di ffi cult to get a man to understand something, when his salary depends upon his not understanding it!   - Upton Sinclair 5

Science “Proof” “Crap” 6

Science “Proof” “Crap” 7

Science Most of biology 8

Different sub-fields have different standards “without knowing anything - if this was your plot, what do you think about that little guy top right" http://drbecca.scientopia.org/2015/08/18/whose-problem-is-the-reproducibility-crisis-anyway/ 9

Technical variation Focus 10

Controls 11

� Seq. tech. does not remove biol. variability a b Sequencing s.d. Sequencing s.d. 1.5 1.5 0.5 0.5 cor: 0.592 n : 5,003 cor: 0.492 n : 2,463 0.5 1.5 0.5 1.5 Array s.d. Array s.d. c COX4NB RASGRP1 Sequencing 1 Centered expression –1 1 Array –1 10 40 10 40 Sample index Hansen (2011) Nat. Biotech 12

Number of replicates GWAS Cell Biology 13

Number of replicates GWAS Cell Biology 14

Number of replicates “We applied MixupMapper to fj ve publicly available human genetical genomics datasets. On average, 3% of all analyzed samples had been assigned incorrect expression phenotypes: in one of the datasets 23% of the samples had incorrect expression phenotypes. “ Westra (2011) Bioinformatics Studies with huge number of samples have challenges as well 15

PERSPECTIVES Batch effects h 0.16 Labels not shuffled 0.14 Fraction of sign-reversed correlations Labels shuffled r 0.12 e 0.10 0.06 0.06 n 0.04 s 0.02 n si- 0 2002/2003 2002/2004 2002/2005 2003/2004 2003/2005 2004/2005 ly Batch pair , Figure 3 | Batch effects also change the correlations between genes. We normalized every gene in the second gene expression data set 2 in Tackling the widespread and critical impact of batch e ff ects to mean 0, variance 1 within each batch. (The 2006 batch was omitted owing to small sample size.) We identified all significant correlations ( p < 0.05) in high-throughput methods between pairs of genes within each batch using a linear model. We looked at genes that showed a significant correlation in two batches and counted the fraction of times that the correlation changed between the two batches. A large percentage of significant correlations reversed signs across batches, Leek (2010) Nat Rev Genet 16 suggesting that the correlation structure between genes changes substantially across batches. To confirm this phenomenon is due to batch, we repeated the process — looking for significant correlations that changed sign across batches — but with the batch labels randomly permuted. With random batches, a much smaller fraction of significant correlations change signs. This suggests that correlation patterns differ by batch, which would affect rank-based prediction methods as well as system biology approaches that rely on between-gene correlation to estimate pathways. Experimental design solutions Glossary GENETICS 737

Combining experiments - Gene Expression Barcode Biggest barrier is metadata Latest: McCall (2014) NAR 17

Speed of light measured by different groups WJ Youden (1972) Technometrics 18

Analysis One-of-a-kind As-a-utility 19

How do we know whether something works Fake data   Real data (simulations) Well designed,   well executed   reference experiments 20

Lessons from gene expression Huge advantage in common so fu ware platform / common formats Designed reference experiments Technological standardization Physical models does not help (not clear this is general) All data is publicly available data 21

Lessons from Gene Expression Kasper Daniel Hansen < - PowerPoint PPT Presentation

Lessons from Gene Expression Kasper Daniel Hansen < khansen@jhsph.edu | www.hansenlab.org > McKusick-Nathans Institute of Genetic Medicine Department of Biostatistics Johns Hopkins University 1 Genomic Data Science Specialization

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Boolean models of the lac operon in E. coli Matthew Macauley Clemson University Gene expression

Survival Models built from Gene Expression Data using Gene Groups as Covariates Kai Kammers,

Boolean models of gene regulatory networks Matthew Macauley Math 4500: Mathematical Modeling

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

101 E C O L O G Y A N D B I O D I V E R S I T Y Introductions Syllabus Term Schedule

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Experimental design Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 & Julie Aubert 2

MOBILE COMPUTING CSE 40814/60814 Fall 2015 The Past The Present 1 11/3/15 Steve Mann

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert & Alois Tschopp Biostatistics

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Addressing Population Variability in Risk Assessment: Challenges and Opportunities SRP Risk

Sambuz

Useful Links

Newsletter

Mail Us

Lessons from Gene Expression Kasper Daniel Hansen < - PowerPoint PPT Presentation

Lessons from Gene Expression Kasper Daniel Hansen < khansen@jhsph.edu | www.hansenlab.org > McKusick-Nathans Institute of Genetic Medicine Department of Biostatistics Johns Hopkins University 1 Genomic Data Science Specialization

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Boolean models of the lac operon in E. coli Matthew Macauley Clemson University Gene expression

Survival Models built from Gene Expression Data using Gene Groups as Covariates Kai Kammers,

Boolean models of gene regulatory networks Matthew Macauley Math 4500: Mathematical Modeling

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Staphylococcus aureus Pathogenesis - Gene exchanges - Gene regulation - Gene products - Gene

101 E C O L O G Y A N D B I O D I V E R S I T Y Introductions Syllabus Term Schedule

Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models

Experimental design Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 &amp; Julie Aubert 2

MOBILE COMPUTING CSE 40814/60814 Fall 2015 The Past The Present 1 11/3/15 Steve Mann

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert &amp; Alois Tschopp Biostatistics

E xpe rime nts De sig n a nd Ana lysis F o tis E . Pso mo po ulo s CODAT A-RDA Advanc e d

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Addressing Population Variability in Risk Assessment: Challenges and Opportunities SRP Risk

Sambuz

Useful Links

Newsletter

Mail Us

Experimental design Etienne Delannoy 1 and Marie-Laure Martin-Magniette 1 , 2 & Julie Aubert 2

Biostatistics ANOVA - Analysis of Variance Burkhardt Seifert & Alois Tschopp Biostatistics