Lessons from Gene Expression Kasper Daniel Hansen < khansen@jhsph.edu | www.hansenlab.org > McKusick-Nathans Institute of Genetic Medicine Department of Biostatistics Johns Hopkins University 1
Genomic Data Science Specialization @Coursera by JHU Liliana Florea Kasper D. Hansen 6 classes Je ff Leek Mihaela Pertea 4 weeks per class Steven Salzberg James Taylor Continuous rollout 2
3
RNA-seq scRNA-seq Microarrays 4
Replication / Reproducibility Replicate samples Replicate the experiment Replicate the conclusion Computation replication / reproduction It is di ffi cult to get a man to understand something, when his salary depends upon his not understanding it! - Upton Sinclair 5
Science “Proof” “Crap” 6
Science “Proof” “Crap” 7
Science Most of biology 8
Different sub-fields have different standards “without knowing anything - if this was your plot, what do you think about that little guy top right" http://drbecca.scientopia.org/2015/08/18/whose-problem-is-the-reproducibility-crisis-anyway/ 9
Technical variation Focus 10
Controls 11
� Seq. tech. does not remove biol. variability a b Sequencing s.d. Sequencing s.d. 1.5 1.5 0.5 0.5 cor: 0.592 n : 5,003 cor: 0.492 n : 2,463 0.5 1.5 0.5 1.5 Array s.d. Array s.d. c COX4NB RASGRP1 Sequencing 1 Centered expression –1 1 Array –1 10 40 10 40 Sample index Hansen (2011) Nat. Biotech 12
Number of replicates GWAS Cell Biology 13
Number of replicates GWAS Cell Biology 14
Number of replicates “We applied MixupMapper to fj ve publicly available human genetical genomics datasets. On average, 3% of all analyzed samples had been assigned incorrect expression phenotypes: in one of the datasets 23% of the samples had incorrect expression phenotypes. “ Westra (2011) Bioinformatics Studies with huge number of samples have challenges as well 15
PERSPECTIVES Batch effects h 0.16 Labels not shuffled 0.14 Fraction of sign-reversed correlations Labels shuffled r 0.12 e 0.10 0.06 0.06 n 0.04 s 0.02 n si- 0 2002/2003 2002/2004 2002/2005 2003/2004 2003/2005 2004/2005 ly Batch pair , Figure 3 | Batch effects also change the correlations between genes. We normalized every gene in the second gene expression data set 2 in Tackling the widespread and critical impact of batch e ff ects to mean 0, variance 1 within each batch. (The 2006 batch was omitted owing to small sample size.) We identified all significant correlations ( p < 0.05) in high-throughput methods between pairs of genes within each batch using a linear model. We looked at genes that showed a significant correlation in two batches and counted the fraction of times that the correlation changed between the two batches. A large percentage of significant correlations reversed signs across batches, Leek (2010) Nat Rev Genet 16 suggesting that the correlation structure between genes changes substantially across batches. To confirm this phenomenon is due to batch, we repeated the process — looking for significant correla- tions that changed sign across batches — but with the batch labels randomly permuted. With random batches, a much smaller fraction of significant correlations change signs. This suggests that correlation patterns differ by batch, which would affect rank-based prediction methods as well as system biology approaches that rely on between-gene correlation to estimate pathways. Experimental design solutions Glossary GENETICS 737
Combining experiments - Gene Expression Barcode Biggest barrier is metadata Latest: McCall (2014) NAR 17
Speed of light measured by different groups WJ Youden (1972) Technometrics 18
Analysis One-of-a-kind As-a-utility 19
How do we know whether something works Fake data Real data (simulations) Well designed, well executed reference experiments 20
Lessons from gene expression Huge advantage in common so fu ware platform / common formats Designed reference experiments Technological standardization Physical models does not help (not clear this is general) All data is publicly available data 21
Recommend
More recommend