CSEP 527 Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 527   Computational Biology Gene Expression Analysis 1

Assaying Gene Expression 3

Microarrays 4

⬇ RNAseq ⬇ ⬇ Millions of reads, DNA Sequencer say, 100 bp each map to genome, analyze 5

Goals of RNAseq #1: Which genes are being expressed? How? assemble reads (fragments of mRNAs) into (nearly) full-length mRNAs and/or map them to a reference genome #2: How highly expressed are they? How? count how many fragments come from each gene–expect more highly expressed genes to yield more reads, after correcting for biases like mRNA length #3: What’s same/diff between 2 samples E.g., tumor/normal #4: ... 7

Recall: splicing exon 5 ’ intron exon 2

RNAseq Data Analysis De novo Assembly mostly deBruijn-based, but likely to change with longer reads more complex than genome assembly due to alt splicing, wide diffs in expression levels; e.g. often multiple “k’s” used pro: no ref needed (non-model orgs), novel discoveries possible, e.g. very short exons con: less sensitive to weakly-expressed genes Reference-based (more later) pro/con: basically the reverse Both: subsequent bias correction, quantitation, differential expression calls, fusion detection, etc. 8

“TopHat” (Ref based example) n map reads to ref transcriptome (optional) BWA n map reads to ref genome n unmapped reads remapped as 25mers n novel splices = 25 mers anchored 2 sides n stitch original reads across these n remap reads with minimal overlaps n Roughly: 10m reads/hr, 4Gbytes   (typical data set 100m–1b reads) 9

Kim,et al. 2013. “TopHat2: Accurate Alignment of Transcriptomes in the Presence of Insertions, Deletions and Gene Fusions.” Genome Biology 14 (4) (April 25): R36. doi:10.1186/gb-2013-14-4-r36. Figure 6

RNAseq Example 1yr-3 hg19 5 kb Scale chr19: 50,020,000 50,025,000 Day 20 1 Year FCGRT FCGRT 20

RNAseq protocol (approx) Extract RNA (either polyA ↔ polyT or tot – rRNA) Reverse-transcribe into DNA (“cDNA”) Make double-stranded, maybe amplify Cut into, say, ~300bp fragments Add adaptors to each end Sequence ~100-175bp from one or both ends CAUTIONS: non-uniform sampling, sequence (e.g. G+C), 5’-3’, and length biases 6

Two Stories: RNAseq Bias Correction & Isoform Quantification Let-7 & Cardiomyocyte Maturation Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA Copenhagen, 2015-Aug-18 ruzzo@uw.edu

Story 1 RNAseq:   Bias Correction & Alt Splicing

“All High-Throughput Technologies are Crap – Initially” Q. Morris   7-20-2015

RNA seq cDNA, fragment, QC filter, RNA → → S equence → → C ount end repair, A-tail, trim, map, ligate, PCR, … … It’s so easy, what could possibly go wrong?

What we expect: Uniform Sampling 100 Count reads starting at each position, not those covering each position 75 50 25 0 0 50 100 150 200 Uniform sampling of 4000 “reads” across a 200 bp “exon.” Average 20 ± 4.7 per position, min ≈ 9, max ≈ 33   I.e., as expected, we see ≈ μ ± 3 σ in 200 samples

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Count reads starting at Uniform each position, not those 50 25 covering each position 0 Actual ––––––––––– 3’ exon ––––––––– 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Count reads starting at Uniform each position, not those 50 25 covering each position 0 Actual How to make it more uniform? A: Math tricks like averaging/smoothing (e.g. “coverage”) or transformations (“log”), …, or WE DO   B: Try to model (aspects of) causation ––––––––––– 3’ exon ––––––––– THIS 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The Good News : we can (partially) correct the bias The bad news : random fragments are not so uniform. Uniform 50 25 0 Actual not perfect, but better: 38% reduction in LLR of uniform model; hugely more likely 200 nucleotides

(in part) Bias is ^ sequence-dependent Reads and platform/sample-dependent Fitting a model of the sequence surrounding read starts lets us predict which positions have more reads.

d sample foreground sequences o ( a ) h e t n e i M l t sample (local) background sequences ( b ) u O train Bayesian network ( c ) I.e., learn sequence patterns associated w/ high / low read counts. ( d ) ( e )

Modeling Sequence Bias Want a probability distribution over k-mers, k ≈ 40? Some obvious choices: Full joint distribution: 4 k -1 parameters PWM (0-th order Markov): (4-1)•k parameters Something intermediate: Directed Bayes network 12

Form of the models: Directed Bayes nets One “node” per nucleotide,   ±20 bp of read start • Filled node means that position is biased • Arrow i → j means letter at position i modifies bias at j • For both, numeric parameters say how much How–optimize: n n Pr [ s i | x i ] Pr [ x i ] � � ℓ = logPr [ x i | s i ]= log � x ∈ { 0 , 1 } Pr [ s i | x ] Pr [ x ] i = 1 i = 1

NB: •Not just initial hexamer •Span ≥ 19 •All include Illumina ABI negative positions •All different, even on same platform

Result – Increased Uniformity Kullback-Leibler Divergence Jones Li et al Hansen et al Trapnell Data

Result – Increased Uniformity R 2 Fractional improvement in log-likelihood under hypothesis test:   uniform model across * = p-value < 10 -23 “Is BN better than X?” 1000 exons (R 2 =1-L’/L) (1-sided Wilcoxon signed-rank test)

“First, do no harm” Theorem: The probability of “false bias discovery,” i.e., of learning a non-empty model from n reads sampled from unbiased data, declines exponentially with n. Prob(non-empty model | unbiased data) If > 10,000 reads are used, the probability of a non-empty model < 0.0004 10 4 17 Number of training reads

CSEP 527 Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3 Microarrays 4 RNAseq Millions of reads, DNA Sequencer say, 100 bp each map to genome, analyze 5 Goals of RNAseq #1: Which genes are

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/18wi Larry Ruzzo

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

CSEP 527 Computational Biology Spring 2016 3: BLAST, Alignment score significance; PCR and DNA

CSEP 527 Computational Biology Spring 2016 Lecture 2 Sequence Alignment 1 HW 0

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

CSEP 527 Spring 2016 Phylogenies: Parsimony Plus a Tantalizing Taste of Likelihood 1

The$mechanism$of$wound$shielding$&$more$using$WSI & Niels&Grabe&

Outline Across the Lifespan with PCOS and Reproductive Care: Diagnostic Criteria and PCOS

Section N Medications Objectives State the intent of Section N Medications. Describe the

for diabetics in a fab lab Who Bespoke design Prototypes Conclusion Ollivier Piqueray Katrien

Pathway Commons A public library of biological pathways on the semantic web June.13 2007 -

Electronic Systems Design at the Department of Computer Science at the Department of Computer

Specification: The Biggest Bottleneck in Aerospace V&V and Autonomy Kristin Yvonne Rozier

Intelligence, Personal Adjustment, and Normality Testing for Intelligence Intelligence

Sambuz

Useful Links

Newsletter

Mail Us

CSEP 527 Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3 Microarrays 4 RNAseq Millions of reads, DNA Sequencer say, 100 bp each map to genome, analyze 5 Goals of RNAseq #1: Which genes are

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/18wi Larry Ruzzo

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

CSEP 527 Computational Biology Spring 2016 3: BLAST, Alignment score significance; PCR and DNA

CSEP 527 Computational Biology Spring 2016 Lecture 2 Sequence Alignment 1 HW 0

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview &amp; Bio

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview &amp; Bio

Rigid Geometric Transformations COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Camera Calibration COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Camera

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

CSEP 527 Spring 2016 Phylogenies: Parsimony Plus a Tantalizing Taste of Likelihood 1

The$mechanism$of$wound$shielding$&amp;$more$using$WSI &amp; Niels&amp;Grabe&amp;

Outline Across the Lifespan with PCOS and Reproductive Care: Diagnostic Criteria and PCOS

Section N Medications Objectives State the intent of Section N Medications. Describe the

for diabetics in a fab lab Who Bespoke design Prototypes Conclusion Ollivier Piqueray Katrien

Pathway Commons A public library of biological pathways on the semantic web June.13 2007 -

Electronic Systems Design at the Department of Computer Science at the Department of Computer

Specification: The Biggest Bottleneck in Aerospace V&amp;V and Autonomy Kristin Yvonne Rozier

Intelligence, Personal Adjustment, and Normality Testing for Intelligence Intelligence

Sambuz

Useful Links

Newsletter

Mail Us

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio

The$mechanism$of$wound$shielding$&$more$using$WSI & Niels&Grabe&

Specification: The Biggest Bottleneck in Aerospace V&V and Autonomy Kristin Yvonne Rozier