CSEP 590 B Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 590 B   Computational Biology Gene Expression Analysis 1

Assaying Gene Expression 3

Microarrays 4

⬇ RNAseq ⬇ ⬇ DNA Sequencer map to genome, analyze 5

Goals of RNAseq #1: Which genes are being expressed? How? assemble reads (fragments of mRNAs) into (nearly) full-length mRNAs and/or map them to a reference genome #2: How highly expressed are they? How? count how many fragments come from each gene–expect more highly expressed genes to yield more reads, after correcting for biases like mRNA length #3: What’s same/diff between 2 samples E.g., tumor/normal #4: ... 7

Recall: splicing exon 5 ’ intron exon 2

RNAseq Data Analysis De novo Assembly mostly deBruijn-based, but likely to change with longer reads more complex than genome assembly due to alt splicing, wide diffs in expression levels; e.g. often multiple “k’s” used pro: no ref needed (non-model orgs), novel discoveries possible, e.g. very short exons con: less sensitive to weakly-expressed genes Reference-based (more later) pro/con: basically the reverse Both: subsequent bias correction, quantitation, differential expression calls, fusion detection, etc. 8

“TopHat” (Ref based example) ! map reads to ref transcriptome (optional) BWA ! map reads to ref genome ! unmapped reads remapped as 25mers ! novel splices = 25 mers anchored 2 sides ! stitch original reads across these ! remap reads with minimal overlaps ! Roughly: 10m reads/hr, 4Gbytes   (typical data set 100m–1b reads) 9

RNAseq Example 1yr-3 hg19 5 kb Scale chr19: 50,020,000 50,025,000 Day 20 1 Year FCGRT FCGRT 20

RNAseq protocol (approx) Extract RNA (maybe by polyA ↔ polyT) Reverse-transcribe into DNA (“cDNA”) Make double-stranded, maybe amplify Cut into, say, ~300bp fragments Add adaptors to each end Sequence ~100-175bp from one or both ends CAUTIONS: non-uniform sampling, sequence (e.g. G+C), 5’-3’, and length biases 6

Bias Correction in RNAseq Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

Associate Editor: Alex Bateman These biases may adversely effect transcript discovery, as ABSTRACT low level noise may be overreported in some regions, and in Motivation: Quantification of sequence abundance in RNA-Seq others, active transcription may be underreported. They render experiments is often conflated by protocol-specific sequence bias. untrustworthy comparisons of relative abundance between genes The exact sources of the bias are unknown, but may be influenced by 2

RNA seq cDNA, fragment, QC filter, RNA → → S equence → → C ount end repair, A-tail, trim, map, ligate, PCR, … …

What we expect: Uniform Sampling 100 75 50 25 0 0 50 100 150 200 Uniform sampling of 4000 “reads” across a 200 bp “exon.” Average 20 ± 4.7 per position, min ≈ 9, max ≈ 33   I.e., as expected, we see ≈ μ ± 3 σ in 200 samples

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. E.g., assuming uniform, the 8 peaks above 100 are > +10 σ above mean ~ Uniform 50 25 0 Actual ––––––––––– 3’ exon ––––––––– 200 nucleotides Mortazavi data

What we get: highly non-uniform coverage The bad news : random fragments are not so uniform. Uniform 50 25 0 Actual not perfect, but better: 38% reduction in LLR of uniform model; hugely more likely 200 nucleotides The Good News : we can (partially) correct the bias

(in part) Bias is ^ sequence-dependent Reads and platform/sample-dependent Fitting a model of the sequence surrounding read starts lets us predict which positions have more reads.

d o ( a ) h e t n e i M l t ( b ) u O ( c ) ( d ) ( e )

Modeling Sequence Bias Want a probability distribution over k-mers, k ≈ 40 Some obvious choices Full joint distribution: 4 k -1 parameters PWM (0-th order Markov): (4-1)•k parameters Something intermediate Directed Bayes network 9

Form of the models: Directed Bayes nets One “node” per nucleotide,   ±20 bp of read start • Filled node means that position is biased • Arrow i → j means letter at position i modifies bias at j • For both, numeric parameters say how much How–optimize: n n Pr [ s i | x i ] Pr [ x i ] � � ℓ = logPr [ x i | s i ]= log � x ∈ { 0 , 1 } Pr [ s i | x ] Pr [ x ] i = 1 i = 1

NB: •Not just initial hexamer •Span ≥ 19 •All include Illumina ABI negative positions •All different, even on same platform

Log Log

Result – Increased Uniformity Kullback-Leibler Divergence Jones Li et al Hansen et al Trapnell Data

Result – Increased Uniformity R 2 Fractional improvement in log-likelihood under hypothesis test:   uniform model across * = p-value < 10 -23 “Is BN better than X?” 1000 exons (R 2 =1-L’/L) (1-sided Wilcoxon signed-rank test)

“First, do no harm” Theorem: The probability of “false bias discovery,” i.e.,   of learning a non-empty model from n reads sampled from un biased data is less than 1 - (Pr( X < 3 log n )) 2 h where h = number of nucleotides in the model and X is a random variable that (asymptotically in n ) is ! 2 with 3 degrees of freedom. (E[ X ] = 3)

“First, do no harm” Theorem: The probability of “false bias discovery,” i.e., of learning a non-empty model from n reads sampled from unbiased data, declines exponentially with n. Prob(non-empty model | unbiased data) If > 10,000 reads are used, the probability of a non-empty model < 0.0004 10 4 19 Number of training reads

CSEP 590 B Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3 Microarrays 4 RNAseq DNA Sequencer map to genome, analyze 5 Goals of RNAseq #1: Which genes are being expressed? How? assemble reads

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/18wi Larry Ruzzo

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSEP 590 A Watson-Crick pair;

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

Chapt hapter er 6 6 Parallel Processors from Client to Cloud 6.1 Introduction Introduction

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

Special Topics in AI: Intelligent Agents and Multi-Agent Systems No guarantees DSA-1,

Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality

Topology-Aware Cooperative Data Protection in Blockchain-Based Decentralized Storage Networks

High-dimensional integration without Markov chains Alexander Gray Carnegie Mellon University

Targeted end-to-end knowledge graph decomposition Bla krlj, Jan Kralj and Nada Lavra c

Collaboration wont just happen Supervising Co-Teaching Teams: Whose Line is it Deliberate

CSEP 590 B Computational Biology Gene Expression Analysis 1 - PowerPoint PPT Presentation

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3 Microarrays 4 RNAseq DNA Sequencer map to genome, analyze 5 Goals of RNAseq #1: Which genes are being expressed? How? assemble reads

CSEP 590 A Computational Biology &quot; &quot; Genes and Gene Prediction &quot; &quot; A

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/16sp Larry Ruzzo

CSEP 527 Computational Biology http://courses.cs.washington.edu/courses/csep527/18wi Larry Ruzzo

DNA Methylation CpG - 2 adjacent nts, same strand (not CH 3 CSEP 590 A Watson-Crick pair;

Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02

De-anonymizing Data CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 2 : 590.03 Fall 12

Sampling from Databases CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 2 : 590.02

Post-processing outputs for better utility CompSci 590.03 Instructor: Ashwin Machanavajjhala

Wavelet and Matrix Mechanism CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 11 :

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

CSEP 527 Computational Biology RNA: Function, Secondary Structure Prediction, Search, Discovery

CSEP 527 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation We

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Course Wrap Up Please complete online course evaluation

RNA Search and Motif Discovery CSEP 527 Computational Biology Previous Lecture

Chapt hapter er 6 6 Parallel Processors from Client to Cloud 6.1 Introduction Introduction

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

Special Topics in AI: Intelligent Agents and Multi-Agent Systems No guarantees DSA-1,

Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-Traffic Optimality

Topology-Aware Cooperative Data Protection in Blockchain-Based Decentralized Storage Networks

High-dimensional integration without Markov chains Alexander Gray Carnegie Mellon University

Targeted end-to-end knowledge graph decomposition Bla krlj, Jan Kralj and Nada Lavra c

Collaboration wont just happen Supervising Co-Teaching Teams: Whose Line is it Deliberate

CSEP 590 A Computational Biology " " Genes and Gene Prediction " " A