Genome 559 Lecture 13a, 2/16/10 Larry Ruzzo A little more about - PowerPoint PPT Presentation

Genome 559 Lecture 13a, 2/16/10 Larry Ruzzo A little more about motif models

Motifs III – Outline Statistical justification for frequency counts Relative Entropy Another example 2

Frequencies Frequency ⇒ Scores: pos 1 2 3 4 5 6 base log 2 (freq/background) A 2 94 26 59 50 1 C 9 2 14 13 20 3 G 10 1 16 15 13 0 T 79 3 44 13 17 96 Scores pos 1 2 3 4 5 6 base A -36 19 1 12 10 -46 (For convenience, scores multiplied by C -15 -36 -8 -9 -3 -31 10, then rounded) G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 3

What’s best WMM? Given, say, 168 sequences s 1 , s 2 , ..., s k of length 6, assumed to be generated at random according to a WMM defined by 6 x (4-1) parameters θ , what’s the best θ ? Answer: count frequencies per position. Analogously, if you saw 900 Heads in1000 coin flips, you’d perhaps estimate P(Heads) = 900/1000 Why is this sensible? 4

Parameter Estimation Assuming sample x 1 , x 2 , ..., x n is from a parametric distribution f(x| θ ) , estimate θ . E.g.: x 1 , x 2 , ..., x 5 is HHHTH, estimate θ = prob(H) 5

Likelihood P(x | θ ): Probability of event x given model θ Viewed as a function of x (fixed θ ), it’s a probability E.g., Σ x P(x | θ ) = 1 Viewed as a function of θ (fixed x), it’s a likelihood E.g., Σ θ P(x | θ ) can be anything; relative values of interest. E.g., if θ = prob of heads in a sequence of coin flips then P(HHHTH | .6) > P(HHHTH | .5), I.e., event HHHTH is more likely when θ = .6 than θ = .5 And what θ make HHHTH most likely? 6

Maximum Likelihood Parameter Estimation One (of many) approaches to param. est. Likelihood of (indp) observations x1, x2, ..., xn As a function of θ , what θ maximizes the likelihood of the data actually observed. Typical approaches: Numerical 0.08 MCMC 0.06 ∂ ∂ Analytical – or ∂θL ( � x | θ ) = 0 ∂θ log L ( � x | θ ) = 0 L(x| θ ) 0.04 EM, etc. 0.02 0.2 0.4 0.6 0.8 1.0 θ 7

0.08 0.06 Example 1 L(x| θ ) 0.04 0.02 0.2 0.4 0.6 0.8 1.0 θ n coin flips, x 1 , x 2 , ..., x n ; n 0 tails, n 1 heads, n 0 + n 1 = n ; θ = probability of heads Observed fraction of successes in sample is MLE of success probability in population (Also verify it’s max, not min, & not better on boundary) 8

3 0.8 2 Example 1I 1 0.6 0 -0.4 -0.4 0.4 -0.2 -0.2 0 0 0.2 0.2 0.2 n letters, x 1 , x 2 , ..., x n drawn at random from a (perhaps 0.4 biased) pool of A, C, G, T, n A + n C + n G + n T = n ; θ = ( θ A , θ C , θ G , θ T ) proportion of each nucleotide. Math is a bit messier, but result is similar to coins Observed fraction of ˆ nucleotides in sample is θ = ( n A /n, n C /n, n G /n, n T /n) MLE of nucleotide probabilities in population 9

What’s best WMM? Given, say, 168 sequences s 1 , s 2 , ..., s k of length 6, assumed to be generated at random according to a WMM defined by 6 x (4-1) parameters θ , what’s the best θ ? Answer: MLE = position specific frequencies 10

r e d Pseudocounts n i m e R Freq/count of 0 ⇒ - ∞ score; a problem? Certain that a given residue never occurs in a given position? Then - ∞ just right. Else, it may be a small-sample artifact Typical fix: add a pseudocount to each observed count—small constant (e.g., .5, 1) Sounds ad hoc ; there is a Bayesian justification Influence fades with more data 11

“Similarity” of Distributions: Relative Entropy AKA Kullback-Liebler Distance/Divergence, AKA Information Content Given distributions P , Q P ( x ) log P ( x ) � ≥ 0 H ( P || Q ) = Q ( x ) x ∈ Ω Notes: Let P ( x ) log P ( x ) Q ( x ) = 0 if P ( x ) = 0 [since lim y → 0 y log y = 0] Undefined if 0 = Q ( x ) < P ( x ) 12

WMM: How “Informative”? Mean score of site vs bkg? For any fixed length sequence x , let P(x) = Prob. of x according to WMM Q(x) = Prob. of x according to background Relative Entropy: P ( x ) � H ( P || Q ) = P ( x ) log 2 Q ( x ) x ∈ Ω -H(Q||P) H(P||Q) H(P||Q) is expected log likelihood score of a sequence randomly chosen from WMM ; -H(Q||P) is expected score of Background 13

WMM Scores vs Relative Entropy H(P||Q) = 5.0 3500 3000 -H(Q||P) = -6.8 2500 2000 1500 1000 500 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 On average, foreground model scores > background by 11.8 bits (score difference of 118 on 10x scale used in examples above). 14

Calculating H & H per Column For WMM, based on the assumption of independence between columns: H ( P || Q ) = � i H ( P i || Q i ) where Pi and Qi are the WMM/background distributions for column i. 15

Questions Which columns of my motif are most informative/uninformative? How wide is my motif, really? Per-column relative entropy gives a quantitative way to look at such questions 16

Another WMM example 8 Sequences: Freq. Col 1 Col 2 Col 3 A 0.625 0 0 ATG C 0 0 0 ATG ATG G 0.250 0 1 ATG T 0.125 1 0 ATG GTG LLR Col 1 Col 2 Col 3 GTG A 1.32 - ∞ - ∞ TTG C - ∞ - ∞ - ∞ Log-Likelihood Ratio: G 0 - ∞ 2.00 T -1.00 2.00 - ∞ f x i ,i , f x i = 1 log 2 f x i 4 17

Non-uniform Background • E. coli - DNA approximately 25% A, C, G, T • M. jannaschi - 68% A-T, 32% G-C LLR from previous LLR Col 1 Col 2 Col 3 example, assuming A 0.74 - ∞ - ∞ C - ∞ - ∞ - ∞ f A = f T = 3 / 8 G 1.00 - ∞ 3.00 f C = f G = 1 / 8 T -1.58 1.42 - ∞ e.g., G in col 3 is 8 x more likely via WMM than background, so (log 2 ) score = 3 (bits). 18

WMM Example, cont. Freq. Col 1 Col 2 Col 3 A 0.625 0 0 C 0 0 0 G 0.250 0 1 T 0.125 1 0 Uniform Non-uniform LLR Col 1 Col 2 Col 3 LLR Col 1 Col 2 Col 3 - ∞ - ∞ A 1.32 A 0.74 - ∞ - ∞ - ∞ - ∞ - ∞ C - ∞ - ∞ - ∞ C - ∞ G 0 2.00 G 1.00 - ∞ 3.00 T -1.00 2.00 - ∞ - ∞ T -1.58 1.42 19

WMM Example, cont. Freq. Col 1 Col 2 Col 3 A 0.625 0 0 C 0 0 0 G 0.250 0 1 T 0.125 1 0 Uniform Non-uniform LLR Col 1 Col 2 Col 3 LLR Col 1 Col 2 Col 3 A 1.32 - ∞ - ∞ A 0.74 - ∞ - ∞ C - ∞ - ∞ - ∞ C - ∞ - ∞ - ∞ G 0 - ∞ 2.00 G 1.00 - ∞ 3.00 T -1.00 2.00 - ∞ T -1.58 1.42 - ∞ RelEnt 0.70 2.00 2.00 4.70 RelEnt 0.51 1.42 3.00 4.93 20

Today’s Summary It’s important to account for background Log likelihood scoring naturally does: log(freq/background freq) Relative Entropy measures “dissimilarity” of two distributions; “information content”; average score difference between foreground & background. Full motif & per column 21

Motif Summary Motif description/recognition fits a simple statistical framework Frequency counts give MLE parameters Scoring is log likelihood ratio hypothesis testing Scores are interpretable Log likelihood scoring naturally accounts for background (which is important): log(foreground freq/background freq) Broadly useful approaches - not just for motifs 22

Genome 559 Lecture 13a, 2/16/10 Larry Ruzzo A little more about - PowerPoint PPT Presentation

Genome 559 Lecture 13a, 2/16/10 Larry Ruzzo A little more about motif models Motifs III Outline Statistical justification for frequency counts Relative Entropy Another example 2 Frequencies Frequency Scores: pos 1 2 3 4 5 6

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and

Col. 1:9, For this reason we also, since the day we heard it, do not cease to pray for you, and

By Fong Yan Kin HOD-Aesthetics Fostering Creativity Celebrating Diversity Every child is

SCAIView - Lucene for Life Science Knowledge Discovery Dr. Christoph M. Friedrich E-mail:

Genetic Algorithms Presentation by Eli Hodges Based on the paper by Eli Hodges What to Expect

CS 241: Systems Programming Lecture 14. Pointers and Arrays Spring 2020 Prof. Stephen Checkoway

Experimental Analysis 2. Program Optimization Marco Chiarandini slides partly based on

From Math 2220 Class 31 Line and Path Integrals Properties Interpretations Dr. Allen Back

Test for Lorentz violation with MiniBooNE low energy excess Teppei Katori Massachusetts

Genome 559 Lecture 13a, 2/16/10 Larry Ruzzo A little more about - PowerPoint PPT Presentation

Genome 559 Lecture 13a, 2/16/10 Larry Ruzzo A little more about motif models Motifs III Outline Statistical justification for frequency counts Relative Entropy Another example 2 Frequencies Frequency Scores: pos 1 2 3 4 5 6

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

CS 559: Machine Learning CS 559: Machine Learning Fundamentals and Applications 12 th Set of

Genome Annotation The steps in genome sequencing Generate genome sequence Assembly ORF

Visualizing ENCODE Data in the UCSC Genome Browser Pauline Fujita, Ph.D. UCSC Genome Bioinformatics

The Mouse Genome The Mouse Genome Database (MGD) Database (MGD) Eppig J.T., et al. (2005). The

Self Study: Yeast Genome Comparison SESSION 4 MARTIN KRZYWINSKI Genome Sciences Centre BC

Genome 562 February 2015 Week 6 Genome 562 p.1/13 Julian Huxley (1887-1975) Oxford

Genome 562 January 2015 Week 1 Genome 562 p.1/6 Early workers in theoretical population

Genome assembly Mark Stenglein, Todos Santos 2018 Genome assembly is the process of attempting to

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Project Design Genome 559: Introduction to Statistical and Computational Genomics Elhanan

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and

Col. 1:9, For this reason we also, since the day we heard it, do not cease to pray for you, and

By Fong Yan Kin HOD-Aesthetics Fostering Creativity Celebrating Diversity Every child is

SCAIView - Lucene for Life Science Knowledge Discovery Dr. Christoph M. Friedrich E-mail:

Genetic Algorithms Presentation by Eli Hodges Based on the paper by Eli Hodges What to Expect

CS 241: Systems Programming Lecture 14. Pointers and Arrays Spring 2020 Prof. Stephen Checkoway

Experimental Analysis 2. Program Optimization Marco Chiarandini slides partly based on

From Math 2220 Class 31 Line and Path Integrals Properties Interpretations Dr. Allen Back

Test for Lorentz violation with MiniBooNE low energy excess Teppei Katori Massachusetts

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference