CSE 427 Computational Biology Gene Prediction A statistical - PowerPoint PPT Presentation

CSE 427 Computational Biology Gene Prediction

A statistical interlude: Fair or biased? H H H H T H H T T H 3

More likely fair or biased? H H H H T H H T T H 4

More likely H0 or H1? H H H H T H H T T H  H0: .5 – .5  H1: .9 – .1 5

Quantify likelihood: H 0 vs H 1 H H H H T H H T T H H0: .5 – .5 .5^10 H1: .9 – .1 .9^7 * .1^3 Likelihood ratio: (.5^10)/(.9^7 * .1^3) = .4898 (I.e., odds favor “biased” by about 2:1) 6

Gene Finding: Motivation Sequence data flooding into Genbank What does it mean? protein genes, RNA genes, mitochondria, chloroplast, regulation, replication, structure, repeats, transposons, unknown stuff, … 7

Protein Coding Nuclear DNA Focus of this lecture Goal: Automated annotation of new sequence data State of the Art: In Eukaryotes: predictions ~ 60% similar to real proteins ~80% if database similarity used Prokaryotes better, but still imperfect lab verification still needed, still expensive 8

Biological Basics Central Dogma: DNA transcription RNA translation Protein Codons: 3 bases code one amino acid Start codon Stop codons 3 ’ , 5 ’ Untranslated Regions (UTR ’ s) 9

(This gene is heavily transcribed, but many are not.) Alberts, et al. 10

Codons & The Genetic Code Ala : Alanine Second Base Arg : Arginine U C A G Asn : Asparagine Phe Ser Tyr Cys U Asp : Aspartic acid Phe Ser Tyr Cys C Cys : Cysteine U Leu Ser Stop Stop A Gln : Glutamine Leu Ser Stop Trp G Glu : Glutamic acid Leu Pro His Arg U Gly : Glycine Leu Pro His Arg C His : Histidine C Third Base First Base Leu Pro Gln Arg A Ile : Isoleucine Leu Pro Gln Arg G Leu : Leucine Ile Thr Asn Ser U Lys : Lysine Ile Thr Asn Ser C Met : Methionine A Ile Thr Lys Arg A Phe : Phenylalanine Met/Start Thr Lys Arg G Pro : Proline Val Ala Asp Gly U Ser : Serine Val Ala Asp Gly C Thr : Threonine G Val Ala Glu Gly A Trp : Tryptophane Val Ala Glu Gly G Tyr : Tyrosine Val : Valine 11

Translation: mRNA → Protein Watson, Gilman, Witkowski, & Zoller, 1992 12

Ribosomes Watson, Gilman, Witkowski, & Zoller, 1992 13

Idea #1: Find Long ORF ’ s Reading frame: which of the 3 possible sequences of triples does the ribosome read? Open Reading Frame: No stop codons In random DNA average ORF = 64/3 = 21 triplets 300bp ORF once per 36kbp per strand But average protein ~ 1000bp 14

A Simple ORF finder start at left end scan triplet-by-non-overlapping triplet for AUG then continue scan for STOP repeat until right end repeat all starting at offset 1 repeat all starting at offset 2 15

Scanning for ORFs * 1 2 3 U U A A U G U G U C A U U G A U U A A G A A U U A C A C A G U A A C U A A U A C 4 5 6 * In bacteria, GUG is sometimes a start codon… 16

Idea #2: Codon Frequency In random DNA Leucine : Alanine : Tryptophan = 6 : 4 : 1 But in real protein, ratios ~ 6.9 : 6.5 : 1 So, coding DNA is not random Even more: synonym usage is biased (in a species dependant way) examples known with 90% AT 3 rd base Why? E.g. efficiency, histone, enhancer, splice interactions 17

Recognizing Codon Bias Assume Codon usage i.i.d.; abc with freq. f(abc) a 1 a 2 a 3 a 4 …a 3n+2 is coding, unknown frame Calculate p 1 = f(a 1 a 2 a 3 )f(a 4 a 5 a 6 )…f(a 3n-2 a 3n-1 a 3n ) p 2 = f(a 2 a 3 a 4 )f(a 5 a 6 a 7 )…f(a 3n-1 a 3n a 3n+1 ) p 3 = f(a 3 a 4 a 5 )f(a 6 a 7 a 8 )…f(a 3n a 3n+1 a 3n+2 ) P i = p i / (p 1 +p 1 +p 3 ) More generally: k-th order Markov model k=5 or 6 is typical (next lecture) 18

Codon Usage in Φ x174 Staden & McLachlan, NAR 10, 1 1982, 141-156 19

CSE 427 Computational Biology Gene Prediction A statistical - PowerPoint PPT Presentation

CSE 427 Computational Biology Gene Prediction A statistical interlude: Fair or biased? H H H H T H H T T H 3 More likely fair or biased? H H H H T H H T T H 4 More likely H0 or H1? H H H H T H H T T H H0: .5 .5 H1: .9 .1

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Winter

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Autumn

CSE427 Computational Biology http://www.cs.washington.edu/427 Larry Ruzzo Winter 2008 UW CSE

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation

CSE 427 Computational Biology Genes and Gene Prediction 1 Some notes on HW #2 How do we

CSE 427 Computational Biology Course Wrap Up 71 Please complete online course

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

The Plan BLAST CSE 427 Scoring Computational Biology Another Bio Interlude: PCR

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

He who asks is a fool for five CSE427 minutes, but he who does not Computational Biology ask

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Curation of computational biology models Curation of computational biology models Anand

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics

MedicineInsight Novel use of electronic health record (EHR) data to improve the diagnosis and

AP Chemistry The Atom www.njctl.org Slide 3 / 121 Deducing the structure of the atom took a

The Upward Pricing Pressure Test and Sensitivity of the Diversion Ratio Lydia Cheung Auckland

Chemistry 2000 Slide Set 17: Introduction to organic chemistry Marc R. Roussel March 14, 2020

Sequences are related Darwin: all organisms are related through descent with modification

Sequence File Formats Sequence File Formats Different formats for different uses

Process Design for Mineral Sem inar Process Design for Mineral Operations Operations Luis A.

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS