Genome 559, Winter 2012 Ab initio gene prediction method Define - PowerPoint PPT Presentation

Ab initio gene prediction Genome 559, Winter 2012

Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): 1) Splice donor sequence model 2) Splice acceptor sequence model 3) Intron and exon length distribution 4) Open reading frame requirement in coding exons 5) Requirement that introns maintain reading frame 6) Transcription start and stop models (difficult to predict, often omitted). • Use those parameters to obtain a best interpretation of genes from any region from genome sequence alone. ab initio = "from the beginning" (i.e. without experimental evidence)

Sites we might want to predict Translation Translation start stop Splice Splice donor site acceptor site (some predictors only deal with coding exons; the 5' and 3' ends are harder to predict.)

Open reading frames (random sequence) • 61 of 64 codons are not stop codons (0.953 assuming equal nucleotide frequencies). • Probability of not having a stop codon in a particular reading frame along a length L of DNA is a geometric distribution that decays rapidly with L . • There are 3 reading frames on each DNA strand.

Geometric distribution in random sequence of distance to first stop codon (p=3/64) k amino acid codons followed by one stop codon k P X k p p ( ) (1 ) long open reading frames are rare in random sequence (distance in codons)

Splice donor and acceptor information exon intron donor, C. elegans (sums to ~8 bits) exon intron acceptor, C. elegans (sums to ~9 bits) Note – these show a log-odds measure of information content compared to background nucleotide frequencies. Similar to BLOSUM matrix log-odds.

Position Specific Score Matrix (PSSM) splice donor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A 1 1 2 1 0 0 2 2 0 1 1 1 1 1 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0 0 2 8 0 1 0 3 0 0 0 0 0 T 0 0 0 0 0 8 1 1 1 2 1 1 1 1 Slide PSSM along DNA, computing a score at every position. (this is a conceptual example, the real thing would be computed as log-odds values, similar to BLOSUM matrices)

Intron length distribution ( C. elegans ) Note: intron length distributions in Drosophila melanogaster and Homo sapiens (and most other animals) are longer and broader.

Other information that can be used • Splice donor and acceptor must be paired and donor must be upstream of acceptor (duh). • Introns in coding regions must maintain reading frame of the flanking exons. • Nucleotide content analysis (e.g. introns tend to be AT rich).

Simple conceptual example (plus strand only) • Sites scored on basis of PSSM matches to known splice donor sequence model. • Arrow length reflects quality of match (worse matches not shown).

(example cont.) Add splice acceptor information Where would you infer introns? (one reasonable interpretation)

(example cont.) stop codon before highest scoring splice donor! reinterpreted (avoids stop codon by using lower scoring splice donor):

Real example (end result) 1 2 3 4 Note that this gene has no mRNA sequences (EST and ORFeome tracks empty). This is a pure ab initio prediction (made by Phil Green's genefinder ).

Hidden Markov Model (HMM) Markov chain - a linear series of states in which each state is dependent only on the previous state. HMM - a model that uses a Markov chain to infer the most likely states in data with unknown states ("hidden" states). A Markov chain has states and transition probabilities: p AB A B states A and B p BA (since there are only two states, the probability of staying in state A is 1-p AB and the probability of staying in state B is 1-p BA )

A B A -> B A 0.98 0.02 B 0.4 0.6 B -> A What will the series of states look like (qualitatively) for this Markov chain? It will have long stretches of A states, interspersed with short stretches of B states.

Hidden Markov Model We have a Markov chain with appropriate states and known transition probabilities (e.g. best fit to experimentally known genes). We have a DNA sequence with unknown states. Find the series of Markov chain states with the maximum likelihood for the DNA sequence. Solved with the Viterbi algorithm (we won't cover this, but it is another dynamic programming algorithm). See http://en.wikipedia.org/wiki/Viterbi_algorithm

Gene Prediction HMM States coding exon states (three frames) (splice acceptor) (splice donor) intron states (three frames of codon they insert into) special first (init) and last (term) coding exon states taken from Stormo lab paper

A way to connect the HMM formalism to specifics probability of being in an intron “ state ” (based solely on donor sites) Note – these probabilities are qualitative and are intended only to portray the local trends.

Long open reading frames favor exon state 1 2 3 probability of being in an exon “ state ” (based only on frame 1 ORF)

Ab initio gene prediction • Requires a Markov chain with transition probabilities (typically "trained" on known genes or genes in a closely-related species). • Produces a parse of a DNA sequence into each of the Markov states (intron, exon, intergenic, etc). • Accuracy varies and always imperfect.

Intron positions and reading frame exon intron exon ATGATCCTGGAGTCGgttggtgaacttgaaatttagGACGCTGTTATTTCC ATGATCCTGGAGTCGGACGCTGTTATTTCC M I L E S D A V I S • The intron can be any length and still produce the same exons • This particular splice is between two codons (0-shifting) • The splice position can move and maintain coding frame as long as both positions move coordinately. • If one splice endpoint moves it may change reading frame

Using evolutionary conservation as a guide to improve ab initio predictions - an example using dot plots is shown in the next two slides. The principle is that introns usually diverge much faster than coding exons. (though not always true)

DNA dot matrix comparison of two ab initio dubious exon 2 gene predictions exon 1 too long in related genomes good exon Gene A ( ab initio add model) intron? other possible corrections remove intron? Gene B ( ab initio model)

After correction of exons 1 and 2 in the red gene

Genome 559, Winter 2012 Ab initio gene prediction method Define - PowerPoint PPT Presentation

Ab initio gene prediction Genome 559, Winter 2012 Ab initio gene prediction method Define parameters of real genes (based on experimental evidence): 1) Splice donor sequence model 2) Splice acceptor sequence model 3) Intron and exon length

Ab initio gene prediction Genome 559, Winter 2014 Ab initio gene prediction method Define

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Genome 559, Winter 2012 Review Comparing networks Node degree distributions Power law

Ab initio modelling methods Al Kikhney EMBL Hamburg Ab initio shape reconstruction Log I(s)

Gene Prediction with AUGUSTUS Genome annotation: challenges in eukaryotes and consequences for

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Biological Networks Analysis Degree Distribution and Network Motifs Genome 559: Introduction to

Family-based analysis of genome-wide gene gene interactions Marit Ackermann Biotec TU Dresden

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Gene Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Strings Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

CSE 527 Computational Biology Lectures 13-14 Gene Prediction Some References (more on schedule

The Contribution of Bioinformatics to Evolutionary Thought A demonstration of the abilities of

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Computational Bioinformatics: Computational Bioinformatics: Software and Databases Software and

Phylogenetics: Recovering Evolutionary History COMP 571 Luay Nakhleh, Rice University 2 The

Chapter Twelve Protein Synthesis: Translation of the Genetic Message Paul D. Adams

Linear Error Correcting Codes for Modeling the Ribosome and Proteins Mario Enrique Duarte Gonz

Sambuz

Useful Links

Newsletter

Mail Us

Genome 559, Winter 2012 Ab initio gene prediction method Define - PowerPoint PPT Presentation

Ab initio gene prediction Genome 559, Winter 2012 Ab initio gene prediction method Define parameters of real genes (based on experimental evidence): 1) Splice donor sequence model 2) Splice acceptor sequence model 3) Intron and exon length

Ab initio gene prediction Genome 559, Winter 2014 Ab initio gene prediction method Define

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Genome 559, Winter 2012 Review Comparing networks Node degree distributions Power law

Ab initio modelling methods Al Kikhney EMBL Hamburg Ab initio shape reconstruction Log I(s)

Gene Prediction with AUGUSTUS Genome annotation: challenges in eukaryotes and consequences for

Artificial Neural Networks Genome 559: Introduction to Statistical and Computational Genomics

Biological Networks Analysis Degree Distribution and Network Motifs Genome 559: Introduction to

Family-based analysis of genome-wide gene gene interactions Marit Ackermann Biotec TU Dresden

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Gene Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics

Whole genome alignments http://faculty.washington.edu/jht/GS559_2013/ Genome 559: Introduction to

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics &amp; Computational

Genome Sequencing &amp; Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Gene Finding Strategies to find gene structures on the web Swiss Institute of Bioinformatics

Strings Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

CSE 527 Computational Biology Lectures 13-14 Gene Prediction Some References (more on schedule

The Contribution of Bioinformatics to Evolutionary Thought A demonstration of the abilities of

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Computational Bioinformatics: Computational Bioinformatics: Software and Databases Software and

Phylogenetics: Recovering Evolutionary History COMP 571 Luay Nakhleh, Rice University 2 The

Chapter Twelve Protein Synthesis: Translation of the Genetic Message Paul D. Adams

Linear Error Correcting Codes for Modeling the Ribosome and Proteins Mario Enrique Duarte Gonz

Sambuz

Useful Links

Newsletter

Mail Us

Whole Genome Analysis and Annotation Adam Siepel Biological Statistics & Computational

Genome Sequencing & Analysis Core Resource Olivier Fedrigo Friday, October 19, 12 Reference

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for