Eukaryotic Gene Eukaryotic Gene Prediction Prediction
Eukaryotic gene structure Eukaryotic gene structure
Translation Translation
Gene Finding: The 1st generation Gene Finding: The 1st generation • Given genomic DNA, does it contain a gene (or Given genomic DNA, does it contain a gene (or • not)? not)? • Key idea: The distributions of nucleotides is Key idea: The distributions of nucleotides is • different in coding (translated exons exons) and non- ) and non- different in coding (translated coding regions. coding regions. • Therefore, a statistical test can be used to Therefore, a statistical test can be used to • discriminate between coding and non-coding discriminate between coding and non-coding regions. regions.
Coding versus non-coding Coding versus non-coding • Fickett Fickett and and Tung Tung (1992) compared various (1992) compared various • measures measures • Measures that preserve the triplet frame are Measures that preserve the triplet frame are • the most successful. the most successful. • Genscan Genscan: 5th order Markov Model : 5th order Markov Model • • Assignment 2 (Conservation implies a Assignment 2 (Conservation implies a • protein coding measure) protein coding measure)
Coding vs vs. non-coding . non-coding Coding regions regions Given : Three 5th order transition matrices C (1) , C (2) , C (3) trained on coding exons b - a P h ( X a , b ) = ’ C (( h + i )mod 3 + 1) [ X a + i ] i = 0 Coding ratio, r = P h ( X a , b ) P D ( X a , b ) Coding Score s = log 2 (r) Compute average coding score (per base) of exons and introns, and take the difference. If the measure is good, the difference must be biased away from 0.
Coding differential for 380 genes Coding differential for 380 genes
Other Signals Other Signals ATG AG GT Coding
Coding region can be detected Coding region can be detected ß Plot the coding score using a sliding window of fixed ß Plot the coding score using a sliding window of fixed length. length. ß The (large) ß The (large) exons exons will show up reliably. will show up reliably. ß Not enough to predict gene boundaries reliably ß Not enough to predict gene boundaries reliably Coding
Other Signals Other Signals ß Signals at ß Signals at exon exon boundaries are precise but not specific. boundaries are precise but not specific. Coding signals are specific but not precise. Coding signals are specific but not precise. ß When combined they can be effective ß When combined they can be effective ATG AG GT Coding
The second generation of Gene finding The second generation of Gene finding • Ex: Grail II. Used statistical techniques to Ex: Grail II. Used statistical techniques to • combine various signals into a coherent combine various signals into a coherent gene structure. gene structure. • It was not easy to train on many It was not easy to train on many • parameters. Guigo Guigo & & Bursett Bursett test revealed test revealed parameters. that accuracy was still very low. that accuracy was still very low. • Problem with multiple genes in a genomic Problem with multiple genes in a genomic • region region
HMMs and gene finding and gene finding HMMs • HMMs HMMs allow for a systematic approach to allow for a systematic approach to • merging many signals. merging many signals. • They can model multiple genes, partial They can model multiple genes, partial • genes in a genomic region, as also genes genes in a genomic region, as also genes on both strands. on both strands.
The Viterbi Viterbi Algorithm Algorithm The Let v k ( i ) be the probability of the most likely path that ends in state p k , and emits symbols x 1 L x k Then, v k ( i + 1) = e k ( x i + 1 )max l ( v l ( i ) a lk )
HMMs and gene finding and gene finding HMMs • The The Viterbi Viterbi algorithm (and backtracking) algorithm (and backtracking) • allows us to parse a string through the allows us to parse a string through the states of an HMM states of an HMM • Can we describe Eukaryotic gene Can we describe Eukaryotic gene • structure by the states of an HMM? structure by the states of an HMM? • This could be a solution to the GF problem. This could be a solution to the GF problem. •
An HMM for Gene structure An HMM for Gene structure
Generalized HMMs HMMs, and other , and other Generalized refinements refinements • A probabilistic model for each of the states (ex: A probabilistic model for each of the states (ex: • Exon, Splice site) needs to be described , Splice site) needs to be described Exon • In standard In standard HMMs HMMs, there is an exponential , there is an exponential • distribution on the duration of time spent in a distribution on the duration of time spent in a state. state. • This is violated by many states of the gene This is violated by many states of the gene • structure HMM. Solution is to model these using structure HMM. Solution is to model these using generalized HMMs HMMs. . generalized
Length distributions of Introns Introns & & Exons Exons Length distributions of
Generalized HMM for gene finding Generalized HMM for gene finding • Each state also emits a Each state also emits a ‘ ‘duration duration’ ’ for which for which • it will cycle in the same state. The time is it will cycle in the same state. The time is generated according to a random process generated according to a random process that depends on the state. that depends on the state.
Forward algorithm for gene finding Forward algorithm for gene finding q k j i  P q k  F k ( i ) = ( X j , i ) f q k ( j - i + 1) a lk F l ( j ) j < i l Œ Q
HMMs and Gene finding and Gene finding HMMs ß Generalized ß Generalized HMMs HMMs are an attractive are an attractive model for computational gene finding model for computational gene finding ß Allow incorporation of various signals Allow incorporation of various signals ß ß Quality of gene finding depends upon quality Quality of gene finding depends upon quality ß of signals. of signals.
Signals Signals • Coding versus non-coding Coding versus non-coding • • Splice Signals Splice Signals • • Translation start Translation start •
Splice signals Splice signals • GT is a Donor signal, and AG is the GT is a Donor signal, and AG is the • acceptor signal acceptor signal GT AG
PWMs PWMs • Fixed length for the splice signal. Fixed length for the splice signal. • 321123456 321123456 AAGGTGAGT AAGGTGAGT • Each position is generated Each position is generated • independently according to a independently according to a CCGGTAAGT CCGGTAAGT distribution distribution GAGGTGAGG GAGGTGAGG • Figure shows data from > 1200 donor Figure shows data from > 1200 donor TAGGTAAGG • TAGGTAAGG sites sites
MDD MDD • PWMs PWMs do not capture correlations between positions do not capture correlations between positions • • Many position pairs in the Donor signal are correlated Many position pairs in the Donor signal are correlated •
• Choose the position which has the highest Choose the position which has the highest • correlation score. correlation score. • Split sequences into two: those which Split sequences into two: those which • have the consensus at position I, and the have the consensus at position I, and the remaining. remaining. • Recurse Recurse until <Terminating conditions> until <Terminating conditions> •
MDD for Donor sites MDD for Donor sites
De novo Gene prediction: Gene prediction: Sumary Sumary De novo • Various signals distinguish coding regions Various signals distinguish coding regions • from non-coding from non-coding • HMMs HMMs are a reasonable model for Gene are a reasonable model for Gene • structures, and provide a uniform method structures, and provide a uniform method for combining various signals. for combining various signals. • Further improvement may come from Further improvement may come from • improved signal detection improved signal detection
How many genes do we have? How many genes do we have? Nature Science
Alternative splicing Alternative splicing
Comparative methods Comparative methods • Gene prediction is harder with alternative splicing. Gene prediction is harder with alternative splicing. • • One approach might be to use comparative methods to One approach might be to use comparative methods to • detect genes detect genes • Given a similar mRNA/protein (from another species, Given a similar mRNA/protein (from another species, • perhaps?), can you find the best parse of a genomic perhaps?), can you find the best parse of a genomic sequence that matches that target sequence sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize Yes, with a variant on alignment algorithms that penalize • • separately for introns introns, versus other gaps. , versus other gaps. separately for
Recommend
More recommend