CSE182-L12 Gene Finding
Quiz • Who are these people, and what is the occasion?
De novo Gene prediction: Summary • Various signals distinguish coding regions from non-coding • HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. • Further improvement may come from improved signal detection
How many genes do we have? Nature Science
Alternative splicing
Comparative methods • Gene prediction is harder with alternative splicing. • One approach might be to use comparative methods to detect genes • Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence • Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.
Comparative gene finding tools • Procrustes/Sim4: mRNA vs. genomic • Genewise: proteins versus genomic • CEM: genomic versus genomic • Twinscan: Combines comparative and de novo approach.
Course • Sequence Comparison (BLAST & other tools) • Protein Motifs: – Profiles/Regular Expression/HMMs • Protein Sequence Identification via Mass Spec. • Discovering protein coding genes – Gene finding HMMs – DNA signals (splice signals)
Genome Assembly
DNA Sequencing • DNA is double- stranded • The strands are separated, and a polymerase is used to copy the second strand. • Special bases terminate this process early.
• A break at T is shown here. • Measuring the lengths using electrophoresis allows us to get the position of each T • The same can be done with every nucleotide. Color coding can help separate different nucleotides
• Automated detectors ‘read’ the terminating bases. • The signal decays after 1000 bases.
Sequencing Genomes: Clone by Clone • Clones are constructed to span the entire length of the genome. • These clones are ordered and oriented correctly (Mapping) • Each clone is sequenced individually
Shotgun Sequencing • Shotgun sequencing of clones was considered viable • However, researchers in 1999 proposed shotgunning the entire genome.
Library • Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.
Sequencing
Questions • Algorithmic: How do you put the genome back together from the pieces? Will be discussed in the next lecture. • Statistical? How many pieces do you need to sequence, etc.? – The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.
Lander Waterman Statistics G = Genome Length L = Clone Length N = Number of Clones T = Required Overlap c = Coverage = LN/G a = N/G q = T/L s = 1- q L G
LW statistics: questions • As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. • Q1: What is the expected number of islands? • Ans: N exp(-c s ) • The number increases at first, and gradually decreases.
Analysis: Expected Number Islands • Computing Expected # islands. • Let X i =1 if an island ends at position i, X i =0 otherwise. • Number of islands = ∑ i X i • Expected # islands = E(∑ i X i ) = ∑ i E(X i )
Prob. of an island ending at i L i T • E(X i ) = Prob (Island ends at pos. i) • = Prob(clone began at position i-L+1 AND no clone began in the next L-T positions) L - T = a e - c s ( ) E ( X i ) = a 1 - a G a e - c s = Ne - c s  Expected # islands = E ( X i ) = i
LW statistics • Pr[Island contains exactly j clones]? • Consider an island that has already begun. With probability e -c s , it will never be continued. Therfore • Pr[Island contains exactly j clones]= (1 - e - c s ) j - 1 e - c s • Expected # j-clone islands = Ne - c s (1 - e - c s ) j - 1 e - c s
Expected # of clones in an island e c s Why?
Expected length of an island L e c s - 1 È ˘ Ê ˆ ˜ + (1 - s ) Í ˙ Á c Ë ¯ Î ˚
Recommend
More recommend