CSE 427 Computational Biology Genes and Gene Prediction 1 Gene - PowerPoint PPT Presentation

CSE 427   Computational Biology Genes and Gene Prediction 1

Gene Finding: Motivation Sequence data flooding in What does it mean? protein genes, RNA genes, mitochondria, chloroplast, regulation, replication, structure, repeats, transposons, unknown stuff, … More generally, how do you: learn from complex data in an unknown language, leverage what’s known to help discover what’s not 2

Protein Coding Nuclear DNA Focus of these slides Goal: Automated annotation of new seq data State of the Art: In Eukaryotes: predictions ~ 60% similar to real proteins ~80% if database similarity used Prokaryotes better, but still imperfect Lab verification still needed, still expensive Largely done for Human; unlikely for most others 3

Biological Basics Central Dogma: DNA transcription RNA translation Protein Codons: 3 bases code one amino acid Start codon Stop codons 3 ’ , 5 ’ Untranslated Regions (UTR ’ s) 4

RNA   Transcription (This gene is heavily transcribed, but many are not.) 5

Translation: mRNA → Protein Watson, Gilman, Witkowski, & Zoller, 1992 6

DNA (thin lines), RNA Pol (Arrow), mRNA with attached Ribosomes (dark circles) Darnell, p120 7

Ribosomes Watson, Gilman, Witkowski, & Zoller, 1992 8

Codons & The Genetic Code Ala : Alanine Arg : Arginine Second Base U C A G Asn : Asparagine Phe Ser Tyr Cys U Asp : Aspartic acid Phe Ser Tyr Cys C Cys : Cysteine U Leu Ser Stop Stop A Gln : Glutamine Leu Ser Stop Trp G Glu : Glutamic acid Leu Pro His Arg U Gly : Glycine Leu Pro His Arg C His : Histidine C Leu Pro Gln Arg A Ile : Isoleucine Third Base First Base Leu Pro Gln Arg G Leu : Leucine Ile Thr Asn Ser U Lys : Lysine Ile Thr Asn Ser C Met : Methionine A Ile Thr Lys Arg A Phe : Phenylalanine Met/Start Thr Lys Arg G Pro : Proline Val Ala Asp Gly U Ser : Serine Val Ala Asp Gly C Thr : Threonine G Val Ala Glu Gly A Trp : Tryptophane Val Ala Glu Gly G Tyr : Tyrosine Val : Valine 9

Idea #1: Find Long ORF ’ s Reading frame: which of the 3 possible sequences of triples does the ribosome read? Open Reading Frame: No internal stop codons In random DNA average ORF ~ 64/3 = 21 triplets 300bp ORF once per 36kbp per strand But average protein ~ 1000bp 10

A Simple ORF finder start at left end scan triplet-by-non-overlapping triplet for AUG then continue scan for STOP repeat until right end repeat all starting at offset 1 repeat all starting at offset 2 then do it again on the other strand 11

Scanning for ORFs * 1 2 3 U U A A U G U G U C A U U G A U U A A G A A U U A C A C A G U A A C U A A U A C 4 5 6 * In bacteria, GUG is sometimes a start codon… 12

Idea #2: Codon Frequency In random DNA   Leucine : Alanine : Tryptophan = 6 : 4 : 1 But in real protein, ratios ~ 6.9 : 6.5 : 1 So, coding DNA is not random Even more: synonym usage is biased (in a species dependant way)   examples known with 90% AT 3 rd base Why? E.g. efficiency, histone, enhancer, splice interactions 13

Idea #3: Non-Independence Not only is codon usage biased, but residues (aa or nt) in one position are not independent of neighbors How to model this? Markov models 14

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene - PowerPoint PPT Presentation

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation Sequence data flooding in What does it mean? protein genes, RNA genes, mitochondria, chloroplast, regulation, replication, structure, repeats,

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Winter

CSE 427 Computational Biology http://courses.cs.washington.edu/courses/cse427 Larry Ruzzo Autumn

CSE427 Computational Biology http://www.cs.washington.edu/427 Larry Ruzzo Winter 2008 UW CSE

CSE 427 Computational Biology Gene Prediction A statistical interlude: Fair or biased? H H H H

CSE 427 Computational Biology Genes and Gene Prediction 1 Some notes on HW #2 How do we

CSE 427 Computational Biology Course Wrap Up 71 Please complete online course

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

The Plan BLAST CSE 427 Scoring Computational Biology Another Bio Interlude: PCR

CSE 427 Computational Biology Winter 2008 Sequence Alignment; DNA Replication 1 Sequence

He who asks is a fool for five CSE427 minutes, but he who does not Computational Biology ask

Deep Computing in Biology Challenges and Progress Ajay K. Royyuru Computational Biology Center

Basics of Molecular biology Molecular biology is the study of biology at molecular level.

2019-20 DNA Biology New Products RNA Biology PROTEIN Biology MOLECULAR Biology Plant DNA

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Curation of computational biology models Curation of computational biology models Anand

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics

Data modeling in and beyond BIBFRAME Tiziana Possemato, @Cult - Casalini Libri Share -VDE

Continued from part a Characteristic Amide Vibrations A often obscured ~3300 cm -1 by solvent

Technology Management Instructor: Carson Block Twitter: @CarsonBlock http://www.carsonblock.com

Towards web-scale video understanding Olga Russakovsky Serena Yeung Achal Dave (Stanford)

Lead-ins and refrains Telugu songs lead into the refrain in many ways K. V. S. Prasad

Sma Smart rt Le Learn rning Ci Cities F Foru orum 26 February 2020 A learning ci city

Quantum information for fundamental physics Daniel Carney JQI, U. Maryland QuICS, NIST (Venn

Large-Scale Machine Learning Jean-Philippe Vert jean-philippe.vert@ { mines-paristech,curie,ens }

Sambuz

Useful Links

Newsletter

Mail Us