He who asks is a fool for five CSE427 minutes, but he who does not Computational Biology ask remains a fool forever. http://www.cs.washington.edu/427 Larry Ruzzo -- Chinese Proverb Winter 2008 UW CSE Computational Biology Group This week Admin Admin Stuff Why Comp Bio? The world’s shortest Intro. to Mol. Bio. 1
Digression: Course Mechanics & Grading Evolution & scientific literacy Reading “human beings, as we know them, developed from earlier species of animals” In class discussion (avoiding the now politically charged word “evolution”) Homeworks from 1985 to 2005, the % of Americans rejecting: declined from 48% to 39% reading accepting: also declined 45% to 40 paper exercises uncertain: increased 7% to 21% programming In a 2005 survey,the proportion of adults who Small Project? accept evolution in 34 European countries No exams and Japan, the United States ranked 33rd, just above Turkey. http://biology.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pbio.0040167 Background & Motivation Source: http://www.intel.com/research/silicon/mooreslaw.htm 2
The Human Genome Project 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 121 gcggctcggc atcatgcgtc gagggcgtct gctggagatc gccctgggat ttaccgtgct 181 tttagcgtcc tacacgagcc atggggcgga cgccaatttg gaggctggga acgtgaagga 241 aaccagagcc agtcgggcca agagaagagg cggtggagga cacgacgcgc ttaaaggacc 301 caatgtctgt ggatcacgtt ataatgctta ctgttgccct ggatggaaaa ccttacctgg 361 cggaaatcag tgtattgtcc ccatttgccg gcattcctgt ggggatggat tttgttcgag 421 gccaaatatg tgcacttgcc catctggtca gatagctcct tcctgtggct ccagatccat 481 acaacactgc aatattcgct gtatgaatgg aggtagctgc agtgacgatc actgtctatg 541 ccagaaagga tacataggga ctcactgtgg acaacctgtt tgtgaaagtg gctgtctcaa 601 tggaggaagg tgtgtggccc caaatcgatg tgcatgcact tacggattta ctggacccca 661 gtgtgaaaga gattacagga caggcccatg ttttactgtg atcagcaacc agatgtgcca 721 gggacaactc agcgggattg tctgcacaaa acagctctgc tgtgccacag tcggccgagc 781 ctggggccac ccctgtgaga tgtgtcctgc ccagcctcac ccctgccgcc gtggcttcat 841 tccaaatatc cgcacgggag cttgtcaaga tgtggatgaa tgccaggcca tccccgggct 901 ctgtcaggga ggaaattgca ttaatactgt tgggtctttt gagtgcaaat gccctgctgg 961 acacaaactt aatgaagtgt cacaaaaatg tgaagatatt gatgaatgca gcaccattcc 1021 ... Source: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html The sea urchin Strongylocentrotus purpuratus 3
“High-Throughput Goals BioTech” Basic biology Sensors DNA sequencing Disease diagnosis/prognosis/treatment Microarrays/Gene expression Mass Spectrometry/Proteomics Drug discovery, validation & development Protein/protein & DNA/protein interaction Controls Individualized medicine Cloning … Gene knock out/knock in RNAi Floods of data “Grand Challenge” problems What’s all the fuss? CS Points of Contact & Opportunities The human genome is “finished”… Scientific visualization Gene expression patterns Even if it were, that’s only the beginning Databases Integration of disparate, overlapping data sources Explosive growth in biological data is Distributed genome annotation in face of shifting underlying genomic revolutionizing biology & medicine coordinates AI/NLP/Text Mining Information extraction from journal texts with inconsistent “All pre-genomic lab nomenclature, indirect interactions, incomplete/inaccurate techniques are obsolete” models,… Machine learning (and computation and mathematics are crucial to post-genomic analysis) System level synthesis of cell behavior from low-level heterogeneous data (DNA sequence, gene expression, protein interaction, mass spec,…) ... Algorithms 4
Computers in biology: Then & now An RNA Structure 5
mRNA leader An RNA Sensor & On/Off Switch mRNA leader switch? L19 absent: Gene On L19 present: Gene Off An RNA Actually, a Stochastic CFG Grammar Associate probabilities with rules: S → LS | L L → s | “dFd” S → LS | L (0.87) (0.13) L → S | dFd (0.89*p(s)) (0.11*p(dd)) F → LS | “dFd” F → LS | dFd (0.21) (0.79*p(dd)) “dFd” means Where p(s) & p(dd) are the probabilities of the specific single/paired nucleotides, perhaps from empirical data or a model of sequence Watson-Crick evolution base pair: aFu | uFa | gFc | cFg paren-like nesting 6
Experimental Validation boxed = confirmed riboswitch (+2 more) Bottom Line An Algorithm Example: ncRNAs CFG technology is a key tool for RNA description, The “Central Dogma”: DNA -> messenger RNA -> Protein discovery and search A very active research area. (Some call RNA the “dark Last ~5 years: 100s – 1000s of examples matter” of the genome.) of functionally important ncRNAs Huge compute hog: results above represent hundreds Much harder to find than protein-coding genes of CPU-years, and smart algorithms can have a big Main method - Covariance Models (based on impact stochastic context free grammars) Main problem - Sloooow … O(nm 4 ) 7
“Rigorous Filtering” - Z. Weinberg Results CENSORED Convert CM to HMM Typically 200-fold speedup or more (AKA: stochastic CFG to stochastic regular grammar) Details Finding dozens to hundreds of new ncRNA genes in Do it so HMM score always ≥ CM score many families Optimize for most aggressive filtering subject to constraint that score bound maintained ) Has enabled discovery of many new families … A large convex optimization problem d e n u Filter genome sequence with (fast) HMM, run (slow) CM only on t Plenty of CS here y sequences above desired CM threshold; guaranteed not to miss a t s anything t u Newer, more elaborate techniques pulling in key secondary b ( Newer, more elaborate techniques pulling in key secondary structure features for better searching structure features for better searching (uses automata theory, dynamic programming, Dijkstra, more (uses automata theory, dynamic programming, Dijkstra, more optimization stuff,…) optimization stuff,…) Course Focus & Goals Mainly sequence analysis Algorithms for alignment, search, & discovery More Admin Specific sequences, general types (“genes”, etc.) Single sequence and comparative analysis Techniques: HMMs, EM, MLE, Gibbs, Viterbi… Enough bio to motivate these problems, including very light intro to modern biotech supporting them Math/stats/cs underpinnings thereof Applied to real data 8
The Genome The hereditary info present in every cell A VERY Quick Intro To DNA molecule -- a long sequence of Molecular Biology nucleotides (A, C, T, G) Human genome -- about 3 x 10 9 nucleotides The genome project -- extract & interpret genomic information, apply to genetics of disease, better understand evolution, … The Double Helix DNA Discovered 1869 Role as carrier of genetic information - much later The Double Helix - Watson & Crick 1953 Complementarity A ←→ T C ←→ G Visualizations: http://www.rcsb.org/pdb/explore.do?structure Id=123D Los Alamos Science 9
Genetics - the study of Cells heredity A gene -- classically, an abstract heritable Chemicals inside a sac - a fatty layer called the plasma membrane attribute existing in variant forms Prokaryotes (bacteria, archaea) - little ( alleles ) recognizable substructure Genotype vs phenotype Eukaryotes (all multicellular organisms, and Mendel many single celled ones, like yeast) - genetic Each individual two copies of each gene material in nucleus, other organelles for other specialized functions Each parent contributes one (randomly) Independent assortment Chromosomes Mitosis/Meiosis Most “higher” eukaryotes are diploid - have 1 pair of (complementary) DNA molecules homologous pairs of chromosomes, one (+ protein wrapper) maternal, other paternal (exception: sex Most prokaryotes have just 1 chromosomes) chromosome Mitosis - cell division, duplicate each most chromosome, 1 copy to each daughter cell Eukaryotes - all cells have same number Meiosis - 2 divisions form 4 haploid gametes of chromosomes, e.g. fruit flies 8, (egg/sperm) humans & bats 46, rhinoceros 84, … Recombination/crossover -- exchange maternal/paternal segments 10
Proteins The “Central Dogma” Chain of amino acids, of 20 kinds Genes encode proteins Proteins:the major functional elements in cells DNA transcribed into messenger RNA Structural/mechanical mRNA translated into proteins Enzymes (catalyze chemical reactions) Receptors (for hormones, other signaling molecules, Triplet code (codons) odorants,…) Transcription factors … 3-D Structure is crucial: the protein folding problem Transcription: DNA → RNA Codons & The Genetic Code Ala : Alanine RNA Second Base Arg : Arginine sense 5’ 3’ U C A G Asn : Asparagine 5 ’ strand Phe Ser Tyr Cys U Asp : Aspartic acid 3 ’ Phe Ser Tyr Cys C Cys : Cysteine U → DNA 3’ 5’ Leu Ser Stop Stop A Gln : Glutamine antisense Leu Ser Stop Trp G Glu : Glutamic acid strand Leu Pro His Arg U Gly : Glycine Leu Pro His Arg C His : Histidine C Third Base First Base Leu Pro Gln Arg A Ile : Isoleucine RNA polymerase Leu Pro Gln Arg G Leu : Leucine Ile Thr Asn Ser U Lys : Lysine Ile Thr Asn Ser C Met : Methionine A Ile Thr Lys Arg A Phe : Phenylalanine Met/Start Thr Lys Arg G Pro : Proline Val Ala Asp Gly U Ser : Serine Val Ala Asp Gly C Thr : Threonine G Val Ala Glu Gly A Trp : Tryptophane Val Ala Glu Gly G Tyr : Tyrosine Val : Valine 11
Recommend
More recommend