Outline • Administravia • What is bioinformatics CS 5263 Bioinformatics • Why bioinformatics • Course overview Lectures 1 & 2: Introduction to • Short introduction to molecular biology Bioinformatics and Molecular Biology Survey form Course Info • Instructor: Jianhua Ruan • Your name Office: S.B. 4.01.48 • Email Phone: 458-6819 • Academic preparation Email: jruan@cs.utsa.edu • Interests Office hours: MW 2-3pm • help me better design lectures and • Web: assignments http://www.cs.utsa.edu/~jruan/teaching/cs 5263_fall_2008/ 1
Course description Textbooks • An Introduction to Bioinformatics • A survey of algorithms and methods in Algorithms bioinformatics, approached from a by Jones and Pevzner computational viewpoint. • Biological Sequence Analysis: • Prerequisite: Probabilistic Models of Proteins and – Programming experiences Nucleic Acids – Some knowledge in algorithms and data structures by Durbin, Eddy, Krogh and Mitchison – Basic understanding of statistics and probability • Additional resources – Appetite to learn some biology – Papers – Handouts – See course website Grading Why bioinformatics • Attendance: 10% • The advance of experimental technology – At most 2 classes missed without affecting grade has generated huge amount of data • Homeworks: 50% – The human genome is “finished” – About 5 assignments – Even if it were, that’s only the beginning… – Combination of theoretical and programming exercises • The bottleneck is how to integrate and – No exams analyze the data – No late submission accepted – Noisy – Read the collaboration policy! • Final project and presentation: 40% – Diverse 2
Growth of GenBank vs Moore’s law Genome annotations Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006 What is bioinformatics What is bioinformatics • National Institutes of Health (NIH): • National Center for Biotechnology Information (NCBI): – Research, development, or application of computational tools and approaches for – the field of science in which biology, computer expanding the use of biological, medical, science, and information technology merge to behavioral or health data, including those to form a single discipline . The ultimate goal of acquire, store, organize, archive, analyze , or the field is to enable the discovery of new visualize such data. biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. 3
What is bioinformatics Biology Molecular Biology • Wikipedia Chemistry Medicine – Bioinformatics refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve Bioinformatics formal and practical problems posed by or inspired from the management and analysis Mathematics of biological data. Physics Statistics Computer Science Informatics Course objectives What you will learn? • Learn the basis of sequence analysis and other • Basic concepts in molecular biology and computational biology algorithms genetics • Familiarize with the research topics in • Algorithms to address selected problems in bioinformatics bioinformatics • Be able to – Dynamic programming, string algorithms, graph algorithms – Read / criticize bioinformatics research articles – Statistical learning algorithms: HMM, EM, Gibbs – Identify subareas that best suit your background sampling – Communicate and exchange ideas with – Data mining: clustering / classification (computational) biologists • Applications to real data 4
What you will not learn? Covered topics 1 week • Biology • Designing / performing biological • Sequence analysis experiments (duh!) – Sequence alignment • Programming (in perl, etc). • Pairwise, multiple, global, local, optimal, heuristic 8 weeks – String matching • Building bioinformatics software tools (GUI, – Motif finding • Gene prediction database, Web, …) • RNA structure prediction • Using existing tools / databases (well, not • Phylogenetic tree exactly true) 5 weeks • Functional Genomics – Microarray data analysis – Biological networks Biologists vs computer scientists • (almost) Everything is true or false in Computer Scientists vs computer science Biologists • (almost) Nothing is ever true or false in Biology (courtesy Serafim Batzoglou, Stanford) 5
Biologists vs computer scientists Biologists vs computer scientists • Biologists seek to understand the • Computer scientists are obsessed with complicated, messy natural world being the first to invent or prove something • Computer scientists strive to build their • Biologists are obsessed with being the first own clean and organized virtual world to discover something 1. Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT Some examples of central 3x10 9 nucleotides role of CS in bioinformatics ~500 nucleotides 6
2. Gene Finding 1. Genome sequencing AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3x10 9 nucleotides Where are the genes? Where are the genes? A big puzzle ~60 million pieces In humans: Computational Fragment Assembly ~22,000 genes Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces ~1.5% of human DNA 2000: assemble whole human genome 2. Gene Finding 3. Protein Folding • The amino-acid sequence of a protein determines the 3D Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 fold 5’ 3’ • The 3D fold of a protein determines its function • Can we predict 3D fold of a protein given its amino-acid sequence? Splice sites Stop codon Start codon – Holy grail of compbio—40 years old problem TAG/ TGA/ TAA ATG – Molecular dynamics, computational geometry, machine learning Hidden Markov Models (Well studied for many years in speech recognition) 7
Lipman & Pearson, 1985 4. Sequence Comparison—Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC …, comparison of a 200-amino-acid …, comparison of a 200-amino-acid TAGCTATCACGACCGCGGTCGATTTGCCCGAC sequence to the 500,000 residues in the sequence to the 500,000 residues in the National Biomedical Research Foundation National Biomedical Research Foundation library would take less than 2 minutes on library would take less than 2 minutes on - AG G CTATCAC CT GACC T C CA GG C CGA -- TGCCC --- a minicomputer, and less than 10 minutes a minicomputer, and less than 10 minutes | | | | | | | | | | | | | | | | | | | | | | | | x on a microcomputer (IBM PC). on a microcomputer (IBM PC). T AG - CTATCAC -- GACC G C -- GG T CGA TT TGCCC GAC Sequence Alignment query Introduced ~1970 BLAST Database size today: 10 12 BLAST: 1990, most cited paper in history Still very active area of research (increased by 2 million folds). Efficient string matching algorithms BLAST search: 1.5 minutes DB Fast database index techniques 5. Microarray analysis Some goals of biology for the next 50 years Clinical prediction of Leukemia type • List all molecular parts that build an organism • 2 types – Genes, proteins, other functional parts – Acute lymphoid (ALL) • Understand the function of each part – Acute myeloid (AML) • Understand how parts interact physically and functionally • Different treatments & outcomes • Study how function has evolved across all species • Predict type before treatment? • Find genetic defects that cause diseases • Design drugs rationally Bone marrow samples: ALL vs AML • Sequence the genome of every human, use it for personalized medicine • Bioinformatics is an essential component for all the goals above Measure amount of each gene 8
Life • Two categories: – Prokaryotes (e.g. bacteria) • Unicellular • No nucleus A short introduction to molecular biology – Eukaryotes (e.g. fungi, plant, animal) • Unicellular or multicellular • Has nucleus Organism, Organ, Cell Prokaryote vs Eukaryote Organism Organ • Eukaryote has many membrane-bounded compartment inside the cell – Different biological processes occur at different cellular location 9
Chemical contents of cell DNA • Water • DNA: forms the genetic material of all • Macromolecules (polymers) - “strings ” made by linking living organisms monomers from a specified set (alphabet) – Can be replicated and passed to descendents –Protein –DNA – Contains information to produce proteins –RNA • To computer scientists, DNA is a string –… made from alphabet {A, C, G, T} • Small molecules –Sugar – e.g. ACAGAACGTAGTGCCGTGAGCG –Ions (Na + , Ka + , Ca 2+ , Cl - ,…) • Each letter is a nucleotide –Hormone –… • Length varies from hundreds to billions RNA Protein • Protein: the actual “worker” for almost all processes in • Historically thought to be information the cell carrier only – Enzymes: speed up reactions – DNA => RNA => Protein – Signaling: information transduction – Structural support – New roles have been found for them – Production of other macromolecules • To computer scientists, RNA is a string – Transport made from alphabet {A, C, G, U} • To computer scientists, protein is a string made from 20 kinds of characters – e.g. ACAGAACGUAGUGCCGUGAGCG – E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP • Each letter is a nucleotide • Each letter is called an amino acid • Length varies from tens to thousands • Length varies from tens to thousands 10
Recommend
More recommend