analysis visualization of large scale genomic data about
play

Analysis & Visualization of large-scale genomic data About the - PDF document

Analysis & Visualization of large-scale genomic data About the course Olga Troyanskaya, Ph.D. Instructor information The course In bioinformatics a field that brings together Olga Troyanskaya computer science and biology to


  1. Analysis & Visualization of large-scale genomic data About the course Olga Troyanskaya, Ph.D. Instructor information The course • In bioinformatics – a field that brings together • Olga Troyanskaya computer science and biology to study the flow • Best way to contact is by e-mail: of information in biological systems and in ogt@cs.princeton.edu, please put course biological research number (597F) in subject line • Office: 204 in 35 Olden Street • This course will focus on analysis of large- scale functional data : gene expression, proteomics, data integration, data visualization What this course is and is not Who should take this course • A course on analysis of gene expression, • Graduate or advanced undergraduate proteomic, and other high-throughput students from any department functional biological data • Interested in genomics, bioinformatics, or • A course in applied computer science applied computer science (with some statistics in the mix) • Have some computational background • Not an overview of bioinformatics – this is • Are interested in learning about genomics a depth-first course, although a brief intro to bioinformatics and biology will be provided (very soon) 1

  2. Prerequisites Course format • SEAS students: ability to program a computer at • Lectures to introduce topics CS 217 (intro to programming) level in a • Student presentations of literature papers language of your choice • Discussion of presented papers in seminar • Biology students: GENERAL understanding of format following the presentation computation and mathematical concepts on the level of SVD • Students will complete a team project • If in doubt, talk to me or e - m a il me - most likely during the duration of the course and write there isn’t a problem a paper on it Grading Presentations • Project – ~45% • Two 30-min presentations per class, plus 20 minutes discussion • Presentations – ~35% • Each presentation is of 1 paper • Discussion of assigned reading (& – Describe major points of the paper, including attendance) – ~20% methods details and evaluation – Outline what you think are strong/weak points of the paper – Suggest what would improve the paper and what the future steps could be Presentation (cont.) The project • Do: • A team or individual project (up to 3 people/team) • Involves designing, implementing and evaluating a novel – Make you presentation accessible to everyone in the bioinformatics method class by explaining methods (both computational and relevant experimental techniques) – Can be a known computational or statistical technique not yet applied to bioinformatics – Skip minor points, but do not just gloss over important – Can be a novel visualization tool method details or evaluation – I would be happy to provide ideas • Do Not: • Project can be applicable to your research – Go over time – 25 mins is good, 31 mins is bad • Biology students who cannot program can instead do a – Be afraid to point out important points you are longer in depth review paper of methods in one area of confused about even after you looked into them informatics we covered (e.g. microarray image analysis), including ideas for novel methods and their necessary • Presentations judged mainly on content, but characteristics delivery does matter • At the end of fall – submission of project/review writeups or project papers 2

  3. Cells are fundamental working units of all Molecular biology 101 organisms or “why bother?” Prokaryotes vs. Eukaryotes Yeast are unicellular organisms Humans are multi - c ellular organisms Understanding how a cell works is critical to Yeast is a eukaryote just like humans. Fundamental understanding how the organism functions biological processes are very similar. Key biological macromolecules Lipids • Lipids: • Each lipid consists of a hydrophilic (water – mostly structural function loving) and – Construct compartments that separate inside from hydrophobic fragment outside • DNA • Spontaneously form lipid bilayers => – Encodes hereditary information membranes • Proteins – Do most of the work in the cell – Form 3D structure and complexes critical for function 3

  4. DNA Proteins • Uses alphabet of 4 • A sequence of amino letters {ATCG}, called acids (alphabet of 20) bases • Each amino acid • Encodes genetic encoded by 3 DNA information in triplet bases code • Perform most of the • Structure: a double actual work in the cell helix • Fold into complex 3D structure Courtesy of the Zhou Laboratory, The State University of New York at Buffalo How does a cell function? The Central Dogma of biology DNA is a sequence of bases {A, T, C, G} TAT-CGT-AGT Proteins consist of amino acids, whose sequence is encoded Each 3 bases of DNA in DNA encode 1 amino acid Tyr-Arg-Ser Courtesy U.S. Department of Energy Genomes to Life program Beyond the “omes” – systems The “omes” biology • Genome – organism’s complete set of DNA • Understanding the function and regulation of – Relatively stable through an organism’s lifetime cellular machinery, as well as cell - to - c ell – Size: from 600,000 to several billion bases communication on the molecular level – Gene is a basic unit of heredity (only 2% of the • Why? Because most important biological human genome) problems are fundamentally systems - level • Proteome – organism’s complete set of proteins problems – Dynamic – changes minute to minute – Systems-level understanding of disease (e.g. cancer) – Proteins actually perform most cellular functions, they are encoded by genes (not a 1-to-1 relationship) – Molecular medicine – Protein function and structure form molecular basis – Gene therapy for disease 4

  5. Systems-level challenges Function • Gene function annotation – what does a gene do – ~30,000 genes in the human genome => systems-level approaches necessary • To study WHAT proteins DO, HOW – A modern human microarray experiment produces ~500,000 data points => computational analysis & visualization necessary they INTERACT, and HOW they – Many high-throughput functional technologies => computational are REGULATED, need data methods necessary to integrate the data • Biological networks – how do proteins interact beyond genomic sequence – Large amounts of high-throughput data => computation necessary to store and analyze it – Data has variable specificity => computational approaches necessary to separate reliable conclusions from random coincidences • Comparative genomics – comparing data between organisms • Genomics/Bioinformatics is – Need to map concepts across organisms on a large scale => practically fundamentally a COLLABORATIVE impossible to do by hand and MULTIDISCIPLINARY effort – High amount of variable quality data => computational methods needed for integration, visualization, and analysis – Data often distributed in databases across the globe, with variable schemas etc => data storage and consolidation methods needed Why microarray analysis: the questions • Large-scale study of biological processes • What is going on in the cell at a certain Gene expression – one type of point in time? high-throughput functional data • On the large-scale genetic level, what accounts for differences between phenotypes? • Sequence important, but genes have effect through expression Why study gene expression Proteins Gene Expression Car parts Microarray technology Blueprints of Proteins automobile parts DNA People Automobiles 5

  6. Early cDNA microarray Microarray technologies (18,000 clones) • Spotted cDNA arrays – Developed by Pat Brown (Stanford U) – Robotic microspotting – PCR products of full-length genes (>100nts) • Affymetrix GeneChips – Photolithography (from computer industry) – Each gene represented by many n-mers • Bubble jet / Ink jet arrays – Oligos (25-60 nts) built directly on arrays (in situ synthesis) – Highly uniform spots, very expensive Extracting Data Cells of Interest cDNA microarrays Experiments Known DNA sequences Isolate mRNA Glass slide Genes experiments 200 10000 50.00 5.64 0.25 0.01 0.30 0.70 0.25 genes genes 4800 4800 1.00 0.00 0.73 0.73 0.89 0.92 0.67 Reference sample 9000 300 0.03 -4.91 0.14 0.15 0.60 0.23 0.14 Cy5 ⎛ ⎞ Cy5 Cy3 Cy5 ⎜ ⎟ ⎜ ⎟ Cy3 log 2 0.12 0.12 0.12 0.07 0.02 ⎝ ⎠ Cy3 Resulting data 0.01 0.05 0.14 0.12 0.01 Microarray Data Flow Microarray Unsupervised Image experiment Analysis – Analysis clustering Database Experimental design of Supervised Data Selection & Missing Analysis value estimation microarrays Normalization & Centering Networks & Data Integration Data Matrix Decomposition techniques 6

Recommend


More recommend