csi5180 machinelearningfor bioinformaticsapplications
play

CSI5180. MachineLearningfor BioinformaticsApplications Essential - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Essential Bioinformatics Skills by Marcel Turcotte Version November 6, 2019 Preamble Preamble 2/59 Preamble Essential Bioinformatics Skills The lecture gives an overview of the


  1. CSI5180. MachineLearningfor BioinformaticsApplications Essential Bioinformatics Skills by Marcel Turcotte Version November 6, 2019

  2. Preamble Preamble 2/59

  3. Preamble Essential Bioinformatics Skills The lecture gives an overview of the available resources that are essential for bioinformatics projects. This includes the main databases, software applications, programming languages and computing environments. We also emphasize the skills that are essential to produce robust and reproducible results. General objective : Summarize the essential resources for conducting a bioinformatics project Preamble 3/59

  4. Learning objectives Describe the best practices for handling large bioinformatics projects Introduce essential tools Present the major repositories and file formats, along with the command line and REST API access Reading: See below Preamble 4/59

  5. Plan 1. Preamble 2. Literature 3. Guidelines 4. Computing Environment 5. Data 6. REST 7. Prologue Preamble 5/59

  6. Literature Literature 6/59

  7. Bioinformatics Data Skills Vince Buffalo. Bioinformatics Data Skills: Reproducible and Robust Research with Open Source Tools. O’Reilly Media, 2015. Literature 7/59

  8. A Practical Introduction to. . . Röbbe Wünschiers. Computational Biology - A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R. Springer, 2013. ( https://link.springer.com/book/10.1007/978-3-642-34749-8 ) Literature 8/59

  9. https://biostar.myshopify.com The Biostar Handbook The Biostar Handbook: Bioinformatics data analysis guide, 2019 Literature 9/59

  10. Ten (10) simple rules for. . . Sandve, G. K., Nekrutenko, A., Taylor, J. & Hovig, E. Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9, (2013). Boulesteix, A.-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11, e1004191 (2015). Prlic, A. & Procter, J. B. Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol 8, e1002802 (2012). Perez-Riverol, Y. et al. Ten Simple Rules for Taking Advantage of Git and GitHub. PLoS Comput Biol 12, e1004947 (2016). Sholler, D. et al. Ten simple rules for helping newcomers become contributors to open projects. PLoS Comput Biol 15, e1007296 (2019). Rule, A. et al. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLoS Comput Biol 15, e1007007 (2019). Literature 10/59

  11. Ten (10) simple rules for. . . Osborne, J. M. et al. Ten simple rules for effective computational research. PLoS Comput Biol 10, e1003506 (2014). Elofsson, A. et al. Ten simple rules on how to create open access and reproducible molecular simulations of biological systems. PLoS Comput Biol 15, e1006649 (2019). Lee, B. D. Ten simple rules for documenting scientific software. PLoS Comput Biol 14, e1006561 (2018). Carey, M. A. & Papin, J. A. Ten simple rules for biologists learning to program. PLoS Comput Biol 14, e1005871 (2018). Zook, M. et al. Ten simple rules for responsible big data research. PLoS Comput Biol 13, e1005399 (2017). Literature 11/59

  12. (One more) Definition “ Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying “ informatics ” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.” Luscombe, N. M., Greenbaum, D. & Gerstein, M. What is bioinformatics? A proposed definition and overview of the field. Methods of information in medicine 40, 346358 (2001). Literature 12/59

  13. Guidelines Guidelines 13/59

  14. Robust research (Vince Buffalo) Pay attention to your experimental design Write code for humans, write code for computers Let the computer do the work Write down your assumptions and test them (unit testing) Use existing libraries Treat data as read-only Guidelines 14/59

  15. Reproducible research (Vince Buffalo) Share your source code and your data Meta-data: Versions of the software and databases you are using Write down the parameters or better yet, make it a script One README file directory Make figures , statistics , and tables from scripts Not only is this more scientific, it is almost certain that you will need to redo your analyses! Guidelines 15/59

  16. ComputingEnvironment Computing Environment 16/59

  17. UNIX Both, Bioinformatics and Machine Learning , favour UNIX Quoting François Cholette (Deep Learning with Python): “Youll need access to a UNIX machine; it’s possible to use Windows, too, but I don’t recommend it” Compute Canada ( https://docs.computecanada.ca ) Cedar - 58,416 CPU cores and 584 GPU devices Graham - 36,160 cores and 320 GPU devices Béluga - 34,880 cores and 688 GPU devices Niagara - 61,920 cores Computing Environment 17/59

  18. Access to UNIX Your laptop or workstation As primary or secondary OS (dual boot, USB key, etc.) In a virtual machine (VMWare is free for EECS students, VirtualBox is also free) Windows Subsystem for Linux Installation Guide for Windows 10 ( https://docs.microsoft.com/en-us/windows/wsl/install-win10 ) Cloud I have vouchers for Google Cloud Platform and Amazon (just ask me) Ubuntu is a popular distribution, but there are many others Computing Environment 18/59

  19. unixprimer.html $ head -c 10 /dev/zero > test10bytes.dat https://www.ks.uiuc.edu/Training/Tutorials/Reference/ $ grep -c '>̂' input.fasta UNIX key concepts Modularity “This is the Unix philosophy : Write programs that do one thing and do it well. Write programs to work together . Write programs to handle text streams , because that is a universal interface.” — Doug McIlory The file system plays a central role /dev/null , /dev/random , /dev/zero The command line Shell (anatomy of a script, the magic line, and more) Redirection Pipe Computing Environment 19/59

  20. https://anaconda.org https://conda.io https://bioconda.github.io Conda/Anaconda/Bioconda Conda is a package, dependency and environment management for any programming language (Python, R, Ruby, Lua, Scala, Java, and more) Anaconda is a package management service, primarily for Python and R , hundreds of packages such as numpy, scipy, scikit-learn, keras, tensorflow Bioconda is a channel for the conda package manager specializing in bioinformatics software. Computing Environment 20/59

  21. $ conda create -n csi5180 $ conda install -n csi5180 keras $ conda activate csi5180 $ conda install bwa $ conda deactivate $ conda update --all Using conda/anaconda/bioconda Computing Environment 21/59

  22. https://git-scm.com/doc Other considerations Consider using a (distributed) version control system Git/GitHub has become the de facto standard Features Manage changes in your documents In a distributed version control system, each developer has its own version of the source code Multiple contributors Creating/merging multiple branches Computing Environment 22/59

  23. Data Data 23/59

  24. https://www.ncbi.nlm.nih.gov https://www.ebi.ac.uk https://www.ddbj.nig.ac.jp/ Major repositories Annotated/assembled nucleotide sequence National Center for Biotechnology Information ( NCBI ) European Bioinformatics Institute ( EBI ) DNA Data Bank of Japan ( DDBJ ) See also : International Nucleotide Sequence Database Collaboration ( http://www.insdc.org ) Data 24/59

  25. Major repositories (continued) GenBank: annotated and identified DNA sequence information SRA (Short Read Archive): measurements from high throughput sequencing experiments UniProt (Universal Protein Resource ): protein sequence data PDB (Protein Data Bank): 3D structural information of macromolecules Data 25/59

  26. Other data sources? UCSC Genome Browser FlyBase (Drosophila [fruit fly], WormBase (nematode), SGD: Saccharomyces Genome Database, TAIR (Arabidopsis), EcoCyc (Encyclopedia of E. coli Genes and Metabolic Pathways), etc. RNA-Central: meta-database Data 26/59

  27. https://academic.oup.com/nar/issue/47/D1 Nucleic Acids Research (NAR) Each year, NAR , a high-impact journal, publishes its “database issue”: Data 27/59

  28. Major file formats (biostar) Data that captures prior knowledge (aka reference: FASTA , GFF , BED ) Experimentally obtained data (aka sequencing reads: FASTQ ) Data generated by the analysis (aka results: BAM , VCF , formats from point 1 above, and many nonstandard formats) Data 28/59

  29. Entrez Direct $ conda i n s t a l l − c bioconda entrez − d i r e c t Data 29/59

  30. (...) Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; RefSeq; RefSeq Select. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Catarrhini; Hominidae; Homo. NM_000020.3 REFERENCE 1 (bases 1 to 4177) AUTHORS Leng H, Zhang Q and Shi L. TITLE [Gene diagnosis and treatment of hereditary hemorrhagic KEYWORDS VERSION 4177 bp LOCUS ACCESSION variant 1, mRNA. Homo sapiens activin A receptor like type 1 (ACVRL1), transcript DEFINITION PRI 16-SEP-2019 linear mRNA NM_000020 NM_000020 GENBANK $ e f e t c h − db nuccore − id NM_000020 − format gb | l e s s Data 30/59

Recommend


More recommend