from cell line to command line my journey to bioinforma4cs
play

From cell line to command line: my journey to bioinforma4cs Ming - PowerPoint PPT Presentation

From cell line to command line: my journey to bioinforma4cs Ming (Tommy) Tang Research scien4st Twi>er @tangming2005 MD Anderson Cancer Center, Houston, TX UFGI Gene4cs & Genomics program seminar Self-introduc4on: 2008 UF Gene4cs and


  1. From cell line to command line: my journey to bioinforma4cs Ming (Tommy) Tang Research scien4st Twi>er @tangming2005 MD Anderson Cancer Center, Houston, TX UFGI Gene4cs & Genomics program seminar

  2. Self-introduc4on: 2008 UF Gene4cs and Genomics graduate student

  3. Hallmarks of cancer Glycolysis VS mitochondria kamarajugadda S et.al 2012 MCB Cai Q et.al 2012 oncogene EMT Epithelial-mesenchymal Tang M et.al 2011 PNAS transi1on(EMT) vascular endothelial growth factor (VEGF) Tang M et.al 2013. JBC Douglas Hanahan and Robert A. Weinberg. 2011.Cell

  4. Challenges that I was facing • How do I open this 2G ChIPseq file? • Excel fails me. • How do I download the files from GEO and process the raw data? • That’s how I started to teach myself Unix,R and python.

  5. 2015.03 joined MD Anderson With Dr.Roel Verhaak for a computa4onal Biology postodoc

  6. Teaching University of Miami 2015 Sodware Carpentry workshop

  7. A book chapter published in April 2017

  8. More about me

  9. Star4ng a new job in October Moving to Harvard FAS informa4cs as a bioinforma4cs scien4st

  10. Challenges and opportuni4es

  11. Data deluge Sean Davis

  12. h>p://journals.plos.org/plosbiology/ar4cle?id=10.1371/journal.pbio.1002195

  13. Credit: Titus Brown

  14. Superman/Wonder woman Credit: Torsten Seemann

  15. What is bioinforma4cs h>ps://academic.oup.com/bib/advance-ar4cle/doi/10.1093/bib/bby063/5066445

  16. A typical day of my life as a bioinforma4cs scien4st • Googling (error message etc) • Conver4ng file formats. • Tidying the data. • Installing sodware. • Real analysis (plokng etc) 20%

  17. Google is what we do

  18. Ask for help • google • SeqAnswer • Biostars • Stack overflow

  19. bioiFORMATics • A real variant calling example: • Fastq • sam • bam • Vcf • Bed

  20. conda and biocoda

  21. Learn command line • Why command line? • More efficient/powerful • HPC, cloud compu4ng

  22. Terminal h>p://rik.smith-unna.com/command_line_bootcamp/ Use a mac/ubuntu or windows10 has a built-in

  23. Learn some python

  24. Automa4on saves you 4me in the long run Computers are good at repe44ve work

  25. Side effect of automa4on • The best documenta4on is automa4on • Write scripts for everything unless it is not possible. (manual edi4ng, document!) Credit to someone in the twi>er-verse J

  26. DNA-seq Snakemake pipeline

  27. A real run

  28. Output files from the pipeline are organized by folders and uniformly named

  29. Learn some R • Rstudio (IDE) • Bioconductor • Tidyverse and ggplot2

  30. Do not re-invent the wheels

  31. Tidying data R for data science by Hadley Wickham & Garre> Grolemund h>p://r4ds.had.co.nz/

  32. Data visualiza4on

  33. Wait, do you really need to learn C and C++? Titus Brown h>p://ivory.idyll.org/blog/2015-bioinforma4cs-middle-class.html

  34. h>ps://www.ncbi.nlm.nih.gov/pmc/ar4cles/PMC3945096/

  35. Reproducibility crisis h>p://biobungalow.weebly.com/bio-bungalow-blog/everybody-knows-the-scien4fic-method

  36. Titus Brown

  37. Method ma>ers Credit: Nicolas Robine

  38. How to ensure reproducibility • Git version control • Jupyter/R Notebook, documenta4on • Containers (docker, singularity, biocontainers • h>ps://biocontainers.pro/)

  39. Version control • Git • Github • Gitlab

  40. Notebooks

  41. docker • Why docker? • Imagine you are working on an analysis in R and you send your code to a friend. Your friend runs exactly this code on exactly the same data set but gets a slightly different result. This can have various reasons such as a different opera4ng system, a different version of an R package, etc. Docker is trying to solve problems like that. h>ps://cyverse-cybercarpentry-container-workshop-2018.readthedocs-hosted.com/en/latest/docke h>ps://ropenscilabs.github.io/r-docker-tutorial/01-what-and-why.html

  42. Other important untaught skills • Naming files • Project organiza4on • Data organiza4on, backup plans

  43. Naming files • Three principles for (file) names: • 1. Machine readable (do not put special characters and space in the name) • 2. Human readable (Easy to figure out what the heck something is, based on its name, add slug) • 3. Plays well with default ordering: • * Put something numeric first • * Use the ISO 8601 standard for dates (YYYY-MM- DD) • * Led pad other numbers with zeros Jenny Bryan

  44. Jenny Bryan

  45. Jenny Bryan: h>ps://rawgit.com/Reproducible-Science-Curriculum/rr-organiza4on1/master/organiza4on-01-slid

  46. TCGA barcode

  47. Organiza4on of each project down-stream analysis

  48. Rstudio R project

  49. Stay the current of bioinforma4cs • Bioinforma4cs evolves so fast! • E.g. sequencing technology: long-read (pacbio, nanopore, single cell) all these require new tools to analyze the associated data. • I started bioinforma4cs ader reading: h>p://www.geknggene4csdone.com/2012/05/how-to-stay-current-in.html

  50. Social medium network • Twi>er • Follow papers/tools, jobs, outreaching • Go to conferences/talk to other people • Blog posts

  51. h>ps://divingintogene4csandgenomics.rbind.io/

  52. Titus Brown

  53. Where to start • Never too late to start to learn! • ANGUS workshop • h>p://angus.readthedocs.io/en/2018/ • I would love to come back and give workshops! • Sodware/Data Carpentries 2-day workshop (Ethan White is heavily involved at UF) • Many other resources • h>ps://github.com/crazyho>ommy/gekng- started-with-genomics-tools-and-resources

  54. Massive Online open Courses(MOOCs) and others • 1. coursera • 2. Edx • 3. Udacity • 4. Datacamp (not free, interac4ve lessons)

  55. Learn by doing

  56. What ques4ons do you have?

  57. Acknowledgements • Thanks Titus Brown, Torsten Seemann and Sean Davis for lekng me borrow their slides. • Thanks Stephen Turner for wri4ng his blog posts. • Thanks Roel Verhaak lab for giving me the opportunity to learn computa4onal biology. • Thanks Samir Amin for teaching me so much!

  58. Use excel wisely

  59. Cau4on with excel h>ps://genomebiology.biomedcentral.com/ar4cles/10.1186/s13059-016-1044-7

  60. Converted to dates h>p://blogs.nature.com/naturejobs/2017/02/27/escape-gene-name-mangling-with-escape-excel/

  61. h>ps://www.sciencedirect.com/science/ar4cle/pii/S0018506X18302599?via%3Dihub

  62. • h>ps://www.youtube.com/watch? v=s3JldKoA0zw&feature=youtu.be

  63. Regular expression

Recommend


More recommend