https github com gglobster trappist trappist d e r u t a
play

https://github.com/gglobster/trappist TRAPPIST: d e r u t a - PowerPoint PPT Presentation

TRAPPIST: A toolkit for comparative analysis and visualization of genomic regions Geraldine A. Van der Auwera, PhD https://github.com/gglobster/trappist TRAPPIST: d e r u t a e f - l l u f n o A toolkit for comparative


  1. TRAPPIST: A toolkit for comparative analysis and visualization of genomic regions Geraldine A. Van der Auwera, PhD � https://github.com/gglobster/trappist

  2. TRAPPIST: d e r u t a e f - l l u f n o A toolkit for comparative analysis and i t a c i l p p a visualization of genomic regions Geraldine A. Van der Auwera, PhD Shankar Ambady � https://github.com/gglobster/trappist

  3. The source code of Life  Evolution = 4 bn years of forking without version tracking … and you thought legacy Fortran code was a pain

  4. Two distinct issues  Getting the code from the  Reverse-engineering the code repository (living beings) (zero documentation!) Extraction, sequencing, assembly Experimentation, mutagenesis + (comparative) sequence analysis

  5. Top issue for “getting” NGS is outscaling Moore’s Law Lincoln Stein via C. Titus Brown @PyCon 2011

  6. Top issue for “getting” NGS is outscaling Moore’s Law R E V E T A H W Lincoln Stein via C. Titus Brown @PyCon 2011

  7. Evolving process of rev-eng  No genomes  entirely experimental  Make random mutants, trace back effect to gene of interest  One genome  some predictive filtering  Design mutants, long iterative process  Many related genomes  much better predictive filtering  Nature’s mutants, drastically reduced iterative process

  8. Nature’s mutants (example) Anthrax PAI rep2 repX tra1 tra2 tra3 pXO1 pBCXO1 p03BB102_179 pAH820_272 pAH187_270 NZ_ACMR0 NZ_ACMH0 IS075 pBc10987 NZ_ACMC0 NZ_ACMS0 NZ_ACMT0 NZ_ACNI0 VD022 Schrouff NZ_ACMO0 TIAC129 NZ_ACNJ0 NZ_ABDM0 NZ_ACNB0 NZ_ACLY0 NZ_ACMP0 NZ_ACNK0 pBc239 NZ_ACNE0 NZ_ABDA0 NZ_ACLV0 NZ_ACLT0 NZ_ACNF0 NZ_ACNA0

  9. Typical analysis process BLAST All done through separate GUIs  poor batching, no automation, no chaining

  10. Programmatic access  The servers can be accessed with scripts*, and there are awesome libraries that provide wrappers, data structures etc. But here’s the rub… * (there are a few GUI pipeline apps but usability is an issue </diplomatic>)

  11. “What’s a command line?” Exhibit A: Experimental Biologist

  12. TRAPPIST Totally Rad Analysis Pipelines Python Super Tool �

  13. TRAPPIST Totally Rad Analysis Pipelines Python Super Tool �

  14. TRAPPIST Totally Rad Analysis Pipelines Python Super Tool �

  15. genome list of list of CONTIG FISHER reference DBs genomes targets Example pipeline / workflow reference plasmids list of + constructs genes OR (my research) Data Input HOST HK_SET CONSERVATION EXTRACTOR SORTER segment STRUCTURE H_K sets Core Analysis sets COMPARATOR ] [ Optional PHYLOGENY SEQUENCE CONTENT Analyses CONSTRUCT VARIATION FUNCTIONS ] [ host plasmid trees trees PHYLOGENY CONGRUENCE

  16. Do it manually? Exhibit B: Lazy Postdoc (me)

  17. Long story short Collection of one-time scripts  Toolkit  Full-featured application  DNA-based OS?

  18. Fundamental requirements  Design, assembly and modification of pipelines / workflows  Automated execution, parameter / output versioning, provenance data bundling, interactive visualization

  19. Staging / Execution

  20. Staging system  Basic requirements  intuitive  flexible  extensible + Pre-assembled workflows / pipelines + Hooks for external / roll-your-own functions

  21. Staging area - workflows

  22. Workflow components  TRAPPIST provides discrete task components for every step of analysis:  Initial inputs selection  Data processing steps (existing algorithms)  Graphical output  Component I/O relies on matching ports with data object classes Forces validation of data type/format (not up to user) 

  23. Component representation

  24. Interacting with components

  25. Connecting components

  26. Connecting components

  27. Connecting components

  28. Execution system  Progressive / modular / dependency-aware  Parameter set versioning linked to output versioning  Users more likely to try various parameters to test assumptions

  29. Flow control rule  Component ports have “fill” status  If all its inputs are filled, component is OK to execute � add component to execution queue I’m OK to go!

  30. Database architecture Workflow 1 Staging DB Execution DB Workflow 2 Central DB Staging DB Execution DB

  31. Staging DB  DB schema + data dump sufficient to fully describe a workflow

  32. Execution DB

  33. Enforcing good practices  Provenance bundle including:  Workflow schema  Parameter sets  Code version info  Executable papers!  Reproducibility!  Science!

  34. Interactive visualization

  35. U haz questions?

Recommend


More recommend