Bioinformatics pipeline for revealing tumour heterogeneity Mustafa Anıl Tuncel Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 1
Mustafa Anıl Tuncel Software Engineer @ ETH Zürich Research interests § Data analysis workflows § Bioinformatics § Machine learning § Recommender systems anilbey /in/aniltuncel anilbey Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 2
Outline § Background § Biology prior § Single cell sequencing technologies § Mutations on DNA § DNA mutation trees § Tree model § MCMC moves § Pipeline § Snakemake § HDF5 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 3
What is a cell? Figure 1. Representation of cell, tissue, organ, system and organism. Retrieved from https://www.colscol.com/body-system/ Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 4
DNA from single cells Figure 2. DNA structure. Retrieved from https://www.interleucina.org/ Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 5
Structural mutations on DNA § Copy number variations § Deletion § Duplication § Mutations from DNA of single cells § Heterogeneous § Have ancestors, children, siblings Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 6
Trees to represent structural mutations DNA: Region 1 Region 2 Region 3 Region 4 Region 5 root R1 : +1 R2 : -1, R5 : +2 R4 : -1 R3:+1 R1 : +1 R3 : -1 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 7
Learning the tree • Dirichlet-multinomial model with overdispersion • We target maximising the tree posterior with an MCMC scheme root • Prune-reattach • Label swap R1 : +1 • Add/remove events • Add/remove node • Condense/split node • Genotype preserving prune-reattach R2 : -1, R5 : +2 R4 : -1 R3:+1 R1 : +1 R3 : -1 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 8
Prune-reattach After Before root root R1 : +1 R1 : +1 R2 : -1, R5 : +2 R2 : -1, R4 : -1 R4 : -1 R3:+1 R3:+1 R1 : +1 R3 : -1 R5 : +2 R1 : +1 R3 : -1 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 9
Add / remove node After Before root root R1 : +1 R1 : +1 R2 : -1, R5 : +2 R5 : +2 R4 : -1 R2 : -1, R4 : -1 R3:+1 R3:+1 R1 : +1 R3 : -1 R1 : +1 R3 : -1 R1 : +1 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 10
Condense / split node After Before root root R1 : +1 R1 : +1 R5 : +2, R2 : -1, R5 : +2 R2 : -1, R4 : -1 R4 : -1 R1:+1 R3:+1 R3:+1 R1 : +1 R3 : -1 R3 : -1 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 11
Tree learned from mouse data Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 12
What else is required? § Reproducibility in research § Scalability § Support for Multiple programming languages § Multi processing § Cluster execution § Resources management § Statistics about resource usages Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 13
Workflow management system Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 14
Snakemake • A Pythonic workflow management system • Extends the Python syntax • Follows the GNU make paradigm • Workflows are defined in terms of rules that define how to create output files from input files • Dependencies between the rules are determined automatically • Benefits from Python libraries • Automated logging of the status • Suspend/resume workflow • A general-purpose workflow management system for any discipline Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 15
Example: read mapping Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 16
Example: read mapping (generalised) Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 17
DAG of jobs Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 18
Snakefile Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 19
Config file Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 20
Cluster execution • Configurable for LSF/BSUB scheduler • Allows scaling without changing the workflow Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 21
HDF5 HDF = H ierarchical D ata F ormat § Hierarchical data format v5 Metadata Data § Binary files Dataspace • Easy to manage multiple datasets Rank Dimensions • Keeps metadata with data 3 Dim_1 = 4 Dim_2 = 5 • Fast I/O operations & storage Dim_3 = 7 space optimization (compressed binary files) Datatype • Platform/language independent Integer Attributes • Self describing Storage info Time = 32.4 • No need to load whole data Chunked Pressure = 987 Compressed Temp = 56 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 22
HDF5 wrappers in Python h5py is a thin, pythonic wrapper around the HDF5 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 23
Outline Future work § Background • Publish the method • Compare to clustering methods § Biology prior • Evaluate on simulated data § Single cell sequencing technologies • Show results on real data § Mutations on DNA § DNA mutation trees • Wrap up the workflow as a Python § Tree model package § MCMC moves • Do the C++ bindings § Pipeline • Open source it § Snakemake § HDF5 Department of Biosystems Science and Engineering Bioinformatics pipeline for revealing tumour heterogeneity. |. Mustafa Anıl Tuncel | 10.07.19 | 24
Thank you! Mustafa Anıl Tuncel Software Engineer ETH Zurich anilbey /in/aniltuncel anilbey mtuncel@ethz.ch
Recommend
More recommend