computing optimal flow decompositions for assembly
play

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, - PowerPoint PPT Presentation

COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. OBrien, Felix Reidl, Fernando Snchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen


  1. COMPUTING OPTIMAL FLOW DECOMPOSITIONS FOR ASSEMBLY Kyle Kloster, Philipp Kuinke , Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan, Andrew van der Poel 2018/03/27 North Carolina State University RWTH Aachen University

  2. MOTIVATION

  3. T he Problem Shared segments between DNA/RNA strands create ambiguity in the assembly problem 2

  4. The Problem Connecting overlapping segments and counting their frequencies yields a splice-graph. 3

  5. The Problem 4

  6. The Problem The problem is to split the flow into s - t -paths, to recover the original DNA/RNA strands. 5

  7. The Problem 6

  8. The Problem k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. 7

  9. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. 8

  10. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? 8

  11. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? → minimization A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths. M. Shao & C. Kingsford 8

  12. Related Work k -Flow Decomposition ( k -FD) Input: ( G , f , k ) with an s - t –DAG G , a flow f on G , and a positive integer k . Problem: Find an integral flow decomposition of ( G , f ) using at most k paths. How do we choose k ? → minimization A novel min-cost flow method for estimating transcript expression with RNA-Seq. A.I. Tomescu et. al. Efficient Heuristic for Decomposing a Flow with Minimum Number of Paths. M. Shao & C. Kingsford Problem is NP-hard even for weights { 1 , 2 , 4 } How to split a flow? T. Hartman et. al. 8

  13. Computer Scientists... About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away! -Joseph Felsenstein (Biologist) 9

  14. Computer Scientists... About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away! -Joseph Felsenstein (Biologist) 9

  15. Linear FPT d k · n c n vs. linear fpt: exponential only in the parameter and linear in n ! 10

  16. Observations Data used by Shao and Kingsford: 1. 99% of instances decompose into ≤ 8 paths. → exploit small natural parameter . 2. ∼ 4 million mostly small instances. → handle large throughput . 3. Output decompositions. → reliably recover domain-specific solution . 11

  17. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. 12

  18. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n 12

  19. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution 12

  20. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model 12

  21. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model � Run-time competitive with current state of the art heuristic 12

  22. Toboggan Theorem Toboggan solves k -FD in 2 O ( k 2 ) ( n + λ ) , where λ is the logarithm of the largest flow value. � Worst-case run-time is linear in n � Guarantees optimal solution ◦ Gives opportunity to validate the model � Run-time competitive with current state of the art heuristic � Usable in practice 12

  23. I MPLEMENTATION & EXPERIMENTS

  24. Repository https://github.com/theoryinpractice/toboggan 14

  25. Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. 15

  26. Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. 15

  27. Setup Dataset: Available from Shao and Kingsford. Simulated sequencing data for human , mouse and zebrafish , containing ground-truth. Deviation from original setup: Trivial instances omitted. Removes around 64% of the 4M graphs. Dedicated system with Intel i7-3770: 3.40 GHz, 8 MB cache and 32 GB RAM. 15

  28. Execution Time Median: Toboggan : 1 . 24 ms Catfish : 3 . 47 ms 16

  29. Ground Truth Validation dataset instances minimal non-minimal 445,880 99.907% 0.053% zebrafish 473,185 99.401% 0.074% mouse 529,523 99.490% 0.043% human all 1,448,588 99.589% 0.056% 17

  30. Exact Recovery Catfish Toboggan k instances 2 63.2791% 0.992 0.995 3 22.0775% 0.967 0.969 4 8.5237% 0.931 0.930 5 3.4920% 0.886 0.886 6 1.5375% 0.830 0.828 7 0.6698% 0.788 0.780 8 0.2889% 0.767 0.766 9 0.1241% 0.740 0.743 10 0.0070% 0.752 0.802 11 0.0004% 0.500 0.500 all 100% 0.973 0.975 18

  31. Solutions vs. Ground Truth 19

  32. ALGORITHM IDEA

  33. The Idea 1 3 3 s t 3 2 5 3 4 1 21

  34. The Idea 1 3 3 s t 5 3 4 3 2 1 22

  35. The Idea 1 3 3 s t 5 3 4 3 2 1 w + w � 3 w � 3 w � 2 22

  36. The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 3 w � 2 22

  37. The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 1 w � 3 w + w � 5 w � 2 w � 1 22

  38. The Idea 1 3 3 s t 3 4 3 2 5 1 w + w � 3 w � 1 Aw � f w � 3 w + w � 5 w � 2 w � 1 22

  39. Dynamic Programming . . . . . . S i − 1 ↓ . . . . . . S i ↓ . . . . . . S i +1 23

  40. Dynamic Programming . . . . . . S i − 1 g 1 , L 1 ↓ g 2 , L 2 g 3 , L 3 g 4 , L 4 g 5 , L 5 . . . . . . S i g 6 , L 6 g 7 , L 7 ↓ g 8 , L 8 g 10 , L 10 � g 9 , L 9 g 11 , L 11 . . . . . . S i +1 23

  41. C ONCLUSION

  42. Conclusion � Theoretical worst-case runtime linear in n . 25

  43. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. 25

  44. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . 25

  45. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . � Python already fast, C++ even faster? 25

  46. Conclusion � Theoretical worst-case runtime linear in n . � Competitive runtime with heuristics. � Guarantees optimal k . � Python already fast, C++ even faster? paper: https://arxiv.org/abs/1706.07851 github: https://github.com/theoryinpractice/toboggan 25

  47. Kyle Kloster, Philipp Kuinke , Michael P. O’Brien, Felix Reidl, Fernando Sánchez Villaamil, Blair D. Sullivan , Andrew van der Poel Thank you! Supported in part by the Gordon & Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4560 to Blair D. Sullivan. 26

Recommend


More recommend