asymptotically optimal minimizers schemes
play

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan - PowerPoint PPT Presentation

Asymptotically optimal minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1


  1. Asymptotically optimal minimizers schemes Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University

  2. Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

  3. Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

  4. Computing read overlaps Cluster by similarity Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. Overlaps 1

  5. Computing minimizers 2

  6. Computing minimizers 2

  7. Computing minimizers 2

  8. Computing minimizers 2

  9. Computing minimizers 2

  10. Computing minimizers 2

  11. Computing minimizers 2

  12. Computing minimizers 2

  13. Computing minimizers 2

  14. Minimizers definition and properties Minimizers ( k , w , o ) In each window of w consecutive k -mers, select the smallest k -mer according to order o . 1. No large gap : distance between selected k -mers is ≤ w 2. Deterministic : two strings matching on w consecutive k -mers select the same minimizer 3

  15. Computing read overlaps Cluster by minimizer 1. No large gap : no sequence ignored 2. Deterministic : reads with overlap in same bin Overlaps 4

  16. Many applications of minimizers • UMDOverlapper (Roberts, 2004) : bin sequencing reads by shared minimizers to compute overlaps • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber, 2017) : bin input sequences based on minimizer to count k -mers in parallel • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014) : reduce memory footprint of de Bruijn assembly graph with minimizers • SamSAMi (Grabowski, 2015) : sparse su ffi x array with minimizers • MiniMap (Li, 2016), MashMap (Jain, 2017) : sparse data structure for sequence alignment • Kraken (Wood, 2014) : taxonomic sequence classi fi er • Schleimer et al. (2003) : winnowing 5

  17. Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence 6

  18. Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence Lower density Cluster by = ⇒ smaller bins minimizer = ⇒ less computation 6

  19. Minimizers density minimizing problem For fixed k and w : • Properties “No large gap ” & “ Deterministic ” unaffected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7

  20. Minimizers density minimizing problem For fixed k and w : • Properties “No large gap ” & “ Deterministic ” unaffected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7

  21. Density and density factor trivial bounds Pick every k -mer ���� 1 ≤ d ≤ 1 w ���� Pick every other w k -mer Random order usual expected density d = 2 w + 1 1 + 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 Random order usual expected density factor df = 2 8

  22. Density and density factor trivial bounds Pick every k -mer ���� 1 ≤ d ≤ 1 w ���� Pick every other w k -mer Random order usual expected density d = 2 w + 1 1 + 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 Random order usual expected density factor df = 2 8

  23. Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 9

  24. Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 Applies only if w ≫ k , or for random orders 9

  25. Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 Applies only if w ≫ k , or for random orders � � 0 , ⌊ k − w 1 . 5 + 2 w + max w ⌋ 1 d ≥ w + k Valid for any k , w and any order 9

  26. Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 Applies only if w ≫ k , or for random orders � � 0 , ⌊ k − w 1 . 5 + 2 w + max w ⌋ 1 � � 1 d ≥ − − − → w + k k →∞ w Valid for any k , w and any order 9

  27. Asymptotic behavior in k and w What is the best ordering possible when: • w is fixed and k → ∞ • k is fixed and w → ∞ 10

  28. A universal set defines an ordering Universal set A set M of k -mers that intersects every path of w nodes in the de Bruijn graph of order k . • w = 2 = ⇒ M is a vertex cover • From M , get order with density d ≤ | M | σ k 110 100 111 101 010 000 011 001 11

  29. A universal set defines an ordering Universal set A set M of k -mers that intersects every path of w nodes in the de Bruijn graph of order k . • w = 2 = ⇒ M is a vertex cover • From M , get order with density d ≤ | M | σ k Universal set of size σ k w ⇓ Order with density 1 w 11

  30. Creating a universal set, k = 3 , w = 3 , algorithm overview Start with a de Bruijn graph 110 100 111 101 010 000 011 001 12

  31. Creating a universal set, k = 3 , w = 3 , algorithm overview Embed into a w dimensional space using ψ 001 110 100 011 111 101 010 000 000 101 111 011 001 010 100 110 12

  32. Creating a universal set, k = 3 , w = 3 , algorithm overview An edge correspond (almost) to a rotation by 2 π/ w 001 110 110 100 100 011 111 111 101 101 010 010 000 000 000 101 111 011 011 001 001 010 100 110 12

  33. Creating a universal set, k = 3 , w = 3 , algorithm overview After w edges return to same sub-volume 001 110 100 011 111 101 010 000 000 101 111 011 001 010 100 110 12

  34. Creating a universal set, k = 3 , w = 3 , algorithm overview Pick k -mers in the highlighted “wedge ” 001 110 100 011 111 101 010 000 000 101 111 011 001 010 100 110 12

  35. Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is θ ( w ) , not constant 13

  36. Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is θ ( w ) , not constant 13

  37. Summary Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : d − − − → 1 w k →∞ • Minimizers scheme is not optimal for large w : df = θ ( w ) • Tighter lower bound � � 0 , ⌊ k − w 1 . 5 + 2 w + max w ⌋ 1 d ≥ w + k • Comparison between k -mers take O ( k ) 14

  38. Future work • Local scheme: f : Σ w + k − 1 → [ 1 , w ] • Local schemes might be optimal for large w 15

  39. Carl Kingsford group: Dan DeBlasio Heewook Lee Mingfu Shao Brad Solomon Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open GBMF4554 R01HG007104 CCF-1256087 R01GM122935 CCF-1319998

Recommend


More recommend