generalization of the minimizers schemes
play

Generalization of the minimizers schemes Guillaume Marc ais, Dan - PowerPoint PPT Presentation

Generalization of the minimizers schemes Guillaume Marc ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1


  1. Generalization of the minimizers schemes Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University

  2. Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

  3. Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1

  4. Computing read overlaps Roberts, et al. Cluster by similarity (2004). Reducing storage requirements for biological sequence comparison. Overlaps 1

  5. Computing minimizers 2

  6. Computing minimizers 2

  7. Computing minimizers 2

  8. Computing minimizers 2

  9. Computing minimizers 2

  10. Computing minimizers 2

  11. Computing minimizers 2

  12. Computing minimizers 2

  13. Computing minimizers 2

  14. Minimizers definition and properties Minimizers ( k , w , o ) In each window of w consecutive k -mers, select the smallest k -mer according to order o . 1. Uniform : distance between selected k -mers is ≤ w 2. Deterministic : two strings matching on w consecutive k -mers select the same minimizer 3

  15. Computing read overlaps 1. Uniform : no Cluster by sequence minimizer ignored 2. Deterministic : reads with overlap in same bin Overlaps 4

  16. Many applications of minimizers • UMDOverlapper (Roberts, 2004) : bin sequencing reads by shared minimizers to compute overlaps • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber, 2017) : bin input sequences based on minimizer to count k -mers in parallel • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014) : reduce memory footprint of de Bruijn assembly graph with minimizers • SamSAMi (Grabowski, 2015) : sparse su ffi x array with minimizers • MiniMap (Li, 2016), MashMap (Jain, 2017) : sparse data structure for sequence alignment • Kraken (Wood, 2014) : taxonomic sequence classi fi er • Schleimer et al. (2003) : winnowing 5

  17. Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence 6

  18. Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence Lower density Cluster by = ⇒ smaller bins minimizer = ⇒ less computation 6

  19. Minimizers density minimizing problem For fi xed k and w : • Properties “ Uniform ” & “ Deterministic ” una ff ected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7

  20. Minimizers density minimizing problem For fi xed k and w : • Properties “ Uniform ” & “ Deterministic ” una ff ected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7

  21. Density and density factor trivial bounds Density Pick every k -mer ���� 1 ≤ d ≤ 1 w ���� Pick every other w k -mer d = # of minimizers per base 8

  22. Density and density factor trivial bounds Density Density factor Pick every k -mer ���� 1 ≤ d ≤ 1 + 1 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 w ���� Pick every other w k -mer df ≈ # of minimizers per window d = # of minimizers per base 8

  23. Expected and bound on density For an idealized random For any order o : order o : d ≥ 1 . 5 + 1 df ≥ 1 . 5 + 1 2 w 2 w + 1 d = df = 2 2 w w + 1 Requires ≥ 1 . 5 minimizers Expect ≈ 2 minimizers per per window window Schleimer 2003, Roberts 2004 9

  24. Expected and bound on density For an idealized random For any order o : order o : d ≥ 1 . 5 + 1 df ≥ 1 . 5 + 1 2 w 2 w + 1 d = df = 2 2 w w + 1 Requires ≥ 1 . 5 minimizers Expect ≈ 2 minimizers per per window window Valid only for w ≫ k Not valid for w ≫ k Schleimer 2003, Roberts 2004 9

  25. Asymptotic behavior in k and w What is the best ordering possible when: • w is fi xed and k → ∞ • k is fi xed and w → ∞ 10

  26. Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is Ω( w ) , not constant 11

  27. Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is Ω( w ) , not constant 11

  28. Asymptotic behavior in k Asymptotically optimal minimizers schemes There exists a sequence of orders ( o k ) k ∈ N which are asymptotically optimal: 1 k →∞ 1 + 1 d o k − − − → df o k − − − → k →∞ w w 12

  29. Depathing the de Bruijn graph Optimal vertex cover of the de Bruijn graph (Lichiardopol 2006) There exists a sequence of vertex cover V k of DB k which is asymptotically optimal in size: σ k | V k | − − − → k →∞ 2 Optimal depathing of the de Bruijn graph For a fi xed w , there exists a sequence ( U k ) k ∈ N of sets of k -mers that covers every path of length w in DB k such that σ k | U k | − − − → k →∞ w 13

  30. Bound on density For all k , w and order o : � � 1 . 5 + 2 w + max 0 , ⌊ k − w w ⌋ 1 d ≥ w + k 14

  31. Bound on density For all k , w and order o : � � 1 . 5 + 2 w + max 0 , ⌊ k − w w ⌋ 1 d ≥ w + k df ≥ 1 + 1 for large k w df ≥ 1 . 5 + 1 for large w 2 w 14

  32. Density factor of minimizers Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : df − k →∞ 1 + 1 − − → w • Minimizers scheme is not optimal for large w : df = Ω( w ) • Better lower bound on d 15

  33. Density factor of minimizers Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : df − k →∞ 1 + 1 − − → w • Minimizers scheme is not optimal for large w : df = Ω( w ) • Better lower bound on d Good : Not good : • First example of • Large k less interesting optimal minimizers in practice • Minimizers don’t have scheme • Constructive proof constant density factor 15

  34. Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . 16

  35. Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . Minimizers scheme with order o is a local scheme where f = arg min i ∈ [ 0 , w − 1 ] o ( ω [ i : k ]) 16

  36. Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . Minimizers scheme with order o is a local scheme where f = arg min i ∈ [ 0 , w − 1 ] o ( ω [ i : k ]) Forward scheme Local scheme such that f ( ω ′ ) ≥ f ( ω ) − 1 if su ffi x of ω ′ equals pre fi x of ω 16

  37. Local & forward as better minimizers schemes Minimizers � Forward � Local • Properties “ Uniform ” & “ Deterministic ” also satis fi ed • Drop-in replacement for minimizers • Potential for lower density 17

  38. Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers Forward Local 18

  39. Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers w Forward Local 18

  40. Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers O ( √ w ) w 1 + 1 ∼ 1 . 5 + 1 Forward w 2 w Local 18

  41. Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers O ( √ w ) w 1 + 1 ∼ 1 . 5 + 1 Forward O ( √ w ) w 2 w 1 + 1 1 + 1 Local w w 18

  42. Conclusion: the quest for constant density factor • Minimizers schemes can’t achieve constant density factor • Local and forward schemes may achieve constant density factor • Design of optimal orders or functions f still open 19

  43. Carl Kingsford group: Dan DeBlasio Heewook Lee Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open GBMF4554 R01HG007104 CCF-1256087 R01GM122935 CCF-1319998

Recommend


More recommend