Generalization of the minimizers schemes Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University
Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1
Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1
Computing read overlaps Roberts, et al. Cluster by similarity (2004). Reducing storage requirements for biological sequence comparison. Overlaps 1
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Minimizers definition and properties Minimizers ( k , w , o ) In each window of w consecutive k -mers, select the smallest k -mer according to order o . 1. Uniform : distance between selected k -mers is ≤ w 2. Deterministic : two strings matching on w consecutive k -mers select the same minimizer 3
Computing read overlaps 1. Uniform : no Cluster by sequence minimizer ignored 2. Deterministic : reads with overlap in same bin Overlaps 4
Many applications of minimizers • UMDOverlapper (Roberts, 2004) : bin sequencing reads by shared minimizers to compute overlaps • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber, 2017) : bin input sequences based on minimizer to count k -mers in parallel • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014) : reduce memory footprint of de Bruijn assembly graph with minimizers • SamSAMi (Grabowski, 2015) : sparse su ffi x array with minimizers • MiniMap (Li, 2016), MashMap (Jain, 2017) : sparse data structure for sequence alignment • Kraken (Wood, 2014) : taxonomic sequence classi fi er • Schleimer et al. (2003) : winnowing 5
Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence 6
Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence Lower density Cluster by = ⇒ smaller bins minimizer = ⇒ less computation 6
Minimizers density minimizing problem For fi xed k and w : • Properties “ Uniform ” & “ Deterministic ” una ff ected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7
Minimizers density minimizing problem For fi xed k and w : • Properties “ Uniform ” & “ Deterministic ” una ff ected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7
Density and density factor trivial bounds Density Pick every k -mer ���� 1 ≤ d ≤ 1 w ���� Pick every other w k -mer d = # of minimizers per base 8
Density and density factor trivial bounds Density Density factor Pick every k -mer ���� 1 ≤ d ≤ 1 + 1 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 w ���� Pick every other w k -mer df ≈ # of minimizers per window d = # of minimizers per base 8
Expected and bound on density For an idealized random For any order o : order o : d ≥ 1 . 5 + 1 df ≥ 1 . 5 + 1 2 w 2 w + 1 d = df = 2 2 w w + 1 Requires ≥ 1 . 5 minimizers Expect ≈ 2 minimizers per per window window Schleimer 2003, Roberts 2004 9
Expected and bound on density For an idealized random For any order o : order o : d ≥ 1 . 5 + 1 df ≥ 1 . 5 + 1 2 w 2 w + 1 d = df = 2 2 w w + 1 Requires ≥ 1 . 5 minimizers Expect ≈ 2 minimizers per per window window Valid only for w ≫ k Not valid for w ≫ k Schleimer 2003, Roberts 2004 9
Asymptotic behavior in k and w What is the best ordering possible when: • w is fi xed and k → ∞ • k is fi xed and w → ∞ 10
Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is Ω( w ) , not constant 11
Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is Ω( w ) , not constant 11
Asymptotic behavior in k Asymptotically optimal minimizers schemes There exists a sequence of orders ( o k ) k ∈ N which are asymptotically optimal: 1 k →∞ 1 + 1 d o k − − − → df o k − − − → k →∞ w w 12
Depathing the de Bruijn graph Optimal vertex cover of the de Bruijn graph (Lichiardopol 2006) There exists a sequence of vertex cover V k of DB k which is asymptotically optimal in size: σ k | V k | − − − → k →∞ 2 Optimal depathing of the de Bruijn graph For a fi xed w , there exists a sequence ( U k ) k ∈ N of sets of k -mers that covers every path of length w in DB k such that σ k | U k | − − − → k →∞ w 13
Bound on density For all k , w and order o : � � 1 . 5 + 2 w + max 0 , ⌊ k − w w ⌋ 1 d ≥ w + k 14
Bound on density For all k , w and order o : � � 1 . 5 + 2 w + max 0 , ⌊ k − w w ⌋ 1 d ≥ w + k df ≥ 1 + 1 for large k w df ≥ 1 . 5 + 1 for large w 2 w 14
Density factor of minimizers Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : df − k →∞ 1 + 1 − − → w • Minimizers scheme is not optimal for large w : df = Ω( w ) • Better lower bound on d 15
Density factor of minimizers Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : df − k →∞ 1 + 1 − − → w • Minimizers scheme is not optimal for large w : df = Ω( w ) • Better lower bound on d Good : Not good : • First example of • Large k less interesting optimal minimizers in practice • Minimizers don’t have scheme • Constructive proof constant density factor 15
Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . 16
Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . Minimizers scheme with order o is a local scheme where f = arg min i ∈ [ 0 , w − 1 ] o ( ω [ i : k ]) 16
Generalizing minimizers: local and forward schemes Local scheme Given f : Σ w + k − 1 → [ 0 , w − 1 ] , for each window ω , select k -mer at position f ( ω ) . Minimizers scheme with order o is a local scheme where f = arg min i ∈ [ 0 , w − 1 ] o ( ω [ i : k ]) Forward scheme Local scheme such that f ( ω ′ ) ≥ f ( ω ) − 1 if su ffi x of ω ′ equals pre fi x of ω 16
Local & forward as better minimizers schemes Minimizers � Forward � Local • Properties “ Uniform ” & “ Deterministic ” also satis fi ed • Drop-in replacement for minimizers • Potential for lower density 17
Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound Minimizers Forward Local 18
Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers w Forward Local 18
Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers O ( √ w ) w 1 + 1 ∼ 1 . 5 + 1 Forward w 2 w Local 18
Density factor overview Density factor df k → ∞ w → ∞ Scheme Best Bound 1 + 1 O ( w ) Ω( w ) Minimizers O ( √ w ) w 1 + 1 ∼ 1 . 5 + 1 Forward O ( √ w ) w 2 w 1 + 1 1 + 1 Local w w 18
Conclusion: the quest for constant density factor • Minimizers schemes can’t achieve constant density factor • Local and forward schemes may achieve constant density factor • Design of optimal orders or functions f still open 19
Carl Kingsford group: Dan DeBlasio Heewook Lee Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open GBMF4554 R01HG007104 CCF-1256087 R01GM122935 CCF-1319998
Recommend
More recommend