Asymptotically optimal minimizers schemes Guillaume Marc ¸ais, Dan DeBlasio, Carl Kingsford Carnegie Mellon University
Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1
Computing read overlaps Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. 1
Computing read overlaps Cluster by similarity Roberts, et al. (2004). Reducing storage requirements for biological sequence comparison. Overlaps 1
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Computing minimizers 2
Minimizers definition and properties Minimizers ( k , w , o ) In each window of w consecutive k -mers, select the smallest k -mer according to order o . 1. No large gap : distance between selected k -mers is ≤ w 2. Deterministic : two strings matching on w consecutive k -mers select the same minimizer 3
Computing read overlaps Cluster by minimizer 1. No large gap : no sequence ignored 2. Deterministic : reads with overlap in same bin Overlaps 4
Many applications of minimizers • UMDOverlapper (Roberts, 2004) : bin sequencing reads by shared minimizers to compute overlaps • MSPKmerCounter (Li, 2015), KMC2 (Deorowicz, 2015), Gerbil (Erber, 2017) : bin input sequences based on minimizer to count k -mers in parallel • SparseAssembler (Ye, 2012), MSP (Li, 2013), DBGFM (Chikhi, 2014) : reduce memory footprint of de Bruijn assembly graph with minimizers • SamSAMi (Grabowski, 2015) : sparse su ffi x array with minimizers • MiniMap (Li, 2016), MashMap (Jain, 2017) : sparse data structure for sequence alignment • Kraken (Wood, 2014) : taxonomic sequence classi fi er • Schleimer et al. (2003) : winnowing 5
Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence 6
Improving minimizers by lowering density Density Density of a scheme is the expected proportion of selected k -mer in a random sequence: d = # of selected k -mers length of sequence Lower density Cluster by = ⇒ smaller bins minimizer = ⇒ less computation 6
Minimizers density minimizing problem For fixed k and w : • Properties “No large gap ” & “ Deterministic ” unaffected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7
Minimizers density minimizing problem For fixed k and w : • Properties “No large gap ” & “ Deterministic ” unaffected by order • Density changes with ordering o • Lower density = ⇒ sparser data structures and/or less computation • Bene fi t existing and new applications Density minimization problem For fi xed w , k , fi nd k -mer order o giving the lowest expected density 7
Density and density factor trivial bounds Pick every k -mer ���� 1 ≤ d ≤ 1 w ���� Pick every other w k -mer Random order usual expected density d = 2 w + 1 1 + 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 Random order usual expected density factor df = 2 8
Density and density factor trivial bounds Pick every k -mer ���� 1 ≤ d ≤ 1 w ���� Pick every other w k -mer Random order usual expected density d = 2 w + 1 1 + 1 w ≤ df = ( w + 1 ) · d ≤ w + 1 Random order usual expected density factor df = 2 8
Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 9
Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 Applies only if w ≫ k , or for random orders 9
Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 Applies only if w ≫ k , or for random orders � � 0 , ⌊ k − w 1 . 5 + 2 w + max w ⌋ 1 d ≥ w + k Valid for any k , w and any order 9
Schleimer’s bound does not apply in general d ≥ 1 . 5 + 1 2 w (Schleimer et al. ) w + 1 Applies only if w ≫ k , or for random orders � � 0 , ⌊ k − w 1 . 5 + 2 w + max w ⌋ 1 � � 1 d ≥ − − − → w + k k →∞ w Valid for any k , w and any order 9
Asymptotic behavior in k and w What is the best ordering possible when: • w is fixed and k → ∞ • k is fixed and w → ∞ 10
A universal set defines an ordering Universal set A set M of k -mers that intersects every path of w nodes in the de Bruijn graph of order k . • w = 2 = ⇒ M is a vertex cover • From M , get order with density d ≤ | M | σ k 110 100 111 101 010 000 011 001 11
A universal set defines an ordering Universal set A set M of k -mers that intersects every path of w nodes in the de Bruijn graph of order k . • w = 2 = ⇒ M is a vertex cover • From M , get order with density d ≤ | M | σ k Universal set of size σ k w ⇓ Order with density 1 w 11
Creating a universal set, k = 3 , w = 3 , algorithm overview Start with a de Bruijn graph 110 100 111 101 010 000 011 001 12
Creating a universal set, k = 3 , w = 3 , algorithm overview Embed into a w dimensional space using ψ 001 110 100 011 111 101 010 000 000 101 111 011 001 010 100 110 12
Creating a universal set, k = 3 , w = 3 , algorithm overview An edge correspond (almost) to a rotation by 2 π/ w 001 110 110 100 100 011 111 111 101 101 010 010 000 000 000 101 111 011 011 001 001 010 100 110 12
Creating a universal set, k = 3 , w = 3 , algorithm overview After w edges return to same sub-volume 001 110 100 011 111 101 010 000 000 101 111 011 001 010 100 110 12
Creating a universal set, k = 3 , w = 3 , algorithm overview Pick k -mers in the highlighted “wedge ” 001 110 100 011 111 101 010 000 000 101 111 011 001 010 100 110 12
Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is θ ( w ) , not constant 13
Asymptotic behavior in w 4 density factor ( w + 1) d 3 . 5 k=3 3 Lower bound ( w + 1) /σ k 2 . 5 Optimal solutions 2 1 . 5 1 Trivial bound 1 + 1 /w 0 . 5 0 0 5 10 15 20 25 30 window length w df ≥ w + 1 d ≥ 1 σ k , σ k Density factor is θ ( w ) , not constant 13
Summary Asymptotic behavior of minimizers is fully characterized: • Minimizers scheme is optimal for large k : d − − − → 1 w k →∞ • Minimizers scheme is not optimal for large w : df = θ ( w ) • Tighter lower bound � � 0 , ⌊ k − w 1 . 5 + 2 w + max w ⌋ 1 d ≥ w + k • Comparison between k -mers take O ( k ) 14
Future work • Local scheme: f : Σ w + k − 1 → [ 1 , w ] • Local schemes might be optimal for large w 15
Carl Kingsford group: Dan DeBlasio Heewook Lee Mingfu Shao Brad Solomon Natalie Sauerwald Cong Ma Hongyu Zheng Laura T ung Postdoc position open GBMF4554 R01HG007104 CCF-1256087 R01GM122935 CCF-1319998
Recommend
More recommend