Locality sensitive hashing for the edit distance Guillaume Marc ¸ais, Dan DeBlasio, Prashant Pandey, Carl Kingsford Carnegie Mellon University
Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • Instance of “Nearest Neighbor Problem ” for edit distance • Use multiple hash tables • Need meaningful hash collisions 1
Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • I nstance of “ Nearest Neighbor Problem ” for edit distance • Use multiple hash tables • Need meaningful hash collisions 1
Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • I nstance of “ Nearest Neighbor Problem ” Hash Tables for edit distance • Use multiple hash tables • Need meaningful hash collisions 1
Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • I nstance of “ Nearest Neighbor Problem ” Hash Tables for edit distance • Use multiple hash tables • Need meaningful hash collisions 1
Locality Sensitive Hashing Pick h at random from H : Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 2
Locality Sensitive Hashing Pick h at random from H : Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 2
Locality Sensitive Hashing The family H is sensitive for distance D if there exists d 1 < d 2 , p 1 > p 2 such that for all x, y ∈ U Locality sensitive hash family D( x, y ) ≤ d 1 = ⇒ h ∈H [ h ( x ) = h ( y )] ≥ p 1 Pr Family H of hash functions where similar elements are more likely D( x, y ) ≥ d 2 = ⇒ h ∈H [ h ( x ) = h ( y )] ≤ p 2 Pr to have the same value than distant elements. • Low distance ⇐ ⇒ High collisions • High distance ⇐ ⇒ Low collisions 2
LSH for the edit distance How to design an LSH for edit distance? • LSH for Jaccard distance (minHash) used as proxy • Jaccard distance is significantly different than edit distance 3
LSH for the edit distance How to design an LSH for edit distance? • LSH for Jaccard distance (minHash) used as proxy • Jaccard distance is significantly different than edit distance 3
LSH for the edit distance How to design an LSH for edit distance? • LSH for Jaccard distance (minHash) used as proxy • Jaccard distance is significantly different than edit distance 3
Jaccard distance Jaccard distance between sets A, B : J( A, B ) = 1 − | A ∩ B | | A ∪ B | 4
� Jaccard distance Jaccard distance between sets A, B : Jaccard between sequences x, y : Jaccard distance of their k -mer sets J( x, y ) = J( K ( x ) , K ( y )) • Low D( x, y ) = ⇒ Low J( x, y ) J( A, B ) = 1 − | A ∩ B | • High D( x, y ) = ⇒ High J( x, y ) | A ∪ B | 4
Jaccard ignores k -mer repetition n − k k � �� � � �� � x = AAAAAAAAAAAAAAA CCCCC y = AAAAA CCCCCCCCCCCCCCC � �� � � �� � k n − k 5
Jaccard ignores k -mer repetition n − k k � �� � � �� � x = → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } AAAAAAAAAAAAAAA CCCCC y = AAAAA → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } CCCCCCCCCCCCCCC � �� � � �� � k n − k 5
Jaccard ignores k -mer repetition n − k k � �� � � �� � x = → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } AAAAAAAAAAAAAAA CCCCC y = AAAAA → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } CCCCCCCCCCCCCCC � �� � � �� � k n − k Edit distance D( x, y ) ≥ 1 − 2 k Jaccard distance J( x, y ) = 0 n I dentical k -mer content and high edit distance 5
Weighted Jaccard handles repetitions n − k k � � � �� � � �� � ( AAAAA , 1) , ( AAAAA , 2) ,..., ( AAAAA , 11) x = → AAAAAAAAAAAAAAA CCCCC ( AAAAC , 1) , ( AAACC , 1) , ( AACCC , 1) , ( ACCCC , 1) , ( CCCCC , 1) � � ( AAAAA , 1) , ( AAAAC , 1) , ( AAACC , 1) , ( AACCC , 1) , ( ACCCC , 1) , y = AAAAA → CCCCCCCCCCCCCCC ( CCCCC , 1) , ( CCCCC , 2) ,..., ( CCCCC , 11) � �� � � �� � k n − k Weighted Jaccard J w ( x, y ) = 1 − k +2 Edit distance D( x, y ) ≥ 1 − 2 k n n Weighted Jaccard = Jaccard for multi-sets 6
Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC y = AAAACACAACCCCACCAAA 7
Jaccard and weighted Jaccard ignore relative order � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � x = CCCCACCAACACAAAACCC → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � y = AAAACACAACCCCACCAAA → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC x, y : de Bruijn sequences, contain all 16 possible 4 -mers once 7
Jaccard and weighted Jaccard ignore relative order � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � x = CCCCACCAACACAAAACCC → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � y = AAAACACAACCCCACCAAA → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC x, y : de Bruijn sequences, contain all 16 possible 4 -mers once J( x, y ) = J w ( x, y ) = 0 D( x, y ) = 0 . 63 7
Jaccard is different from edit distance Unlike edit distance, Jaccard is insensitive to: 1. k -mer repetitions 2. relative positions of k -mers 8
OMH: Order Min Hash • minHash is an LSH for Jaccard • OMH is a re fi nement of minHash • OMH is sensitive to • repeated k -mers • relative order of k -mers 9
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 AT AG GT CG AG GT GT TT CG GG TT CA GG TG TG TC AA GA AA GC AC TA GA GC CT 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 2 3 4 5 6 AG GG CG AA TG TT AG GT GA GA TT GG CG CG TG TT GG AG GT CG GC TT AG AG GG GC TG AA GA TG GC GT AA GT GG AA AA TT GA AG CG GA TT AA CG GC TG GC TG GT GG GA GT 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 2 3 4 5 6 AG GG CG AA TG TT AG GT GA GA TT GG CG CG TG TT GG AG GT CG GC TT AG AG GG GC TG AA GA TG GC GT AA GT GG AA AA TT GA AG CG GA TT AA CG GC TG GC TG GT GG GA GT 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 GA, 4 TG, 3 1 2 3 4 5 6 AG, 5 AG GG CG AA TG TT GT, 1 AG GT GA GA TT GG GT, 13 CG CG TG TT GG AG AA, 10 GT CG GC TT AG AG AG, 11 GG GC TG AA GA TT, 2 TG GC GT AA GT GG AG, 0 AA AA TT GA AG CG CG, 7 GA TT AA CG GC TG GG, 12 GC TG GT GG GA GT GC, 6 TG, 14 GG, 8 GA, 9 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 5 6 2 3 4 GA, 4 CG, 7 GT, 13 AG, 0 AA, 10 GA, 9 TG, 3 TG, 14 GA, 4 TT, 2 GT, 13 GG, 8 1 2 3 4 5 6 AG, 5 AG, 0 GA, 9 AG, 11 GA, 9 GC, 6 AG GG CG AA TG TT GT, 1 GA, 9 TG, 3 AG, 5 GT, 1 TG, 14 AG GT GA GA TT GG GT, 13 AG, 5 AG, 5 AA, 10 AG, 5 GT, 13 CG CG TG TT GG AG AA, 10 AG, 11 CG, 7 GT, 13 TT, 2 TT, 2 GT CG GC TT AG AG AG, 11 GA, 4 TT, 2 CG, 7 GA, 4 AA, 10 GG GC TG AA GA TT, 2 GT, 13 AA, 10 GG, 8 CG, 7 AG, 0 TG GC GT AA GT GG AG, 0 TT, 2 GG, 12 GA, 4 AG, 0 CG, 7 AA AA TT GA AG CG CG, 7 TG, 3 GG, 8 GA, 9 TG, 3 GG, 12 GA TT AA CG GC TG GG, 12 GG, 8 TG, 14 TG, 14 GG, 8 AG, 11 GC TG GT GG GA GT GC, 6 AA, 10 GT, 1 TG, 3 GG, 12 TG, 3 TG, 14 GG, 12 AG, 11 GC, 6 GC, 6 GT, 1 GG, 8 GT, 1 GC, 6 GT, 1 AG, 11 GA, 4 GA, 9 GC, 6 AG, 0 GG, 12 TG, 14 AG, 5 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 5 6 2 3 4 GA, 4 CG, 7 GT, 13 AG, 0 AA, 10 GA, 9 TG, 3 TG, 14 GA, 4 TT, 2 GT, 13 GG, 8 1 2 3 4 5 6 AG, 5 AG, 0 GA, 9 AG, 11 GA, 9 GC, 6 AG GG CG AA TG TT GT, 1 GA, 9 TG, 3 AG, 5 GT, 1 TG, 14 AG GT GA GA TT GG GT, 13 AG, 5 AG, 5 AA, 10 AG, 5 GT, 13 CG CG TG TT GG AG AA, 10 AG, 11 CG, 7 GT, 13 TT, 2 TT, 2 GT CG GC TT AG AG AG, 11 GA, 4 TT, 2 CG, 7 GA, 4 AA, 10 GG GC TG AA GA TT, 2 GT, 13 AA, 10 GG, 8 CG, 7 AG, 0 TG GC GT AA GT GG AG, 0 TT, 2 GG, 12 GA, 4 AG, 0 CG, 7 AA AA TT GA AG CG CG, 7 TG, 3 GG, 8 GA, 9 TG, 3 GG, 12 GA TT AA CG GC TG GG, 12 GG, 8 TG, 14 TG, 14 GG, 8 AG, 11 GC TG GT GG GA GT GC, 6 AA, 10 GT, 1 TG, 3 GG, 12 TG, 3 TG, 14 GG, 12 AG, 11 GC, 6 GC, 6 GT, 1 GG, 8 GT, 1 GC, 6 GT, 1 AG, 11 GA, 4 GA, 9 GC, 6 AG, 0 GG, 12 TG, 14 AG, 5 10
minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 , ℓ = 2 1 5 6 2 3 4 GA, 4 CG, 7 GT, 13 AG, 0 AA, 10 GA, 9 TG, 3 TG, 14 GA, 4 TT, 2 GT, 13 GG, 8 1 2 3 4 5 6 AG, 5 AG, 0 GA, 9 AG, 11 GA, 9 GC, 6 AG GG CG AA TG TT GT, 1 GA, 9 TG, 3 AG, 5 GT, 1 TG, 14 AG GT GA GA TT GG GT, 13 AG, 5 AG, 5 AA, 10 AG, 5 GT, 13 CG CG TG TT GG AG AA, 10 AG, 11 CG, 7 GT, 13 TT, 2 TT, 2 GT CG GC TT AG AG AG, 11 GA, 4 TT, 2 CG, 7 GA, 4 AA, 10 GG GC TG AA GA TT, 2 GT, 13 AA, 10 GG, 8 CG, 7 AG, 0 TG GC GT AA GT GG AG, 0 TT, 2 GG, 12 GA, 4 AG, 0 CG, 7 AA AA TT GA AG CG CG, 7 TG, 3 GG, 8 GA, 9 TG, 3 GG, 12 GA TT AA CG GC TG GG, 12 GG, 8 TG, 14 TG, 14 GG, 8 AG, 11 GC TG GT GG GA GT GC, 6 AA, 10 GT, 1 TG, 3 GG, 12 TG, 3 TG, 14 GG, 12 AG, 11 GC, 6 GC, 6 GT, 1 GG, 8 GT, 1 GC, 6 GT, 1 AG, 11 GA, 4 GA, 9 GC, 6 AG, 0 GG, 12 TG, 14 AG, 5 GC GA AG GG AG TG 10
Recommend
More recommend