locality sensitive hashing for the edit distance
play

Locality sensitive hashing for the edit distance Guillaume Marc - PowerPoint PPT Presentation

Locality sensitive hashing for the edit distance Guillaume Marc ais, Dan DeBlasio, Prashant Pandey, Carl Kingsford Carnegie Mellon University Overlap computation Reads Compute overlaps between reads (HMAP) Overlap? Instance of


  1. Locality sensitive hashing for the edit distance Guillaume Marc ¸ais, Dan DeBlasio, Prashant Pandey, Carl Kingsford Carnegie Mellon University

  2. Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • Instance of “Nearest Neighbor Problem ” for edit distance • Use multiple hash tables • Need meaningful hash collisions 1

  3. Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • I nstance of “ Nearest Neighbor Problem ” for edit distance • Use multiple hash tables • Need meaningful hash collisions 1

  4. Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • I nstance of “ Nearest Neighbor Problem ” Hash Tables for edit distance • Use multiple hash tables • Need meaningful hash collisions 1

  5. Overlap computation Reads • Compute overlaps between reads (HMAP) Overlap? • I nstance of “ Nearest Neighbor Problem ” Hash Tables for edit distance • Use multiple hash tables • Need meaningful hash collisions 1

  6. Locality Sensitive Hashing Pick h at random from H : Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 2

  7. Locality Sensitive Hashing Pick h at random from H : Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 2

  8. Locality Sensitive Hashing The family H is sensitive for distance D if there exists d 1 < d 2 , p 1 > p 2 such that for all x, y ∈ U Locality sensitive hash family D( x, y ) ≤ d 1 = ⇒ h ∈H [ h ( x ) = h ( y )] ≥ p 1 Pr Family H of hash functions where similar elements are more likely D( x, y ) ≥ d 2 = ⇒ h ∈H [ h ( x ) = h ( y )] ≤ p 2 Pr to have the same value than distant elements. • Low distance ⇐ ⇒ High collisions • High distance ⇐ ⇒ Low collisions 2

  9. LSH for the edit distance How to design an LSH for edit distance? • LSH for Jaccard distance (minHash) used as proxy • Jaccard distance is significantly different than edit distance 3

  10. LSH for the edit distance How to design an LSH for edit distance? • LSH for Jaccard distance (minHash) used as proxy • Jaccard distance is significantly different than edit distance 3

  11. LSH for the edit distance How to design an LSH for edit distance? • LSH for Jaccard distance (minHash) used as proxy • Jaccard distance is significantly different than edit distance 3

  12. Jaccard distance Jaccard distance between sets A, B : J( A, B ) = 1 − | A ∩ B | | A ∪ B | 4

  13. � Jaccard distance Jaccard distance between sets A, B : Jaccard between sequences x, y : Jaccard distance of their k -mer sets J( x, y ) = J( K ( x ) , K ( y )) • Low D( x, y ) = ⇒ Low J( x, y ) J( A, B ) = 1 − | A ∩ B | • High D( x, y ) = ⇒ High J( x, y ) | A ∪ B | 4

  14. Jaccard ignores k -mer repetition n − k k � �� � � �� � x = AAAAAAAAAAAAAAA CCCCC y = AAAAA CCCCCCCCCCCCCCC � �� � � �� � k n − k 5

  15. Jaccard ignores k -mer repetition n − k k � �� � � �� � x = → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } AAAAAAAAAAAAAAA CCCCC y = AAAAA → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } CCCCCCCCCCCCCCC � �� � � �� � k n − k 5

  16. Jaccard ignores k -mer repetition n − k k � �� � � �� � x = → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } AAAAAAAAAAAAAAA CCCCC y = AAAAA → { AAAAA , AAAAC , AAACC , AACCC , ACCCC , CCCCC } CCCCCCCCCCCCCCC � �� � � �� � k n − k Edit distance D( x, y ) ≥ 1 − 2 k Jaccard distance J( x, y ) = 0 n I dentical k -mer content and high edit distance 5

  17. Weighted Jaccard handles repetitions n − k k � � � �� � � �� � ( AAAAA , 1) , ( AAAAA , 2) ,..., ( AAAAA , 11) x = → AAAAAAAAAAAAAAA CCCCC ( AAAAC , 1) , ( AAACC , 1) , ( AACCC , 1) , ( ACCCC , 1) , ( CCCCC , 1) � � ( AAAAA , 1) , ( AAAAC , 1) , ( AAACC , 1) , ( AACCC , 1) , ( ACCCC , 1) , y = AAAAA → CCCCCCCCCCCCCCC ( CCCCC , 1) , ( CCCCC , 2) ,..., ( CCCCC , 11) � �� � � �� � k n − k Weighted Jaccard J w ( x, y ) = 1 − k +2 Edit distance D( x, y ) ≥ 1 − 2 k n n Weighted Jaccard = Jaccard for multi-sets 6

  18. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC y = AAAACACAACCCCACCAAA 7

  19. Jaccard and weighted Jaccard ignore relative order � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � x = CCCCACCAACACAAAACCC → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � y = AAAACACAACCCCACCAAA → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC x, y : de Bruijn sequences, contain all 16 possible 4 -mers once 7

  20. Jaccard and weighted Jaccard ignore relative order � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � x = CCCCACCAACACAAAACCC → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC � AAAA , AAAC , AACA , AACC , ACAA , ACAC , ACCA , ACCC � y = AAAACACAACCCCACCAAA → CAAA , CAAC , CACA , CACC , CCAA , CCAC , CCCA , CCCC x, y : de Bruijn sequences, contain all 16 possible 4 -mers once J( x, y ) = J w ( x, y ) = 0 D( x, y ) = 0 . 63 7

  21. Jaccard is different from edit distance Unlike edit distance, Jaccard is insensitive to: 1. k -mer repetitions 2. relative positions of k -mers 8

  22. OMH: Order Min Hash • minHash is an LSH for Jaccard • OMH is a re fi nement of minHash • OMH is sensitive to • repeated k -mers • relative order of k -mers 9

  23. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 10

  24. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 AT AG GT CG AG GT GT TT CG GG TT CA GG TG TG TC AA GA AA GC AC TA GA GC CT 10

  25. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 2 3 4 5 6 AG GG CG AA TG TT AG GT GA GA TT GG CG CG TG TT GG AG GT CG GC TT AG AG GG GC TG AA GA TG GC GT AA GT GG AA AA TT GA AG CG GA TT AA CG GC TG GC TG GT GG GA GT 10

  26. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 2 3 4 5 6 AG GG CG AA TG TT AG GT GA GA TT GG CG CG TG TT GG AG GT CG GC TT AG AG GG GC TG AA GA TG GC GT AA GT GG AA AA TT GA AG CG GA TT AA CG GC TG GC TG GT GG GA GT 10

  27. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 GA, 4 TG, 3 1 2 3 4 5 6 AG, 5 AG GG CG AA TG TT GT, 1 AG GT GA GA TT GG GT, 13 CG CG TG TT GG AG AA, 10 GT CG GC TT AG AG AG, 11 GG GC TG AA GA TT, 2 TG GC GT AA GT GG AG, 0 AA AA TT GA AG CG CG, 7 GA TT AA CG GC TG GG, 12 GC TG GT GG GA GT GC, 6 TG, 14 GG, 8 GA, 9 10

  28. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 5 6 2 3 4 GA, 4 CG, 7 GT, 13 AG, 0 AA, 10 GA, 9 TG, 3 TG, 14 GA, 4 TT, 2 GT, 13 GG, 8 1 2 3 4 5 6 AG, 5 AG, 0 GA, 9 AG, 11 GA, 9 GC, 6 AG GG CG AA TG TT GT, 1 GA, 9 TG, 3 AG, 5 GT, 1 TG, 14 AG GT GA GA TT GG GT, 13 AG, 5 AG, 5 AA, 10 AG, 5 GT, 13 CG CG TG TT GG AG AA, 10 AG, 11 CG, 7 GT, 13 TT, 2 TT, 2 GT CG GC TT AG AG AG, 11 GA, 4 TT, 2 CG, 7 GA, 4 AA, 10 GG GC TG AA GA TT, 2 GT, 13 AA, 10 GG, 8 CG, 7 AG, 0 TG GC GT AA GT GG AG, 0 TT, 2 GG, 12 GA, 4 AG, 0 CG, 7 AA AA TT GA AG CG CG, 7 TG, 3 GG, 8 GA, 9 TG, 3 GG, 12 GA TT AA CG GC TG GG, 12 GG, 8 TG, 14 TG, 14 GG, 8 AG, 11 GC TG GT GG GA GT GC, 6 AA, 10 GT, 1 TG, 3 GG, 12 TG, 3 TG, 14 GG, 12 AG, 11 GC, 6 GC, 6 GT, 1 GG, 8 GT, 1 GC, 6 GT, 1 AG, 11 GA, 4 GA, 9 GC, 6 AG, 0 GG, 12 TG, 14 AG, 5 10

  29. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 1 5 6 2 3 4 GA, 4 CG, 7 GT, 13 AG, 0 AA, 10 GA, 9 TG, 3 TG, 14 GA, 4 TT, 2 GT, 13 GG, 8 1 2 3 4 5 6 AG, 5 AG, 0 GA, 9 AG, 11 GA, 9 GC, 6 AG GG CG AA TG TT GT, 1 GA, 9 TG, 3 AG, 5 GT, 1 TG, 14 AG GT GA GA TT GG GT, 13 AG, 5 AG, 5 AA, 10 AG, 5 GT, 13 CG CG TG TT GG AG AA, 10 AG, 11 CG, 7 GT, 13 TT, 2 TT, 2 GT CG GC TT AG AG AG, 11 GA, 4 TT, 2 CG, 7 GA, 4 AA, 10 GG GC TG AA GA TT, 2 GT, 13 AA, 10 GG, 8 CG, 7 AG, 0 TG GC GT AA GT GG AG, 0 TT, 2 GG, 12 GA, 4 AG, 0 CG, 7 AA AA TT GA AG CG CG, 7 TG, 3 GG, 8 GA, 9 TG, 3 GG, 12 GA TT AA CG GC TG GG, 12 GG, 8 TG, 14 TG, 14 GG, 8 AG, 11 GC TG GT GG GA GT GC, 6 AA, 10 GT, 1 TG, 3 GG, 12 TG, 3 TG, 14 GG, 12 AG, 11 GC, 6 GC, 6 GT, 1 GG, 8 GT, 1 GC, 6 GT, 1 AG, 11 GA, 4 GA, 9 GC, 6 AG, 0 GG, 12 TG, 14 AG, 5 10

  30. minHash & OMH sketches x = AGTTGAGCGGAAGGTG , k = 2 , m = 6 , ℓ = 2 1 5 6 2 3 4 GA, 4 CG, 7 GT, 13 AG, 0 AA, 10 GA, 9 TG, 3 TG, 14 GA, 4 TT, 2 GT, 13 GG, 8 1 2 3 4 5 6 AG, 5 AG, 0 GA, 9 AG, 11 GA, 9 GC, 6 AG GG CG AA TG TT GT, 1 GA, 9 TG, 3 AG, 5 GT, 1 TG, 14 AG GT GA GA TT GG GT, 13 AG, 5 AG, 5 AA, 10 AG, 5 GT, 13 CG CG TG TT GG AG AA, 10 AG, 11 CG, 7 GT, 13 TT, 2 TT, 2 GT CG GC TT AG AG AG, 11 GA, 4 TT, 2 CG, 7 GA, 4 AA, 10 GG GC TG AA GA TT, 2 GT, 13 AA, 10 GG, 8 CG, 7 AG, 0 TG GC GT AA GT GG AG, 0 TT, 2 GG, 12 GA, 4 AG, 0 CG, 7 AA AA TT GA AG CG CG, 7 TG, 3 GG, 8 GA, 9 TG, 3 GG, 12 GA TT AA CG GC TG GG, 12 GG, 8 TG, 14 TG, 14 GG, 8 AG, 11 GC TG GT GG GA GT GC, 6 AA, 10 GT, 1 TG, 3 GG, 12 TG, 3 TG, 14 GG, 12 AG, 11 GC, 6 GC, 6 GT, 1 GG, 8 GT, 1 GC, 6 GT, 1 AG, 11 GA, 4 GA, 9 GC, 6 AG, 0 GG, 12 TG, 14 AG, 5 GC GA AG GG AG TG 10

Recommend


More recommend