multi probe lsh efficient indexing for efficient indexing
play

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for - PowerPoint PPT Presentation

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University


  1. Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University Joint work with Zhe Zhe Wang, William Wang, William Josephson Josephson, , Joint work with Moses Charikar Charikar, Kai Li (Princeton University) , Kai Li (Princeton University) Moses

  2. Motivations Motivations Massive amounts of feature-rich data Massive amounts of feature-rich data  Audio, video, digital photos, sensor data, Audio, video, digital photos, sensor data, … …  Fuzzy & high-dimensional Fuzzy & high-dimensional  Similarity search Similarity search in high dimensions in high dimensions   KNN or ANN in KNN or ANN in feature-vector space feature-vector space  Important in various areas Important in various areas  Databases, data mining, search engines Databases, data mining, search engines … …  2 2

  3. Ideal Indexing for Similarity Search Ideal Indexing for Similarity Search Accurate Accurate  Return Return results that are close to brute-force search results that are close to brute-force search  Time efficient Time efficient  O(1) or O(log N) query time O(1) or O(log N) query time  Space efficient Space efficient  Small Small space usage for index space usage for index   May fit into main memory even for large datasets May fit into main memory even for large datasets  High-dimensional High-dimensional  Work well Work well for datasets with high dimensionality for datasets with high dimensionality  3 3

  4. Previous Indexing Methods Previous Indexing Methods K-D tree, R-tree, X-tree, SR-tree … … K-D tree, R-tree, X-tree, SR-tree  “ “curse of dimensionality curse of dimensionality” ”   Linear scan outperforms when d > 10 Linear scan outperforms when d > 10 [WSB98] [WSB98]  Navigating nets [KL04] [KL04] , cover tree , cover tree [BKL06] [BKL06] Navigating nets  Based on Based on “ “intrinsic dimensionality intrinsic dimensionality” ”   Do not perform well with high intrinsic dimensionality Do not perform well with high intrinsic dimensionality  Locality sensitive hashing (LSH) sensitive hashing (LSH) Locality 4 4

  5. Outline Outline Motivations Motivations Locality sensitive hashing (LSH) Locality sensitive hashing (LSH)  Basic Basic LSH, LSH, entropy-based LSH entropy-based LSH  Multi-probe LSH indexing Multi-probe LSH indexing  Step-wise probing, query-directed probing Step-wise probing, query-directed probing  Evaluations Evaluations Conclusions & future work Conclusions & future work 5 5

  6. LSH: Locality Sensitive Hashing LSH: Locality Sensitive Hashing cr (r, cr cr, p , p 1 , p 2 )-sensitive [IM98] [IM98] (r, 1 , p 2 )-sensitive q  If If D(q,p) < r D(q,p) < r , then , then Pr [h(q) Pr [h(q)=h =h(p)] >= p (p)] >= p 1 r  1  If If D(q,p) > D(q,p) > cr cr , then , then Pr [h(q) Pr [h(q)=h =h(p)] <= p (p)] <= p 2  2  i.e. i.e. closer objects have higher collision probability closer objects have higher collision probability  LSH based on based on p p -stable -stable distributions distributions [DIIM04] [DIIM04] LSH  w w : slot width : slot width  a v b � + � � h b ( v ) = h � � a , w � � w w w Slot 1 Slot 2 Slot 3 6 6

  7. LSH for Similarity Search LSH for Similarity Search False positive False positive h2  Intersection of Intersection of  multiple multiple hashes hashes q w False negative False negative w  Union of Union of  multiple multiple w h1 hashes hashes w w w 7 7

  8. Basic LSH Indexing Basic LSH Indexing q [IM98, GIM99, DIIM04] [IM98, GIM99, DIIM04] M hash functions per table hash functions per table M g i ( v ) = ( h i,1 ( v ) , …, h i,M ( v ) ) L hash tables hash tables L G = { g 1 , …, g L } g L (q) Issues: Issues: g 1 (q)  Large number of tables Large number of tables  g i (q)  L > 100 in L > 100 in [GIM99] [GIM99]   L > 500 in L > 500 in [Buhler01] [Buhler01]  g 1 g i g L Impractical for large datasets 8 8

  9. Entropy-Based LSH Indexing Entropy-Based LSH Indexing p 4 [Panigrahy Panigrahy, SODA , SODA’ ’06] 06] p 1 [ q R q Randomly perturb q q Randomly perturb p 2 p 3 at distance R R at distance Check hash buckets Check hash buckets of perturbed points of perturbed points g 1 (p 1 ) g L (q) Issues: Issues: g 1 (q)  Difficult to choose Difficult to choose R R g i (p 1 )  g L (p 1 )  Duplicate buckets Duplicate buckets g i (q)  Inefficient probing g 1 g i g L 9 9

  10. Outline Outline Motivations Motivations Locality Sensitive Hashing (LSH) Locality Sensitive Hashing (LSH)  Basic LSH, entropy-based LSH Basic LSH, entropy-based LSH  Multi-probe LSH indexing Multi-probe LSH indexing  Step-wise probing, query-directed probing Step-wise probing, query-directed probing  Evaluations Evaluations Conclusions & future work Conclusions & future work 10 10

  11. Multi-Probe LSH Indexing Multi-Probe LSH Indexing Probes multiple hash buckets per table Probes multiple hash buckets per table Perturbs directly on hash values Perturbs directly on hash values  Check left and right slots Check left and right slots   Perturbation vector Perturbation vector ∆  g(q) = (2, 5, 3), 5, 3), ∆ = (-1, 1, 0), = (-1, 1, 0), g(q) = (2, q g(q) + ∆ = (1, 6, 3) = (1, 6, 3) g(q) + h w w w Systematic probing Systematic probing 4 h(q) = 5 6  ( ( ∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 , … )  11 11

  12. Multi-Probe LSH Indexing Multi-Probe LSH Indexing ? probing sequence: A carefully derived A carefully derived ( ∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 , … ) q probing sequence probing sequence Advantages Advantages  Fast probing sequence Fast probing sequence  generation generation g L (q) g 1 (q)+ ∆ 1  No duplicate buckets No duplicate buckets  g 1 (q) g i (q)+ ∆ 2  More effective in finding More effective in finding  g L (q)+ ∆ 3 similar objects similar objects g i (q) g i (q)+ ∆ 4 g 1 g i g L 12 12

  13. Step-Wise Probing Step-Wise Probing Given q q ’ ’s s hash values hash values Given g(q)=(3,2,5) ∆ = (0,0,1) 1-step buckets (2,2,5) (4,2,5) (3,2,6) ∆ = (-1,-1,0) 2-step buckets (2,1,5) (2,2,6) (3,3,6) Intuitions Intuitions WRONG!  1-step buckets better than 2-step buckets 1-step buckets better than 2-step buckets   All 1-step buckets are equally good All 1-step buckets are equally good  13 13

  14. Success Probability Estimation Success Probability Estimation Hashed position within slot matters! Hashed position within slot matters! Estimation based on x x i (-1) and x i (1) Estimation based on i (-1) and x i (1) 14 14

  15. Query-Directed Probing Query-Directed Probing 0.7 0.3 g(q) = (h 1 (q), h 2 (q), h 3 (q)) = (2, 5, 1) h 1 (q) = 2 0.4 0.6 { 0.2, 0.3, 0.4, 0.6, 0.7, 0.8 } h 2 (q) = 5 0.2 0.8 { x 3 (-1), x 1 (1), x 2 (-1), x 2 (1), x 1 (-1), x 3 (1) } h 3 (q) = 1 { 0.2 } { 0.2, 0.3 } { 0.2, 0.3, 0.4 } { 0.2 } ∆ 1 = (0, 0, -1) (2, 5, 0) { 0.3 } ∆ 2 = (1, 0, 0) (3, 5, 1) { 0.2, 0.4 } { 0.3 } { 0.3, 0.4 } { 0.2, 0.3 } ∆ 3 = (1, 0, -1) (3, 5, 0) { 0.4 } 15 15

  16. Outline Outline Motivations Motivations Locality Sensitive Hashing (LSH) Locality Sensitive Hashing (LSH)  Basic LSH, entropy-based LSH Basic LSH, entropy-based LSH  Multi-probe LSH indexing Multi-probe LSH indexing  Step-wise probing, query-directed probing Step-wise probing, query-directed probing  Evaluations Evaluations Conclusions & future work Conclusions & future work 16 16

  17. Evaluations Evaluations Multi-probe vs. basic vs. entropy-based Multi-probe vs. basic vs. entropy-based  Tradeoff among space, speed and quality Tradeoff among space, speed and quality   Space reduction Space reduction  Query-directed vs. step-wise probing Query-directed vs. step-wise probing  Tradeoff between search quality and Tradeoff between search quality and  number of probes number of probes 17 17

  18. Evaluation Methodology Evaluation Methodology Dataset Dataset #objects #objects #dimensions #dimensions Web images Web images 1.3 million 64 64 1.3 million Switchboard audio Switchboard audio 2.6 million 2.6 million 192 192 Benchmarks Benchmarks  100 random queries, top K results 100 random queries, top K results  Evaluation metrics Evaluation metrics I R  Search quality: recall, error ratio Search quality: recall, error ratio   Search speed: query latency Search speed: query latency   Space usage: #hash tables Space usage: #hash tables recall =|I ∩ R| / |I|  18 18

  19. Multi-Probe vs. Basic vs. Entropy Multi-Probe vs. Basic vs. Entropy Multi-probe LSH achieves higher recall with fewer hash tables 19 19

  20. Space Savings of Multi-Probe LSH Space Savings of Multi-Probe LSH 30 11 2 14x - 18x fewer tables than basic LSH 5x - 8x fewer tables than entropy LSH 20 20

  21. Multi-Probe vs. Entropy-Based vs. Entropy-Based Multi-Probe Multi-probe LSH uses much fewer number of probes 21 21

  22. Query-Directed vs. Step-Wise Probing Query-Directed vs. Step-Wise Probing 250 150 20 10 Query-directed probing uses 10x fewer number of probes 22 22

Recommend


More recommend