comp9313
play

COMP9313: Big Data Management High Dimensional Similarity Search - PowerPoint PPT Presentation

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem Definition: Given a query and dataset , find o , where is similar to Two types of similarity search


  1. COMP9313: Big Data Management High Dimensional Similarity Search

  2. Similarity Search β€’ Problem Definition: β€’ Given a query π‘Ÿ and dataset 𝐸 , find o ∈ 𝐸 , where 𝑝 is similar to π‘Ÿ β€’ Two types of similarity search 𝑝 βˆ— β€’ Range search: 𝜐 β€’ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ ≀ Ο„ π‘Ÿ β€’ Nearest neighbor search β€’ 𝑒𝑗𝑑𝑒 𝑝 βˆ— , π‘Ÿ ≀ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ , βˆ€π‘ ∈ 𝐸 β€’ Top-k version β€’ Distance/similarity function varies β€’ Euclidean, Jaccard, inner product, … β€’ Classic problem, with mutual solutions 2

  3. High Dimensional Similarity Search β€’ Applications and relationship to Big Data β€’ Almost every object can be and has been represented by a high dimensional vector β€’ Words, documents β€’ Image, audio, video β€’ … β€’ Similarity search is a fundamental process in information retrieval β€’ E.g., Google search engine, face recognition system, … β€’ High Dimension makes a huge difference! β€’ Traditional solutions are no longer feasible β€’ This lecture is about why and how β€’ We focus on high dimensional vectors in Euclidean space 3

  4. Similarity Search in Low Dimensional Space 4

  5. Similarity Search in One Dimensional Space β€’ Just numbers, use binary search, binary search tree, B+ Tree… β€’ The essential idea behind: objects can be sorted

  6. Similarity Search in Two Dimensional Space β€’ Why binary search no longer works? β€’ No order! β€’ Voronoi diagram Euclidean distance Manhattan distance

  7. Similarity Search in Two Dimensional Space β€’ Partition based algorithms β€’ Partition data into β€œcells” β€’ Nearest neighbors are in the same cell with query or adjacent cells β€’ How many β€œcells” to probe on 3 -dimensional space? 7

  8. Similarity Search in Metric Space β€’ Triangle inequality β€’ 𝑒𝑗𝑑𝑒 𝑦, π‘Ÿ ≀ 𝑒𝑗𝑑𝑒 𝑦, 𝑧 + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β€’ Orchard’s Algorithm β€’ for each 𝑦 ∈ 𝐸 , create a list of points in increasing order of distance to 𝑦 β€’ given query π‘Ÿ , randomly pick a point 𝑦 as the initial candidate (i.e., pivot π‘ž ), compute 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ β€’ walk along the list of π‘ž , and compute the distances to π‘Ÿ . If found 𝑧 closer to π‘Ÿ than π‘ž , then use 𝑧 as the new pivot (e.g., π‘ž ← 𝑧 ). β€’ repeat the procedure, and stop when β€’ 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ 8

  9. Similarity Search in Metric Space β€’ Orchard’s Algorithm, stop when 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 and 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 ≀ 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β‡’ 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ + 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ ⇔ 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ < 𝑒𝑗𝑑𝑒 𝑧, π‘Ÿ β€’ Since the list of π‘ž is in increasing order of distance to π‘ž , 𝑒𝑗𝑑𝑒 π‘ž, 𝑧 > 2 β‹… 𝑒𝑗𝑑𝑒 π‘ž, π‘Ÿ hold for all the rest 𝑧 ’s. 9

  10. None of the Above Works in High Dimensional Space! 10

  11. Curse of Dimensionality β€’ Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings. β€’ Triangle inequality β€’ The pruning power reduces heavily β€’ What is the volume of a high dimensional β€œring” (i.e., hyperspherical shell)? π‘Š π‘ π‘—π‘œπ‘• π‘₯=1,𝑒=2 β€’ π‘Š π‘π‘π‘šπ‘š 𝑠=10,𝑒=2 = 29% π‘Š π‘ π‘—π‘œπ‘• π‘₯=1,𝑒=100 β€’ π‘Š π‘π‘π‘šπ‘š 𝑠=10,𝑒=100 = 99.997% 11

  12. Approximate Nearest Neighbor Search in High Dimensional Space β€’ There is no sub-linear solution to find the exact result of a nearest neighbor query β€’ So we relax the condition β€’ approximate nearest neighbor search (ANNS) β€’ allow returned points to be not the NN of query β€’ Success: returns the true NN β€’ use success rate (e.g., percentage of succeed queries) to evaluate the method β€’ Hard to bound the success rate 12

  13. c-approximate NN Search β€’ Success: returns 𝑝 such that 𝑝 βˆ— β€’ 𝑒𝑗𝑑𝑒 𝑝, π‘Ÿ ≀ 𝑑 β‹… 𝑒𝑗𝑑𝑒(𝑝 βˆ— , π‘Ÿ) 𝑑𝑠 𝑠 β€’ Then we can bound the success π‘Ÿ probability β€’ Usually noted as 1 βˆ’ Ξ΄ β€’ Solution: Locality Sensitive Hashing (LSH) 13

  14. Locality Sensitive Hashing β€’ Hash function β€’ Index: Map data/objects to values (e.g., hash key) β€’ Same data β‡’ same hash key (with 100% probability) β€’ Different data β‡’ different hash keys (with high probability) β€’ Retrieval: Easy to retrieve identical objects (as they have the same hash key) β€’ Applications: hash map, hash join β€’ Low cost β€’ Space: 𝑃(π‘œ) β€’ Time: 𝑃(1) β€’ Why it cannot be used in nearest neighbor search? β€’ Even a minor difference leads to totally different hash keys 14

  15. Locality Sensitive Hashing β€’ Index: make the hash functions error tolerant β€’ Similar data β‡’ same hash key (with high probability) β€’ Dissimilar data β‡’ different hash keys (with high probability) β€’ Retrieval: β€’ Compute the hash key for the query β€’ Obtain all the data has the same key with query (i.e., candidates) β€’ Find the nearest one to the query β€’ Cost: β€’ Space: 𝑃(π‘œ) β€’ Time: 𝑃 1 + 𝑃(|π‘‘π‘π‘œπ‘’|) β€’ It is not the real Locality Sensitive Hashing! β€’ We still have several unsolved issues… 15

  16. LSH Functions β€’ Formal definition: β€’ Given point 𝑝 1 , 𝑝 2 , distance 𝑠 1 , 𝑠 2 , probability π‘ž 1 , π‘ž 2 β€’ An LSH function β„Ž(β‹…) should satisfy β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 β‰₯ π‘ž 1 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 ≀ 𝑠 1 β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 ≀ π‘ž 2 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 > 𝑠 2 β€’ What is β„Ž β‹… for a given distance/similarity function? β€’ Jaccard similarity β€’ Angular distance β€’ Euclidean distance 16

  17. MinHash - LSH Function for Jaccard Similarity β€’ Each data object is a set |𝑇 𝑗 βˆ©π‘‡ π‘˜ | β€’ 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇 1 , 𝑇 2 = |𝑇 𝑗 βˆͺ𝑇 π‘˜ | β€’ Randomly generate a global order for all the π‘œ 𝑇 𝑗 elements in C =Ϊ‚ 1 β€’ Let β„Ž(𝑇) be the minimal member of 𝑇 with respect to the global order β€’ For example, 𝑇 = {𝑐, 𝑑, 𝑓, β„Ž, 𝑗} , we use inversed alphabet order, then re-ordered 𝑇 = {𝑗, β„Ž, 𝑓, 𝑑, 𝑐} , hence β„Ž 𝑇 = 𝑗 . 17

  18. MinHash β€’ Now we compute Pr β„Ž 𝑇 1 = β„Ž 𝑇 2 β€’ Every element 𝑓 ∈ 𝑇 1 βˆͺ 𝑇 2 has equal chance to be the first element among 𝑇 1 βˆͺ 𝑇 2 after re- ordering β€’ 𝑓 ∈ 𝑇 1 ∩ 𝑇 2 if and only if β„Ž 𝑇 1 = β„Ž 𝑇 2 β€’ 𝑓 βˆ‰ 𝑇 1 ∩ 𝑇 2 if and only if β„Ž 𝑇 1 β‰  β„Ž 𝑇 2 |𝑇 𝑗 βˆ©π‘‡ π‘˜ | |{𝑓 𝑗 |β„Ž 𝑗 𝑇 1 =β„Ž 𝑗 𝑇 2 }| β€’ Pr β„Ž 𝑇 1 = β„Ž 𝑇 2 = = |𝑇 𝑗 βˆͺ𝑇 π‘˜ | = |{𝑓 𝑗 }| 𝐾𝑏𝑑𝑑𝑏𝑠𝑒 𝑇 1 , 𝑇 2 18

  19. SimHash – LSH Function for Angular Distance β€’ Each data object is a d dimensional vector β€’ πœ„(𝑦, 𝑧) is the angle between 𝑦 and 𝑧 β€’ Randomly generate a normal vector 𝑏 , where 𝑏 𝑗 ~𝑂(0,1) β€’ Let β„Ž 𝑦; 𝑏 = sgn(𝑏 π‘ˆ 𝑦) β€’ sgn o = α‰Š 1; 𝑗𝑔 𝑝 β‰₯ 0 βˆ’1; 𝑗𝑔 𝑝 < 0 β€’ 𝑦 lies on which side of 𝑏 ’s corresponding hyperplane 19

  20. SimHash β€’ Now we compute Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 β€’ β„Ž 𝑝 1 β‰  β„Ž 𝑝 2 iff 𝑝 1 and 𝑝 2 are on different sides of the hyperplane with 𝑝 1 𝑏 as its normal vector = 1 βˆ’ πœ„ 𝑏 β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 𝑝 2 ΞΈ 𝜌 πœ„ 𝜌 = 20

  21. p-stable LSH - LSH function for Euclidean distance β€’ Each data object is a d dimensional vector 𝑒 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 β€’ 𝑒𝑗𝑑𝑒 𝑦, 𝑧 = Οƒ 1 β€’ Randomly generate a normal vector 𝑏 , where 𝑏 𝑗 ~𝑂(0,1) β€’ Normal distribution is 2-stable, i.e., if 𝑏 𝑗 ~𝑂(0,1) , 𝑒 𝑏 𝑗 β‹… 𝑦 𝑗 ~𝑂(0, 𝑦 2 2 ) then Οƒ 1 𝑏 π‘ˆ 𝑦+𝑐 β€’ Let β„Ž 𝑦; 𝑏, 𝑐 = , where 𝑐~𝑉(0,1) and π‘₯ π‘₯ is user specified parameter π‘₯ 1 𝑒 𝑒 β€’ Pr β„Ž 𝑝 1 ; 𝑏, 𝑐 = β„Ž 𝑝 2 ; 𝑏, 𝑐 = Χ¬ 𝑝 1 ,𝑝 2 𝑔 1 βˆ’ π‘₯ 𝑒𝑒 π‘ž 0 𝑝 1 ,𝑝 2 β€’ 𝑔 π‘ž β‹… is the pdf of the absolute value of normal variable 21

  22. p-stable LSH β€’ Intuition of p-stable LSH β€’ Similar points have higher chance to be hashed together 22

  23. Pr β„Ž 𝑦 = β„Ž 𝑧 for different Hash Functions MinHash SimHash p-stable LSH 23

  24. Problem of Single Hash Function β€’ Hard to distinguish if two pairs have distances close to each other β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 β‰₯ π‘ž 1 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 ≀ 𝑠 1 β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 ≀ π‘ž 2 , if 𝑒𝑗𝑑𝑒 𝑝 1 , 𝑝 2 > 𝑠 2 β€’ We also want to control where the drastic change happens… β€’ Close to 𝑒𝑗𝑑𝑒(𝑝 βˆ— , π‘Ÿ) β€’ Given range 24

  25. AND-OR Composition β€’ Recall for a single hash function, we have β€’ Pr β„Ž 𝑝 1 = β„Ž 𝑝 2 = π‘ž(𝑒𝑗𝑑𝑒(𝑝 1 , 𝑝 2 )) , denoted as π‘ž 𝑝 1 ,𝑝 2 β€’ Now we consider two scenarios: β€’ Combine 𝑙 hashes together, using AND operation β€’ One must match all the hashes 𝑙 β€’ Pr 𝐼 𝐡𝑂𝐸 𝑝 1 = 𝐼 𝐡𝑂𝐸 𝑝 2 = π‘ž 𝑝 1 ,𝑝 2 β€’ Combine π‘š hashes together, using OR operation β€’ One need to match at least one of the hashes β€’ Pr 𝐼 𝑃𝑆 𝑝 1 = 𝐼 𝑃𝑆 𝑝 2 = 1 βˆ’ (1 βˆ’ π‘ž 𝑝 1 ,𝑝 2 ) π‘š β€’ Not match only when all the hashes don’t match 25

Recommend


More recommend