COMP9313: Big Data Management High Dimensional Similarity Search
Similarity Search β’ Problem Definition: β’ Given a query π and dataset πΈ , find o β πΈ , where π is similar to π β’ Two types of similarity search π β β’ Range search: π β’ πππ‘π’ π, π β€ Ο π β’ Nearest neighbor search β’ πππ‘π’ π β , π β€ πππ‘π’ π, π , βπ β πΈ β’ Top-k version β’ Distance/similarity function varies β’ Euclidean, Jaccard, inner product, β¦ β’ Classic problem, with mutual solutions 2
High Dimensional Similarity Search β’ Applications and relationship to Big Data β’ Almost every object can be and has been represented by a high dimensional vector β’ Words, documents β’ Image, audio, video β’ β¦ β’ Similarity search is a fundamental process in information retrieval β’ E.g., Google search engine, face recognition system, β¦ β’ High Dimension makes a huge difference! β’ Traditional solutions are no longer feasible β’ This lecture is about why and how β’ We focus on high dimensional vectors in Euclidean space 3
Similarity Search in Low Dimensional Space 4
Similarity Search in One Dimensional Space ⒠Just numbers, use binary search, binary search tree, B+ Tree⦠⒠The essential idea behind: objects can be sorted
Similarity Search in Two Dimensional Space β’ Why binary search no longer works? β’ No order! β’ Voronoi diagram Euclidean distance Manhattan distance
Similarity Search in Two Dimensional Space β’ Partition based algorithms β’ Partition data into βcellsβ β’ Nearest neighbors are in the same cell with query or adjacent cells β’ How many βcellsβ to probe on 3 -dimensional space? 7
Similarity Search in Metric Space β’ Triangle inequality β’ πππ‘π’ π¦, π β€ πππ‘π’ π¦, π§ + πππ‘π’ π§, π β’ Orchardβs Algorithm β’ for each π¦ β πΈ , create a list of points in increasing order of distance to π¦ β’ given query π , randomly pick a point π¦ as the initial candidate (i.e., pivot π ), compute πππ‘π’ π, π β’ walk along the list of π , and compute the distances to π . If found π§ closer to π than π , then use π§ as the new pivot (e.g., π β π§ ). β’ repeat the procedure, and stop when β’ πππ‘π’ π, π§ > 2 β πππ‘π’ π, π 8
Similarity Search in Metric Space β’ Orchardβs Algorithm, stop when πππ‘π’ π, π§ > 2 β πππ‘π’ π, π 2 β πππ‘π’ π, π < πππ‘π’ π, π§ and πππ‘π’ π, π§ β€ πππ‘π’ π, π + πππ‘π’ π§, π β 2 β πππ‘π’ π, π < πππ‘π’ π, π + πππ‘π’ π§, π β πππ‘π’ π, π < πππ‘π’ π§, π β’ Since the list of π is in increasing order of distance to π , πππ‘π’ π, π§ > 2 β πππ‘π’ π, π hold for all the rest π§ βs. 9
None of the Above Works in High Dimensional Space! 10
Curse of Dimensionality β’ Refers to various phenomena that arise in high dimensional spaces that do not occur in low dimensional settings. β’ Triangle inequality β’ The pruning power reduces heavily β’ What is the volume of a high dimensional βringβ (i.e., hyperspherical shell)? π π πππ π₯=1,π=2 β’ π ππππ π =10,π=2 = 29% π π πππ π₯=1,π=100 β’ π ππππ π =10,π=100 = 99.997% 11
Approximate Nearest Neighbor Search in High Dimensional Space β’ There is no sub-linear solution to find the exact result of a nearest neighbor query β’ So we relax the condition β’ approximate nearest neighbor search (ANNS) β’ allow returned points to be not the NN of query β’ Success: returns the true NN β’ use success rate (e.g., percentage of succeed queries) to evaluate the method β’ Hard to bound the success rate 12
c-approximate NN Search β’ Success: returns π such that π β β’ πππ‘π’ π, π β€ π β πππ‘π’(π β , π) ππ π β’ Then we can bound the success π probability β’ Usually noted as 1 β Ξ΄ β’ Solution: Locality Sensitive Hashing (LSH) 13
Locality Sensitive Hashing β’ Hash function β’ Index: Map data/objects to values (e.g., hash key) β’ Same data β same hash key (with 100% probability) β’ Different data β different hash keys (with high probability) β’ Retrieval: Easy to retrieve identical objects (as they have the same hash key) β’ Applications: hash map, hash join β’ Low cost β’ Space: π(π) β’ Time: π(1) β’ Why it cannot be used in nearest neighbor search? β’ Even a minor difference leads to totally different hash keys 14
Locality Sensitive Hashing β’ Index: make the hash functions error tolerant β’ Similar data β same hash key (with high probability) β’ Dissimilar data β different hash keys (with high probability) β’ Retrieval: β’ Compute the hash key for the query β’ Obtain all the data has the same key with query (i.e., candidates) β’ Find the nearest one to the query β’ Cost: β’ Space: π(π) β’ Time: π 1 + π(|ππππ|) β’ It is not the real Locality Sensitive Hashing! β’ We still have several unsolved issuesβ¦ 15
LSH Functions β’ Formal definition: β’ Given point π 1 , π 2 , distance π 1 , π 2 , probability π 1 , π 2 β’ An LSH function β(β ) should satisfy β’ Pr β π 1 = β π 2 β₯ π 1 , if πππ‘π’ π 1 , π 2 β€ π 1 β’ Pr β π 1 = β π 2 β€ π 2 , if πππ‘π’ π 1 , π 2 > π 2 β’ What is β β for a given distance/similarity function? β’ Jaccard similarity β’ Angular distance β’ Euclidean distance 16
MinHash - LSH Function for Jaccard Similarity β’ Each data object is a set |π π β©π π | β’ πΎπππππ π π 1 , π 2 = |π π βͺπ π | β’ Randomly generate a global order for all the π π π elements in C =Ϊ 1 β’ Let β(π) be the minimal member of π with respect to the global order β’ For example, π = {π, π, π, β, π} , we use inversed alphabet order, then re-ordered π = {π, β, π, π, π} , hence β π = π . 17
MinHash β’ Now we compute Pr β π 1 = β π 2 β’ Every element π β π 1 βͺ π 2 has equal chance to be the first element among π 1 βͺ π 2 after re- ordering β’ π β π 1 β© π 2 if and only if β π 1 = β π 2 β’ π β π 1 β© π 2 if and only if β π 1 β β π 2 |π π β©π π | |{π π |β π π 1 =β π π 2 }| β’ Pr β π 1 = β π 2 = = |π π βͺπ π | = |{π π }| πΎπππππ π π 1 , π 2 18
SimHash β LSH Function for Angular Distance β’ Each data object is a d dimensional vector β’ π(π¦, π§) is the angle between π¦ and π§ β’ Randomly generate a normal vector π , where π π ~π(0,1) β’ Let β π¦; π = sgn(π π π¦) β’ sgn o = α 1; ππ π β₯ 0 β1; ππ π < 0 β’ π¦ lies on which side of π βs corresponding hyperplane 19
SimHash β’ Now we compute Pr β π 1 = β π 2 β’ β π 1 β β π 2 iff π 1 and π 2 are on different sides of the hyperplane with π 1 π as its normal vector = 1 β π π β’ Pr β π 1 = β π 2 π 2 ΞΈ π π π = 20
p-stable LSH - LSH function for Euclidean distance β’ Each data object is a d dimensional vector π π¦ π β π§ π 2 β’ πππ‘π’ π¦, π§ = Ο 1 β’ Randomly generate a normal vector π , where π π ~π(0,1) β’ Normal distribution is 2-stable, i.e., if π π ~π(0,1) , π π π β π¦ π ~π(0, π¦ 2 2 ) then Ο 1 π π π¦+π β’ Let β π¦; π, π = , where π~π(0,1) and π₯ π₯ is user specified parameter π₯ 1 π’ π’ β’ Pr β π 1 ; π, π = β π 2 ; π, π = Χ¬ π 1 ,π 2 π 1 β π₯ ππ’ π 0 π 1 ,π 2 β’ π π β is the pdf of the absolute value of normal variable 21
p-stable LSH β’ Intuition of p-stable LSH β’ Similar points have higher chance to be hashed together 22
Pr β π¦ = β π§ for different Hash Functions MinHash SimHash p-stable LSH 23
Problem of Single Hash Function β’ Hard to distinguish if two pairs have distances close to each other β’ Pr β π 1 = β π 2 β₯ π 1 , if πππ‘π’ π 1 , π 2 β€ π 1 β’ Pr β π 1 = β π 2 β€ π 2 , if πππ‘π’ π 1 , π 2 > π 2 β’ We also want to control where the drastic change happensβ¦ β’ Close to πππ‘π’(π β , π) β’ Given range 24
AND-OR Composition β’ Recall for a single hash function, we have β’ Pr β π 1 = β π 2 = π(πππ‘π’(π 1 , π 2 )) , denoted as π π 1 ,π 2 β’ Now we consider two scenarios: β’ Combine π hashes together, using AND operation β’ One must match all the hashes π β’ Pr πΌ π΅ππΈ π 1 = πΌ π΅ππΈ π 2 = π π 1 ,π 2 β’ Combine π hashes together, using OR operation β’ One need to match at least one of the hashes β’ Pr πΌ ππ π 1 = πΌ ππ π 2 = 1 β (1 β π π 1 ,π 2 ) π β’ Not match only when all the hashes donβt match 25
Recommend
More recommend