Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)
Near Neighbor Problem Definition β’ Set of π points πΈ in π -dimensional β space Query point π β Report one neighbor of π if there is any β Neighbor: A point within distance π of β’ query Application β’ Major importance in databases β (document, image, video), information retrieval, pattern recognition Object of interest as point β’ Similarity is measured as distance. β’
Motivation Search: How many answers? Small output size, e.g. 10 β’ Reporting π Nearest Neighbors may not β be informative (could be identical texts) Large output size β’ Time to retrieve them is high β Small output size which is Relevant and Diverse β’ Good to have result from each β’ cluster, i.e. should be diverse
Diverse Near Neighbor Problem Definition β’ Set of π points πΈ in π -dimensional space β Query point π β Report the k most diverse neighbors of π β Neighbor: β’ Points within distance π of query β We use Hamming distance β Diversity: β’ div S = m ππ π , πβπ | π β π | β Goal: report Q (green points), s.t. β’ π β π β© πΆ π , π β |Q| = k β πππ π is maximized β
Approximation Want sublinear query time, so need to approximate β’ Approximate NN: β’ πΆ π , π β πΆ π , ππ for some value of π > 1 β 1 Result: query time of π ( ππ π ) β Approximate Diverse NN: β’ Bi-criterion approximation: distance and diversity β ( π , π· ) -Approximate π -diverse Near Neighbor β Let π β (green points) be the optimum solution for πΆ π , π β Report approximate neighbors π (purple points) β’ π β πΆ π , ππ Diversity approximates the optimum diversity β’ 1 π½ πππ ( π β ) , π½ β₯ 1 πππ π β₯
Results Algorithm A Algorithm B Distance Apx. Factor c > 2 c >1 Diversity Apx. Factor Ξ± 6 6 log π β π 1+1 / π + ππ ( π log π ) 1+1 /( πβ1 ) + ππ Space π 2 + log π π 2 + log π Query Time π (log π ) π /( πβ1 ) π 1 /( πβ1 ) π β log π β π 1 / π π π Algorithm A was earlier introduced in [Abbar, Amer-yahia, Indyk, Mahabadi, β’ WWWβ13]
Techniques
Compute k-diversity: GMM β’ Have n points, compute the subset with maximum diversity. β’ Exact : NP-hard to approximate better than 2 [Ravi et al.] β’ GMM Algorithm [Ravi et al.] [Gonzales] β Choose an arbitrary point β Repeat k-1 times β’ Add the point whose minimum distance to the currently chosen points is maximized β’ Achieves approximation factor 2 β’ Running time of the algorithm is O(kn)
Locality Sensitive Hashing (LSH) β’ LSH β close points have higher probability of collision than far points β Hash functions: π 1 , β¦ , π π π π = < β π , 1 , β¦ , β π , π’ > β’ β π , π β β is chosen randomly β’ β is a family of hash functions which is β’ π 1 , π 2 , π , ππ -sensitive: If π β ππ β€ π then Pr β π = β ππ β₯ π 1 β If π β ππ β₯ ππ then Pr β π = β ππ β€ π 2 β Example: Hamming distance: β’ β π = π π , i.e., the ith bit of π β Is (1 β π π , 1 β π π π , π , π π ) -sensitive β β π΄ and π are parameters of LSH
LSH-based NaΓ―ve Algorithm [Indyk, Motwani] Parameters π and π’ can be set s.t. β’ With constant probability Any neighbor of π falls into the same bucket as π in at least β one hash function Total number of outliers is at most 3π β Outlier : point farther than ππ from the query point β Algorithm Arrays for each hash function π΅ 1 , β¦ , π΅ π β’ For a query π compute β’ π Retrieve the possible neighbors S = β π© [ π π ( π )] β π=1 Remove the outliers S = S β© B q, cr β Report the approximate k most diverse points of S, or β GMM(S) Achieves (c,2)-approximation β’ Running time may be linear in π ο β’ Should prune the buckets before collecting them β
Core-sets Core-sets [ Agarwal, Har-Peled, Varadarajan] : subset of a point set S that β’ represents it. Approximately determines the solution to an optimization β’ problem Composes: A union of coresets is a coreset of the union β’ Ξ² β core-set: Approximates the cost up-to a factor of Ξ² β’ Our Optimization problem: β’ Finding the k-diversity of S. β Instead we consider finding K-Center Cost of S β πΏπΏ π , π β² = max π β² βπ β² π β π β² πβπ min β’ π β² βπ , π β² =π πΏπΏ ( π , π β² ) πΏπΏ π π = min β’ KC cost 2-approximates diversity β πΏπΏ πβ1 π β€ πππ π π β€ 2. πΏπΏ πβ1 π β’ GMM computes a 1/3-Coreset for KC-cost β’
Algorithms
Algorithm A β’ Parameters π and π’ can be set s.t. with constant probability β Any neighbor of π falls into the same bucket as π in at least one hash function β There is no outlier No need to keep all the points in each bucket, β’ just keep a coreset! β’ β π©π π π = π―π―π― π© π π β Keep a 1/3 coreset of π© π π Given query π β’ π β Retrieve the coresets from buckets S = β π©π [ π π ( π )] π=1 β Run GMM(S) β Report the result
Analysis β’ Achieves (c,6)-Approx β Union of 1/3 coresets is a 1/3 coreset for the union β The last GMM call, adds a 2 approximation factor Only works if we set π and π’ s.t. there is no outlier in π β’ with constant probability Space: π ππ = π (( π log π ) 1+1 /( πβ1 ) + ππ ) β Time: π ππ 2 = π ( π 2 + log π π (log π ) π /( πβ1 ) π 1 /( πβ1 ) ) β π Only makes sense for π > 2 β Not optimal: β’ 1 ANN query time is π ( ππ π ) β So if we want to improve over these we should be able to deal β with outliers.
Robust Core-sets β’ π β² i s an π -robust Ξ² -coreset for S if β for any set π of outliers of size at most π β ( ππ \O ) is a Ξ² -coreset for π β’ Peeling Algorithm [Agarwal, Har-peled, Yu,β06][Varadarajan, Xiao, β12] : Repeat ( π + 1) times β Compute a Ξ² -coreset for π β’ Add them to the coreset ππ β’ Remove them from the set π β’ Note: if we order the points in ππ as we find them, then the first π β² + 1 π points also form an ππ -robust Ξ² -coreset. 2 robust coreset: Sβ= {3, 5 ; 2, 9 ; 1, 6} 1 robust coreset
Algorithm B Parameters π and π’ can be set s.t. With constant β’ probability β Any neighbor of π falls into the same bucket as π in at least one hash function β Total number of outliers is at most 3π For each bucket π΅ π π keep an 3 π -robust 1/3-coreset in β’ π΅π π π which has size 3π + 1 π For query π β’ For each bucket π΅π [ π π ( π )] β Find smallest π s.t. the first ( ππ ) points contains less than π outliers β’ Add those ππ points to π β’ Remove outliers from π β Return π»π»π» ( π ) β
Example and Analysis β’ Total # outliers β€ 3π , π < π ( ππ ) 1 β’ Time: O( ππ 2 ) = O( π 2 + log π π β log π β π π ) π β’ Space: π ππ = π (log π β π 1+1 / π + ππ ) β’ Achieves (c,6)-Approx for the same reason
Conclusion Algorithm A Algorithm B ANN Distance Apx. c > 2 c >1 c >1 Factor Diversity Apx. 6 6 - Factor Ξ± ~ π 1+ 1 ~ π 1+1 π 1+1 Space πβ1 π π 1 1 1 Query Time ~ π π ~ π π π π πβ1 π π Further Work β’ Improve diversity factor Ξ± β’ Consider other definitions of diversity , e.g., sum of distances
Thank You!
Recommend
More recommend