diverse near neighbor problem
play

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - PowerPoint PPT Presentation

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa) Near Neighbor Problem Definition Set of points in -dimensional space


  1. Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)

  2. Near Neighbor Problem Definition β€’ Set of π‘œ points 𝑸 in 𝑒 -dimensional – space Query point 𝒓 – Report one neighbor of 𝒓 if there is any – Neighbor: A point within distance 𝑠 of β€’ query Application β€’ Major importance in databases – (document, image, video), information retrieval, pattern recognition Object of interest as point β€’ Similarity is measured as distance. β€’

  3. Motivation Search: How many answers? Small output size, e.g. 10 β€’ Reporting 𝑙 Nearest Neighbors may not – be informative (could be identical texts) Large output size β€’ Time to retrieve them is high – Small output size which is Relevant and Diverse β€’ Good to have result from each β€’ cluster, i.e. should be diverse

  4. Diverse Near Neighbor Problem Definition β€’ Set of π‘œ points 𝑸 in 𝑒 -dimensional space – Query point 𝒓 – Report the k most diverse neighbors of π‘Ÿ – Neighbor: β€’ Points within distance 𝑠 of query – We use Hamming distance – Diversity: β€’ div S = m π‘—π‘œ π‘ž , π‘Ÿβˆˆπ‘‡ | π‘ž βˆ’ π‘Ÿ | – Goal: report Q (green points), s.t. β€’ 𝑅 βŠ† 𝑄 ∩ 𝐢 π‘Ÿ , 𝑠 – |Q| = k – 𝑒𝑗𝑒 𝑅 is maximized –

  5. Approximation Want sublinear query time, so need to approximate β€’ Approximate NN: β€’ 𝐢 π‘Ÿ , 𝑠 β†’ 𝐢 π‘Ÿ , 𝑑𝑠 for some value of 𝑑 > 1 – 1 Result: query time of 𝑃 ( π‘’π‘œ 𝑑 ) – Approximate Diverse NN: β€’ Bi-criterion approximation: distance and diversity – ( 𝐝 , 𝜷 ) -Approximate 𝑙 -diverse Near Neighbor – Let 𝑅 βˆ— (green points) be the optimum solution for 𝐢 π‘Ÿ , 𝑠 – Report approximate neighbors 𝑅 (purple points) β€’ 𝑅 βŠ† 𝐢 π‘Ÿ , 𝑑𝑠 Diversity approximates the optimum diversity β€’ 1 𝛽 𝑒𝑗𝑒 ( 𝑅 βˆ— ) , 𝛽 β‰₯ 1 𝑒𝑗𝑒 𝑅 β‰₯

  6. Results Algorithm A Algorithm B Distance Apx. Factor c > 2 c >1 Diversity Apx. Factor Ξ± 6 6 log 𝑙 βˆ— π‘œ 1+1 / 𝑑 + π‘œπ‘’ ( π‘œ log 𝑙 ) 1+1 /( π‘‘βˆ’1 ) + π‘œπ‘’ Space 𝑙 2 + log π‘œ 𝑙 2 + log π‘œ Query Time 𝑒 (log 𝑙 ) 𝑑 /( π‘‘βˆ’1 ) π‘œ 1 /( π‘‘βˆ’1 ) 𝑒 βˆ— log 𝑙 βˆ— π‘œ 1 / 𝑑 𝑠 𝑠 Algorithm A was earlier introduced in [Abbar, Amer-yahia, Indyk, Mahabadi, β€’ WWW’13]

  7. Techniques

  8. Compute k-diversity: GMM β€’ Have n points, compute the subset with maximum diversity. β€’ Exact : NP-hard to approximate better than 2 [Ravi et al.] β€’ GMM Algorithm [Ravi et al.] [Gonzales] – Choose an arbitrary point – Repeat k-1 times β€’ Add the point whose minimum distance to the currently chosen points is maximized β€’ Achieves approximation factor 2 β€’ Running time of the algorithm is O(kn)

  9. Locality Sensitive Hashing (LSH) β€’ LSH – close points have higher probability of collision than far points – Hash functions: 𝑕 1 , … , 𝑕 𝑀 𝑕 𝑗 = < β„Ž 𝑗 , 1 , … , β„Ž 𝑗 , 𝑒 > β€’ β„Ž 𝑗 , π‘˜ ∈ β„‹ is chosen randomly β€’ β„‹ is a family of hash functions which is β€’ 𝑄 1 , 𝑄 2 , 𝑠 , 𝑑𝑠 -sensitive: If π‘ž βˆ’ π‘žπ‘ž ≀ 𝑠 then Pr β„Ž π‘ž = β„Ž π‘žπ‘ž β‰₯ 𝑄 1 – If π‘ž βˆ’ π‘žπ‘ž β‰₯ 𝑑𝑠 then Pr β„Ž π‘ž = β„Ž π‘žπ‘ž ≀ 𝑄 2 – Example: Hamming distance: β€’ β„Ž π‘ž = π‘ž 𝑗 , i.e., the ith bit of π‘ž – Is (1 βˆ’ 𝑠 𝑒 , 1 βˆ’ 𝑠𝑑 𝑒 , 𝑠 , 𝑠𝑑 ) -sensitive – – 𝑴 and 𝒖 are parameters of LSH

  10. LSH-based NaΓ―ve Algorithm [Indyk, Motwani] Parameters 𝑀 and 𝑒 can be set s.t. β€’ With constant probability Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least – one hash function Total number of outliers is at most 3𝑀 – Outlier : point farther than 𝑑𝑠 from the query point – Algorithm Arrays for each hash function 𝐡 1 , … , 𝐡 𝑀 β€’ For a query 𝒓 compute β€’ 𝑀 Retrieve the possible neighbors S = ⋃ 𝑩 [ 𝑕 𝑗 ( π‘Ÿ )] – 𝑗=1 Remove the outliers S = S ∩ B q, cr – Report the approximate k most diverse points of S, or – GMM(S) Achieves (c,2)-approximation β€’ Running time may be linear in π‘œ  β€’ Should prune the buckets before collecting them –

  11. Core-sets Core-sets [ Agarwal, Har-Peled, Varadarajan] : subset of a point set S that β€’ represents it. Approximately determines the solution to an optimization β€’ problem Composes: A union of coresets is a coreset of the union β€’ Ξ² – core-set: Approximates the cost up-to a factor of Ξ² β€’ Our Optimization problem: β€’ Finding the k-diversity of S. – Instead we consider finding K-Center Cost of S – 𝐿𝐿 𝑇 , 𝑇 β€² = max π‘ž β€² βˆˆπ‘‡ β€² π‘ž βˆ’ π‘ž β€² π‘žβˆˆπ‘‡ min β€’ 𝑇 β€² βŠ†π‘‡ , 𝑇 β€² =𝑙 𝐿𝐿 ( 𝑇 , 𝑇 β€² ) 𝐿𝐿 𝑙 𝑇 = min β€’ KC cost 2-approximates diversity – 𝐿𝐿 π‘™βˆ’1 𝑇 ≀ 𝑒𝑗𝑒 𝑙 𝑇 ≀ 2. 𝐿𝐿 π‘™βˆ’1 𝑇 β€’ GMM computes a 1/3-Coreset for KC-cost β€’

  12. Algorithms

  13. Algorithm A β€’ Parameters 𝑀 and 𝑒 can be set s.t. with constant probability – Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least one hash function – There is no outlier No need to keep all the points in each bucket, β€’ just keep a coreset! β€’ – π‘©π‘ž 𝒋 π’Œ = 𝑯𝑯𝑯 𝑩 𝒋 π’Œ – Keep a 1/3 coreset of 𝑩 𝒋 π’Œ Given query 𝒓 β€’ 𝑀 – Retrieve the coresets from buckets S = ⋃ π‘©π‘ž [ 𝑕 𝑗 ( π‘Ÿ )] 𝑗=1 – Run GMM(S) – Report the result

  14. Analysis β€’ Achieves (c,6)-Approx – Union of 1/3 coresets is a 1/3 coreset for the union – The last GMM call, adds a 2 approximation factor Only works if we set 𝑀 and 𝑒 s.t. there is no outlier in 𝑇 β€’ with constant probability Space: 𝑃 π‘œπ‘€ = 𝑃 (( π‘œ log 𝑙 ) 1+1 /( π‘‘βˆ’1 ) + π‘œπ‘’ ) – Time: 𝑃 𝑀𝑙 2 = 𝑃 ( 𝑙 2 + log π‘œ 𝑒 (log 𝑙 ) 𝑑 /( π‘‘βˆ’1 ) π‘œ 1 /( π‘‘βˆ’1 ) ) – 𝑠 Only makes sense for 𝑑 > 2 – Not optimal: β€’ 1 ANN query time is 𝑃 ( π‘’π‘œ 𝑑 ) – So if we want to improve over these we should be able to deal – with outliers.

  15. Robust Core-sets β€’ 𝑇 β€² i s an π‘š -robust Ξ² -coreset for S if – for any set 𝑃 of outliers of size at most π‘š – ( π‘‡π‘ž \O ) is a Ξ² -coreset for 𝑇 β€’ Peeling Algorithm [Agarwal, Har-peled, Yu,’06][Varadarajan, Xiao, β€˜12] : Repeat ( π‘š + 1) times – Compute a Ξ² -coreset for 𝑇 β€’ Add them to the coreset π‘‡π‘ž β€’ Remove them from the set 𝑇 β€’ Note: if we order the points in π‘‡π‘ž as we find them, then the first π‘š β€² + 1 𝑙 points also form an π‘šπ‘ž -robust Ξ² -coreset. 2 robust coreset: S’= {3, 5 ; 2, 9 ; 1, 6} 1 robust coreset

  16. Algorithm B Parameters 𝑀 and 𝑒 can be set s.t. With constant β€’ probability – Any neighbor of π‘Ÿ falls into the same bucket as π‘Ÿ in at least one hash function – Total number of outliers is at most 3𝑀 For each bucket 𝐡 𝑗 π‘˜ keep an 3 𝑀 -robust 1/3-coreset in β€’ π΅π‘ž 𝑗 π‘˜ which has size 3𝑀 + 1 𝑙 For query π‘Ÿ β€’ For each bucket π΅π‘ž [ 𝑕 𝑗 ( π‘Ÿ )] – Find smallest π‘š s.t. the first ( π‘™π‘š ) points contains less than π‘š outliers β€’ Add those π‘™π‘š points to 𝑇 β€’ Remove outliers from 𝑇 – Return 𝐻𝐻𝐻 ( 𝑇 ) –

  17. Example and Analysis β€’ Total # outliers ≀ 3𝑀 , 𝑇 < 𝑃 ( 𝑀𝑙 ) 1 β€’ Time: O( 𝑀𝑙 2 ) = O( 𝑙 2 + log π‘œ 𝑒 βˆ— log 𝑙 βˆ— π‘œ 𝑑 ) 𝑠 β€’ Space: 𝑃 π‘œπ‘€ = 𝑃 (log 𝑙 βˆ— π‘œ 1+1 / 𝑑 + π‘œπ‘’ ) β€’ Achieves (c,6)-Approx for the same reason

  18. Conclusion Algorithm A Algorithm B ANN Distance Apx. c > 2 c >1 c >1 Factor Diversity Apx. 6 6 - Factor Ξ± ~ π‘œ 1+ 1 ~ π‘œ 1+1 π‘œ 1+1 Space π‘‘βˆ’1 𝑑 𝑑 1 1 1 Query Time ~ 𝑒 π‘œ ~ 𝑒 π‘œ 𝑒 π‘œ π‘‘βˆ’1 𝑑 𝑑 Further Work β€’ Improve diversity factor Ξ± β€’ Consider other definitions of diversity , e.g., sum of distances

  19. Thank You!

Recommend


More recommend