Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - PowerPoint PPT Presentation

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa)

Near Neighbor Problem Definition • Set of 𝑜 points 𝑸 in 𝑒 -dimensional – space Query point 𝒓 – Report one neighbor of 𝒓 if there is any – Neighbor: A point within distance 𝑠 of • query Application • Major importance in databases – (document, image, video), information retrieval, pattern recognition Object of interest as point • Similarity is measured as distance. •

Motivation Search: How many answers? Small output size, e.g. 10 • Reporting 𝑙 Nearest Neighbors may not – be informative (could be identical texts) Large output size • Time to retrieve them is high – Small output size which is Relevant and Diverse • Good to have result from each • cluster, i.e. should be diverse

Diverse Near Neighbor Problem Definition • Set of 𝑜 points 𝑸 in 𝑒 -dimensional space – Query point 𝒓 – Report the k most diverse neighbors of 𝑟 – Neighbor: • Points within distance 𝑠 of query – We use Hamming distance – Diversity: • div S = m 𝑗𝑜 𝑞 , 𝑟∈𝑇 | 𝑞 − 𝑟 | – Goal: report Q (green points), s.t. • 𝑅 ⊆ 𝑄 ∩ 𝐶 𝑟 , 𝑠 – |Q| = k – 𝑒𝑗𝑒 𝑅 is maximized –

Approximation Want sublinear query time, so need to approximate • Approximate NN: • 𝐶 𝑟 , 𝑠 → 𝐶 𝑟 , 𝑑𝑠 for some value of 𝑑 > 1 – 1 Result: query time of 𝑃 ( 𝑒𝑜 𝑑 ) – Approximate Diverse NN: • Bi-criterion approximation: distance and diversity – ( 𝐝 , 𝜷 ) -Approximate 𝑙 -diverse Near Neighbor – Let 𝑅 ∗ (green points) be the optimum solution for 𝐶 𝑟 , 𝑠 – Report approximate neighbors 𝑅 (purple points) • 𝑅 ⊆ 𝐶 𝑟 , 𝑑𝑠 Diversity approximates the optimum diversity • 1 𝛽 𝑒𝑗𝑒 ( 𝑅 ∗ ) , 𝛽 ≥ 1 𝑒𝑗𝑒 𝑅 ≥

Results Algorithm A Algorithm B Distance Apx. Factor c > 2 c >1 Diversity Apx. Factor α 6 6 log 𝑙 ∗ 𝑜 1+1 / 𝑑 + 𝑜𝑒 ( 𝑜 log 𝑙 ) 1+1 /( 𝑑−1 ) + 𝑜𝑒 Space 𝑙 2 + log 𝑜 𝑙 2 + log 𝑜 Query Time 𝑒 (log 𝑙 ) 𝑑 /( 𝑑−1 ) 𝑜 1 /( 𝑑−1 ) 𝑒 ∗ log 𝑙 ∗ 𝑜 1 / 𝑑 𝑠 𝑠 Algorithm A was earlier introduced in [Abbar, Amer-yahia, Indyk, Mahabadi, • WWW’13]

Techniques

Compute k-diversity: GMM • Have n points, compute the subset with maximum diversity. • Exact : NP-hard to approximate better than 2 [Ravi et al.] • GMM Algorithm [Ravi et al.] [Gonzales] – Choose an arbitrary point – Repeat k-1 times • Add the point whose minimum distance to the currently chosen points is maximized • Achieves approximation factor 2 • Running time of the algorithm is O(kn)

Locality Sensitive Hashing (LSH) • LSH – close points have higher probability of collision than far points – Hash functions: 𝑕 1 , … , 𝑕 𝑀 𝑕 𝑗 = < ℎ 𝑗 , 1 , … , ℎ 𝑗 , 𝑢 > • ℎ 𝑗 , 𝑘 ∈ ℋ is chosen randomly • ℋ is a family of hash functions which is • 𝑄 1 , 𝑄 2 , 𝑠 , 𝑑𝑠 -sensitive: If 𝑞 − 𝑞𝑞 ≤ 𝑠 then Pr ℎ 𝑞 = ℎ 𝑞𝑞 ≥ 𝑄 1 – If 𝑞 − 𝑞𝑞 ≥ 𝑑𝑠 then Pr ℎ 𝑞 = ℎ 𝑞𝑞 ≤ 𝑄 2 – Example: Hamming distance: • ℎ 𝑞 = 𝑞 𝑗 , i.e., the ith bit of 𝑞 – Is (1 − 𝑠 𝑒 , 1 − 𝑠𝑑 𝑒 , 𝑠 , 𝑠𝑑 ) -sensitive – – 𝑴 and 𝒖 are parameters of LSH

LSH-based Naïve Algorithm [Indyk, Motwani] Parameters 𝑀 and 𝑢 can be set s.t. • With constant probability Any neighbor of 𝑟 falls into the same bucket as 𝑟 in at least – one hash function Total number of outliers is at most 3𝑀 – Outlier : point farther than 𝑑𝑠 from the query point – Algorithm Arrays for each hash function 𝐵 1 , … , 𝐵 𝑀 • For a query 𝒓 compute • 𝑀 Retrieve the possible neighbors S = ⋃ 𝑩 [ 𝑕 𝑗 ( 𝑟 )] – 𝑗=1 Remove the outliers S = S ∩ B q, cr – Report the approximate k most diverse points of S, or – GMM(S) Achieves (c,2)-approximation • Running time may be linear in 𝑜  • Should prune the buckets before collecting them –

Core-sets Core-sets [ Agarwal, Har-Peled, Varadarajan] : subset of a point set S that • represents it. Approximately determines the solution to an optimization • problem Composes: A union of coresets is a coreset of the union • β – core-set: Approximates the cost up-to a factor of β • Our Optimization problem: • Finding the k-diversity of S. – Instead we consider finding K-Center Cost of S – 𝐿𝐿 𝑇 , 𝑇 ′ = max 𝑞 ′ ∈𝑇 ′ 𝑞 − 𝑞 ′ 𝑞∈𝑇 min • 𝑇 ′ ⊆𝑇 , 𝑇 ′ =𝑙 𝐿𝐿 ( 𝑇 , 𝑇 ′ ) 𝐿𝐿 𝑙 𝑇 = min • KC cost 2-approximates diversity – 𝐿𝐿 𝑙−1 𝑇 ≤ 𝑒𝑗𝑒 𝑙 𝑇 ≤ 2. 𝐿𝐿 𝑙−1 𝑇 • GMM computes a 1/3-Coreset for KC-cost •

Algorithms

Algorithm A • Parameters 𝑀 and 𝑢 can be set s.t. with constant probability – Any neighbor of 𝑟 falls into the same bucket as 𝑟 in at least one hash function – There is no outlier No need to keep all the points in each bucket, • just keep a coreset! • – 𝑩𝑞 𝒋 𝒌 = 𝑯𝑯𝑯 𝑩 𝒋 𝒌 – Keep a 1/3 coreset of 𝑩 𝒋 𝒌 Given query 𝒓 • 𝑀 – Retrieve the coresets from buckets S = ⋃ 𝑩𝑞 [ 𝑕 𝑗 ( 𝑟 )] 𝑗=1 – Run GMM(S) – Report the result

Analysis • Achieves (c,6)-Approx – Union of 1/3 coresets is a 1/3 coreset for the union – The last GMM call, adds a 2 approximation factor Only works if we set 𝑀 and 𝑢 s.t. there is no outlier in 𝑇 • with constant probability Space: 𝑃 𝑜𝑀 = 𝑃 (( 𝑜 log 𝑙 ) 1+1 /( 𝑑−1 ) + 𝑜𝑒 ) – Time: 𝑃 𝑀𝑙 2 = 𝑃 ( 𝑙 2 + log 𝑜 𝑒 (log 𝑙 ) 𝑑 /( 𝑑−1 ) 𝑜 1 /( 𝑑−1 ) ) – 𝑠 Only makes sense for 𝑑 > 2 – Not optimal: • 1 ANN query time is 𝑃 ( 𝑒𝑜 𝑑 ) – So if we want to improve over these we should be able to deal – with outliers.

Robust Core-sets • 𝑇 ′ i s an 𝑚 -robust β -coreset for S if – for any set 𝑃 of outliers of size at most 𝑚 – ( 𝑇𝑞 \O ) is a β -coreset for 𝑇 • Peeling Algorithm [Agarwal, Har-peled, Yu,’06][Varadarajan, Xiao, ‘12] : Repeat ( 𝑚 + 1) times – Compute a β -coreset for 𝑇 • Add them to the coreset 𝑇𝑞 • Remove them from the set 𝑇 • Note: if we order the points in 𝑇𝑞 as we find them, then the first 𝑚 ′ + 1 𝑙 points also form an 𝑚𝑞 -robust β -coreset. 2 robust coreset: S’= {3, 5 ; 2, 9 ; 1, 6} 1 robust coreset

Algorithm B Parameters 𝑀 and 𝑢 can be set s.t. With constant • probability – Any neighbor of 𝑟 falls into the same bucket as 𝑟 in at least one hash function – Total number of outliers is at most 3𝑀 For each bucket 𝐵 𝑗 𝑘 keep an 3 𝑀 -robust 1/3-coreset in • 𝐵𝑞 𝑗 𝑘 which has size 3𝑀 + 1 𝑙 For query 𝑟 • For each bucket 𝐵𝑞 [ 𝑕 𝑗 ( 𝑟 )] – Find smallest 𝑚 s.t. the first ( 𝑙𝑚 ) points contains less than 𝑚 outliers • Add those 𝑙𝑚 points to 𝑇 • Remove outliers from 𝑇 – Return 𝐻𝐻𝐻 ( 𝑇 ) –

Example and Analysis • Total # outliers ≤ 3𝑀 , 𝑇 < 𝑃 ( 𝑀𝑙 ) 1 • Time: O( 𝑀𝑙 2 ) = O( 𝑙 2 + log 𝑜 𝑒 ∗ log 𝑙 ∗ 𝑜 𝑑 ) 𝑠 • Space: 𝑃 𝑜𝑀 = 𝑃 (log 𝑙 ∗ 𝑜 1+1 / 𝑑 + 𝑜𝑒 ) • Achieves (c,6)-Approx for the same reason

Conclusion Algorithm A Algorithm B ANN Distance Apx. c > 2 c >1 c >1 Factor Diversity Apx. 6 6 - Factor α ~ 𝑜 1+ 1 ~ 𝑜 1+1 𝑜 1+1 Space 𝑑−1 𝑑 𝑑 1 1 1 Query Time ~ 𝑒 𝑜 ~ 𝑒 𝑜 𝑒 𝑜 𝑑−1 𝑑 𝑑 Further Work • Improve diversity factor α • Consider other definitions of diversity , e.g., sum of distances

Thank You!

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - PowerPoint PPT Presentation

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa) Near Neighbor Problem Definition Set of points in -dimensional space

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

Scaling IPv6 Neighbor Discovery Ben Mack-Crane ( ben.mackcrane@huawei.com ) Overview of Neighbor

On Optimal Neighbor Discovery Philipp H. Kindt philipp.kindt@tum.de SIGCOMM19, Beijing CH

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal Jiantao Jiao

Average - case Lower Bounds for Approximate Near - Neighbor fs om Isoperimetric Inequalities

Nearest Neighbor Classification Machine Learning 1 This lecture K-nearest neighbor

Example Im at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesnt

Improving Neighbor Discovery with Slot Index Improving Neighbor Discovery with Slot Index

Graph-based Nearest Neighbor Search: From Practice to Theory Liudmila Prokhorenkova, Aleksandr

The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm Hypothesis Space Hypothesis Space

Composable Core-sets for Diversity and Coverage Maximization Piotr Indyk (MIT) Sepideh Mahabadi

State of the Electric Vehicle Consumers, Markets, Environment, and Motorsports Anmol Kabra

Partitioning Spatially Located Load with Rectangles Erik Saule 1 , Erdeniz s 1 , 2 , urek 1 ,

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

AngularJS Unit Testing AngularJS Directives with Karma & Jasmine Directives Directives

AngularJS Unit Testing with Karma & Jasmine Getting started with testing Angular

California Health Advocates Our Focus Providing quality Medicare and related healthcare coverage

ECE590 Computer and Information Security Fall 2018 Database Security Tyler Bletsch Duke

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia - PowerPoint PPT Presentation

Diverse Near Neighbor Problem Sofiane Abbar (QCRI) Sihem Amer-Yahia (CNRS) Piotr Indyk (MIT) Sepideh Mahabadi (MIT) Kasturi R. Varadarajan (UIowa) Near Neighbor Problem Definition Set of points in -dimensional space

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

Scaling IPv6 Neighbor Discovery Ben Mack-Crane ( ben.mackcrane@huawei.com ) Overview of Neighbor

On Optimal Neighbor Discovery Philipp H. Kindt philipp.kindt@tum.de SIGCOMM19, Beijing CH

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal Jiantao Jiao

Average - case Lower Bounds for Approximate Near - Neighbor fs om Isoperimetric Inequalities

Nearest Neighbor Classification Machine Learning 1 This lecture K-nearest neighbor

Example Im at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesnt

Improving Neighbor Discovery with Slot Index Improving Neighbor Discovery with Slot Index

Graph-based Nearest Neighbor Search: From Practice to Theory Liudmila Prokhorenkova, Aleksandr

The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm Hypothesis Space Hypothesis Space

Composable Core-sets for Diversity and Coverage Maximization Piotr Indyk (MIT) Sepideh Mahabadi

State of the Electric Vehicle Consumers, Markets, Environment, and Motorsports Anmol Kabra

Partitioning Spatially Located Load with Rectangles Erik Saule 1 , Erdeniz s 1 , 2 , urek 1 ,

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

AngularJS Unit Testing AngularJS Directives with Karma &amp; Jasmine Directives Directives

AngularJS Unit Testing with Karma &amp; Jasmine Getting started with testing Angular

California Health Advocates Our Focus Providing quality Medicare and related healthcare coverage

ECE590 Computer and Information Security Fall 2018 Database Security Tyler Bletsch Duke

AngularJS Unit Testing AngularJS Directives with Karma & Jasmine Directives Directives

AngularJS Unit Testing with Karma & Jasmine Getting started with testing Angular