Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled Sepideh Mahabadi UIUC MIT
Nearest Neighbor Problem
Nearest Neighbor Dataset of 𝑜 points 𝑄 in a metric space (𝑌, 𝑒 𝑌 ) , e.g. ℝ 𝑒
Nearest Neighbor Dataset of 𝑜 points 𝑄 in a metric space (𝑌, 𝑒 𝑌 ) , e.g. ℝ 𝑒 A query point 𝑟 comes online 𝑟
Nearest Neighbor Dataset of 𝑜 points 𝑄 in a metric space (𝑌, 𝑒 𝑌 ) , e.g. ℝ 𝑒 A query point 𝑟 comes online Goal: 𝑟 • Find the nearest data point 𝑞 ∗ 𝑞 ∗
Nearest Neighbor Dataset of 𝑜 points 𝑄 in a metric space (𝑌, 𝑒 𝑌 ) , e.g. ℝ 𝑒 A query point 𝑟 comes online Goal: 𝑟 • Find the nearest data point 𝑞 ∗ 𝑞 ∗ • Do it in sub-linear time and small space
Approximate Nearest Neighbor Dataset of 𝑜 points 𝑄 in a metric space (𝑌, 𝑒 𝑌 ) , e.g. ℝ 𝑒 A query point 𝑟 comes online 𝑞 Goal: 𝑟 • Find the nearest data point 𝑞 ∗ 𝑞 ∗ • Do it in sub-linear time and small space • Approximate Nearest Neighbor ─ If optimal distance is 𝑠 , report a point in distance c𝑠 for c = 1 + 𝜗
Approximate Nearest Neighbor Dataset of 𝑜 points 𝑄 in a metric space (𝑌, 𝑒 𝑌 ) , e.g. ℝ 𝑒 A query point 𝑟 comes online 𝑞 Goal: 𝑟 • Find the nearest data point 𝑞 ∗ 𝑞 ∗ • Do it in sub-linear time and small space • Approximate Nearest Neighbor ─ If optimal distance is 𝑠 , report a point in distance c𝑠 for c = 1 + 𝜗 ─ For Hamming (and 𝑀 1 ) query time is 𝑜 1/𝑃(𝑑) [IM98] 1 𝑃(𝑑 2 ) [AI08] ─ and for Euclidean ( 𝑀 2 ) it is 𝑜
Applications of NN Searching for the closest object
Robust NN Problem
Robustness The data points are:
Robustness The data points are: • corrupted, noisy • Image denoising
Robustness The data points are: • corrupted, noisy Movies • Image denoising 1 - 0 - - - Users • Incomplete - 0 1 - 0 - - - - 1 1 - • Recommendation: Sparse matrix
Robustness The data points are: • corrupted, noisy Movies • Image denoising 1 - 0 - - - Users • Incomplete - 0 1 - 0 - - - - 1 1 - • Recommendation: Sparse matrix • Irrelevant • Occluded image
The Robust NN problem • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 n=3 𝑞 1 = (3,4,0,5) 𝑞 2 = (3,2,1,2) 𝑞 3 = (2,3,3,1)
The Robust NN problem • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 n=3,k=2 𝑞 1 = (3,4,0,5) • A parameter 𝒍 𝑞 2 = (3,2,1,2) 𝑞 3 = (2,3,3,1)
The Robust NN problem • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑟 = (1,2, 1,5) n=3,k=2 𝑞 1 = (3,4,0,5) • A parameter 𝒍 𝑞 2 = (3,2,1,2) • A query point 𝑟 comes online 𝑞 3 = (2,3,3,1) • Find the closest point after removing 𝒍 coordinates
The Robust NN problem • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑟 = (1,2, 1,5) n=3,k=2 𝑞 1 = (3,4,0,5) dist=1 • A parameter 𝒍 𝑞 2 = (3,2,1,2) dist=0 • A query point 𝑟 comes online 𝑞 3 = (2,3,3,1) dist=2 • Find the closest point after removing 𝒍 coordinates
The Robust NN problem • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑟 = (1,2, 1,5) n=3,k=2 𝑞 1 = (3,4,0,5) dist=1 • A parameter 𝒍 𝑞 2 = (3,2,1,2) dist=0 • A query point 𝑟 comes online 𝑞 3 = (2,3,3,1) dist=2 • Find the closest point after removing 𝒍 coordinates
The Robust NN problem • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑟 = (1,2, 1,5) n=3,k=2 𝑞 1 = (3,4,0,5) dist=1 • A parameter 𝒍 𝑞 2 = (3,2,1,2) dist=0 • A query point 𝑟 comes online 𝑞 3 = (2,3,3,1) dist=2 • Find the closest point after removing 𝒍 coordinates Different set of coordinates for different points Applying this naively would require 𝑒 𝑙 ≈ 𝑒 𝑙
Budgeted Version • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑥 = 0.5, 0.5, 0.8, 0.3 n=3 • 𝑒 weights 𝑞 1 = (1,4,0,3) 𝑥 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑒 ) ∈ 0,1 𝑒 𝑞 2 = (3,2,4,2) 𝑞 3 = (4,6,3,4)
Budgeted Version • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑥 = 0.5, 0.5, 0.8, 0.3 𝑟 = (1,2, 5,5) n=3 • 𝑒 weights 𝑞 1 = (1,4,0,3) 𝑥 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑒 ) ∈ 0,1 𝑒 𝑞 2 = (3,2,4,2) • A query point 𝑟 comes online 𝑞 3 = (4,6,3,4) • Find the closest point after removing a set of coordinates 𝐶 of weight at most 𝟐 .
Budgeted Version • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑥 = 0.5, 0.5, 0.8, 0.3 𝑟 = (1,2, 5,5) n=3 • 𝑒 weights 𝑞 1 = (1,4,0,3) 𝑥 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑒 ) ∈ 0,1 𝑒 𝑞 2 = (3,2,4,2) • A query point 𝑟 comes online 𝑞 3 = (4,6,3,4) • Find the closest point after removing a set of coordinates 𝐶 of weight at most 𝟐 .
Budgeted Version • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑥 = 0.5, 0.5, 0.8, 0.3 𝑟 = (1,2, 5,5) n=3 • 𝑒 weights 𝑞 1 = (1,4,0,3) 𝑥 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑒 ) ∈ 0,1 𝑒 dist=4 𝑞 2 = (3,2,4,2) dist=1 • A query point 𝑟 comes online 𝑞 3 = (4,6,3,4) dist=3 • Find the closest point after removing a set of coordinates 𝐶 of weight at most 𝟐 .
Budgeted Version • Dataset of 𝑜 points 𝑄 in ℝ 𝑒 𝑥 = 0.5, 0.5, 0.8, 0.3 𝑟 = (1,2, 5,5) n=3 • 𝑒 weights 𝑞 1 = (1,4,0,3) 𝑥 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑒 ) ∈ 0,1 𝑒 dist=4 𝑞 2 = (3,2,4,2) dist=1 • A query point 𝑟 comes online 𝑞 3 = (4,6,3,4) dist=3 • Find the closest point after removing a set of coordinates 𝐶 of weight at most 𝟐 .
Results Bicriterion Approximation, for 𝑀 1 norm • Suppose that for 𝑞 ∗ ⊂ 𝑄 we have 𝑒𝑗𝑡𝑢 𝑟, 𝑞 ∗ = 𝑠 after ignoring 𝑙 coordinates
Results Bicriterion Approximation, for 𝑀 1 norm • Suppose that for 𝑞 ∗ ⊂ 𝑄 we have 𝑒𝑗𝑡𝑢 𝑟, 𝑞 ∗ = 𝑠 after ignoring 𝑙 coordinates • For 𝜀 ∈ (0,1) o Report a point 𝑞 s.t. 𝑒𝑗𝑡𝑢 𝑟, 𝑞 = 𝑃(𝑠/𝜀) after ignoring 𝑃(𝑙/𝜀) coordinates. o Query time equals to 𝑜 𝜀 queries in 2-ANN data- structure
Results Bicriterion Approximation, for 𝑀 1 norm • Suppose that for 𝑞 ∗ ⊂ 𝑄 we have 𝑒𝑗𝑡𝑢 𝑟, 𝑞 ∗ = 𝑠 after ignoring 𝑙 coordinates • For 𝜀 ∈ (0,1) o Report a point 𝑞 s.t. 𝑒𝑗𝑡𝑢 𝑟, 𝑞 = 𝑃(𝑠/𝜀) after ignoring 𝑃(𝑙/𝜀) coordinates. o Query time equals to 𝑜 𝜀 queries in 2-ANN data- structure Why not single criterion? • Equivalent to exact near neighbor in Hamming: there is a point within distance 𝑠 of the query iff there is a point within distance 0 after ignoring 𝑙 = 𝑠 coordinates
Results distance #ignored Query Time coordinates #Queries Query type Opt 𝑠 𝑙
Results distance #ignored Query Time coordinates #Queries Query type Opt 𝑠 𝑙 𝑃( 𝑠 𝑜 𝜀 𝑃( 𝑙 𝑀 1 2-ANN 𝜀) 𝜀)
Results distance #ignored Query Time coordinates #Queries Query type Opt 𝑠 𝑙 𝑃( 𝑠 𝑜 𝜀 𝑃( 𝑙 𝑀 1 2-ANN 𝜀) 𝜀) 𝑑 1/p -ANN 1/p 𝑃(𝑙 𝑑 + 1 𝑜 𝜀 𝑀 𝐪 𝑃(𝑠 𝑑 + 1 𝜀 ) ) 𝜀
Results distance #ignored Query Time coordinates #Queries Query type Opt 𝑠 𝑙 𝑃( 𝑠 𝑜 𝜀 𝑃( 𝑙 𝑀 1 2-ANN 𝜀) 𝜀) 𝑑 1/p -ANN 1/p 𝑃(𝑙 𝑑 + 1 𝑜 𝜀 𝑀 𝐪 𝑃(𝑠 𝑑 + 1 𝜀 ) ) 𝜀 𝑃( 𝑙 (1 + 𝜗) - 𝑠(1 + 𝜗) O( 𝑜 𝜀 1 + 𝜗 − ANN 𝜗𝜀 ) 𝜗 ) approximation
Results distance #ignored Query Time coordinates #Queries Query type Opt 𝑠 𝑙 𝑃( 𝑠 𝑜 𝜀 𝑃( 𝑙 𝑀 1 2-ANN 𝜀) 𝜀) 𝑑 1/p -ANN 1/p 𝑃(𝑙 𝑑 + 1 𝑜 𝜀 𝑀 𝐪 𝑃(𝑠 𝑑 + 1 𝜀 ) ) 𝜀 𝑃( 𝑙 (1 + 𝜗) - 𝑠(1 + 𝜗) O( 𝑜 𝜀 1 + 𝜗 − ANN 𝜗𝜀 ) 𝜗 ) approximation 𝑜 𝜀 Budgeted 𝑃(𝑠) Weight of 𝑃(1) 2-ANN +𝑃(𝑜 𝜀 𝑒 4 ) Version
Algorithm
High Level Algorithm Theorem. If for a point 𝑞 ∗ ⊂ 𝑄 , the 𝑀 1 distance of 𝑟 and 𝑞 ∗ is at most 𝑠 after removing 𝑙 coordinates, there exists an algorithm which reports a point 𝑞 whose distance to 𝑟 is 𝑃(𝑠/𝜀) after removing 𝑃(𝑙/𝜀) coordinates.
High Level Algorithm Theorem. If for a point 𝑞 ∗ ⊂ 𝑄 , the 𝑀 1 distance of 𝑟 and 𝑞 ∗ is at most 𝑠 after removing 𝑙 coordinates, there exists an algorithm which reports a point 𝑞 whose distance to 𝑟 is 𝑃(𝑠/𝜀) after removing 𝑃(𝑙/𝜀) coordinates. • Cannot apply randomized dimensionality reduction e.g. Johnson-Lindenstrauss
High Level Algorithm Theorem. If for a point 𝑞 ∗ ⊂ 𝑄 , the 𝑀 1 distance of 𝑟 and 𝑞 ∗ is at most 𝑠 after removing 𝑙 coordinates, there exists an algorithm which reports a point 𝑞 whose distance to 𝑟 is 𝑃(𝑠/𝜀) after removing 𝑃(𝑙/𝜀) coordinates. • Cannot apply randomized dimensionality reduction e.g. Johnson-Lindenstrauss • A set of randomized maps 𝒈 𝟐 , 𝒈 𝟑 , … 𝒈 𝒏 : ℝ 𝒆 → ℝ 𝒆 ′ • All of them map far points from query to far points • At least one of them maps a close point to a close point
Recommend
More recommend