Optimal Multi-Step k -Nearest Neighbor Search Thomas Seidl and Hans-Peter Kriegel University of Munich, Germany ACM SIGMOD ‘98, Seattle
Survey • Similarity search for complex similarity models • Analysis of previous solution for k -nn search • An optimality criterion for k -nn search • Optimal algorithm for k -nn search • Performance analysis (c) 1998 Thomas Seidl SIGMOD ‘98 - 2
Distance-based Similarity Search Principle: Small Distances ↔ Strong Similarity k -NearestNeighborQuery ( q , k ): { } ( ) RangeQuery , : ( , ) ε ∈ ≤ ε q o DB d o q { } monotonous d q − 1 , , → � k DB 4th 2nd 3rd 1st no answer too many answers k nearest neighbors (c) 1998 Thomas Seidl SIGMOD ‘98 - 3
Complex Similarity Models • Quadratic Form Distance Functions A 2 ( , ) ( ) ( ) = − ⋅ ⋅ − T d p q p q p q A – Color Histograms for Image Databases (QBIC) 256-D histograms (Niblack et al. 93) (Hafner et al. 95) – Shape Similarity for 2D and 3D: Up to 4,096-D vectors (Thesis Seidl 97) – … • Max-Morphological Distance – 2D images: Tumor shapes (Korn et al. 96) (c) 1998 Thomas Seidl SIGMOD ‘98 - 4
Cost of Single Evaluations – Quadratic Form Distance Functions 100,000 evaluation 1,656 102 time [msec] 1,000 6.2 1.1 10 0.4 0.23 0 21 64 112 256 1,024 4,096 dimension – Max-Morphological Distance (Korn et al. 96) 12.69 seconds (avg) per distance evaluation (c) 1998 Thomas Seidl SIGMOD ‘98 - 5
Multi-Step Query Processing • Multi-Step Similarity Search Range Queries (Faloutsos et al. 94) Filter Step k -Nearest Neighbor Queries (Korn et al. 96) (index-based) • No False Drops? candidates Lower-Bounding Property Refinement Step ≤ ( , ) ( , ) d p q d p q (exact evaluation) f o filter distance object distance results (c) 1998 Thomas Seidl SIGMOD ‘98 - 6
Previous k -nn Algorithm (Korn et al. 96) query (q,k) First More candidates k -nn query on Index ( d f ) Phase generated Index than necessary primary d max (d o ) k Second in d max query on Index ( d f ) d Objects Fixed Phase x a m 2nd Phase! >>k final k -nn (d o ) (c) 1998 Thomas Seidl SIGMOD ‘98 - 7
Number of Candidates 1.2 object and d max filter distances 1 k -th object 0.8 distance 0.6 0.4 dmax object distance 0.2 filter distance 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 rank according to filter distance (c) 1998 Thomas Seidl SIGMOD ‘98 - 8
Optimality of k -NN Algorithms Lemma d ( , ) d ( , ) – Let d f be a lower-bounding filter of d o : ≤ p q p q f o – For a multi-step k -nn algorithm based on d o and d f , the optimal set of candidates is: { } d ( , ) ∈ ≤ ε o DB o q f k – where ε k is the k -th object similarity distance: { } ( ) max d ( , ) ε k = ∈ o q o NN k o q (c) 1998 Thomas Seidl SIGMOD ‘98 - 9
Optimal k -nn Algorithm (new) query (q,k) THEOREM: No false drops 1 No unnecessary init ranking on Index (d f ) 2 candidates Index while d f (o,q) ≤ d max do get next o from index is adjusted and adjust d max (d o ) step by step! d x a m Objects result final k -nn: d o (o,q) ≤ d max Required: Incremental Ranking on index (Hjaltason & Samet 95) (c) 1998 Thomas Seidl SIGMOD ‘98 - 10
Minimal Set of Candidates 1.2 object and primary d max filter distances 1 optimal d max 0.8 primary dmax 0.6 optimal dmax 0.4 filter distance 0.2 object distance 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 rank according to filter distance The higher the filter distance, the better the filter selectivity (c) 1998 Thomas Seidl SIGMOD ‘98 - 11
Uniformly Distributed Data (20-D) 71,610 80000 number of candidates Experimental Setting 60000 42,891 • 100,000 Objects, 20-D 40000 26,546 • Matrices: sim-id, 1-0, 2-2 20000 370 358 1,118 • Queries: k = 10 (0.01%) 0 previous sim-id sim-1-0 sim-2-2 algorithm • Index: 15-D 1200 1,117 overall runtime [sec] 1000 optimal Avg. Improvement Factors algorithm 800 664 600 419 • Candidates: 72, 120, 64 400 • Overall Time: 26, 48, 23 200 16 14 48 0 sim-id sim-1-0 sim-2-2 similarity matrix (c) 1998 Thomas Seidl SIGMOD ‘98 - 12
2-D Shape Similarity (1,024-D) 2500 number of candidates Experimental Setting 2000 1500 • 10,000 Images, 32x32 Pixel 1000 • ‘Neighborhood Area’: 9-1 500 • Queries: k = 5 (0.05%) 0 previous algorithm 16-D 32-D 48-D 64-D • Index (KLT): 16-D, …, 64-D 300 overall runtime [sec] optimal 250 Avg. Improvement Factors algorithm 200 150 • Candidates: 2.3 100 • Overall Time: 1.6 to 2.3 50 0 16-D 32-D 48-D 64-D dimension of index (c) 1998 Thomas Seidl SIGMOD ‘98 - 13
Color Histograms (112-D) 10000 number of candidates Experimental Setting 8000 6000 • 112,700 Histograms (112-D) 4000 • Quadratic Form Distance 2000 • Queries: k = 2,…,12 (0.01%) previous 0 algorithm • Index (KLT): 12-D 2 4 6 8 10 12 120 optim al overall runtime [sec] 100 algorithm Avg. Improvement Factors 80 60 • Candidates: 17 40 • Overall Time: 8.5 20 0 2 4 6 8 10 12 query parameter k (c) 1998 Thomas Seidl SIGMOD ‘98 - 14
Conclusions • Complex Similarity Search : Expensive similarity evaluations • Multi-Step Approach : Lower-bounding filter distance function • Optimal Algorithm : Minimum number of exact evaluations • Average Improvement Factors : – up to 120 (number of candidates) – up to 48 (overall runtime) • Future Work : New applications; Integration with Data Mining (c) 1998 Thomas Seidl SIGMOD ‘98 - 15
Recommend
More recommend