DBRank 2011 LUDWIG- August 29, 2011 MAXIMILIANS- DEPARTMENT DATABASE UNIVERSITÄT INSTITUTE FOR SYSTEMS Seattle, WA Seattle WA MÜNCHEN MÜNCHEN INFORMATICS INFORMATICS GROUP GROUP BeyOND – Unleashing BOND Thomas Bernecker, Franz Graf, Hans-Peter Kriegel, , , g , Christian Moennig and Arthur Zimek Ludwig-Maximilians-Universität München (LMU) Munich, Germany http://www.dbs.ifi.lmu.de {bernecker, graf, kriegel, zimek}@dbs.ifi.lmu.de moennig@cip.ifi.lmu.de
Outline DATABASE SYSTEMS GROUP 1. Background Motivation: k-nearest neighbor search in high-dimensional g g – databases – BOND revisited 2. Introducing BeyOND – Filtering objects via distance approximations – Sub Cubes, MBRs 3. Experimental Evaluation 4. Conclusions BeyOND – Unleashing BOND 2
Motivation DATABASE SYSTEMS GROUP • Similarity search in high-dimensional space is ☺ important in cases of images, e-commerce, etc. � slow • The suitability of index-based solutions depends on the data di t ib ti distribution • Open question: relevant vs. irrelevant attributes • Similarity search in subspaces: Si il it h i b – Fix query attributes beforehand – Use multiple pivot points to derive upper and lower bounds Use multiple pivot points to derive upper and lower bounds – Process data vertically to reduce the high-dimensional space BeyOND – Unleashing BOND 3
BOND Revisited (1) DATABASE SYSTEMS GROUP • BOND [1] : k-nearest neighbor search on high-dimensional data – Resolves feature vectors (FVs) column-wise – Ranking of columns w.r.t. relevance – Pruning of columns using a branch-and-bound approach – Resolved part is known exactly – Unresolved part has to be approximated Unresolved part has to be approximated – Resolving stops when approximation is „good enough“ – Support of subspace queries pp p q – Distance metrics: • Histogram intersection (uncorrelated dimensions) • Euclidean distance E lid di t [1] de Vries, Mamoulis, Nes, Kersten: Efficient k-NN Search On Vertically Decomposed Data (SIGMOD’02) BeyOND – Unleashing BOND 4
BOND Revisited (2) DATABASE SYSTEMS GROUP • Restrictions of BOND: 1. The approach works only on Zipfian distributed data. 2. The feature values are normalized to [0,1] in each dimension. 3 3. The proposed bounds are loose The validity of stricter bounds The proposed bounds are loose. The validity of stricter bounds (Bond advanced) depends on a certain resolve order of the columns. BeyOND – Unleashing BOND 5
BOND Revisited (3) DATABASE SYSTEMS GROUP • Notation: – query vector q y , , database vector q q v − ∪ − + + = – Splitting of : resolved part , unresolved part ⇒ v v v v v v − − + + = + • Approximated distance: S approx ( q , v ) S ( q , v ) S ( q , v ) 1 2 ∑ ∑ − − − − = − – Resolved part: 2 2 S ( q , v ) ( q v ) 1 i { i } = ∑ i + + + + 2 + + − ≥ – Unresolved part: S ( q , v ) max q , 1 q S ( q , v ) 2 i i 1 i • Distance bounds: − − + + = = + + ≥ ≥ S S upper ( ( q q , v v ) ) S S ( ( q q , v v ) ) S S ( ( q q , v v ) ) S S ( ( q q , v v ) ) 1 2 1 − − = + ≤ S lower ( q , v ) S ( q , v ) 0 S ( q , v ) 1 1 BeyOND – Unleashing BOND 6
Beyond BOND DATABASE SYSTEMS GROUP • Benefits of BeyOND: 1. Independence of the data distribution. p ☺ ☺ 2. No restriction to a normalized data space. ☺ 3. No specific resolve order of the dimensions is needed. ☺ ⇒ Price: Distance approximations are no more suitable! � • Solution: Combining the idea of BOND with well-known t techniques: h i – VA-file (data space partitioning) – MBR (Minimum Bounding Rectangle) approximation (data organizing) MBR (Minimum Bounding Rectangle) approximation (data organizing) ⇒ Remaining restriction: minimum/maximum values for each ⇒ Remaining restriction: minimum/maximum values for each dimension need to be known � BeyOND – Unleashing BOND 7
Sub Cubes (1) DATABASE SYSTEMS GROUP • First extension: VA-file [2] with one split ⇒ 2 d sub cubes ⇒ 2 sub cubes 1 ⇒ Addressing via Z-IDs ⇒ Improved bounds based on the close / far ⇒ Improved bounds based on the close / far sub cube borders and lower 1 upper 2 c c 2 v i v i • Memory-efficient representation (8 bytes → 1 bit) – Sub cube need not be kept in main memory p y • Split positions stored in one separate array per dimension • Dependence on split level: p p – FV: 8 bytes per dimension – s splits: s / 8 bytes ( s bits) per dimension [2] Weber, Schek, Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High-Dimensional Spaces (VLDB‘98) BeyOND – Unleashing BOND 8
Sub Cubes (2) DATABASE SYSTEMS GROUP • Old distance bounds: { { } } ∑ ∑ − − + + 2 = = + + − S S ( ( q q , v v ) ) S S ( ( q q , v v ) ) max max q q , 1 1 q q upper 1 i i i − v − = + S lower ( q , v ) S ( q , ) 0 1 • Approximations of unresolved dimensions: { { } } ∑ 2 2 ′ + + + + = − − lower upper S ( q , v ) max q c , q c [ ] + + 2 i i v v i i i ⎧ + ∈ lower upper 0 if q c , c ⎪ ⎪ + + ∑ ∑ i { } v v ′ ′ ′ ′ + + + + = i i i i ⎨ ⎨ S S ( ( q , v ) ) 2 2 + + − − i lower upper ⎪ min q c , q c else ⎩ + + i i v v i i • New distance bounds: • New distance bounds: ′ ′ = − − + + + ≥ S upper ( q , v ) S ( q , v ) S ( q , v ) S ( q , v ) 1 2 1 ′ ′ ′ = − − + + + ≤ S lower ( q , v ) S ( q , v ) S ( q , v ) S ( q , v ) 1 2 1 BeyOND – Unleashing BOND 9
MBR Caching (1) DATABASE SYSTEMS GROUP • Most sub cubes are (very) sparse, i.e. occupied by at most one FV • Dense sub cubes allow a tighter Dense sub cubes allow a tighter approximation via MBRs – Restrict the number of MBRs in order to avoid a memory overhead – Ranking function for MBRs: V V = ⋅ sub cube f ( MBR ) card ( MBR ) V MBR d ⋅ d 16 16 – 8 byte coordinates: memory increase is limited to bytes card ( MBR ) per feature vector (+ pointer to Z-ID) BeyOND – Unleashing BOND 10
MBR Caching (2) DATABASE SYSTEMS GROUP • Limit the number of MBRs to 1% of the database size • Threshold as a trade-off between pruning power and Threshold as a trade off between pruning power and additional memory consumption • Requirements: Requirements: – Either all MBRs can be kept in memory, – or the time for loading the MBRs is less than the time for resolving the respective FVs. • Adaption of the equations for lower and upper bounds BeyOND – Unleashing BOND 11
Experimental Evaluation (1) DATABASE SYSTEMS GROUP • Evaluated approaches: 1. BondAdvanced (stricter bounds, but resolve order dependent) 2. Bond (original bounds)* 3. Sequential* 4. Beyond-1 (1 split) 5. BeyondMBR-1 (1 split + MBRs) y ( p ) 6. Beyond-2 7. BeyondMBR-2 8. Beyond-3* 9. BeyondMBR-3* BeyOND – Unleashing BOND 12
Experimental Evaluation (2) DATABASE SYSTEMS GROUP • Data set descriptions: Data Set Dims Size Type ALOI 27 110,250 Color Histograms, Zipfian CLUSTERED CLUSTERED 20 20 500 000 500,000 S Synthetic, 50 Clusters, Gaussian th ti 50 Cl t G i PHOG [3] 110 10,715 CT Histograms, PCA‘ed SIFT [4] SIFT 133 133 335 583 335,583 Image Features Image Features [3] Graf, Kriegel, Schubert, Poelsterl, Cavallaro. 2D Image Registration in CT Images Using Radial Image Descriptors (MICCAI‘11) [4] Lowe. Distinctive Image Features from Scale-Invariant Keypoints (Int. Journal of Computer Vision, 2004) BeyOND – Unleashing BOND 13
Experimental Evaluation (3) DATABASE SYSTEMS GROUP • Experimental settings: – 50 k-nearest neighbor queries g q – k = 10 – Averaged cumulative number of pruned FVs after resolving a column – AUC: data not resolved – AOC: data resolved for refinement BeyOND – Unleashing BOND 14
Experimental Evaluation (4) DATABASE SYSTEMS GROUP ALOI 27 110,250 Color Histograms, Zipfian BondAdvanced Bond Beyond-2 Beyond-1 BeyondMBR-1 BeyOND – Unleashing BOND 15
Experimental Evaluation (5) DATABASE SYSTEMS GROUP CLUSTERED 20 500,000 Synthetic, 50 Clusters, Gaussian BondAdvanced Bond Beyond-2 Beyond-1 BeyondMBR-1 BeyOND – Unleashing BOND 16
Experimental Evaluation (6) DATABASE SYSTEMS GROUP PHOG 110 10,715 CT Histograms, PCA‘ed BondAdvanced Bond Beyond-2 BeyondMBR-1 Beyond-1 BeyOND – Unleashing BOND 17
Recommend
More recommend