overview
play

Overview Similarity Search in Multimedia Databases Introduction - PDF document

Advanced Technology Seminar Overview Similarity Search in Multimedia Databases Introduction 1. Efficiency 2. Effectiveness 3. Applications 4. Daniel A. Keim and Benjamin Bustos Future research 5. Databases, Data Mining, and


  1. Advanced Technology Seminar Overview Similarity Search in Multimedia Databases Introduction 1. Efficiency 2. Effectiveness 3. Applications 4. Daniel A. Keim and Benjamin Bustos Future research 5. Databases, Data Mining, and Visualization University of Konstanz, Germany E-mail: {keim|bustos}@informatik.uni-konstanz.de http://dbvis.inf.uni-konstanz.de/ Introduction Introduction � Many application domains � Multimedia data: Heterogeneous! Molecular Text Biology Medicine Image Manufacturing Geography Industry Audio & video And many others… Introduction Introduction � Content-based retrieval in multimedia � Example of content-based retrieval databases [YI99] – Two approaches for retrieval in multimedia databases: • Object Annotation (Meta Information) : Describes the content of the multimedia object • The object itself : Representation is the multimedia object itself – Exact search is not meaningful Similarity Search!

  2. Introduction Introduction � Content-based retrieval in multimedia � Multimedia databases: Involves different databases is a difficult problem! areas in Computer Science Orientation Level-of-detail Introduction Introduction � Basic Approach to Similarity Search � Modeling multimedia data – Metric space [CNB+01] – Vector space [BBK01] � Nomenclature Introduction Introduction � Modeling multimedia data: Metric space � Distance functions: Minkowski – Measure of distance between objects – Properties of a metric: � Weighted Minkowski, Mahalanobis, etc.

  3. Introduction Introduction � Similarity queries: Range query � Similarity queries: k -Nearest Neighbor Query – Returns an answer set C such that Introduction Introduction � Main elements of MPEG-7 standard � Multimedia Content Descriptor Interface – Description tools (MPEG-7) • Descriptors – MPEG-7 is a standard that describes multimedia • Description schemes content data – Description definition language (DDL) – System tools • Text format (searching and editing) • Binary format (efficient storage and transmission) � URL: http://www.chiariglione.org/mpeg/standards/m peg-7/mpeg-7.htm Overview Overview Introduction Introduction 1. 1. Efficiency 2. Efficiency 2. i. Efficiency considerations Effectiveness 3. ii. Spatial access methods Applications 4. iii. Metric indices Future research iv. Approximate and probabilistic approaches 5. Effectiveness 3. Applications 4. Future research 5.

  4. Efficiency Considerations Efficiency Considerations � Effects in high-dimensional spaces � Notations and assumptions [BBK01] – D dimensions – Exponential dependency of measures on – Size of the database = N the dimension – Data space normalized to [0,1] D – Boundary effects – Uniformly distributed data – No geometric imagination Intuition fails “Curse of dimensionality” Efficiency Considerations Efficiency Considerations � Exponential growth of volume � The surface is everything! – Hypercube – Probability that a point is closer to 0.1 to a ( D -1)-dimensional surface – Hypersphere Efficiency Considerations Efficiency Considerations � Number of surfaces � “Each circle touching all boundaries includes the center point” False! – How many k -dimensional surfaces has a D -dimensional hypercube [0..1] D ? – D -dimensional cube [0,1] D – cp=(0.5, 0.5, ..., 0.5), p=(0.3, 0.3, ..., 0.3) – 16-D: circle (p, 0.7), distance (p, cp)=0.8!!!

  5. Efficiency Considerations Efficiency Considerations � Database specific effects � Database specific effects – Selectivity of range queries: Depends on – Data pages have large extensions the volume of the query – Most of data pages touch the surface of the data space on most sides Efficiency Considerations Efficiency Considerations � How to express useful queries in high- � How do meaningful distance dimensional spaces? distributions look like? – Histograms describing some statistical properties • Medium - very high dimensionality (20-1000) All 10 dimensions are relevant • Meaningful queries are difficult to express – Observations 9 of 10 dimensions are relevant • Not all dimensions are equally relevant for a given query • Multiple meaningful NNs exist for different 8 of 10 dimensions are relevant search metrics Efficiency: Spatial access Efficiency Considerations methods � Effects in metric spaces [CNB+01] � High-dimensional indexing methods [BBK01] Hierarchical index structures

  6. Efficiency: Spatial access Efficiency: Spatial access methods methods � Minimum bounding rectangles � kd-B-tree [Rob81] – kd-tree directory •Hyperrectangle-shaped page regions • kd-B-tree [Rob81] •An adaptive kd-tree is used for space partitioning • LSD h -tree [Hen98] •Complete and disjoint partitioning – R-tree variations • R-tree [Gut84] • R + -tree [SRF87] • R*-tree [BKS+90] • X-tree [BKK96] Efficiency: Spatial access Efficiency: Spatial access methods methods � R-tree [Gut84] � X-tree [BKK96] •Solid minimum bounding •Avoids overlap in the rectangles (MBR) directory by using: •Space partitioning is neither - Overlap-free split complete nor disjoint - Supernodes •Overlapping regions are allowed Efficiency: Spatial access Efficiency: Spatial access methods methods � Bounding spheres and combined regions � Other structures – TV-tree [LJF94] SS-tree [WJ96b] SR-tree [KS97] – Space filling curves [Sag94] – Pyramid technique [BBK98] Example: Pyramid technique

  7. Efficiency: Spatial access Efficiency: Metric indices methods GEMINI: Generic Multimedia object � Indexing metric spaces [CNB+01] � INdexING [Fal96] Querying: 1. Determine distance function D between two •Traverse index and objects discard classes 2. Find numerical feature-extraction functions (internal complexity) 3. Prove that distance in feature space is a lower- •Search in candidate bound of D classes (external 4. Use an index to store and retrieve feature complexity) vectors Efficiency: Metric indices Efficiency: Metric indices � Complexity of the search � Pivot-based indexing – Usually measured as the number of •Set of k pivots Example using 1 pivot distance computations •Distance lower bound – Other costs (I/O, CPU) are neglected � Two main indexing approaches •Exclusion condition for – Pivot-based indexing – Indexing based on compact partitions Efficiency: Metric indices Efficiency: Metric indices � Metric trees based on pivots � Other structures – Burkhard-Keller Tree [BK73] – Approximating and Eliminating Search – Vantage Point Tree [Yia93] Algorithm (AESA) [Vid86] – Fixed Queries Tree [BCM+94] – Linear AESA [MOV94] – Fixed-Height Queries Tree [BCM+94] � Pivot selection techniques [BNC03] – Multi Vantage Point Tree [BO97] – Random selection � Array representations of trees – Maximize mean distribution of – Spaghettis [CMB99] – Fixed Queries Array [CMN01]

  8. Efficiency: Metric indices Efficiency: Metric indices � Indexing based on compact partitions � Hyperplane criterion Search algorithm for (q,r) : •Compute distances divides the space between centers and q in compact zones •Let c be the closest center to q • Exclusion condition : � Criteria for partitioning the space •For q1 , the algorithm discards – Hyperplane partition the zone of c4 •For q2 , the algorithm discards – Covering radius the zones of c1 and c2 Efficiency: Metric indices Efficiency: Metric indices � Covering radius criterion � M-tree [CPZ97] – Covering radius : Maximum distance from a – Based on the covering radius criterion center to an object from its zone. – Good I/O performance and few distances – Exclusion criterion: computations Example: For q1 , the zone of c cannot be discarded, but for q2 it is discarded Efficiency: Approximate and Efficiency: Metric indices probabilistic approaches � Hyperplane criterion � Approximate and probabilistic – Generalized-Hyperplane Tree [Uhl91] approaches Approximately correct NN � Covering radius criterion –Trade off between – Bisector Tree (BST) [KM83] performance efficiency and – Voronoi Tree [DN87] quality of the approximation – Monotonous BST [NVZ92] – (1+ ε )- approximate NN: – M-Tree [CPZ97] Distance is within a factor (1+ ε ) – List of Clusters [CN00] of the distance to the true NN � Mixed criteria –Time-bounded search: – Geometric Near-neighbor Access Tree [Bri95] Retrieve similar objects in a – Spatial Approximation Tree [Nav02] fixed amount of time

Recommend


More recommend