content based similarity queries on complex data
play

Content-based Similarity Queries on Complex Data: Challenges and - PowerPoint PPT Presentation

31 st Brazilian Symposium on Databases 04-07, October 2016 Salvador - Bahia Content-based Similarity Queries on Complex Data: Challenges and Real Applications Agma J. Machado Traina agma@icmc.usp.br 1 Outline Outline Introduction and


  1. The BoVW drawback The BoVW drawback Image The Bag-of-Visual-Words ignores the spatial information of visual words in the Local features extraction final representation A A Dictionary of words C A D Representation F B A B C F A A A C E A C A B D F C B D F B A C A A B C D E F C F 29 Bag-of-Visual-Words

  2. Bags-of-Visual-Phrases Bags-of-Visual-Phrases - The goal is to encode spatial information of visual words - A more powerful description can be obtained by grouping words Image A AB B AB CA C D Local features A C CD Representation F AB CDF CA B F A extraction C CA A C D C AB CA CDF A ABEF B F A ABEF A B E F C D C A B E A F B F A Bag-of-Visual-Phrases Dictionary of Phrases Bag-of-Visual-Phrases A A Dictionary of words C A D Representation F B A B C F A A A C E A C A B D F C B D F B A C A A B C D E F C F 30 Bag-of-Visual-Words

  3. The proposed approach The proposed approach The 2-grams can be generated by placing a region over each keypoint All pairs of words formed with the center point are considered an 2-gram. D BD A B BA A C BD D BC BA 31 A A 31

  4. The proposed approach The proposed approach We divided the area in two zones to extract orientation 2-grams extracted A A BC, BA, BA B A C BD, BA, BA D A A 32

  5. The proposed approach The proposed approach D Bag 2 Bag 1 C B A A AB, BC, AC, BB, CC, CA, 2-grams AB, CA, BB, CA, BA, AA extraction CC, CA, CA, AB, BC, AC, BA, AA ... AB, CA, .. . D A A Dictionary of 2-grams AB AC AD BC BD CD 4 4 3 3 2 2 1 1 AB AD AB AD AC BC BD CD AC BC BD CD Bag-of-2-grams Bag-of-2-grams 33

  6. Experimental Results Experimental Results Dataset Evaluated: ImageCLEF 2012 Medical Task : composed of 5,042 bio-medical images classified in 32 categories and 3 levels Comparative results using 80-20 classification test CLD = Color Layour Descriptor EHD = Edge Histogram Descriptor CEED = Color and Edge Directivity Descriptor BoVW = Bag-of-Visual-Words 34

  7. Going further on BoVW Going further on BoVW Bag-of-Salience-Points Bag-of-Salience-Points It is a Bag-of-Words model to retrieve shapes by similarity using salient points

  8. Image Description Image Description Image is usually represented by color, texture and/or shape color texture shape • Shape is usually more effective in characterizing the object within an image • It represents the silhouette of the object in the image 36

  9. Shape Description Shape Description However, the development of a shape descriptor is a challenging task in the computer vision area The main reason is the fact that a same object may present a rich variability of shapes and different objects may present shapes with a high visual similarity Shapes of different objects with high visual similarity 37 Shapes of same objects with different visual similarity

  10. Shape Description Shape Description Different shape descriptors may target different aspects of the shape There are many shape descriptors in the literature: - Fourier Descriptors Contour-based approaches - Curvature Scale Space Region-based approach - Fractal Dimension - Moment Invariant Statistical-based approaches - Zernike Moments Analyses only the salient points - Shape Salience Descriptor 38

  11. Motivation Motivation Saliences are the higher curvature points along the countour Motivation: Salient (corner) points encode the most important parts in a compact way and invariant to geometric transformations Salient (corner) points of some shapes 39

  12. Problems of using salient points as features... Problems of using salient points as features... How to represent each salient point? How to measure the dissimilarity? 4 saliences 10 saliences 1. Dealing with variable number of features 2. The need of building a distance function 3. Dealing with a large number of features 40

  13. Bag-of-Salience-Points (BoSP) : Method Bag-of-Salience-Points (BoSP) : Method The idea is to model the representation as a Bag-of-Words approach • Bag-of-Salience-Points (BoSP) Visual Dictionary • Advantage: final feature vector with a fixed dimensionality 41

  14. Bag-of-Salience-Points (BoSP) : Method Bag-of-Salience-Points (BoSP) : Method 42

  15. Bag-of-Salience-Points (BoSP) : Method Bag-of-Salience-Points (BoSP) : Method We assume the shape was previously segmented 43

  16. Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method We assume the salient points were detected 44

  17. Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method 45

  18. Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method Curvature equation: , where s = arc length of the curve portion - Basically, the curvature equation describes how much a point bends at a portion of the curve - Varying the value of S we can obtain a multi-scale representation 46

  19. BoSP Method BoSP Method We count how many times each word appears in the shape Problem: Two different shapes can have the same global histogram A B A Solution: B B C Encode the spatial relationship between the D visual words in the image!! D 47

  20. Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method 48

  21. Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method 49

  22. Bag-of-Salience-Points (BoSP) Method Bag-of-Salience-Points (BoSP) Method 50

  23. BoSP Method BoSP Method • Encoding the spatial arrangement of visual words… 1) We divide the shape space in equally separated zones according to the distance from the shape centroid. In this example, we used 3 zones 2) We compute a histogram in each zone. Two different shapes can have the same global histogram, but usually not the same distribution of visual words 51

  24. BoSP Method BoSP Method FINAL REPRESENTATION 52

  25. BoSP Method BoSP Method Parameters: - How many words? - How many zones? 53

  26. Experimental Evaluation Experimental Evaluation To investigate the best values of these two parameters, we exploited different values using two different databases: Kimia-216 MPEG-7 CE-Shape-1 (18 different classes) (70 different classes) Some sample shapes of each dataset 54

  27. Experimental Evaluation Experimental Evaluation • These graphs show the mAP values obtained by varying : - the dictionary size - the quantity of zones. For Z = 1 we consider only the global histogram MPEG-7 dataset Kimia-216 dataset • Size of dictionary higher than 20 did not achieve an overall improvement • Quantity of zones higher than 4 did not improve the results 55

  28. Experimental Evaluation Experimental Evaluation Performance comparison of BOSP with 4 descriptors: - MI (Moment Invariants) - Fourier (Fourier Descriptor), - MS Fractal (Multi-scale Fractal Dimension) - SSD (Shape Salience Descriptor) Problem: it needs a specific distance function to compute smallest vector the dissimilarity Feature vector dimensionality comparison 56

  29. Experimental Evaluation Experimental Evaluation Average computational time to compute the dissimilarity: The proposed descriptor (BoSP) is the second faster descriptor 57

  30. Experimental Evaluation Experimental Evaluation Retrieval Performance (Curve of Precision x Recall) Kimia-216 dataset MPEG-7 dataset The proposed descriptor achieved similar performance to the SSD descriptor, but being 53% faster when computing the dissimilarity 58

  31. Considerations on BoVW approaches Considerations on BoVW approaches Bag-of-Visual-Phrases -> Bag-of-Salience-Points (BoSP): new feature extraction methods for dealing with shape-based images using salience points features 1. Deal with variable number of features 2. The need of building a distance function Three interesting points: 3. Dealing with a large number of features a multi-scale method to efficiently represent the salience points of a shape; a Dictionary of Curvatures to encode the final shape representation into a one single feature vector ; a spatial pooling approach to encode the distance distribution of the visual words in the shape space. Experimental results show that the proposed descriptor achieved the best retrieval performance while requiring a low computational cost to 59 measure the dissimilarity .

  32. From Features to Data Structures From Features to Data Structures Information Data (Structured) (Content) Access Features Methods

  33. Complex Data Complex Data Advanced database applications must deal with: Large number of data elements (i.e., cardinality) High dimensionality (i.e., number of attributes) Complexity of the features that describe the attributes Non-dimensional data (e.g. DNA sequences) Non-dimensional and high-dimensional Non-dimensional and high-dimensional datasets may consist of thousands of datasets may consist of thousands of attributes and may be subject to missing attributes and may be subject to missing values. values. 61

  34. Motivation Motivation Missing data can occur due to: Preventable errors or mistakes (e.g. failing to appear for a medical exam,…etc). Problems outside of control (e.g. failure of the equipment, low battery,…etc). Privacy or security reason. Legitimate (e.g. a survey question that does not apply to the respondent). 62

  35. Dealing with missing data Dealing with missing data Missing data: Missing data impacts similarity search due to main reasons: Distance function: How to measure the distance among elements when part of the information is missing? Access methods collapse 63

  36. Taxonomy Taxonomy Y : a set of variables Mechanisms of missingness (Rubin, 1976) Mechanisms of missingness (Rubin, 1976) y obs : fully observed variables y miss : variables with missing values I : indicator variable Missing Completely At Random : MCAR - probability that data are Missing Completely At Random : MCAR - probability that data are missing is independent of both observed and missing data missing is independent of both observed and missing data Pr(I / y obs, y miss ) = Pr(I) Pr(I / y obs, y miss ) = Pr(I) Missing At Random : MAR (Ignorable Missingness) - probability Missing At Random : MAR (Ignorable Missingness) - probability that data are missing is independent of missing data, but may be missing as a function of that data are missing is independent of missing data, but may be missing as a function of observed data observed data Pr(I / y obs, y miss ) = Pr(I / y obs ) Pr(I / y obs, y miss ) = Pr(I / y obs ) Missing Not At Random : MNAR (Non-ignorable Missingness) Missing Not At Random : MNAR (Non-ignorable Missingness) occurs when data are missing as a function of the missing values. occurs when data are missing as a function of the missing values. Pr(I / y obs, y miss ) = Pr(I / y miss ) Pr(I / y obs, y miss ) = Pr(I / y miss ) 64 64

  37. Missing data can occur in the raw data: Missing data can occur in the raw data: Original signal Signal with 10% missing data 1, 1, Normalized NDVI Normalized NDVI 0,9 0,9 0,8 0,8 0,7 0,7 0,6 0,6 0,5 0,5 0,4 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0, 0, Time Time Signal reconstructed with DWT Signal reconstructed with DWT 1, 1, 0,9 0,9 NDVI reconstructed NDVI reconstructed 0,8 0,8 0,7 0,7 0,6 0,6 0,5 0,5 0,4 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0, 0, Time Time 65

  38. How to compare the data elements? The distances between objects with missing values are undefined because The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown. the differences between the attributes with missing values are unknown. Obj A A 1 A 2 Null …. A n-2 Null A n ? ? Obj B B 1 B 2 B 3 …. B n-2 B n-1 B n Obs: Given any two feature vectors X and Y, the Lp family of distance Obs: Given any two feature vectors X and Y, the Lp family of distance functions are defined as: functions are defined as: 66

  39. How to compare the data elements? The distances between objects with missing values are undefined because the The distances between objects with missing values are undefined because the differences between the attributes with missing values are unknown. differences between the attributes with missing values are unknown. Obj A A 1 A 2 Null …. A n-2 Null A n ? ? Obj B B 1 B 2 B 3 …. B n-2 B n-1 B n 67 67 67

  40. Fractal Concepts 68 68 68

  41. Fractal Concepts Fractal : self-similarity property (an object that presents roughly the same characteristics over a large range of scales. L ine - resolutio n 1 :1 L ine - resolution 1 :100 L ine - resolution 1:1000000000000 69 69 69

  42. Fractals and Intrinsic Dimension - Intuition 70

  43. Fractals and Intrinsic Dimension - Intuition data distribution /behavior and attributes correlation � ) � �,� Box-counting approach Fast Method (linear cost) [Traina_SBBD2000], [Traina_JIDM2011] Well-suited to complex data scalable 71

  44. Fractals – Examples: Sierpinski triangle . . . � ) � �,� Box-counting approach 72

  45. Fractals – Examples Bureau of the Census - Tiger/Line Precensus 73 Files: 1990 technical documentation .

  46. Fractals – Examples Bureau of the Census - Tiger/Line 74 74 74 Precensus Files: 1990 technical documentation .

  47. Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search Given a set of n objects in a dataset with a distance function d : Given a set of n objects in a dataset with a distance function d : PC ( r ) = K p × r D PC ( r ) = K p × r D Fractal dimension of the Sierpinski dataset 21, log(# pairs within distance r) 18, log(Pairs(k)) 15, D 12, log( r ) 9, 6, log(r)

  48. Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search Fractal Dimension Fractal Dimension Given a set of n objects in a dataset with a distance function d : Given a set of n objects in a dataset with a distance function d : PC ( r ) = K p × r D PC ( r ) = K p × r D The distance exponent is invariant to random sampling, i.e., the Fractal dimension of the Sierpinski dataset power law holds forsubsets of the dataset. 21, log(# pairs within distance r) 18, 15, D 12, 9, 6, log(r) 76 76

  49. Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search Query Query Objects Objects Range k-NN q Metric Metric query query tree tree Limiting radius Dynamic radius r q = s q s q Diameter of space Oid 1 , d(S 1 , S rep ) … Oid n , d(S n , S rep ) Oid 1 , d(S 1 , S rep ) … Oid k , d(S k , S rep ) Update the query response Oid 1 , d(S 1 , S rep ) … Oid k , d(S k , S rep ) 77 77

  50. Fractal Dimension for Similarity Search Fractal Dimension for Similarity Search Query Query Objects Objects Range k-NN q Metric Metric query query tree tree Limiting radius Reduce the r q dynamic s q s q redius Final query Oid 1 , d(S 1 , S rep ) … Oid n , d(S n , S rep ) Oid 1 , d(S 1 , S rep ) … Oid k , d(S k , S rep ) response 78 78

  51. Fractal Concepts 79

  52. Missing Data Treatment at the Data Level Missing Data Treatment at the Data Level Pairewise/ Listwise Deletion Pairewise/ Listwise Deletion Imputation methods (e.g. Mean Substitution, Multiple Imputation methods (e.g. Mean Substitution, Multiple Imputation) Imputation) Biased results when predicting MNAR data Biased results when predicting MNAR data High cost for more sophisticated techniques High cost for more sophisticated techniques Special treatment is necessary to allow the applications to operate on the available data properly. 80 80

  53. Missing Data Treatment at the Data Level Missing Data Treatment at the Data Level Up to now, there is no solution for metric access method to support similarity search over incomplete datasets. 81 81

  54. Metric Access Methods Metric Access Methods Metric access methods Metric access methods Data Objects Employ an index structure to Employ an index structure to organize the objects in an organize the objects in an hierarchical hierarchical tree tree structure, structure, Metric Tree called Metric Tree , based on called Metric Tree , based on a distance function . a distance function . The The space space is is divided divided into into regions using a set of chosen regions using a set of chosen Rep objects, objects, called called Rep Rep representatives , representatives , and and their their Rep distances to the rest of the distances to the rest of the Rep objects in the space. objects in the space. 82 82

  55. Metric Space – Metric distance Metric Space – Metric distance S (1) d(x,x) = 0 S : Data domain S : Data domain (2) d(x,y) > 0 d : Metric distance d : Metric distance (1) : Reflexivity (1) : Reflexivity (2) : Non-negativity (2) : Non-negativity (3) : Symmetry (3) : Symmetry (4) : Triangle inequality (4) : Triangle inequality (4) d(x,y) + d(y,z) ≥ d(x,z) (3) d(y,z) = d(z,y) 83 83

  56. Slim-tree Metric Access Method Slim-tree Metric Access Method 84 84

  57. Problem definition Problem definition Missing data can underestimate or overestimate the distances and: Missing data can underestimate or overestimate the distances and: When data are MAR => Distortion of the index structure When data are MAR => Distortion of the index structure When data are MNAR => Skew (distance concentration) of the When data are MNAR => Skew (distance concentration) of the index structure index structure Covering Covering Radius Radius Representative Object with Null Rep Values Rep 85 Complete Object Distance Concentration 85

  58. Missing data considerations Missing data considerations Distance Representative Concentration Sparser (Skew) Data Object with Null Distortion Values Complete Object r r r Rep Rep Rep Complete Data Missing At Random Missing Not At Random 86 86 Missing data can underestimate or overestimate the distances

  59. Problem definition Problem definition Ignore missing attribute values and index the data with missing values: Ignore missing attribute values and index the data with missing values: When data are MAR => Distortion in the index structure When data are MAR => Distortion in the index structure When data are MNAR => Skew in the index structure When data are MNAR => Skew in the index structure This fact can cause inconsistency in the data structure, leading to inaccurate query response. Covering Covering Radius Radius Representative Object with Null Rep Values Rep Complete Object Distance Concentration 87

  60. Hollow-tree Hollow-tree Investigate the key issues involved when indexing and searching Investigate the key issues involved when indexing and searching datasets with missing attribute values in metric spaces, datasets with missing attribute values in metric spaces, A Metric Access Method to support similarity search over Identify the effects of each mechanism of missingness on the metric Identify the effects of each mechanism of missingness on the metric large and complex datasets with missing attribute values: access methods when applied on incomplete datasets, access methods when applied on incomplete datasets, • Able to index data with missing attribute values . • Performs the similarity queries on the available data. Fomalize the problem of missing data in metric spaces and Fomalize the problem of missing data in metric spaces and • Searches for complete data as well as data with propose a ”Model of Missingness”, propose a ”Model of Missingness”, missing values. Develop new techniques to support similarity search over large and Develop new techniques to support similarity search over large and complex datasets with missing values. complex datasets with missing values.

  61. The Hollow -tree Metric Access Method The Hollow -tree Metric Access Method The Hollow -tree metric access method The Hollow -tree metric access method Built over the Slim- tree platform. Built over the Slim- tree platform. Technique that allows to index objects with missing values. Technique that allows to index objects with missing values. Similarity queries based on Fractal Dimension and the local Similarity queries based on Fractal Dimension and the local density around the query objects to achieve an accurate query density around the query objects to achieve an accurate query response, when missingness is ignorable. response, when missingness is ignorable. Overcome the limitations of the metric access methods when applied Overcome the limitations of the metric access methods when applied on incomplete datasets. on incomplete datasets. 89 89

  62. Building the Hollow -tree Building the Hollow -tree Object with Null Data Objects Values Complete Object Load Complete Objects Slim-tree v r 1 v r 2 rep 1 Leaf rep 2 Nodes rep 3 r 3 90 90

  63. Similarity Queries Similarity Queries There are two types: There are two types: Range query R q (s q , r) Range query R q (s q , r) k-Nearest Neighbor query k-NN q (s q , k) k-Nearest Neighbor query k-NN q (s q , k) r s q s q R q (s q , r) k-NN q (s q , k) 91 91

  64. Building the Hollow -tree Building the Hollow -tree Object with Null Data Objects Values Complete Object Load complete objects Slim-tree Load objects v r 1 with Null v r 2 rep 1 values Leaf rep 2 Nodes rep 3 r 3 92 92

  65. Building the Hollow -tree Building the Hollow -tree Object with Null Data Objects Values Complete Object Load complete objects Slim-tree FALS E This strategy prevents data with missing values from being Indicator promoted as representatives and, thus, avoiding to introduce missing TRUE substantial distortion in the internal structure of the index. Load objects v r 1 with Null v r 2 rep 1 Leaf values rep 2 Nodes rep 3 r 3 93 93

  66. Similarity queries on the Hollow -tree Similarity queries on the Hollow -tree The queries return two separate lists The queries return two separate lists List of complete objects Oid 1 , d(S 1 , S rep ) … Oid n , d(S n , S rep ) Oid 1 , d(S 1 , S rep ) … Oid n , d(S n , S rep ) r s q Oid 1 , d(S 1 , S rep ) … Oid n , d(S n , S rep ) Oid 1 , d(S 1 , S rep ) … Oid n , d(S n , S rep ) List of objects with Null values 94 94

  67. k-NN q Query for Data with Missing Values — k-NNFM q k-NN q Query for Data with Missing Values — k-NNFM q k-NN q (s q , k) query is sensitive to distance concentration around the query k-NN q (s q , k) query is sensitive to distance concentration around the query center s q . center s q . List of complete objects Oid 1 , d(S 1 , S rep ) Oid 1 , d(S 1 , S rep ) … … r s q Oid 1 , d(S 1 , S rep ) … Oid k-1 , d(S k-1 , S rep ) Oid 1 , d(S 1 , S rep ) … Oid k-1 , d(S k-1 , S rep ) List of objects with Null values 95 95

  68. Experimental Evaluation Experimental Evaluation 500 query Complete Time Series Datasets object (Original Data) Incomplete Time MAR/MNAR Series Datasets Incomplete Time Incomplete Time Query Object Query Object Series Dataset Series Dataset Feature Discret Wavelet Extraction Transform 20 Coefficients 20 Coefficients Euclidean Distance Query by Similarity Indexing Querying (k-NN q , Range) Slim Tree 96 96

  69. Experimental Evaluation Experimental Evaluation Complete datasets Complete datasets Dataset Description Type Nº attributes Nº objects Normalized Difference NDVI Real 108 500000 Vegetation Index Weather WeathFor Synthetic 128 10000 Forecast Incomplete datasets Incomplete datasets MAR data Nº attributes % missing dada MNAR data % missing dada 2 7 NDVI 20 17 5 2 10 33 NDVI 5 15 49 10 65 20 WeathFor 12 25 82 16 2 8 18 20 5 39 10 WeathFor 15 58 97 97 20 78 97 25

  70. Experimental Results Experimental Results Precision and Recall for RM q queries — Weather dataset RMq - Hollow-tree RMq - Slim-tree 1, 1, 0,9 0,9 0,8 0,8 0,7 0,7 Precision 0,6 0,6 Recall 0,5 0,5 0,4 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0, 0, 0 2 5 10 15 20 25 0 2 5 10 15 20 25 % Missing Data % Missing Data 98 98

  71. Experimental Results Experimental Results Efficiency parameters — NDVI (MAR & MNAR) Efficiency parameters — NDVI (MAR & MNAR) k-NNq query k-NNq query Rq query Rq query MNAR MNAR Avg. Disk Access Avg. Dist. Calc. MAR MAR 0 2 5 10 15 20 25 20 0 2 5 10 15 20 25 20 % Missing Data % Missing Data k-NNq query Rq query Total Time [sec] MNAR MAR 0 2 5 10 15 20 25 20 % Missing Data 99 99

  72. Experimental Results Experimental Results Efficiency parameters — WeathFor (MAR) Efficiency parameters — WeathFor (MAR) k-NNq query k-NNq query Rq query Rq query Avg. Disk Access Avg. Dist. Calc. 0 2 5 10 15 20 25 0 2 5 10 15 20 25 % Missing Data % Missing Data k-NNq query Rq query Total Time [sec] 0 2 5 10 15 20 25 % Missing Data 100 100

Recommend


More recommend