detecting duplicates
play

Detecting Duplicates Duplicates and . . . Duplicates . . . in - PowerPoint PPT Presentation

Outline Geospatial Databases: . . . Gravity . . . Duplicates: Where . . . Why duplicates Are a . . . Detecting Duplicates Duplicates and . . . Duplicates . . . in Geoinformatics: Duplicates Are Not . . . From Interval to Fuzzy . . . from


  1. Outline Geospatial Databases: . . . Gravity . . . Duplicates: Where . . . Why duplicates Are a . . . Detecting Duplicates Duplicates and . . . Duplicates . . . in Geoinformatics: Duplicates Are Not . . . From Interval to Fuzzy . . . from Intervals and Fuzzy What We Did in Our . . . Formalization of the . . . New Algorithm: . . . Numbers to General Possibility of . . . Acknowledgments Multi-D Uncertainty Title Page ◭◭ ◮◮ Scott A. Starks, Luc Longpr´ e ◭ ◮ Roberto Araiza, Vladik Kreinovich University of Texas at El Paso Page 1 of 15 El Paso, Texas 79968, USA sstarks@utep.edu, vladik@utep.edu Go Back Full Screen Hung T. Nguyen New Mexico State University Close Las Cruces, New Mexico 88003, USA Quit

  2. Gravity . . . Duplicates: Where . . . 1. Outline Why duplicates Are a . . . • Fact: geospatial databases often contain duplicate records. Duplicates and . . . Duplicates . . . • it What are duplicates: two or more close records rep- Duplicates Are Not . . . resenting the same measurement result. From Interval to Fuzzy . . . • Problem: how to detect and delete duplicates. What We Did in Our . . . • Test case: measurements of anomalies in the Earth’s Formalization of the . . . gravity field that we have compiled. New Algorithm: . . . Possibility of . . . • Previously analyzed case: closeness of two points ( x 1 , y 1 ) Acknowledgments and ( x 2 , y 2 ) is described as closeness of both coordi- Title Page nates. ◭◭ ◮◮ • What was known: O ( n · log( n )) duplication deletion algorithm for this case. ◭ ◮ Page 2 of 15 • New result: we extend this algorithm to the case when closeness is described by an arbitrary metric. Go Back Full Screen Close

  3. Gravity . . . Duplicates: Where . . . 2. Geospatial Databases: General Description Why duplicates Are a . . . • Fact: researchers and practitioners have collected a Duplicates and . . . large amount of geospatial data. Duplicates . . . Duplicates Are Not . . . • Examples: at different geographical points ( x, y ), geo- physicists measure values d of: From Interval to Fuzzy . . . What We Did in Our . . . – the gravity fields, Formalization of the . . . – the magnetic fields, New Algorithm: . . . – elevation, Possibility of . . . – reflectivity of electromagnetic energy for a broad Acknowledgments range of wavelengths (visible, infrared, and radar). Title Page • How this data is stored: corresponding records ( x i , y i , d i ) ◭◭ ◮◮ are stored in a large geospatial database. ◭ ◮ • How this data is used: dased on these measurements, Page 3 of 15 geophysicists generate maps and images and derive geo- Go Back physical models that fit these measurements. Full Screen Close

  4. Gravity . . . Duplicates: Where . . . 3. Gravity Measurements: Case Study Why duplicates Are a . . . • Typical geophysical data (e.g., remote sending images): Duplicates and . . . Duplicates . . . – mainly reflect the conditions of the Earth’s surface ; Duplicates Are Not . . . – cover a reasonably local area. From Interval to Fuzzy . . . • Gravity measurements: What We Did in Our . . . Formalization of the . . . – gravitation comes from the whole Earth, including deep zones; New Algorithm: . . . Possibility of . . . – gravity measurements cover broad areas. Acknowledgments • Conclusion: gravity measurements are one of the most Title Page important sources of information about subsurface struc- ◭◭ ◮◮ ture and physical conditions. ◭ ◮ Page 4 of 15 Go Back Full Screen Close

  5. Gravity . . . Duplicates: Where . . . 4. Duplicates: Where They Come From Why duplicates Are a . . . • Fact: the existing geospatial databases contain many Duplicates and . . . duplicate points. Duplicates . . . Duplicates Are Not . . . • Reason: From Interval to Fuzzy . . . – databases are rarely formed completely “from scratch”; What We Did in Our . . . – they are usually are built by combining measure- Formalization of the . . . ments from previous databases; New Algorithm: . . . – some measurements are represented in several of Possibility of . . . the combined databases. Acknowledgments Title Page • Conclusion: after combining databases, we get dupli- cate records. ◭◭ ◮◮ ◭ ◮ Page 5 of 15 Go Back Full Screen Close

  6. Gravity . . . Duplicates: Where . . . 5. Why duplicates Are a Problem Why duplicates Are a . . . • Main reason: duplicate values can corrupt the results Duplicates and . . . of statistical data processing and analysis. Duplicates . . . Duplicates Are Not . . . • Example: From Interval to Fuzzy . . . – when we see several measurement results confirm- What We Did in Our . . . ing each other, Formalization of the . . . – we may get an erroneous impression that this mea- New Algorithm: . . . surement result is more reliable than it actually is. Possibility of . . . • Conclusion: detecting and eliminating duplicates is an Acknowledgments important part of assuring and improving the quality Title Page of geospatial data. ◭◭ ◮◮ ◭ ◮ Page 6 of 15 Go Back Full Screen Close

  7. Gravity . . . Duplicates: Where . . . 6. Duplicates and Related Uncertainty Why duplicates Are a . . . • Ideal case: measurement results are simply stored in Duplicates and . . . their original form. Duplicates . . . Duplicates Are Not . . . • In this case: duplicates are identical records, easy to detect and to delete. From Interval to Fuzzy . . . What We Did in Our . . . • In reality: databases use different formats and units. Formalization of the . . . • Example: the latitude can be stored in degrees (as New Algorithm: . . . 32.1345) or in degrees, minutes, and seconds. Possibility of . . . • As a result: when a record ( x i , y i , d i ) is placed in a Acknowledgments database, it is transformed into this database’s format. Title Page ◭◭ ◮◮ • Fact: transformations are approximate. ◭ ◮ • Result: records representing the same measurement in different formats get transformed into values which cor- Page 7 of 15 respond to close but not identical points” Go Back ( x i , y i ) � = ( x j , y j ) . Full Screen Close

  8. Gravity . . . Duplicates: Where . . . 7. Duplicates Corresponding to Interval Uncertainty Why duplicates Are a . . . Geophysicists produce a threshold ε > 0 such that ε -closed Duplicates and . . . points ( x i , y i ) and ( x j , y j ) are duplicates. Duplicates . . . Duplicates Are Not . . . ❅ � ✻ ε From Interval to Fuzzy . . . ❄ � ❅ ✛ ε What We Did in Our . . . ✲ Formalization of the . . . In other words, if a new point ( x j , y j ) is within a 2D interval New Algorithm: . . . [ x i − ε, x i + ε ] × [ y i − ε, y i + ε ] centered at one of the existing Possibility of . . . points ( x i , y i ), then this new point is a duplicate: Acknowledgments Title Page ✻ ◭◭ ◮◮ ε ❄ ◭ ◮ � ❅ ✻ Page 8 of 15 ε � ❅ ❄ Go Back ✛ ✲ ✛ ✲ ε ε Full Screen Close

  9. Gravity . . . Duplicates: Where . . . 8. Duplicates Are Not Easy to Detect and Delete Why duplicates Are a . . . • Problem: detect and delete duplicates. Duplicates and . . . Duplicates . . . • How this is done now: “by hand”, by a professional Duplicates Are Not . . . geophysicist looking at the raw measurement results From Interval to Fuzzy . . . (and at the preliminary results of processing these raw data). What We Did in Our . . . Formalization of the . . . • Limitations: time-consuming. New Algorithm: . . . • Natural idea: use a computer to compare every record Possibility of . . . with every other record. Acknowledgments ∼ n 2 Title Page • Analysis: this idea requires n ( n − 1) 2 compar- 2 ◭◭ ◮◮ isons. ◭ ◮ • Limitation: this is impossible for large databases, with n ≈ 10 6 records. Page 9 of 15 Go Back • Conclusion: faster algorithms are needed. Full Screen Close

  10. Gravity . . . Duplicates: Where . . . 9. From Interval to Fuzzy Uncertainty Why duplicates Are a . . . • Typical situation: geophysicists provide several possi- Duplicates and . . . ble threshold values ε 1 < ε 2 < . . . < ε m that corre- Duplicates . . . spond to decreasing levels of their certainty: Duplicates Are Not . . . – if two measurements are ε 1 -close, we are 100% cer- From Interval to Fuzzy . . . tain that they are duplicates; What We Did in Our . . . Formalization of the . . . – if two measurements are ε 2 -close, then with some degree of certainty, we can claim them to be dupli- New Algorithm: . . . cates, etc. Possibility of . . . Acknowledgments • Objectives: Title Page – eliminate certain duplicates, and ◭◭ ◮◮ – mark possible duplicates (about which we are not ◭ ◮ 100% certain) with the corresponding degree of cer- tainty. Page 10 of 15 • Reduction to interval case: we need to solve the interval Go Back problem for several different values of ε i . Full Screen Close

Recommend


More recommend