a machine learning perspective on managing noisy data
play

A Machine Learning Perspective on Managing Noisy Data Theodoros - PowerPoint PPT Presentation

A Machine Learning Perspective on Managing Noisy Data Theodoros Rekatsinas | UW-Madison @thodrek Data-hungry applications are taking over Data errors are everywhere Noisy measurements Sensor failures Data errors are everywhere


  1. A Machine Learning Perspective on Managing Noisy Data Theodoros Rekatsinas | UW-Madison @thodrek

  2. Data-hungry applications are taking over

  3. Data errors are everywhere • Noisy measurements • Sensor failures

  4. Data errors are everywhere • Uncertain extractions • Semantic ambiguity

  5. Data errors are everywhere • Adversarial examples

  6. Data errors are everywhere • Human errors • Machine failures • Code bugs

  7. The Achilles’ Heel of Modern Analytics is low quality, erroneous data

  8. The Achilles’ Heel of Modern Analytics is low quality, erroneous data Cleaning and organizing the data comprises 60% of the time spent on an analytics or AI project.

  9. The Achilles’ Heel of Modern Analytics is low quality, erroneous data Many modern data management systems are being developed to address aspects of this issue: Stanford’s Snorkel: A System for Fast Training Data Creation Google’s TFX: TensorFlow Data Validation Amazon’s SageMaker Amazon’s Deequ: Data Quality Validation for ML Pipelines HoloClean: Weakly-supervised data cleaning

  10. Question: What is an appropriate (formal) framework for managing noisy data? Things to consider: Simplicity and generality

  11. Talk outline • Managing Noisy Data (Background) • The Probabilistic Unclean Databases (PUDs) Framework • From Theory to Systems

  12. Managing Noisy Data

  13. A simple example of noisy data DBAName AKAName Address City State Zip 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST Conflicts 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL c1: DBAName → Zip Morgan ST c2: Zip → City, State 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST c3: City, State, Address → Zip Does not obey Conflict data distribution

  14. A simple example of noisy data DBAName AKAName Address City State Zip 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST Conflicts 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL c1: DBAName → Zip Morgan ST c2: Zip → City, State 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST c3: City, State, Address → Zip Does not obey Conflict data distribution Computational problems: Detect errors, repair errors, compute “ consistent” query answers .

  15. The case for inconsistent data DBAName AKAName Address City State Zip 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST c1: DBAName → Zip 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL c2: Zip → City, State Morgan ST c3: City, State, Address → Zip 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST An example unclean database J • Errors correspond to tuples/cells that introduce inconsistencies (violations of integrity constraints). • Inconsistencies are typical in data integration, extract-load-transform workloads, etc. • Data repairs: A theoretical framework for coping with inconsistent databases [Arenas et al. 1999]

  16. Minimal data repairs Slide by Phokion Kolaitis [SAT 2016]

  17. Minimal data repairs Plethora of fundamental results on tractability of repair-checking and consistent query answering. Slide by Phokion Kolaitis [SAT 2016]

  18. Minimal data repairs Plethora of fundamental results on tractability of repair-checking and consistent query answering. Limited adoption in practice. Slide by Phokion Kolaitis [SAT 2016]

  19. Minimal data repairs c1: DBAName → Zip DBAName AKAName Address City State Zip c2: Zip → City, State 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL c3: City, State, Address → Zip Morgan ST 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST

  20. Minimal data repairs c1: DBAName → Zip DBAName AKAName Address City State Zip c2: Zip → City, State 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL c3: City, State, Address → Zip Morgan ST 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST Minimal subset repair: 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL We remove t1 Morgan ST 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST An example repaired database I

  21. Minimal data repairs c1: DBAName → Zip DBAName AKAName Address City State Zip c2: Zip → City, State 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL c3: City, State, Address → Zip Morgan ST 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST Minimal subset repair: 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL We remove t1 Morgan ST 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST Errors remain: (1) Cicago should clearly be Chicago (2) Non-obvious errors: 60609 is the wrong Zip

  22. Minimal data repairs c1: DBAName → Zip DBAName AKAName Address City State Zip c2: Zip → City, State 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL c3: City, State, Address → Zip Morgan ST 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST Minimal subset repair: 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL We remove t1 Morgan ST 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Several variations of Morgan ST minimal repairs. E.g., Errors remain: update the minimum (1) Cicago should clearly be Chicago number of cells. (2) Non-obvious errors: 60609 is the wrong Zip

  23. Minimal data repairs c1: DBAName → Zip DBAName AKAName Address City State Zip c2: Zip → City, State 3465 S Chicago 60608 t1 John Veliotis Sr. Johnnyo’s IL c3: City, State, Address → Zip Morgan ST 3465 S 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST Minimal subset repair: 3465 S 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL We remove t1 Morgan ST 3465 S Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Several variations of Morgan ST minimal repairs. E.g., Errors remain: update the minimum (1) Cicago should clearly be Chicago number of cells. (2) Non-obvious errors: 60609 is the wrong Zip Minimality can be used as an operational principle to prioritize repairs but these repairs are not necessarily correct with respect to the ground truth.

  24. The case for most probable data [Gribkoff et al., 14] DBAName AKAName Address City State Zip p 3465 S Chicago 60608 0.9 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST 3465 S 0.4 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.4 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.8 Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST Most probable world, c1: DBAName → Zip conditioned on integrity c2: Zip → City, State constraint satisfaction c3: City, State, Address → Zip

  25. The case for most probable data [Gribkoff et al., 14] DBAName AKAName Address City State Zip p Factor (f) 3465 S Chicago 60608 0.9 1 - 0.9 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST 3465 S 0.4 0.4 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.4 0.4 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.8 0.8 Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST Optimization Objective c1: DBAName → Zip c2: Zip → City, State ∏ p ( t ) ∏ (1 − p ( t )) c3: City, State, Address → Zip max I t ∈ I t ∉ I

  26. The case for most probable data [Gribkoff et al., 14] DBAName AKAName Address City State Zip p Factor (f) 3465 S Chicago 60608 0.9 0.9 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST 3465 S 0.4 1 - 0.4 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.4 1 - 0.4 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.8 1 - 0.8 Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST Optimization Objective c1: DBAName → Zip c2: Zip → City, State ∏ p ( t ) ∏ (1 − p ( t )) c3: City, State, Address → Zip max I t ∈ I t ∉ I

  27. Most probable repairs DBAName AKAName Address City State Zip p Factor (f) 3465 S Chicago 60608 0.9 1 - 0.9 t1 John Veliotis Sr. Johnnyo’s IL Morgan ST 3465 S 0.4 0.4 60609 t2 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.4 0.4 60609 t3 John Veliotis Sr. Johnnyo’s Chicago IL Morgan ST 3465 S 0.8 0.8 Johnnyo’s Cicago t4 Johnnyo’s IL 60608 Morgan ST ∏ p ( t ) ∏ Optimization Objective (1 − p ( t )) max I t ∈ I t ∉ I Probabilities o ff er clear semantics than minimality. Fundamental question: How do we know p?

  28. Probabilistic Unclean Databases Christopher De Sa, Ihab Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas, ICDT 2019

  29. ̂ The case of a noisy channel for data Observed Data Clean Source Data with Errors Noisy Channel Noisy Channel Model 1. We see an observation x in the noisy world w = arg max w ∈ W P ( w | x ) 2. Find the correct world w Applications: Speech, OCR, Spelling correction, Part of speech tagging, machine translations, etc…

Recommend


More recommend