Functional Dependency Generation and Applications in Pay As You Go - PowerPoint PPT Presentation

Functional Dependency Generation and Applications in Pay ‐ As ‐ You ‐ Go Data Integration Systems Daisy Zhe Wang , Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy UC Berkeley, AT&T Research, Stanford University, and Google Inc. y, g 1

Web scale Structured Data Web ‐ scale Structured Data Database Views in the HTML Tables extracted from the Web Deep Web accessed through HTML Forms on the Web Relations generated by For years, Microsoft Corporation CEO Bill information extraction Gates was against open source. But today he appears to have changed his from web pages from web pages mind. "We can be open source. We love mind. We can be open source. We love the concept of shared source," said Bill Name Title Organization Veghte, a Microsoft VP. "That's a super- Bill Gates CEO Microsoft important shift for us in terms of code Bill Veghte VP Microsoft access.“ Richard Stallman Founder Free Soft.. Richard Stallman, founder of the Free Software Foundation, countered saying… 2

A Typical Data Integration System A Typical Data Integration System Mediated Schema (G) Semantic Mappings (M) Different Structured Data Sources (S) Data Sources (S): a set of data sources of a specific domain • M di t d S h Mediated Schema (G): a set of relations and attributes that we wish to (G) t f l ti d tt ib t th t i h t • expose to users Schema Mapping (M): a set of mappings from the attributes in S to the • attributes in G attributes in G Query processing • – A user query over G is reformulated into multiple queries over S using M – Results are retrieved from multiple data sources and combined – Results are retrieved from multiple data sources and combined 3

Data Integration at Web scale Data Integration at Web ‐ scale • A typical data integration solution is impractical for web ‐ A t i l d t i t ti l ti i i ti l f b scale data – Too many domains of interest (Web Data is about everything) y y g – Huge number of sources for each domain – Designing mediated schema is infeasible – Data sources are dirty, incomplete and lack of meta ‐ data Data sources are dirty incomplete and lack of meta data • A web ‐ scale data integration system – can only afford pay ‐ as ‐ you ‐ go [Franklin et. al 2005] – Support automated schema design and mapping Support automated schema design and mapping – Provide best ‐ Effort services 4

Functional Dependency (FD) Functional Dependency (FD) • Can we use FD theory in some way to automate the massive C FD h i h i data integration problem? FDs are specified top down in the database design process as FDs are specified top ‐ down in the database design process as • • statements of truth on how attributes relates to each other FD X � Y holds if and only if each X value is associated with FD X � Y h ld if d l if h X l i i d i h • precisely one Y value One of Armstrong’s Axioms for Normalization • Transitivity: if X � Y, Y � Z, then X � Z 5

Probabilistic Functional Dependencies (pFDs) • Why probabilistic FDs? Wh b bili ti FD ? • Definition of a probabilistic FD (pFD) Definition of a probabilistic FD (pFD) X � p A, p is the likelihood of FD holds in general • Related work – TANE [Huhtala et al 1999] – TANE [Huhtala et al. 1999] – CORDS [Ilyas et al. 2004] • The new challenge: from single large table to many, potentially incomplete and dirty tables 6

Generating pFDs Generating pFDs Probability of pFD over single data source R P b bilit f FD i l d t R • • – Per ‐ Tuple counting: – Per ‐ Value counting: g Probability of pFD over multiple data sources • – Merge pFDs: – Merge Data 7

Results for pFDs Generation Algorithms Number of data sources: 50 ‐‐ 600 8

App1: Normalize Mediated Schema Example (I) Attributes in the mediated schema of the Bibliography Domain Attributes in the mediated schema of the Bibliography Domain abstract issn author author eissn i authors paper author(s) pages subject title subjects journal Title journal key words year y editor conference meeting school colloquium company location association website venue place place date date position dates 9

App1: Normalize Mediate Schema Example (II) Paper Journal eissn abstract issn author authors authors paper author(s) pages subject title subjects journal title journal j l key words year editor conference conference meeting school colloquium company association website address date position dates dates country Editors city Conference 10

Normalizing Mediated Schema Normalizing Mediated Schema Prune pFD set Prune pFD set • – Prune low ‐ probability pFDs – Prune pFDs that can be generated by transitivity 0.95 paper issn 0.95 author title 0.950.92 authors 0.9 subject b author(s) journal title 0.97 subjects journal Avoid over ‐ splitting • conference conference zip 0.95 meeting city colloquium 0.9 1.0 address address 11

Results for Schema Normalization Results for Schema Normalization 12

App2: Identify Dirty Data Sources App2: Identify Dirty Data Sources Structured data sources from the Web can be dirty Structured data sources from the Web can be dirty • Nested Columns Dummy Values Entity Ambiguity name city country name company email name country city Ali Alice Boston B t 02101 USA 02101,USA Alice IBM email Alice USA Boston Bob Seattle 98101,USA Bob Google email Bob US Boston Cathy Cathy Chicago Chicago 6060 ,USA 60601,USA C th Cathy Y h Yahoo email il Cathy C th u.s.a Boston B t David New 12201,USA David MSR email David United Boston York States We report data sources that violate pFDs with high probabilities: W d h i l FD i h hi h b bili i • Results People: 3 out of 3 reported are dirty Course: 31 out of 80 reported are dirty (estimate total 66) p y ( ) Bib: 3 out of 7 reported are dirty 13

Conclusion: FD in Pay as you go Conclusion: FD in Pay ‐ as ‐ you ‐ go Web ‐ scale data integration can only afford to pay ‐ as ‐ you ‐ go • Automation is the key • – Automatically setting up mediated schema, mapping [Das Sarma et. al. 2008] – Automatically measuring and improve the quality of data integration Automatically measuring and improve the quality of data integration • Measuring quality of data sources • Measuring and Improving quality of mediated schema, schema mapping, etc. FD ‐ based Quality Measuring and Improvement • – Identify dirty data sources Id tif di t d t – Improving mediated schemas 14

Future Work Future Work • Automatic mediated schema design for millions of HTML tables HTML tables – Domain cluster (clustering over source schema) – Entity/Relationship cluster (clustering + pFD normalization) – Attribute cluster (synonyms + string similarity) • Related Issues – Scalability Scalability – Visualization 15

Functional Dependency Generation and Applications in Pay As You Go - PowerPoint PPT Presentation

Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems Daisy Zhe Wang , Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy UC Berkeley, AT&T Research, Stanford University, and Google

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Gender Pay Gap Reporting What is Gender Pay Gap? Gender Pay Gap is the difference between the

MMFCU Bill Pay Personal Accounts What is Bill Pay? Bill pay is a service that allows you to pay

IDS &TUC Pay forum 2015 Making up Lost Ground on Pay Reuters \ Luke MacGregor Thursday 12 th

Lecture 19: Dependency Grammars and Dependency Parsing Julia Hockenmaier juliahmr@illinois.edu

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

Dependency Grammars Topological Dependency Trees: A Constraint-based Account of Linear

Equal Value: How do We Get There? The Presentation Equal Pay and UN Global Compact

Gender Pay Disparity In the Legal Profession: Trends and Developments Pay Equity - Overview

Thoughts on Learner Data and Motivation Learner Language Dependency Parsing and Dependency

Dependency Grammars and Parsing CMSC 473/673 UMBC Outline Review: PCFGs and CKY Dependency

Natural Language Processing Other Syntactic Models Parsing IV Dan Klein UC Berkeley Dependency

Dependency Parsing CMSC 470 Marine Carpuat Dependency Grammars Syntactic structure = lexical

Marina Valeeva Outline 2 1. Introduction What is Dependency Parsing? What is a

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Calculation of the Effect of Temperature and Xenon Gas on the Defect Formation in Irradiated UO 2

Scientific domain Human-Computer Interaction Interaction Computer science Supported by

Voronoi Tessellations for Ocean Modeling: Methods, Modes, and Conservation Todd Ringler

for Estimation of Distribution Algorithms (EDA) to Solve Multi-Structural Problems By Mohammad

Health Current Summit G. Cameron Deemer October 2017 President, DrFirst What CMIOs had to

How ICIJ used Technology to Unravel the Panama Papers Ccile Schilis-Gallego Data Journalist

Data Cleansing and Data Understanding Best Practices and Lessons from the Field Casey Stella

WELCOME SAVING MILLIONS WITH WORKFORCE OPTIMIZATION APPLIED ARTIFICIAL INTELLIGENCE WITH: BELLE

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Functional Dependency Generation and Applications in Pay As You Go - PowerPoint PPT Presentation

Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems Daisy Zhe Wang , Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy UC Berkeley, AT&T Research, Stanford University, and Google

Econom ical Aspects Econom ical Aspects Pay per Risk Pay per Use Pay per Use Pay per

Dependency Dependency- -Based Automatic Evaluation Based Automatic Evaluation Dependency

Gender Pay Gap Reporting What is Gender Pay Gap? Gender Pay Gap is the difference between the

MMFCU Bill Pay Personal Accounts What is Bill Pay? Bill pay is a service that allows you to pay

IDS &amp;TUC Pay forum 2015 Making up Lost Ground on Pay Reuters \ Luke MacGregor Thursday 12 th

Lecture 19: Dependency Grammars and Dependency Parsing Julia Hockenmaier juliahmr@illinois.edu

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

Dependency Grammars Topological Dependency Trees: A Constraint-based Account of Linear

Equal Value: How do We Get There? The Presentation Equal Pay and UN Global Compact

Gender Pay Disparity In the Legal Profession: Trends and Developments Pay Equity - Overview

Thoughts on Learner Data and Motivation Learner Language Dependency Parsing and Dependency

Dependency Grammars and Parsing CMSC 473/673 UMBC Outline Review: PCFGs and CKY Dependency

Natural Language Processing Other Syntactic Models Parsing IV Dan Klein UC Berkeley Dependency

Dependency Parsing CMSC 470 Marine Carpuat Dependency Grammars Syntactic structure = lexical

Marina Valeeva Outline 2 1. Introduction What is Dependency Parsing? What is a

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Calculation of the Effect of Temperature and Xenon Gas on the Defect Formation in Irradiated UO 2

Scientific domain Human-Computer Interaction Interaction Computer science Supported by

Voronoi Tessellations for Ocean Modeling: Methods, Modes, and Conservation Todd Ringler

for Estimation of Distribution Algorithms (EDA) to Solve Multi-Structural Problems By Mohammad

Health Current Summit G. Cameron Deemer October 2017 President, DrFirst What CMIOs had to

How ICIJ used Technology to Unravel the Panama Papers Ccile Schilis-Gallego Data Journalist

Data Cleansing and Data Understanding Best Practices and Lessons from the Field Casey Stella

WELCOME SAVING MILLIONS WITH WORKFORCE OPTIMIZATION APPLIED ARTIFICIAL INTELLIGENCE WITH: BELLE

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

IDS &TUC Pay forum 2015 Making up Lost Ground on Pay Reuters \ Luke MacGregor Thursday 12 th