Data Integration Sam Birch & Alex Leblang Two faces of data - PowerPoint PPT Presentation

Data Integration Sam Birch & Alex Leblang

Two faces of data integration ● Businesses ○ Have relatively more structured databases which they need to organize ● Research on integrating less structured data ○ Databases coming from different organizations without common architecture

Businesses have a lot of data …in many different databases

One safe place: the data warehouse Businesses want to control and access their operational data in a single place: 1. Backups and versioning 2. Single query interface 3. Quality & consistency 4. Separation of analytic and operational workloads

Data warehouse design Bill Inmon, a data warehouse means: 1. All data about a single real-world thing are linked; 2. Data are never over-written or deleted; 3. Comprises all (or nearly all) an organization's data in a consistent way; 4. Comprises all versions of the data in the operational system; 5. Not virtualized

Update methods 1. ETL 2. Virtual / federated databases 3. Change data capture

ETL ● Extract, Transform, Load ● Take data from one data source, transform it, and load it into another location, usually a data warehouse ● Generally periodic (hourly, daily … )

Virtual (federated) DBs ● A method for integrating data virtually ● No actual (physical) data integration ● Virtual database systems give you the feel of data integration without the need for maintaining one single data warehouse

How they work ● Federated databases map autonomous database systems into a single database ● This is done over a computer network and has the advantage of possible geographic distribution ● Federated databases can be loosely or tightly coupled

Loosely / Tightly Coupled ● Loosely coupled databases require each component to construct their own schema ○ ...forces the user to have knowledge of the schema when using the database ● Tightly coupled databases use independent processes to create a schema used across the federated database ○ … removes much of the work from the user or DBA to the software itself

Change data capture ● Keep track of diffs, like version control for your database ● Helpful for large data with smaller changes ● Different Implementations: ○ Timestamps on rows ○ Version number on rows ○ Triggers on tables

Reforming unstructured data

Large and unstructured The 4 Vs (according to Dong) ○ large Volume of sources ○ changing at a high Velocity ○ as well as a huge Variety of sources ○ with lots of question regarding data Veracity Dong et al.

Goals ● Schema alignment ● Record linkage ● Data fusion

Schema Alignment Dong et al

Schema Alignment ● Mediated Schema ○ Identify domain specific modeling ● Attribute Matching ○ Identify similarities between schema attributes ● Schema Mapping ○ Specify how to specifically map records in different schemas Dong et al

Record Linkage Dong et al

Data Fusion Dong et al

Data Fusion ● Reconciliation of conflicting non-identifying content ○ Voting ○ Source Quality ○ Copy Detection Dong et al

Dealing With Different Data Sources ● Semantic Heterogeneity ● Access Pattern Limitations ● Integrity Constraints ● Data-level Heterogeneity http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

Semantic Heterogeneity ● Data integration can suffer from many issues ● Differences in: ○ organization of tables ○ naming of schemas ○ data-level representation http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

Data-level Heterogeneity ● “115 Waterman St. 02912” / “Brown University CIT” ● “Tim Kraska” / “Timothy Kraska” / “tim_kraska@brown.edu” ● IRA: Individual Retirement Account or Irish Republican Army? ● Arbitrarily hard: different descriptions / photos of the same place

Entity resolution (ER) “[The] problem of identifying and linking/grouping different manifestations of the same real world object.” Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match... Getoor, 2012.

Motivating examples ● Mining unstructured data (e.g. webpages) ● Governance (census, intelligence) ● Generally, when data comes from different organizations Getoor, 2012.

ER Challenges ● Fundamental ambiguity ● Diversity in representations (format, truncation, ambiguity) ● Errors ● Missing data ● Records from different times ● Relationships in addition to equality Getoor, 2012.

Normalization ● Transform data into a format which is more likely to match other similar data ○ Splitting / combining rows ● Canonicalization (e.g. phone numbers, URLs, case of text, expanding truncations) ○ Maximally informative, but standard format ● Logic is specific to data Getoor, 2012.

Pairwise matching Normalized Matching Raw data data features Getoor, 2012.

Matching features ● Edit distance (e.g. Levenstein) for typos ● Set/vector similarity (Jaccard index, TF/IDF, dot-product) for ● Alignment (e.g. Monge-Elkan) ● Phonetic (e.g. Soundex) Getoor, 2012.

Record linkage ● Matching record to record rather than datum to datum ○ May also require schema alignment ● Average of component similarities ● Specific rules about each column ● Probabilistic models with ML ○ Training data not trivial: most pairs are obviously not matches Getoor, 2012.

Collective matching and constraints ● Some data matching operations aren’t independent of the other data in the record ○ e.g. two research papers in the same venue are more likely to be by the same authors ● Expressed in constraints over the matching relationships of columns in a record ○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C) Getoor, 2012.

Getoor, 2012.

AscDB ● Developed at Google ● The authors looked at 14.1 billion HTML tables and from that found 154 million that they considered to contain high quality relational data ● Work was done in 2008 Cafarella et al

AscDB ● The authors created the attribute correlation statistics database ● AscDB is “a set of statistics about the schemas in the corpus” Cafarella et al

AscDB ● AscBD makes possible: ○ schema auto-complete ○ attribute synonym finding ○ join-graph traversal Cafarella et al

ASCDb Cafarella et al

2013 Follow Up Extracting Tabular Data on the Web VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema Adelfio et al

Conclusion ● For businesses there are tradeoffs between specialized systems and integration ● Lots of research is being done involving combining very large amounts of disparate data

References ● Luna Dong and Divesh Srivastava, Big Data Integration, Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013 ● Lise Getoor, Ashwin Machanavajjhala, Entity Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012) ● Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008) ● Marco D. Adelfio, Hanan Samet, Schema Extraction for Tabular Data on the Web, In International Conference on Very Large Data Bases (VLDB), 2013 ● http://research.cs.wisc.edu/dibook/

Data Integration Sam Birch & Alex Leblang Two faces of data - PowerPoint PPT Presentation

Data Integration Sam Birch & Alex Leblang Two faces of data integration Businesses Have relatively more structured databases which they need to organize Research on integrating less structured data Databases coming from

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Data Cleaning for Data Integration Advanced School on Data Exchange, Integration, and Streams

Research Integration Model Codes Looking Forward Integration Bim Ex Plan Research

Integration Programme? Integration Strategy? No national or local integration programme (not

XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell

Overview Motivation Problem Definition Data Integration Data Integration Approaches

Aligning and Integrating Data in Karma Craig Knoblock University of Southern California Data

2/16/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data

Phone System Integration with Synergy Enterprise Phone Integration This Phone Integration

Urban Development Policy - part of National Policy for Integration Ministry of Integration and

Fortify Integration & User Experience Fortify Partner Integration Integration with both

Integration of Integration of Sustainable Travel Modes Sustainable Travel Modes Integration of

OTC 2019 ENERGY SYSTEM INTEGRATION NORTH SEA ENERGY SYSTEM INTEGRATION IN THE NETHERLANDS

EPA and the issue of regional EPA and the issue of regional integration in Africa integration in

CD/MH Integration Workgroup The What and the How of integration CD Integration Workgroup

Page 1, 18-Sep-07 Product Integration Product Integration Technical Product Product

Development of a Global Map for Sustainable Development (GM4SD) Report of the GM4SD Working

Contents 1. Poverty, Livestock & Livelihoods 2. IPALP Methodology 3. IPALP Applications 4.

Analyses of Sequences using Stata The SQ-Ados 2.0 Ulrich Kohler ulrich.kohler@uni-potsdam.de P

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Detecting the Linkage in an n -Component Brunnian Link IMUS Mini-Course Session 2 Joint work

TRANSFORMING TRANSFORMING TRANSFORMING TRANSFORMING FINANCIAL SERVICES FINANCIAL SERVICES FOR

Vis u ali z ing hierarchies U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson

Genetic Algorithms for Map Labeling Steven van Dijk 10th December 2001 Typeset by Foil T EX

Data Integration Sam Birch & Alex Leblang Two faces of data - PowerPoint PPT Presentation

Data Integration Sam Birch & Alex Leblang Two faces of data integration Businesses Have relatively more structured databases which they need to organize Research on integrating less structured data Databases coming from

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Data Cleaning for Data Integration Advanced School on Data Exchange, Integration, and Streams

Research Integration Model Codes Looking Forward Integration Bim Ex Plan Research

Integration Programme? Integration Strategy? No national or local integration programme (not

XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell

Overview Motivation Problem Definition Data Integration Data Integration Approaches

Aligning and Integrating Data in Karma Craig Knoblock University of Southern California Data

2/16/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data

Phone System Integration with Synergy Enterprise Phone Integration This Phone Integration

Urban Development Policy - part of National Policy for Integration Ministry of Integration and

Fortify Integration &amp; User Experience Fortify Partner Integration Integration with both

Integration of Integration of Sustainable Travel Modes Sustainable Travel Modes Integration of

OTC 2019 ENERGY SYSTEM INTEGRATION NORTH SEA ENERGY SYSTEM INTEGRATION IN THE NETHERLANDS

EPA and the issue of regional EPA and the issue of regional integration in Africa integration in

CD/MH Integration Workgroup The What and the How of integration CD Integration Workgroup

Page 1, 18-Sep-07 Product Integration Product Integration Technical Product Product

Development of a Global Map for Sustainable Development (GM4SD) Report of the GM4SD Working

Contents 1. Poverty, Livestock &amp; Livelihoods 2. IPALP Methodology 3. IPALP Applications 4.

Analyses of Sequences using Stata The SQ-Ados 2.0 Ulrich Kohler ulrich.kohler@uni-potsdam.de P

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Detecting the Linkage in an n -Component Brunnian Link IMUS Mini-Course Session 2 Joint work

TRANSFORMING TRANSFORMING TRANSFORMING TRANSFORMING FINANCIAL SERVICES FINANCIAL SERVICES FOR

Vis u ali z ing hierarchies U N SU P E R VISE D L E AR N IN G IN P YTH ON Benjamin Wilson

Genetic Algorithms for Map Labeling Steven van Dijk 10th December 2001 Typeset by Foil T EX

Fortify Integration & User Experience Fortify Partner Integration Integration with both

Contents 1. Poverty, Livestock & Livelihoods 2. IPALP Methodology 3. IPALP Applications 4.