Overview • Motivation • Problem Definition Data Integration • Data Integration Approaches –Virtual integration Hanna Zhong –Data warehouse hzhong@illinois.edu • Issues Department of Computer Science University of Illinois, Urbana-Champaign • Discussion 11/12/2009 2 Why Data Integration? Example: Apartment Search Find an apartment near Siebel Center JSM Bankier Ramshaw 3 4 1
Find an apartment near Siebel Center Find an apartment near Siebel Center mediated schema mediated schema Apartment Search source schema 1 source schema 2 source schema 3 source schema 1 source schema 2 source schema 3 wrapper wrapper wrapper wrapper wrapper wrapper Bankier Ramshaw Bankier Ramshaw JSM JSM 5 6 More Examples Example Systems • People Search • Apartment Search – Build a yellowpage application on db people • DB People Search • Many people doing database stuff in the US • Etc… • How can we find information about a database person, such as classes taught, publications, collaborators, etc? – Homepages – http://dblife.cs.wisc.edu/ 7 8 2
Data Integration Overview • Arises in numerous contexts • Motivation – on the Web, at enterprises, military, scientific • Problem Definition cooperation, bio-informatics domains, e- • Data Integration Approaches commerce, etc. • Currently very hot –Virtual integration – in both database research and industry –Data warehouse • Current state of affairs • Issues – Mostly ad-hoc solution • Discussion • create a special solution for every case; pay consultants a lot of money. 9 10 What is Data Integration? What is Data Integration? (2) The process of • The process of 1. Combining data from different data sources 1. Combining data from different data sources • Data sources: 2. Presenting a unified view of these data – Databases, websites, documents, blogs, discussion forums, emails, etc User Query Search 2. Presenting a unified view of these data as if the data are from the SAME source Unified View Source Results: Source … … Source Source 11 12 3
Data Integration is Hard � Problem Definition • Data sources are heterogeneous, distributed, and autonomous How can we access a set of heterogeneous , – Sources Type • Relational database, text , xml, etc distributed , autonomous databases as if – Query-Language accessing a single database? • SQL queries, keyword queries, XQuery – Schema • Databases have different schemas – Data type & value • The same data are represented differently in different sources – Type (e.g. time represented as varchar or timestamp) – Value (e.g. 8pm represented as 8:00pm or 20:00:00) – Semantic • Words have different meanings at different sources (e.g. title ) – Communication • Some sources are accessed via HTTP, others FTP 13 14 Overview Event Search • Motivation • Provide a comprehensive search on Champaign-Urbana events in one place • Problem Definition – Search events by its title, description, location • Data Integration Approaches proximity, dates, venues, and/or data sources –Virtual integration –Data warehouse • Issues • Discussion 15 16 4
Virtual Integration Approach Virtual Integration Architecture Query • Leave the data in the sources Break down the query into sub-queries for the sources Mediated schema Mediator: Reformulation engine meta-information about the sources • When a query comes in: (ie. source contents, what queries are supported, etc) – Determine the relevant sources to the query Optimizer – Break down the query into sub-queries for the Data source Execution engine catalog sources – Get the answers from the sources, and combine them appropriately wrapper wrapper wrapper Determine relevant sources • Data is fresh Source Source Source 17 18 Virtual Integration Architecture Virtual Integration Example Query Find upcoming events in Champaign-Urbana Mediated schema Mediator: Mediator: Reformulation engine Events(title, location, description, cost, start, end) Optimizer Data source Execution engine catalog Communicate with the data source and do format translations Get the answers and format appropriately wrapper wrapper wrapper Events(title, where, description, start, end) Events(title, location, cost, start, end) schema 1 schema 2 schema 3 Source Source Source 19 eventful.com zvents.com UIUC calendar 20 5
Virtual Integration Example Virtual Integration Example Find upcoming events in Champaign-Urbana Find upcoming events in Champaign-Urbana Mediator: Mediator: Events(title, location, description, cost, start, end) Events(title, location, description, cost, start, end) title where description start end title location cost start end Music Canopy Music 9pm 11pm Music Canopy Club $5 8pm 11pm wrapper wrapper wrapper wrapper Events(title, where, description, start, end) Events(title, location, cost, start, end) schema 1 schema 2 schema 3 schema 1 schema 2 schema 3 eventful.com zvents.com UIUC calendar 21 eventful.com zvents.com UIUC calendar 22 Challenges Challenges Data Integration : the process of combining data from different Data Integration : the process of combining data from different data sources and presenting a unified view of these data data sources and presenting a unified view of these data title location description cost start end title where description cost start end Events(title, location, description, cost, start, end) Events(title, location, description, cost, start, end) Schema Matching title where description start end title where description start end Music Canopy Music 9pm 11pm Music Canopy Music 9pm 11pm title location Cost start end title location Cost start end Music Canopy Club $5 8pm 11pm Music Canopy Club $5 8pm 11pm 23 24 6
Virtual Integration Architecture Virtual Integration Query Query: Find upcoming events in Champaign-Urbana Break down the query into sub-queries for the sources Mediated schema Mediator: Mediator: Reformulation engine Events(title, location, description, cost, start, end) Optimizer subQuery 1 Data source subQuery 2 subQuery 3 Execution engine catalog wrapper wrapper wrapper wrapper wrapper wrapper schema 1 schema 2 schema 3 Source Source Source 25 eventful.com zvents.com UIUC calendar 26 Mediators Global-as-View GAV Query • Express the mediated schema relations as a set of views over the data source relations subQuery 1 wrapper – The mediated schema is modeled as a set of views over the source schemas schema 1 • Design the mediated schema around the source schemas subQuery 2 • Mediated schema: Mediated schema wrapper schema 2 Events(title, location, description, cost, start, end) • Source schema: wrapper schema 3 subQuery 3 – S1: Events(title, where, description, start, end) – S2: Events(title, location, description, cost, start, end, performer) • Global-as-view – S3: Events(title, location, cost, start, end) • GAV: – express the mediated schema relations as a Create View Events AS set of views over the data source relations select title, where AS location, description, NULL, start, end from S1 UNION select title, location, description, cost, start, end from S2 • Local-as-view UNION select title, location, NULL, cost, start, end from S3 – express the source relations as views over the mediated schema 27 28 7
Global-as-View GAV (2) Global-as-View GAV (3) • Mediated schema: • Adding sources is hard Events(title, location, description, cost, start, end) Venues(location, city, state) – The core work is on how to retrieve elements • Source schema: – S4: Events(title, description, city, state) from the source databases – Need to consider all other sources that are GAV: Create View Events AS available select title, NULL, description, NULL, NULL, NULL from S4 Create View Venues AS select NULL, city, state from S4 What if we want to find events that are in Champaign? 29 30 Local-as-View LAV Local-as-View LAV (2) • Express the source relations as views over the mediated schema • Mediated schema: Events(title, location, description, cost, start, end) • The mediated schema is already designed Venues(location, city, state) – Create views on the source schemas • Mediated schema: • Source schema: Events(title, location, description, cost, start, end) – S4: Events(title, description, city, state) • Source schema: – S1: Events(title, where, description, start, end) What if we want to find events that are in Champaign? – S2: Events(title, location, description, cost, start, end, performer) LAV: – S3: Events(title, location, cost, start, end) Create View S4 • LAV: select title, description, city, state Create View S1 select title, location AS where, description, start, end from Events from Events e, Venues v where e.location=v.location AND city=“champaign” Create View S2 select title, location, description, start, end, NULL from Events Create View S3 select title, location, cost, start, end from Events 31 32 8
Recommend
More recommend