2 16 16
play

2/16/16 Outline 0) Course Info CS520 1) Introduction Data - PDF document

2/16/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data Preparation and Cleaning Provenance 3) Schema matching and mapping 4) Virtual Data Integration 4. Virtual Data Integration 5) Data Exchange 6) Data


  1. 2/16/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2) Data Preparation and Cleaning Provenance 3) Schema matching and mapping 4) Virtual Data Integration 4. Virtual Data Integration 5) Data Exchange 6) Data Warehousing IIT DBGroup 7) Big Data Analytics Boris Glavic http://www.cs.iit.edu/~glavic/ 8) Data Provenance http://www.cs.iit.edu/~cs520/ http://www.cs.iit.edu/~dbgroup/ 1 CS520 - 3) Matching and Mapping 4. Virtual Data Integration 4. Virtual Data Integration • Virtual Data Integration Problems: • How to create mappings? – Discussed in previous part of the course • How to compute query Q Query – This is the main focus of this part Global Schema Mappings Local Local Local Schema Schema Schema 1 2 n 2 3 CS520 - 3) Matching and Mapping CS520 - 3) Matching and Mapping 4. Query Answering with Views 4.1 Query Answering with Views • How to compute query Q over global Example: Solu-ons schema based on source schemas only? Local Schema Global Schema Person Person – What language is used to express mappings? Name Name Address Address – What language due we allow for Q? Office-phone Address Office-address – What language(s) can we use to query local Id Home-phone City sources? Office-contact – What language can we use to compute Q from ∀ x, y, z, a : Person ( x, y ) ∧ Address ( y, z, a ) → ∃ b, c : Person ( x, z, a, b, c ) query results returned by local sources? Query: Q(Name) :- Person(Name, A, OP, OA, HP). – How to deal with incompleteness? Name Address Id City Office-contact Peter 1 1 Chicago (312) 123 4343 Alice 2 2 Chicago (312) 555 7777 Bob 3 3 New York (465) 123 1234 4 5 CS520 - 5) Data Exchange CS520 - 5) Data Exchange 1

  2. 2/16/16 4.1 Query Answering with Views 4.1 Query Answering with Views Example: Solu-ons Example: Solu-ons Local Schema Local Schema Global Schema Global Schema Person Person Person Person Name Address Id City Office-contact Name Name Name Name Peter 1 Address 1 Chicago Address (312) 123 4343 Address Address Office-phone Office-phone Alice 2 2 Chicago (312) 555 7777 Address Office-address Address Office-address Bob 3 3 New York (465) 123 1234 Id Home-phone Id Home-phone Values of home-phone are not City City Office-contact available in the source Office-contact ∀ x, y, z, a : Person ( x, y ) ∧ Address ( y, z, a ) → ∃ b, c : Person ( x, z, a, b, c ) ∀ x, y, z, a : Person ( x, y ) ∧ Address ( y, z, a ) → ∃ b, c : Person ( x, z, a, b, c ) Query: Q(Name) :- Person(Name, A, OP, OA, HP). Query: Q(Home-ph) :- Person(N, A, OP, OA, Home-ph). Name RewriKen query over the source: Name Address Id City Office-contact Peter Q(Name) :- Person(Name, AI), Peter 1 1 Chicago (312) 123 4343 Alice Address(AI,A,OP). Alice 2 2 Chicago (312) 555 7777 Bob Bob 3 3 New York (465) 123 1234 6 7 CS520 - 5) Data Exchange CS520 - 5) Data Exchange 4. Query Answering with Views Motivating Example (Part 1) • Problems Movie(ID,Ttle,year,genre) – How to determine whether query can be answered Director(ID,director) Actor(ID, actor) at all? – Given a rewriting of the query using views, how Q ( T , Y , D ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " do we know it is correct? Director ( I , D ), Actor ( I , D ) – What to do if views can only return some of the query results? V 1 ( T , Y , D ) : − Movie ( I , T , Y , G ), Y ≥ 1940, G = " comedy " Director ( I , D ), Actor ( I , D ) V 1 ⊇ Q Q '( T , Y , D ) : − V 1 ( T , Y , D ), Y ≥ 1950 ⇒ Containment is enough to show that V 1 can be used to answer Q. 8 CS520 - 5) Data Exchange Motivating Example (Part 2) Problem Definition Q ( T , Y , D ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " Input: Query Q Director ( I , D ), Actor ( I , D ) View definiTons: V 1 ,… ,V n V 2 ( I , T , Y ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " A rewriTng: a query Q ’ that refers only V 3 ( I , D ) : − Director ( I , D ), Actor ( ID , D ) to the views and interpreted predicates Containment does not hold, but intuiTvely, V 2 and V 3 are (comparisons) useful for answering Q . Q ''( T , Y , D ) : − V 2 ( I , T , Y ), V 3 ( I , D ) An equivalent rewriTng of Q using V 1 ,… ,V n : How do we express that intuiTon? a rewriTng Q ’ , such that Q ’ ⇔ Q Answering queries using views! 2

  3. 2/16/16 Naïve approach Motivating Example (Part 3) • Given Q and views Movie(ID,Ttle,year,genre) – Randomly combine views into a query Q’ Director(ID,director) Actor(ID, actor) – Check equivalence of Q’ and Q – If Q’ is equivalent we are done Q ( T , Y , D ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " – Else repeat Director ( I , D ), Actor ( I , D ) • Why is this not good? V 4 ( I , T , Y ) : − Movie ( I , T , Y , G ), Y ≥ 1960, G = " comedy " – There are infinitely many ways of combining V 3 ( I , D ) : − Director ( I , D ), Actor ( ID , D ) views • E.g., V, V x V, V x V x V, … Q '''( T , Y , D ) : − V 4 ( I , T , Y ), V 3 ( I , D ) – We are not using any information in the query maximally-contained rewri-ng Maximally-Contained Rewritings Why again? Input: Query Q Rewriting query language L View definitions: V 1 , … ,V n Q ’ is a maximally-contained rewriting of Query Q given V 1 , … ,V n and L if: Global LAV/GLAV! 1. Q ’ ∈ L, Schema 2. Q ’ ⊆ Q, and Mappings 3. there is no Q ’’ in L such Local Local Local Schema Schema Schema that 1 2 n Q ’’ ⊆ Q and Q ’ ⊂ Q ’’ Exercise: which of these views Other use-cases can be used to answer Q ? • Query opTmizaTon with materialized views Q ( T , Y , D ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " – Need equivalent rewriTngs Director ( I , D ), Actor ( I , D ) – Implemented in many commercial DBMS – Here interest is cost: how to speed-up query V 2 ( I , T , Y ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " processing by using materialized views V 3 ( I , D ) : − Director ( I , D ), Actor ( I , D ) V 6 ( T , Y ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy " V 7 ( I , T , Y ) : − Movie ( I , T , Y , G ), Y ≥ 1950, G = " comedy ", Award ( I , W ) V 8 ( I , T ) : − Movie ( I , T , Y , G ), Y ≥ 1940, G = " comedy " 3

  4. 2/16/16 Algorithms for answering queries Bounding the Rewriting Length using views • Step 1 : we ’ ll bound the space of possible Theorem : if there is an equivalent rewriTng, query rewriTngs we need to consider (no there is one with at most n subgoals. comparisons) Query: Q ( X ) : − p 1 ( X 1 ),..., p n ( X n ) • Step 2 : we ’ ll find efficient methods for searching the space of rewriTngs Rewriting: Q '( X ) : − V 1 ( X 1 ),..., V m ( X m ) – Bucket Algorithm, MiniCon Algorithm ϕ • Step 2b : we consider “ logical approaches ” to 1 ,... g k 1 m ,..., g j m Expansion: Q ''( X ) : − g 1 ,..., g 1 the problem:         – The Inverse-Rules Algorithm Proof: Only n subgoals in Q can contribute to the image of the containment mapping ϕ Complexity Result The Bucket Algorithm [LMSS, 1995] • Applies to queries with no interpreted Key idea : predicates. – Create a bucket for each subgoal g in the query. • Finding an equivalent rewriTng of a query – The bucket contains views that contribute to g . using views is NP-complete – Create rewriTngs from the Cartesian product of the buckets (select one view for each goal) – Need only consider rewriTngs of query length or less. • Maximally-contained rewriTng: • Step 1 : assign views with renamed vars to buckets – Union of all conjuncTve rewriTngs of length n or less. • Step 2 : create rewriTngs, refine them, unTl equivalent/all contained rewriTng(s) are found The Bucket Algorithm The Bucket Algorithm Step 1 : Step 1 Intui-on – We want to construct buckets with views that – A view can only be used to provide informaTon have parTally mapped variables about a goal R(X) if it has a goal R(Y) – For each goal g = R in query • Q(X) :- R(X,Y) • V(X) :- S(X,Y) – For each view V – If the query goal contains variables that are in the – For each goal v = R in V head of the query, then the view is only useful if it • If the goal has head variables in the same places as g gives access to these values (they are in the head) then • Q(X) :- R(X,Y) – rename the view head variables to match the query goal vars – choose a new unique name for each other var • V(X) :- S(X,Y), R(Y,Z) – add the resulTng view atom to the bucket 4

Recommend


More recommend