Inconsistency and Incompleteness in Data Integration: a Logic-based Approach Andrea Cal` ı Universit` a di Roma “La Sapienza” CoLogNET Workshop Logic-based methods for information integration Vienna, Austria, 23 August 2003
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Joint work with: • Maurizio Lenzerini • Domenio Lembo • Riccardo Rosati Andrea Cal` ı 2
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 What is a data integration system? • Offers uniform access to a set of heterogeneous sources • The representation provided to the user is called global schema • The user is freed from the knowledge about the data sources • When the user issues a query over the global schema, the system: 1. determines which sources to query and how 2. issues suitable queries to the sources 3. assembles the results and provides the answer to the user Andrea Cal` ı 3
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Logical architecture for a data integration system Application Global Schema Source structure Source structure Source structure Source Source Source Andrea Cal` ı 4
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Software architecture for a data integration system Query Application Mediator Global Schema Wrapper Wrapper Wrapper Source structure Source structure Source structure Source Source Source Andrea Cal` ı 5
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Data integration system: formalisation A data integration system I is a triple �G , S , M� : • G : global schema • S : source schema • M : mapping Andrea Cal` ı 6
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Our framework • global schema G : relational with integrity constraints (ICs) • source schema S : relational; • mapping M : global-as-view (GAV) , expressed with the language of union of conjunctive queries (UCQ) Andrea Cal` ı 7
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example player ( Pname , Pteam ) Global schema : team ( Tname , Tcity ) { s 1 / 3 , s 2 / 2 , s 3 / 2 } Source schema : The GAV mapping associates to each relation in the global schema G a view over the source schema: player ( X, Y ) ← s 1 ( X, Y, Z ) player � player ( X, Y ) ← s 3 ( X, Y ) team ( X, Y ) ← s 2 ( X, Y ) team � Andrea Cal` ı 8
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 The role of integrity constraints (ICs) ICs on the global schema: • enhance the expressiveness of the global schema • in general they are not satisfied by the data at the sources ICs on the source schema: • represent local properties of data sources • we assume that the data at the sources satisfy ICs expressed over the sources ⇒ not considered Andrea Cal` ı 9
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Constraints on the global schema 1. key dependencies (KDs) key ( r ) = { A 1 , . . . , A k } 2. inclusion dependencies (IDs) (generalisation of foreign key dependencies) r 1 [ A 1 , . . . , A m ] ⊆ r 2 [ B 1 , . . . , B m ] Andrea Cal` ı 10
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Outline ♦ Introduction (done) ♦ Framework (done) • Reasoning on integrity constraints • Query rewriting for IDs alone • Query rewriting for KDs and IDs • Semantics for inconsistent data (loosely-sound) • Query rewriting under loosely-sound semantics • Complexity results Andrea Cal` ı 11
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Reasoning about constraints Given a source database D for a system I , a global database B is said to be legal if: 1. it satisfies the ICs on the global schema 2. it satisfies the mapping, i.e. B is constituted by a superset of the retrieved global database ret ( I , D ) • ret ( I , D ) is obtained by evaluating, for each relation in G , the mapping queries over the source database • assumption of sound mapping • there are several global databases that are legal for the system Andrea Cal` ı 12
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Answers to queries under constraints • We are interested in certain answers . • A tuple t is a certain answer for a query Q if t is in the answer to Q for all (possibly infinite) legal databases. • The certain answers to Q are denoted by ans ( Q, I , D ) . Andrea Cal` ı 13
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example player ( Pname , Pteam ) Global schema : team ( Tname , Tcity ) Constraints : player [ Pteam ] ⊆ team [ Tname ] Mapping : player ( X, Y ) ← s 1 ( X, Y, Z ) player � player ( X, Y ) ← s 3 ( X, Y ) team ( X, Y ) ← s 2 ( X, Y ) team � Andrea Cal` ı 14
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Source database D s 1 s 2 31 figo realMadrid realMadrid madrid s 3 totti roma Andrea Cal` ı 15
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Retrieved global database ret ( I , D ) figo realMadrid player team realMadrid madrid totti roma Andrea Cal` ı 16
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Retrieved global database ret ( I , D ) figo realMadrid realMadrid madrid player team α totti roma roma The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Andrea Cal` ı 17
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Retrieved global database ret ( I , D ) figo realMadrid realMadrid madrid player team α totti roma roma The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Warning 1 there may be an infinite number of legal databases for I Andrea Cal` ı 18
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Retrieved global database ret ( I , D ) figo realMadrid realMadrid madrid player team α totti roma roma The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Warning 1 there may be an infinite number of legal databases for I Warning 2 in case of cyclic IDs, legal databases for I may be of infinite size Andrea Cal` ı 19
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Retrieved global database ret ( I , D ) figo realMadrid realMadrid madrid player team α totti roma roma The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Consider the query q ( X ) ← team ( X, Y ) Andrea Cal` ı 20
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Example (cont’d) Retrieved global database ret ( I , D ) figo realMadrid realMadrid madrid player team α totti roma roma The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Consider the query q ( X ) ← team ( X, Y ) ans ( q, I , D ) = { realMadrid , roma } Andrea Cal` ı 21
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Query rewriting Given a user query Q over G • we look for a rewriting R of Q expressed over S • a rewriting R is perfect if R D = ans ( Q, I , D ) for every source database D . With a perfect rewriting, we can do query answering by rewriting Note that we avoid the construction of the retrieved global database ret ( I , D ) Andrea Cal` ı 22
Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003 Query rewriting for IDs alone Intuition: Use the IDs as basic rewriting rules q ( X ) ← team ( X, Y ) player [ Pteam ] ⊆ team [ Tname ] team ( W 2 , W 3 ) ← player ( W 1 , W 2 ) as a logic rule: Andrea Cal` ı 23
Recommend
More recommend