iTrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 September 26, 2007
Outline � Motivation � iTrails � Experiments � Conclusions and Future Work 2 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Problem: Querying Several Sources What is the impact of global warming Query in Zurich? ? ? ? ? Systems Data Sources Email Web DB Laptop Server Server Server 3 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Job! Solution 1: Use a Search Engine Query global warming zurich Graph IR Search Engine System Drawback: Query semantics are not precise! TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03] text, text, text, text, links links links links Data Sources Email Web DB Laptop Server Server Server 4 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Solution 2: Use an Information Integration System //Temperatures/*[city = Query “zurich”] . . . . . . Information ... Integration Temps Cities Drawback: Too much effort to provide System ... System CO 2 Sunspots schema mappings! GAV (e.g. [ICDE95]), LAV (e.g. [VLDB96]), GLAV [AAAI99], P2P (e.g. [SIGMOD04]) missing missing schema schema schema schema mapping mapping mapping mapping Data Sources Email Web DB Laptop Server Server Server 5 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Research Challenge: Is There an Integration Solution in-between These Two Extremes? global warming zurich //Temperatures/*[city = global warming zurich “zurich”] ? . . . . . . Information Dataspace Graph IR ... Integration Temps Cities Search Engine System System ... CO 2 Sunspots Pay-as-you-go full-blown text, schema Information text, text, text, text, links mappings Integration links links links links Data Data Sources Sources Dataspace Vision by Email Web DB Laptop Franklin, Halevy, and Maier Server Server Server [SIGMOD Record 05] 6 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Outline � Motivation � iTrails � Experiments � Conclusions and Future Work 7 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
iTrails Core Idea: Add Integration Hints Incrementally � Step 1: Provide a search service over all the data � Use a general graph data model (see VLDB 2006) � Works for unstructured documents, XML, and relations � Step 2: Add integration semantics via hints (trails) on top of the graph � Works across data sources, not only between sources � Step 3: If more semantics needed, go back to step 2 � Impact: � Smooth transition between search and data integration � Semantics added incrementally improve precision / recall 8 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
iTrails: Defining Trails � Basic Form of a Trail Queries: NEXI-like keyword and path expressions Q L [.C L ] → Q R [.C R ] Attribute projections � Intuition: When I query for Q L [.C L ], you should also query for Q R [.C R ] 9 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Trail Examples: Global Warming Zurich DB � Trail for Implicit Meaning: Server global warming zurich “When I query for global warming , you should also query for Temperature data Temperatures above 10 degrees” region celsius date city global warming → //Temperatures/*[celsius > 10] 24-Sep 20 Bern BE ZH 15 � Trail for an Entity: “When I 24-Sep Uster query for zurich , you Zurich ZH 14 25-Sep should also query for Zurich ZH 26-Sep 9 references of zurich as a region” zurich → //*[region = “ZH”] 10 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Trail Example: Deep Web Bookmarks Web Server train home � Trail for a Bookmark: “When I query for train home , you should also query for the TrainCompany ’s website with origin at ETH Uni and destination at Seilbahn Rigiblick ” train home → //trainCompany.com//*[origin=“ETH Uni” and dest =“Seilbahn Rigiblick”] 11 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Trail Examples: Thesauri, Dictionaries, Language-agnostic Search Email Laptop Server � Trail for Thesauri: “When I car auto query for car , you should also query for auto ” car → auto � Trails for Dictionary: car carro “When I query for car , you should also query for carro and vice-versa” car → carro → car carro 12 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Trail Examples: Schema Equivalences DB Server � Trail for schema match on names: “When I query for Employee Employee.empName , you should also query for Person.name ” empId empName salary //Employee//*.tuple.empName → Person //Person//*.tuple.name SSN name age income � Trail for schema match on salaries: “When I query for Employee.salary , you should also query for Person.income ” //Employee//*.tuple.salary → //Person//*.tuple.income 13 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
� Core Idea Outline � Trail Examples � How are Trails Created? � Motivation � Uncertainty and Trails � iTrails � Rewriting Queries with Trails � Experiments � Recursive Matches � Conclusion and Future Work 14 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
How are Trails Created? � Given by the user � Explicitly � Via Relevance Feedback � (Semi-)Automatically � Information extraction techniques � Automatic schema matching � Ontologies and thesauri (e.g., wordnet) � User communities (e.g., trails on gene data, bookmarks) 15 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Uncertainty and Trails � Probabilistic Trails: � model uncertain trails � probabilities used to rank trails Q L [.C L ] → Q R [.C R ], 0 ≤ p ≤ 1 p � Example: car → auto p = 0.8 16 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Certainty and Trails � Scored Trails: � give higher value to certain trails � scoring factors used to boost scores of query results obtained by the trail Q L [.C L ] → Q R [.C R ], sf > 1 sf � Examples: - T 1 : weather → //Temperatures/* p = 0.9, sf = 2 - T 2 : yesterday → //*[date = today() – 1] p = 1, sf = 3 17 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Rewriting Queries with Trails U U (3) Merging Query weather U weather yesterday //*[date = today() – 1] yesterday T 2 matches Trail T 2 : yesterday → //*[date = today() – 1] (1) Matching (2) Transformation 18 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Replacing Trails � Trails that use replace instead of union semantics U U (3) Merging Query weather //*[date = today() – 1] yesterday weather T 2 matches Trail T 2 : yesterday //*[date = today() – 1] (2) Transformation (1) Matching 19 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Problem: Recursive Matches (1/2) U New query U weather still matches T 2 , //*[date = today() – 1] so T 2 could be applied yesterday again T 2 matches U U weather //*[date = today() – 1] T 2 : yesterday → U //*[date = today() – 1] //*[date = today() – 1] ... U //*[date = today() – 1] ... U //*[date = today() – 1] T 2 yesterday Infinite recursion matches 20 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Problem: Recursive Matches (2/2) U T 3 matches Trails may be U weather //*[date = today() – 1] mutually recursive yesterday U weather U T 10 matches T 3 : //*.tuple.date → U yesterday //*[modified = today() – 1] //*.tuple.modified //*[date = today() – 1] T 10 : //*.tuple.modified → U //*.tuple.date We again match T 3 weather U and enter an infinite loop yesterday U U //*[date = today() – 1] //*[date = today() – 1] //*[modified = today() – 1] 21 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Solution: Multiple Match Coloring Algorithm U T 3 , T 4 match U First //*[date = today() – 1] U Level yesterday U weather //Temperatures/* Second yesterday weather Level U T 1 T 2 U matches matches U U U yesterday weather //*[date = today() – 1] //Temperatures/* //*[received = today() – 1] //*[modified = today() – 1] T 1 : weather → //Temperatures/* T 2 : yesterday → //*[date = today() – 1] T 3 : //*.tuple.date → //*.tuple.modified T 4 : //*.tuple.date → //*.tuple.received 22 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Multiple Match Coloring Algorithm Analysis � Problem: MMCA is exponential in number of levels � Solution: Trail Pruning � Prune by number of levels � Prune by top-K trails matched in each level � Prune by both top-K trails and number of levels 23 September 26, 2007 Marcos Vaz Salles / ETH Zurich / marcos.vazsalles@inf.ethz.ch
Recommend
More recommend