Large-Scale Data Integration Systems Exporting and Interactively Querying Application Domain • CNET Computer Developer Web Service-Accessed Sources: • PCWorld Portals Application Application The CLIDE System Integration Compatible Combinations Domain Integrated Integration Mediator Schema of Computers, Routers Engineer and Printers Michalis Petropoulos Source • Dell Computers by CPU Domain Web Web Web • Cisco Routers by Rate Service Service Service … • HP Printers by Speed Source Owner • Dell Computers Source Source Data Data • Cisco Routers … Schema Schema Source Source • HP Printers Database Seminar, February 2010 2 Large-Scale Data Integration Systems Running Example Parameterized Views Application What queries can the Domain mediator answer for me? Developer CLIDE Schema Schema Application Application Computers (cid, cpu, ram, price) Routers (rate, standard, price, type) NetCards (cid, rate, standard, interface) Views Integration Views Wired Integrated Domain Computers Integration Routers Mediator V3 RouWired () → ( Router )* Schema for a given cpu Engineer V1 ComByCpu ( cpu ) → ( Computer )* SELECT DISTINCT Rou1.* SELECT DISTINCT Com1.* FROM Routers Rou1 FROM Computers Com1 WHERE Rou1.type= 'Wired' Wireless Source WHERE Com1.cpu= cpu Computers & NetCards Routers Domain Web Web Web for a given cpu & rate V4 RouWireless () → ( Router )* V2 ComNetByCpuRate ( cpu , rate ) → Service Service Service … ( Computer , NetCard )* SELECT DISTINCT Rou1.* Source FROM Routers Rou1 Owner SELECT DISTINCT Com1.*, Net1.* WHERE Rou1.type= 'Wireless' FROM Computers Com1, Network Net1 Source Source Data Data WHERE Com1.cid=Net1.cid … Conjunctive Queries CQ Schema Schema Source Source AND Com1.cpu= cpu AND Net1.rate= rate • Equality & Comparison Conditions • Parameters 3 4 1
Sophisticated Mediators Make Running Example Feasible Queries Hard to Predict Integrated Schema Feasible Queries FQ • Equivalent CQ query rewritings using the views • Might involve more than one views • Order might matter Developer Application Query: Query: Feasible Infeasible Get all Computers Get all ‘P4’ Computers , together with their NetCards • Integrated schema puts together and their compatible ‘Wireless’ Routers Mediator Integrated the Dell and Cisco schemas Schema Computers.* NetCards.* Routers.* E A123 P4 512 400 A123 10 .11b USB 10 .11b 50 Wireless B123 P4 1024 550 B123 54 .11g USB 54 .11g 120 Wireless Attribute Associations Routers.* Computers.* NetCards.* • (Computers.cid, NetCards.cid) Mediator B D 10 .11b 50 Wireless A123 P4 512 400 A123 10 .11b USB V1 V2 V3 V4 54 .11g 120 Wireless B123 P4 1024 550 B123 54 .11g USB • (NetCards.rate, Routers.rate) Mediator • (NetCards.standard, Routers.standard) A C RouWireless () ComNetByCpuRate (‘P4’, ‘10’) V1 Dell Cisco ComNetByCpuRate (‘P4’, ‘54’) V4 V2 5 6 Problem The CLIDE Solution 1. Large number of sources CLIDE 2. Large number of views (web-services) 3. Mediator capabilities Developer Application A query formulation interface, Developer formulates an application query which interactively guides the Is an application query feasible? Mediator Integrated developer toward feasible queries If not, how do I know which ones are feasible? Schema by employing a coloring scheme Previous options: – The developer had to browse the view definitions and somehow formulate a feasible query V1 V2 V3 V4 – Or formulate queries until a feasible one is found (trial-and-error) Dell Cisco No system-provided guidance 7 8 2
QBE-Like Interfaces CLIDE Interface Microsoft SQL-Server Last/Next Step Table Alias Selection Boxes Table Boxes Feasibility Flag Projection Box • Table, selection, projection and join actions • Feasibility Flag • Color-based suggestions 9 10 Example Interaction Example Interaction Snapshot 1 Snapshot 2 Yellow required action Blue required choice of action C – All feasible queries require this action – At least one feasible query cannot be formulated ram price Mediator 512 400 unless this action is performed 1024 550 White optional action cid cpu ram price – Feasible queries can be formulated A ComByCpu (‘P4’) B A123 P4 512 400 w/ or w/o these actions B123 P4 1024 550 V1 11 12 3
Example Interaction Example Interaction Snapshot 3 Snapshot 4 • * any other constant Join Lines: • Red prohibited action • Only yellow and blue are displayed – Does not appear in any feasible query • Must appear in Attribute Associations – Lead to “Dead End” state 13 14 Example Interaction Demo Snapshot 5 ram price rate interface price F 512 400 10 USB 50 1024 550 54 USB 120 Mediator Computers.* NetCards.* A D Routers.* A123 P4 512 400 A123 10 .11b 50 10 .11b 512 Wireless B123 P4 1024 550 B123 54 .11g 120 RouWireless () ComNetByCpuRate (‘P4’, rate ) 54 .11g 1024 Wireless E B V4 V2 15 16 4
CLIDE Properties Interaction Graph Selection Table Join Action Action Action • Completeness of Suggestions Com1 Com1.ram Com1.price Com1.cpu=‘P4’ Net1 Com1.cid=Net1.cid Rou1 … … … … … … … … … … – Every feasible query can be formulated by performing yellow and blue actions at every step • Summarization of Suggestions – At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following • Nodes are queries: One for each q ∈ CQ Suggestions • Edges are actions: Table, selection, projection and join actions – The shortest sequence of actions from a query to • Green nodes are feasible queries any feasible query consists of suggested actions • Infinitely big structure – All CQ queries – All possible combinations of actions formulating them 17 18 Interaction Graph: Colorable Actions Interaction Graph: Colors • Yellow action α – Every path from current node n to a feasible node contains α • Blue action α – At least one feasible query cannot be formulated unless this action is performed (summarization) Com1.cid Current Node … … Com1.cpu • Red action α Current Com1.cid – No path to a feasible node contains α Node Com1.cpu … Com1.cid=* … … Com1.cpu=* Current Com1.cid=* Node … Com1.ram=* … … Com1.cpu=* Com1.cpu=* Net1 Com1.cid=Net1.cid Com1.cid=Net1.cid Net1.rate=’54Mbps’ … Com1.price=* … … … … Com1.ram=* • Colorable actions A C label … Com1.cid=Net1.cid … … Net1.rate=’54Mbps’ Net1.rate=’54Mbps’ Com1.price=* outgoing edges of the current node Net1 … … … … Rou1 Com1.cpu=* Rou1 Rou1 Com1.cid=Net1.cid Net1.rate=Rou1.rate Net1 … … Com2 … … … … … Rou1 Com2 Com2 … … Com2 Com2.cid=Net1.cid Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ … … … … … 19 20 5
Color Determined CLIDE Architecture By a Finite Set of Feasible Queries Challenge: Infinitely Many Feasible Queries Actions Front-End … ? User Current Query Colored Actions + Feasibility Flag s u d i a R … Back-End Color Algorithm Closest … n Feasible … Seed Queries SQ Queries FQ C … Parameters Algorithm … Closest Feasible Queries FQ C … Closest Feasible Queries Algorithm Solution: Closest Feasible Queries FQ C Aliases Collapse Rule Minimal Feasible • FQ C is sufficient to color actions in A C Extension Queries Maximally-Contained Rewriter • Theorem: Set of Closest Feasible Queries is Finite Column Schemas Views Associations Challenge: How far can the Closest Feasible Queries FQ C be? • Back-End invoked every time the user performs an action Solution: Based on Maximally Contained Queries FQ MC – i.e., the user arrives at a new node in the interactions graph 21 22 Maximally Contained Queries FQ MC Closest Feasible Queries FQ C Algorithm Challenge: How far can the Closest Feasible Queries FQ C be? Solution: Maximally Contained Queries FQ MC Maximally Contained Query Query: Q2 s Get all Computers i u d a R Query: Q1 p … with a given cpu L Maximally Get all Computers Contained … n Closest … Queries FQ MC Maximally Contained Query Not Maximally Contained Feasible … Query: Q4 Query: Q3 Queries FQ C … Get all Computers Get all Computers … with a given ram with a given cpu & ram … • Compute maximally contained queries FQ MC • Assuming fixed SELECT clause (projection list) • Theorem: All FQ C queries are reachable • Covered extensively in literature via a path of length p ≤ p L – MiniCon, Bucket, InverseRules Algorithms • The radius p L is the longest path to a maximally contained • FQ MC is finite query 23 24 6
Recommend
More recommend