PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP
WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente § Research interests: Data integration, data quality, information extraction, natural language processing, data cleaning I am here on a research visit § Period Nov’17 – Feb’18 several times 3 or 2 days § Goals: § Compare Probabilistic Programming (PP) with Probabilistic Databases (PDBs) § Does PP lead to new insights in PDBs and Probabilistic Data Integration (PDI)? Research visit RWTH - Probabilistic Databases 23 Nov 2017 2
CONTENTS § Motivational example § Intuition – What is a probabilistic database § Theoretical foundation – possible worlds theory § Querying a probabilistic database § Real-world query performance § Probabilistic databases for other data models (XML, DataLog) § Conclusions § Outlook on my next presentation on “Probabilistic Data Integration” Research visit RWTH - Probabilistic Databases 23 Nov 2017 3
MOTIVATIONAL EXAMPLE AND WHAT IS A PROBABILISTIC DATABASE?
MOTIVATIONAL EXAMPLE: COMBINING DATA … Car brand Sales Car brand Sales Car brand Sales B.M.W. 25 BMW 72 Bayerische Motoren Werke 8 Mercedes 32 Mercedes-Benz 39 Mercedes 35 Renault 10 Renault 20 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Keulen, M. (2012) Mercedes-Benz 39 Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). Renault 45 pp. 138-146. ISSN 1611-2776 Research visit RWTH - Probabilistic Databases 23 Nov 2017 5
… AND THE PROBLEM OF SEMANTIC DUPLICATES Car brand Sales Preferred customers … B.M.W. 25 SELECT SUM(Sales) Bayerische Motoren Werke 8 FROM CarSales BMW 72 WHERE Sales>100 Mercedes 67 Mercedes-Benz 39 Renault 45 0 ‘No preferred customers’ Research visit RWTH - Probabilistic Databases 23 Nov 2017 6
SEMANTIC DUPLICATES Real world Database (of car brands) ω o 1 Car brand Sales d 1 o 2 B.M.W. 25 Bayerische 8 d 2 Motoren Werke d 3 o 3 Mercedes 67 d 4 Renault 45 d 5 o 4 BMW 72 d 6 Mercedes-Benz 39 Research visit RWTH - Probabilistic Databases 23 Nov 2017 7
MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA Run some duplicate detection tool Car brand Sales B.M.W. 25 1 X=0 4 and 5 different 0.2 Bayerische Motoren Werke 8 2 X=1 4 and 5 the same 0.8 BMW 72 3 4 Mercedes 67 X=0 Y=0 “Mercedes” 0.5 Mercedes-Benz 39 5 X=0 correct name Renault 45 6 Y=1 “Mercedes-Benz” 0.5 correct name Mercedes 106 X=1 Y=0 Mercedes-Benz 106 X=1 Y=1 B.M.W. / BMW / Bayerische Motoren Werke analogously Research visit RWTH - Probabilistic Databases 23 Nov 2017 8
WHAT I HAVE NOW IS A PROBABILISTIC DATABASE § Looks like ordinary database § Several “possible” answers or approximate answers to queries § Important: Scalability (big data!) SUM(sales) P Sales of “preferred customers” 0 14% § SELECT SUM(sales) 105 6% FROM carsales 106 56% WHERE sales ≥ 100 211 24% Research visit RWTH - Probabilistic Databases 23 Nov 2017 9
QUERYING AND RELIABILITY ASSESSMENT Sales of “preferred customers” SUM(sales) P § SELECT SUM(sales) 0 14% FROM carsales 105 6% WHERE sales ≥ 100 106 56% § Answer: 106 Risk of substantially 211 24% wrong answer § Risk = Probability * Impact § Analyst only bothered with Second most likely problems that matter answer at 24% with impact factor 2 in sales (211 vs 106) Research visit RWTH - Probabilistic Databases 23 Nov 2017 10
THEORETICAL FOUNDATION QUERYING A PROBABILISTIC DATABASE
X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 POSSIBLE WORLDS THEORY Y=0 “Mercedes” correct 0.5 Y=1 “Mercedes-Benz” correct 0.5 Car brand Sales Mercedes 67 4 X=0 X=1 0.4 Y=0 Mercedes-Benz 39 5 X=0 Car brand Sales Mercedes 106 X=1 Y=0 Mercedes 106 106 Mercedes-Benz 106 X=1 Y=1 Renault 45 X=0 Y=1 Renault 45 6 Car brand Sales 0.1 Mercedes 67 Mercedes-Benz 39 0.2 * 0.5 X=1 Renault 45 0 X=0 = 0.1 Y=1 Y=0 Car brand Sales 0.4 Car brand Sales Mercedes-Benz 106 ‘0’ (0.2) Mercedes 67 Renault 45 106 ‘106’ (0.8) Mercedes-Benz 39 Renault 45 0 Research visit RWTH - Probabilistic Databases 23 Nov 2017 12
POSSIBLE WORLDS THEORY Associate each tuple t i with a sentence 𝜒 i i § (t i , 𝜒 i ) i ) 𝜒 is a propositional formula § atoms of the form r = v Example § ( <carbrand=“Mercedes”, sales=106>, X=1 ⋀ Y=0 ) This tuple exists in all worlds for which 𝜒 is true Research visit RWTH - Probabilistic Databases 23 Nov 2017 13
QUERYING IN POSSIBLE WORLDS THEORY Given a database D as a probabilistic relation § D ={ (t 1 , 𝜒 1 ), … , (t n , 𝜒 n ) } A world w is induced by a random variable assignment for each random variable § Example § 𝜄 w ={ X ⟼ 1, Y ⟼ 0 } P( w ) = P( 𝜄 w ) = 0.8 x 0.5 = 0.4 § w ={ <“Mercedes”, 106>, <“Renault”, 45> } § PWS is the set of all possible worlds Semantics § Tuple t i exists in world w iff is 𝜒 i true for 𝜄 w i true for 𝜄 w § Query semantics: Q( D )= ∪ w ∈ PWS Q( w ) Research visit RWTH - Probabilistic Databases 23 Nov 2017 14
POSSIBLE WORLDS THEORY possible worlds possible answers Q Query semantics Theory Implementation Q’ Query implementation compact representation representation of possible answers Research visit RWTH - Probabilistic Databases 23 Nov 2017 15
PROBABILISTIC RELATIONAL ALGEBRA SQL queries in relational databases are executed by § Translating them to relational algebra (query plan) § Optimizing the query plan § Choosing implementations for certain operators (e.g., mergejoin for a join / index lookup for a select) § Execution of the query plan Approach: § Extend relational algebra with sentence propagation Research visit RWTH - Probabilistic Databases 23 Nov 2017 16
PROBABILISTIC RELATIONAL ALGEBRA A selection of relational algebra operations § Selection: σ p (R) (t, 𝜒 ) ∈ R p(t)=true ⇔ (t, 𝜒 ) ∈ σ p (R) § Cartesian product: R x S (t 1 , 𝜒 1 ) ∈ R (t 2 , 𝜒 2 ) ∈ S ⇔ (t 1 t 2 , 𝜒 1 ⋀ 𝜒 2 ) ∈ R x S § Join: R ⋈ p S = σ p (RxS) § Duplicate removal: 𝛆 (R) (t i , 𝜒 i ) ∈ R for i ∈ [1..n] t 1 = … =t n ⇔ (t 1 , 𝜒 1 ⋁ … ⋁ 𝜒 n ) ∈ 𝛆 (R) i ) ∈ R for i ∈ [1..n] t 1 = … =t n ⇔ (t 1 , 𝜒 1 ⋁ … ⋁ 𝜒 n ) ∈ 𝛆 (R) Research visit RWTH - Probabilistic Databases 23 Nov 2017 17
ALTERNATIVE IMPLEMENTATION § In this way one could implement a probabilistic relational database Alternative on-top-of normal RDBMS approach (e.g., MayBMS, Trio) § Represent sentences with additional columns / tables in a normal database § Map SQL query Q to Q’ which additionally performs sentence propagation § Execute Q’ § Logically the same thing, but leverages existing RDBMS functionality to the fullest Research visit RWTH - Probabilistic Databases 23 Nov 2017 18
REAL-WORLD QUERY PERFORMANCE
REAL-WORLD EXPERIMENT ON QUERY PERFORMANCE Research question § Are current probabilistic database prototypes mature enough for real-world use? Approach § PDB prototype: MayBMS § Real world scenario § Probabilistic integration of 3 biological databases § Query load with typical queries for the scenario § Measure query execution times Research visit RWTH - Probabilistic Databases 23 Nov 2017 20
REAL-WORLD SCENARIO ORTHOLOGOUS GROUPS IN BIO-INFORMATICS § Proteins in orthologous group expected to have same function § Example “disease-causing bacteria” “Ancient Paperbird” § Identified protein if silenced will A 0 A 1 A 2 kill bacteria. Side-effects in humans? Ø Find orthologous proteins § Combined insight of multiple sources “Long-beaked “Running § Sources: Homologene, PIRSF, and Paperbird” Paperbird” eggNOG L 0 L 1 L 2 R 0 R 1 R 2 § Challenges § Integrating conflicting groups B. Wanders, M. van Keulen, P. van der Vet, Uncertain § No truth & partial untrustworthiness: Groupings: Probabilistic data established with different combination of grouping data. DEXA 2015. methods by different research groups Research visit RWTH - Probabilistic Databases 23 Nov 2017 21
UNCERTAIN GROUPINGS § Uncertain groupings § Iterative data understanding with Integration views Source Groups How to combine the insights of S 1 ABC 1 DE 1 FG 1 orthologous grouping of S 2 AB 2 CD 2 FH 2 proteins from 3 sources? S 3 ABE 3 FGH 3 Research visit RWTH - Probabilistic Databases 23 Nov 2017 22
INTEGRATION VIEW 1: COMP COMBINATION OF INDEPENDENT COMPONENTS X= Y= ABC 1 DE 1 FG 1 Group 1 1 ABC 1 X=1 DE 1 X=1 AB 2 CD 2 FH 2 2 2 FG 1 Y=1 AB 2 X=2 3 FGH 3 3 ABE 3 CD 2 X=2 FH 2 Y=2 Two choices: ABE 3 X=3 • Source S i is correct for FGH 3 Y=3 component ABCDE • Source S i is correct for component FGH Research visit RWTH - Probabilistic Databases 23 Nov 2017 23
Recommend
More recommend