probabilistic databases
play

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - PowerPoint PPT Presentation

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente Research


  1. PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP

  2. WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente § Research interests: Data integration, data quality, information extraction, natural language processing, data cleaning I am here on a research visit § Period Nov’17 – Feb’18 several times 3 or 2 days § Goals: § Compare Probabilistic Programming (PP) with Probabilistic Databases (PDBs) § Does PP lead to new insights in PDBs and Probabilistic Data Integration (PDI)? Research visit RWTH - Probabilistic Databases 23 Nov 2017 2

  3. CONTENTS § Motivational example § Intuition – What is a probabilistic database § Theoretical foundation – possible worlds theory § Querying a probabilistic database § Real-world query performance § Probabilistic databases for other data models (XML, DataLog) § Conclusions § Outlook on my next presentation on “Probabilistic Data Integration” Research visit RWTH - Probabilistic Databases 23 Nov 2017 3

  4. MOTIVATIONAL EXAMPLE AND WHAT IS A PROBABILISTIC DATABASE?

  5. MOTIVATIONAL EXAMPLE: COMBINING DATA … Car brand Sales Car brand Sales Car brand Sales B.M.W. 25 BMW 72 Bayerische Motoren Werke 8 Mercedes 32 Mercedes-Benz 39 Mercedes 35 Renault 10 Renault 20 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Keulen, M. (2012) Mercedes-Benz 39 Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). Renault 45 pp. 138-146. ISSN 1611-2776 Research visit RWTH - Probabilistic Databases 23 Nov 2017 5

  6. … AND THE PROBLEM OF SEMANTIC DUPLICATES Car brand Sales Preferred customers … B.M.W. 25 SELECT SUM(Sales) Bayerische Motoren Werke 8 FROM CarSales BMW 72 WHERE Sales>100 Mercedes 67 Mercedes-Benz 39 Renault 45 0 ‘No preferred customers’ Research visit RWTH - Probabilistic Databases 23 Nov 2017 6

  7. SEMANTIC DUPLICATES Real world Database (of car brands) ω o 1 Car brand Sales d 1 o 2 B.M.W. 25 Bayerische 8 d 2 Motoren Werke d 3 o 3 Mercedes 67 d 4 Renault 45 d 5 o 4 BMW 72 d 6 Mercedes-Benz 39 Research visit RWTH - Probabilistic Databases 23 Nov 2017 7

  8. MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA Run some duplicate detection tool Car brand Sales B.M.W. 25 1 X=0 4 and 5 different 0.2 Bayerische Motoren Werke 8 2 X=1 4 and 5 the same 0.8 BMW 72 3 4 Mercedes 67 X=0 Y=0 “Mercedes” 0.5 Mercedes-Benz 39 5 X=0 correct name Renault 45 6 Y=1 “Mercedes-Benz” 0.5 correct name Mercedes 106 X=1 Y=0 Mercedes-Benz 106 X=1 Y=1 B.M.W. / BMW / Bayerische Motoren Werke analogously Research visit RWTH - Probabilistic Databases 23 Nov 2017 8

  9. WHAT I HAVE NOW IS A PROBABILISTIC DATABASE § Looks like ordinary database § Several “possible” answers or approximate answers to queries § Important: Scalability (big data!) SUM(sales) P Sales of “preferred customers” 0 14% § SELECT SUM(sales) 105 6% FROM carsales 106 56% WHERE sales ≥ 100 211 24% Research visit RWTH - Probabilistic Databases 23 Nov 2017 9

  10. QUERYING AND RELIABILITY ASSESSMENT Sales of “preferred customers” SUM(sales) P § SELECT SUM(sales) 0 14% FROM carsales 105 6% WHERE sales ≥ 100 106 56% § Answer: 106 Risk of substantially 211 24% wrong answer § Risk = Probability * Impact § Analyst only bothered with Second most likely problems that matter answer at 24% with impact factor 2 in sales (211 vs 106) Research visit RWTH - Probabilistic Databases 23 Nov 2017 10

  11. THEORETICAL FOUNDATION QUERYING A PROBABILISTIC DATABASE

  12. X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 POSSIBLE WORLDS THEORY Y=0 “Mercedes” correct 0.5 Y=1 “Mercedes-Benz” correct 0.5 Car brand Sales Mercedes 67 4 X=0 X=1 0.4 Y=0 Mercedes-Benz 39 5 X=0 Car brand Sales Mercedes 106 X=1 Y=0 Mercedes 106 106 Mercedes-Benz 106 X=1 Y=1 Renault 45 X=0 Y=1 Renault 45 6 Car brand Sales 0.1 Mercedes 67 Mercedes-Benz 39 0.2 * 0.5 X=1 Renault 45 0 X=0 = 0.1 Y=1 Y=0 Car brand Sales 0.4 Car brand Sales Mercedes-Benz 106 ‘0’ (0.2) Mercedes 67 Renault 45 106 ‘106’ (0.8) Mercedes-Benz 39 Renault 45 0 Research visit RWTH - Probabilistic Databases 23 Nov 2017 12

  13. POSSIBLE WORLDS THEORY Associate each tuple t i with a sentence 𝜒 i i § (t i , 𝜒 i ) i ) 𝜒 is a propositional formula § atoms of the form r = v Example § ( <carbrand=“Mercedes”, sales=106>, X=1 ⋀ Y=0 ) This tuple exists in all worlds for which 𝜒 is true Research visit RWTH - Probabilistic Databases 23 Nov 2017 13

  14. QUERYING IN POSSIBLE WORLDS THEORY Given a database D as a probabilistic relation § D ={ (t 1 , 𝜒 1 ), … , (t n , 𝜒 n ) } A world w is induced by a random variable assignment for each random variable § Example § 𝜄 w ={ X ⟼ 1, Y ⟼ 0 } P( w ) = P( 𝜄 w ) = 0.8 x 0.5 = 0.4 § w ={ <“Mercedes”, 106>, <“Renault”, 45> } § PWS is the set of all possible worlds Semantics § Tuple t i exists in world w iff is 𝜒 i true for 𝜄 w i true for 𝜄 w § Query semantics: Q( D )= ∪ w ∈ PWS Q( w ) Research visit RWTH - Probabilistic Databases 23 Nov 2017 14

  15. POSSIBLE WORLDS THEORY possible worlds possible answers Q Query semantics Theory Implementation Q’ Query implementation compact representation representation of possible answers Research visit RWTH - Probabilistic Databases 23 Nov 2017 15

  16. PROBABILISTIC RELATIONAL ALGEBRA SQL queries in relational databases are executed by § Translating them to relational algebra (query plan) § Optimizing the query plan § Choosing implementations for certain operators (e.g., mergejoin for a join / index lookup for a select) § Execution of the query plan Approach: § Extend relational algebra with sentence propagation Research visit RWTH - Probabilistic Databases 23 Nov 2017 16

  17. PROBABILISTIC RELATIONAL ALGEBRA A selection of relational algebra operations § Selection: σ p (R) (t, 𝜒 ) ∈ R p(t)=true ⇔ (t, 𝜒 ) ∈ σ p (R) § Cartesian product: R x S (t 1 , 𝜒 1 ) ∈ R (t 2 , 𝜒 2 ) ∈ S ⇔ (t 1 t 2 , 𝜒 1 ⋀ 𝜒 2 ) ∈ R x S § Join: R ⋈ p S = σ p (RxS) § Duplicate removal: 𝛆 (R) (t i , 𝜒 i ) ∈ R for i ∈ [1..n] t 1 = … =t n ⇔ (t 1 , 𝜒 1 ⋁ … ⋁ 𝜒 n ) ∈ 𝛆 (R) i ) ∈ R for i ∈ [1..n] t 1 = … =t n ⇔ (t 1 , 𝜒 1 ⋁ … ⋁ 𝜒 n ) ∈ 𝛆 (R) Research visit RWTH - Probabilistic Databases 23 Nov 2017 17

  18. ALTERNATIVE IMPLEMENTATION § In this way one could implement a probabilistic relational database Alternative on-top-of normal RDBMS approach (e.g., MayBMS, Trio) § Represent sentences with additional columns / tables in a normal database § Map SQL query Q to Q’ which additionally performs sentence propagation § Execute Q’ § Logically the same thing, but leverages existing RDBMS functionality to the fullest Research visit RWTH - Probabilistic Databases 23 Nov 2017 18

  19. REAL-WORLD QUERY PERFORMANCE

  20. REAL-WORLD EXPERIMENT ON QUERY PERFORMANCE Research question § Are current probabilistic database prototypes mature enough for real-world use? Approach § PDB prototype: MayBMS § Real world scenario § Probabilistic integration of 3 biological databases § Query load with typical queries for the scenario § Measure query execution times Research visit RWTH - Probabilistic Databases 23 Nov 2017 20

  21. REAL-WORLD SCENARIO ORTHOLOGOUS GROUPS IN BIO-INFORMATICS § Proteins in orthologous group expected to have same function § Example “disease-causing bacteria” “Ancient Paperbird” § Identified protein if silenced will A 0 A 1 A 2 kill bacteria. Side-effects in humans? Ø Find orthologous proteins § Combined insight of multiple sources “Long-beaked “Running § Sources: Homologene, PIRSF, and Paperbird” Paperbird” eggNOG L 0 L 1 L 2 R 0 R 1 R 2 § Challenges § Integrating conflicting groups B. Wanders, M. van Keulen, P. van der Vet, Uncertain § No truth & partial untrustworthiness: Groupings: Probabilistic data established with different combination of grouping data. DEXA 2015. methods by different research groups Research visit RWTH - Probabilistic Databases 23 Nov 2017 21

  22. UNCERTAIN GROUPINGS § Uncertain groupings § Iterative data understanding with Integration views Source Groups How to combine the insights of S 1 ABC 1 DE 1 FG 1 orthologous grouping of S 2 AB 2 CD 2 FH 2 proteins from 3 sources? S 3 ABE 3 FGH 3 Research visit RWTH - Probabilistic Databases 23 Nov 2017 22

  23. INTEGRATION VIEW 1: COMP COMBINATION OF INDEPENDENT COMPONENTS X= Y= ABC 1 DE 1 FG 1 Group 1 1 ABC 1 X=1 DE 1 X=1 AB 2 CD 2 FH 2 2 2 FG 1 Y=1 AB 2 X=2 3 FGH 3 3 ABE 3 CD 2 X=2 FH 2 Y=2 Two choices: ABE 3 X=3 • Source S i is correct for FGH 3 Y=3 component ABCDE • Source S i is correct for component FGH Research visit RWTH - Probabilistic Databases 23 Nov 2017 23

Recommend


More recommend