dissociation based optimization in
play

Dissociation-based Optimization in Probabilistic Databases Maarten - PowerPoint PPT Presentation

Dissociation-based Optimization in Probabilistic Databases Maarten Van den Heuvel 1 , Floris Geerts 1 , Martin Theobald 2 1 Universiteit Antwerpen, Belgium 2 Ulm University, Germany Contents Introduction Issues with safety


  1. Dissociation-based Optimization in Probabilistic Databases Maarten Van den Heuvel 1 , Floris Geerts 1 , Martin Theobald 2 1 Universiteit Antwerpen, Belgium 
 2 Ulm University, Germany

  2. Contents • Introduction • Issues with safety • Dissociation: make (probabilistically) unsafe queries safe • Top-k: using summaries to speed up inference in safe queries

  3. Introduction What is the director that is most likely to have directed a movie starring an award winning actor? WonBy DirectedBy Actor Prize P Director Movie P Ewan McGregor Oscar 0.9 George Lucas Star Wars 0.9 Samuel L. Jackson Grammy 0.8 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1 PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Top-1 query Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2

  4. Introduction Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) WonBy DirectedBy Actor Prize P Director Movie P Ewan McGregor Oscar 0.9 George Lucas Star Wars 0.9 Samuel L. Jackson Grammy 0.8 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1 PlayedIn Movie Actor P Star Wars Ewan McGregor 0.9 Top-1 query Star Wars Samuel L. Jackson 0.7 Star Trek Samuel L. Jackson 0.2

  5. Introduction Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) WonBy DirectedBy Actor Prize P Director Movie P Ewan McGregor Oscar 0.9 George Lucas Star Wars 0.9 Samuel L. Jackson Grammy 0.8 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1 PlayedIn Movie Actor P Answers Star Wars Ewan McGregor 0.9 Director P Star Wars Samuel L. Jackson 0.7 George Lucas 0.827 Star Trek Samuel L. Jackson 0.2 J.J. Abrahms 0.128

  6. Introduction Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) WonBy DirectedBy Actor Prize P Director Movie P Ewan McGregor Oscar 0.9 George Lucas Star Wars 0.9 Samuel L. Jackson Grammy 0.8 J.J. Abrahms Star Trek 0.8 George Lucas Star Trek 0.1 PlayedIn Top-1 query: Answers Not interested in exact P Movie Actor P Director P Star Wars Ewan McGregor 0.9 George Lucas 0.827 Star Wars Samuel L. Jackson 0.7 J.J. Abrahms 0.128 Star Trek Samuel L. Jackson 0.2

  7. Introduction Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) Answers Director P George Lucas 0.827 J.J. Abrahms 0.128 P Not interested in exact probability • Interested in ranking •

  8. Introduction Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) Answers Director P George Lucas 0.827 J.J. Abrahms 0.128 P Not interested in exact probability • Interested in ranking • Upper and lower bounds are enough • P

  9. Complexity project is always with duplicate elimination Some queries always have a query plan using probabilistic operators = safe • Prob-Join ( ⋈ ) : P(s) = P(t 1 ) * … * P(t n ) • Prob-Project ( π ) : P(s) = 1 - (1 - P(t 1 )) * … *(1- P(t n )) π x Q(X):- PlayedIn(X, Y), WonBy(Y, Z) ⋈ y π PTIME in data size to calculate P(X) y PlayedIn WonBy

  10. Complexity Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) Has no query plan using probabilistic operators since they assume independence = unsafe PlayedIn DirectedBy WonBy Movie Actor P Director Movie P Actor Prize P Star Wars Ewan McGregor 0.9 George Lucas Star Wars 0.9 Ewan McGregor Oscar 0.9 Star Wars Samuel L. Jackson 0.7 J.J. Abrahms Star Trek 0.8 Samuel L. Jackson Grammy 0.8 Star Trek Samuel L. Jackson 0.2 George Lucas Star Trek 0.1 #P-hard in data size to calculate P(X)

  11. Idea 1: Approximation using safe queries P low (X) P(X) Answers P up (X) Actor P George Lucas 0.827 J.J. Abrahms 0.128 P Use Q low for lower bound Use Q up for upper bound

  12. Approximation using safe queries What if we pretend independence? Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Z, U) PlayedIn DirectedBy WonBy Movie Actor P Director Movie P Actor Prize P Star Wars Ewan McGregor 0.9 George Lucas Star Wars 0.9 Ewan McGregor Oscar 0.9 Star Wars Samuel L. Jackson 0.7 J.J. Abrahms Star Trek 0.8 Samuel L. Jackson Grammy 0.8 Star Trek Samuel L. Jackson 0.2 George Lucas Star Trek 0.1

  13. Approximation using safe queries What if we pretend independence? Q’(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Y*, Z, U) WonBy PlayedIn DirectedBy Movie* Actor Prize P Movie Actor P Director Movie P Star Wars Ewan Oscar 0.9 Star Wars Ewan 0.9 George Lucas Star Wars 0.9 Star Trek Ewan Oscar 0.9 Star Wars Samuel 0.7 J.J. Abrahms Star Trek 0.8 Star Wars Samuel Grammy 0.8 Star Trek Samuel 0.2 George Lucas Star Trek 0.1 Star Trek Samuel Grammy 0.8 Dissociation(1) gives upper and lower bounds • Use query plan of safe dissociated query on original data • Works for self-join free conjunctive queries •

  14. Dissociation Q(X):- DirectedBy(X, Y), PlayedIn(Y, Z), WonBy(Y*,Z, U) Downside to dissociation: Exponential amount of • dissociations in query size Different dissociations => • P different accuracy Possibly insufficient • differentiation Need to execute Q to know • P Q(X):- DirectedBy(X, Y, Z*), PlayedIn(Y, Z), WonBy(Z, U)

  15. Idea 2: Approximation using summaries P low (X) P(X) P up (X) P low (X) P(X) P up (X) P P Safe queries alone are not efficient enough: Why not approximate these bounds with more bounds ?

  16. Approximation using summaries WonBy(Y,Z) Actor Prize P Ewan McGregor Oscar 0.9 Q(X):- PlayedIn(X, Y), WonBy(Y, Z) Samuel L. Jackson Grammy 0.8 … … … π x PlayedIn(X,Y) Movie Actor P ⋈ y Star Wars Ewan McGregor 0.9 π y Star Wars Samuel L. Jackson 0.7 PlayedIn Star Trek Samuel L. Jackson 0.2 WonBy

  17. Approximation using summaries π x WonBy(Y,Z) Actor Prize P ⋈ y Ewan McGregor Oscar 0.9 π y Samuel L. Jackson Grammy 0.8 … … … PlayedIn WonBy Answers Actor P up Ewan McGregor ? Samuel L. Jackson ?

  18. Approximation using summaries π x WonBy(Y,Z) Actor Prize P ⋈ y Ewan McGregor Oscar 0.9 π y Samuel L. Jackson Grammy 0.8 … … P max PlayedIn WonBy Answers( π y ) Depends on all n tuples WonBy(Ewan, …) Actor P up • P(s) = 1 - (1 - P(t 1 )) * … *(1- P(t n )) Ewan McGregor 0.965 • P up (s) = 1 - (1 - 0.9 ) * (1 - P max ) n-1 Samuel L. Jackson 0.853

  19. Approximation using summaries π x WonBy(Y,Z) Actor Prize P ⋈ y Ewan McGregor Oscar 0.9 π y Samuel L. Jackson Grammy 0.8 … … P max PlayedIn WonBy Depends on all n tuples WonBy(Ewan, …) Answers( π y ) • P(s) = 1 - (1 - P(t 1 )) * … *(1- P(t n )) Actor P up • P up (s) = 1 - (1 - 0.9 ) * (1 - P max ) n-1 Ewan McGregor 0.965 Summary: Samuel L. Jackson 0.853 • Hold P max • Upper bound on n

  20. Approximation using summaries π x Answers( π y ) Actor P up P low ⋈ y Ewan McGregor 0.965 0.9 π y Samuel L. Jackson 0.853 0.8 PlayedIn WonBy … • Recursively propagate the bounds 
 Answers( π x ) to the query answers Movie P up P low Star Wars 0.82 0.30 Star Trek 0.64 0.12

  21. Approximation using summaries π x Answers( π y ) Actor P up P low ⋈ y Ewan McGregor 0.92 0.9 π y Samuel L. Jackson 0.832 0.8 PlayedIn WonBy … • Recursively propagate the bounds 
 Answers( π x ) to the query answers Movie P up P low • Read more data and update bounds: Star Wars 0.82 0.62 • Lower P max Star Trek 0.58 0.37 • Better estimate for n

  22. Approximation using summaries π x Answers( π y ) Actor P up P low ⋈ y Ewan McGregor 0.92 0.9 π y Samuel L. Jackson 0.832 0.8 PlayedIn WonBy … Answers( π x ) Stop if enough differentiation: Movie P up P low • No possible candidates Star Wars 0.82 0.62 • No overlapping bounds Star Trek 0.58 0.37

  23. Dissociation++ PlayedIn DirectedBy WonBy Movie Actor P Director Movie P Actor Prize P Star Wars Ewan McGregor 0.9 George Lucas Star Wars 0.9 Ewan McGregor Oscar 0.9 Star Wars Samuel L. Jackson 0.7 J.J. Abrahms Star Trek 0.8 Samuel L. Jackson Grammy 0.8 Star Trek Samuel L. Jackson 0.2 George Lucas Star Trek 0.1 Choosing a good dissociation is costly but: • Accuracy depends on number of faulty independence assumptions: • Estimate with n statistics in summaries!

  24. Questions/Challenges • Implementation: ongoing • Accuracy: • Bounds good enough to differentiate? • Statistics good enough to approximate faulty independence assumptions • Summary: What are good summaries regarding size, detail,…? Thank you for your attention!

  25. References (1) Gatterbauer, W., & Suciu, D. (2014). Oblivious bounds on the probability of boolean functions. ACM Transactions on Database Systems (TODS), 39(1), 5. (2) Gatterbauer, Wolfgang, and Dan Suciu. "Approximate lifted inference with probabilistic databases." Proceedings of the VLDB Endowment 8.5 (2015): 629-640. (3) Dylla, M., Miliaraki, I., & Theobald, M. (2013, April). Top-k query processing in probabilistic databases with non-materialized views. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on (pp. 122-133). IEEE.

Recommend


More recommend