techniques for managing probabilistic data
play

Techniques for managing probabilistic data Dan Suciu University of - PowerPoint PPT Presentation

Techniques for managing probabilistic data Dan Suciu University of Washington 1 Databases Are Deterministic Applications since 1970s required precise semantics Accounting, inventory Database tools are deterministic A


  1. Semantics 1: Possible Tuples Review p Movie p mid rating P m42 7 0.5 id year P m42 4 0.3 m42 1995 0.6 m42 9 0.9 m99 2002 0.8 m99 7 0.6 m76 2002 0.3 m99 5 0.2 m76 6 0.3 q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 Answer p 1 1995 mid rating id year mid rating id year m42 7 m42 1995 id year mid rating year P m42 7 p 4 1995 m42 1995 m42 4 m99 2002 id year mid rating m42 7 p 5 1995 m42 1995 m42 9 m42 9 m99 2002 mid rating m42 7 m76 2002 id year p 1 +p 4 +p 5 +p 8 +p 9 m99 2002 m42 4 m99 7 1995 m76 2002 m99 7 mid rating m42 7 id year m42 4 m42 1995 m42 9 m76 2002 m99 5 mid rating m99 5 m42 7 p 9 1995 id year m42 4 m42 1995 m42 9 m99 7 m99 2002 m76 6 m42 7 m42 4 p 3 +p 4 +p 7 m76 6 p 9 1995 2002 m42 9 m42 1995 m99 7 m99 2002 m76 6 m42 4 m42 9 m99 7 m76 6 m99 2002 m42 9 30 m99 7 m76 6 m99 7 m76 6 m76 6

  2. Formal Definition ( Ω , P) tuple Query probability space q a Boolean query q(a) Probabilistic event: E = { ω | ω |= q(a) } Definition P(q(a)) = P(E) = ∑ ω |= q(a) P( ω ) Example q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 1995 q(1995) :- Movie p ( x ,1995), Review p ( x , z ), z>3 = marginal probability of q(1995) 31 P(q(1995))

  3. Semantics 2: Possible Answers Possible mid rating id year mid rating id year m42 7 m42 1995 id year mid rating m42 7 m42 1995 m42 4 worlds m99 2002 id year mid rating m42 7 m42 1995 m42 9 m42 9 m99 2002 mid rating m76 2002 m42 7 m99 id 2002 year m42 4 m99 7 m76 2002 m99 7 mid rating m42 7 id year m42 4 m42 1995 m76 2002 m42 9 m99 5 mid rating m99 5 m42 7 id year m42 4 m42 1995 m42 9 m99 7 m76 6 m99 2002 m42 7 m76 6 m42 4 m42 1995 m42 9 m99 7 m99 2002 m76 6 m42 4 m42 9 m99 7 m99 2002 m76 6 m42 9 m99 7 m76 6 m99 7 m76 6 m76 6 q(y) :- Movie p ( x , y ), Review p ( x , z ), z>3 p 1 year Possible year p 2 year 1930 year answers p 3 1995 year 1990 1990 . . . 2002 1950 1999 1999 1960 2002 32 1970

  4. Formal Definition ( Ω , P) , View Probability space v ( Ω ’, P’) New probability space Definition Ω ’ = { ω ’ | ∃ ω ∈ Ω , v( ω ) = ω ’} P’( ω ’) = ∑ ω : v( ω )= ω ’ P( ω ) “Image probability space” [Green&Tannen’06] 33

  5. Query Semantics Best for • Possible tuples: expressing user queries – Simple, intuitive user interface – Query evaluation is probabilistic inference – But is not compositional • Possible answers: Best for – Is compositional defining views – Open research problems: user interface, query evaluation 34

  6. Complex Models = Simple + Views Example adapted from [Gupta&Sarawagi’2006] Address p ID House-No Street City P 1 52 Goregaon West Mumbai 0.06 1 52-A Goregaon West Mumbai 0.15 1 52 Goregaon West Mumbai 0.12 1 52-A Goregaon West Mumbai 0.3 2 . . . . . . . . . . . . . . . . 2 . . . . Suppose House-no extracted independently from Street and City 35

  7. Address p ID House-No Street City P 1 52 Goregaon West Mumbai 0.06 1 52-A Goregaon West Mumbai 0.15 1 52 Goregaon West Mumbai 0.12 1 52-A Goregaon West Mumbai 0.3 2 . . . . . . . . . . . . . . . . AddrH p AddrSC p ID House-No P ID Street City P 1 52 0.2 1 Goregaon West Mumbai 0.3 1 1 52-A 0.5 Goregaon West Mumbai 0.6 2 . . . . . . . . 2 . . . . . . . . . . . . Address(x,y,z,u) :- AddrH(x,y), AddrSC(x,z,u) View: 36

  8. Complex Models = Simple + Views Standard query rewriting: Address(x,y,z,u) :- AddrH(x,y), AddrSC(x,z,u) View: User query: q(x) :- Address(x,y,z,’West Mumbai’)  Rewritten query q(x) :- AddrH(x,y), AddrSC(x,z,’West Mumbai’) 37

  9. Complex Models = Simple + Views • In this simple example the view is already representable as a tuple disjoint/independent table • In general views can define more complex probability spaces over possible worlds, that are not disjoint/indepdendent Theorem [Dalvi&S’2007] Independent/disjoint tables + conjunctive views = a complete representation system 38

  10. Discussion of Data Model Tuple-disjoint/independent tables: • Simple model, can store in any DBMS More advanced models: • Symbolic boolean expressions Fuhr and Roellke • Trio: add lineage [Widom05, Das Sarma’06, Benjelloun 06] • Probabilistic Relational Models [Getoor’2006] • Graphical models [Sen&Desphande’07] 39

  11. Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 40

  12. Extensional Operators Object Person Location P John L45 p1 Laptop77 Jim L45 p2 Jim L66 p3 Mary L66 p4 Mary L45 p5 Book302 Jim L66 p6 John L45 p7 Fred L45 p8 Location P q(z) :- HasObject p ( Book302 , y, z) L66 p4+p6 41 L45 p5+p7+p8

  13. Disjoint Project p1+p2+p3 Π d p1 p2 p3 42

  14. Extensional Operators Object Person Location P John L45 p1 Laptop77 Jim L45 p2 Jim L66 p3 Mary L66 p4 Mary L45 p5 Book302 Jim L66 p6 John L45 p7 Fred L45 p8 Person Location P Jim L66 1-(1-p3)(1-p6) q(y,z) :- HasObject p ( x ,y,z) John L45 1-(1-p1)(1-p7) 43 . . .

  15. Independent Project 1-(1-p1)(1-p2)(1-p3) Π i p1 p2 p3 44

  16. q(y) :- Movie p ( x , y ), Review p ( x , z ),z>3 A Taste of Query Evaluation Review Movie mid rating P m42 7 q1 id year P m42 4 q2 m42 1995 p1 m42 9 q3 m99 2002 p2 m99 7 q4 m76 2002 p3 m99 5 q5 Answer m76 6 q6 year P p1 × (1 - (1 - q1) × (1 - q2) × (1 - q3)) 1995 1 - (1 - ) × p2 × (1 - (1 - q4) × (1 - q5)) 2002 (1 - ) p3 × q6 45

  17. q(y) :- Movie p ( x , y ), Review p ( x , z ) q(1995) Answer depends on query plan ! 1-(1-p1q1)(1-p1q2)(1-p1q3) 1-(1-p1(1-(1-q1)(1-q2)(1-q3)))(1-…)… Π iy Π iy p1q1 p1q2 p1(1-(1-q1)(1-q2)(1-q3)) ⋈ x ⋈ x p1q3 1-(1-q1)(1-q2)(1-q3) Π ix Movie(x,y) Review(x,z) Movie(x,y) p1 q1 Review(x,z) q2 p1 q1 q3 CORRECT q2 INCORRECT (“safe plan”) q3 46

  18. Safe Plans are Efficient • Very efficient: run almost as fast as regular queries • Require only simple modifications of the relational operators • Or can be translated back into SQL and sent to any RDBMS Can we always generate a safe plan ? 47

  19. A Hard Query S R p T p B C A B P C D P x1 y1 a x1 p1 y1 c q1 a x2 p2 x1 y2 y2 c q2 x2 y1 Π i Unsafe ! h(u,v) :- R p ( u , x ),S( x , y ),T p ( y , v ) (1-(1-p1)(1-p2))q1 ⋈ p2q2 p1 h(a,c) Π i p1 T ⋈ p2 p1 There is no safe plan ! 48 R S p2

  20. Independent Queries Let q1, q2 be two boolean queries Definition q1, q2 are “independent” if P(q1, q2) = P(q1) P(q2) Also: P(q1 V q2) = 1 - (1 - P(q1))(1 - P(q2)) 49

  21. Quiz: which are independent ? q1 q2 Indep.? Movie p ( m41 , y ) Review p ( m41 , z ) Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m77 , y ),Review p ( m77 , z ) Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m42 , 1995 ) Movie p ( m42 , y ),Review p ( m42 , 7 ) Movie p ( m42 , y ),Review p ( m42 , 4 ) R p ( x , y , z , z , u ), R p ( x , x , x , y , y ) R p ( a , a , b , b , c ) 50

  22. Answers q1 q2 Indep.? Movie p ( m41 , y ) Review p ( m41 , z ) YES Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m77 , y ),Review p ( m77 , z ) YES Movie p ( m42 , y ),Review p ( m42 , z ) Movie p ( m42 , 1995 ) NO Movie p ( m42 , y ),Review p ( m42 , 7 ) Movie p ( m42 , y ),Review p ( m42 , 4 ) NO R p ( x , y , z , z , u ), R p ( x , x , x , y , y ) R p ( a , a , b , b , c ) YES Prop If no two subgoals unify then q1,q2 are independent Note: necessary but not sufficient condition Theorem Independece is Π p 2 complete [Miklau&S’04] 51 Reducible to query containment [Machanavajjhala&Gehrke’06]

  23. Disjoint Queries Let q1, q2 be two boolean queries Definition q1, q2 are “disjoint” if P(q1, q2) = 0 Iff q1, q2 depend on two disjoint tuples t1, t2 52

  24. Quiz: which are disjoint ? q1 q2 ? HasObject p (‘ book’ , ‘ 9’ , ‘Mary’, x) HasObject p (‘ book’ , ‘ 9’ , ‘Jim’, x) HasObject p (‘ book’ , t , ‘Mary’, x) HasObject p (‘ book’ , t , ‘Jim’, x) HasObject p (‘ book’ , ‘ 9’ , u, x) HasObject p (‘ book’ , ‘ 9’ , v, x) 53

  25. Answers q1 q2 ? HasObject p (‘ book’ , ‘ 9’ , ‘Mary’, x) HasObject p (‘ book’ , ‘ 9’ , ‘Jim’, x) Y HasObject p (‘ book’ , t , ‘Mary’, x) HasObject p (‘ book’ , t , ‘Jim’, x) N HasObject p (‘ book’ , ‘ 9’ , u, x) HasObject p (‘ book’ , ‘ 9’ , v, x) N Proposition q1, q2 are “disjoint” if they contain subgoals g1, g2: • Have the same values for the key attributes • these values are constants • have at least one different constant in the non-key attributes 54

  26. Definition of Safe Operators “safe” if ∀ a, q1(x)q2(x) q(x) q1(a), q2(a) are ⋈ σ x=a Always independent “safe” q(x) q1(x) q2(x) q q “safe” if ∀ a, b, “safe” if ∀ a, b, q(a), q(b) are q(a), q(b) are Π i Π d disjoint independent q(x) q(x) 55

  27. q(y c ) :- Movie p ( x ,y c ), Review p ( x , z ) y c “is a constant” Example 1 q1 :- Movie(x,y c ), Review(x,z) Π iy Because these are dependent: Unsafe q1(m42,7)=Movie(m42,y c ),Review(m42,7) q1(m42,4)=Movie(m42,y c ),Review(m42,4) q1(x,z) :- Movie(x,y c ), Review(x,z) ⋈ x Movie(x,y) Review(x,z) 56

  28. q(y c ) :- Movie p ( x ,y c ), Review p ( x , z ) y c “is a constant” Example 2 q1 :- Movie(x,y c ), Review(x,z) Π iy Now these are independent ! Safe ! q1(m42) = Movie(m42,y c ), Review(m42,z) q1(m77) = Movie(m77,y c ), Review(m77,z) q1(x) :- Movie(x,y c ), Review(x,z) ⋈ x Π ix Movie(x,y) Review(x,z) 57

  29. [Valiant’79] Complexity Class #P Definition #P is the class of functions f(x) for which there exists a PTIME non-deterministic Turing machine M s.t. f(x) = number of accepting computations of M on input x Examples: SAT = “given formula Φ , is Φ satisfiable ?” = NP-complete #SAT = “given formula Φ , count # of satisfying assignments” = #P-complete 58

  30. [Valiant’79] [Provan&Ball’83] All You Need to Know About #P Class Example SAT #SAT (X ∨ Y ∨ Z) ∧ ( ¬ X ∨ U ∨ W) … 3CNF NP #P (X ∨ Y) ∧ ( ¬ X ∨ U) … 2CNF PTIME #P Positive, (X1 ∨ Y1) ∧ (X1 ∨ Y4) ∧ partitioned PTIME #P (X2 ∨ Y1) ∧ (X3 ∨ Y1) … 2CNF Positive, (X1 ∧ Y1) ∨ (X1 ∧ Y4) ∨ partitioned PTIME #P (X2 ∧ Y1) ∨ (X3 ∧ Y1) … 2DNF Here NP, #P means “NP-complete, #P-complete” 59

  31. See also [Graedel et al. 98] #P-Hard Queries hd1 :- R p ( x ),S( x , y ),T p ( y ) Theorem The query hd1 is #P-hard Proof: Reduction from partitioned, positive 2DNF E.g. Φ = x1 y1 V x2 y1 V x1 y2 V x3 y2 reduces to R p S T p A P A B B P x1 y1 x1 0.5 y1 0.5 x2 y1 x2 0.5 y2 0.5 x1 y2 x3 0.5 x3 y2 # Φ = P(hd1) * 2 n 60

  32. #P-Hard Queries • #P-hard queries do not have safe plans • Do not have any PTIME algorithm – Unless P = NP • Can be evaluated using probabilistic inference – Exponential time exact algorithms or – PTIME approximations, e.g. Luby&Karp • In our experience with MystiQ, unsafe queries are 2 orders of magnitude slower than safe queries, and that only after optimizations 61

  33. Lessons What do users want ? • Arbitrary queries, not just safe queries – Safe query  very fast – Unsafe query  begs for optimizations What should the system do ? • Aggressively check if a query is safe • If not, aggressively search safe subqueries Key problem: identifying the safe queries 62

  34. Dichotomy Property LANG = a query language. REP = a representation formalism (Independent or independent/disjoint) REP, LANG have the DICHOTOMY PROPERTY if ∀ q ∈ LANG (1) The complexity of q is PTIME, or (2) The complexity of q is #P-hard CQ = conjunctive queries LANG: CQ 1 = conjunctive queries without self-joins Theorems The dichotomy property holds for: 1. CQ 1 and independent dbs. 2. CQ 1 and disjoint/independent dbs. 3. CQ and independent dbs. 63

  35. Summary So Far • Lots of applications need probabilistic data • Tuple disjoint/independent data model – Sufficient for many applications – Can be made complete through views – Ideal for studying query evaluation • Query evaluation – Some (many ?) queries are inherently hard – Main optimization tool: safe queries 64

  36. Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 65

  37. Dichotomy Property LANG = a query language. REP = a representation formalism (Independent or independent/disjoint) REP, LANG have the DICHOTOMY PROPERTY if ∀ q ∈ LANG (1) The complexity of q is PTIME, or (2) The complexity of q is #P-hard CQ = conjunctive queries LANG: CQ 1 = conjunctive queries without self-joins Theorems The dichotomy property holds for: 1. CQ 1 and independent dbs. 2. CQ 1 and disjoint/independent dbs. 3. CQ and independent dbs. 66

  38. PTIME Queries #P-Hard Queries hd1 = R( x ), S( x, y ), T( y ) R( x, y ), S( x, z ) hd2 = R( x ,y), S( y ) R( x , y), S( y ), T( ‘a’ , y) hd3 = R( x ,y), S(x, y ) R( x ), S( x, y ), T( y ), U( u , y), W( ‘a’ , u) . . . . . . Will discuss next how to decide their complexity and how evaluate PTIME queries

  39. Hierarchical Queries sg(x) = set of subgoals containing the variable x in a key position Definition A query q is hierarchical if forall x, y: sg(x) ⊇ sg(y) or sg(x) ⊆ sg(y) or sg(x) ∩ sg(y) = ∅ Non-hierarchical Hierarchical h1 = R( x ), S( x, y ), T( y ) q = R( x, y ), S( x, z ) x y x z S T R y S R 68

  40. Case 1: CQ 1 + Independent • Dichotomy established in [Dalvi&S’2004] • CQ 1 (conjunctive queries, no self-joins): – R( x , y ), S( y , z ) OK – R( x , y ), R( y , z ) Not OK • Independent tuples only: – R( x , y ) OK – S( y ,z) Not OK 69

  41. [Dalvi&S’2004] CQ 1 + Independent Theorem Forall q ∈ CQ 1 : • q is hierarchical, has a safe plan, and is in PTIME, OR • q is not hierarchical and is #P-hard 70

  42. The PTIME Queries Algorithm : convert a Hierarchy to a Safe Plan Independent 1. Root variable u  Π i project -u 2. Connected components  Join 3. Single subgoal  Leaf node Π i -x q = R( x, y ), S( x, z )  ⋈ x x z y S R Π d Π d -y -z R p ( x , y ) S p ( x , z ) 71

  43. P(q) = 1 - (1-p 1 (1-(1-q 1 )(1-q 2 ))) * (1-p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 ))) Π -x A P a 1 p 1 (1-(1-q 1 )(1-q 2 )) q = a 2 p 2 (1-(1-q 3 )(1-q 4 )(1-q 5 )) ⋈ x R( x , y ), S( x , z ) A P a 1 1-(1-q 1 )(1-q 2 ) a 2 1-(1-q 3 )(1-q 4 )(1-q 5 ) Π -y Π -z A C P a 1 c 1 q 1 R p ( x , y ) a 1 c 2 q 2 S p ( x , z ) a 2 c 3 q 3 A B P a 2 c 4 q 4 a 1 b 1 p 1 72 a 2 c 5 q 5 a 2 b 2 p 2

  44. [D&S’2004] The #P-Hard Queries Are precisely the non-hierarchical queries. Example: hd1 :- R( x ), S( x, y ), T( y ) More general: q :- …, R( x , …), S( x, y , …), T( y , …) , … Theorem Testing if q is PTIME or #P-hard is in AC 0 73

  45. Quiz: What is their complexity ? q PTIME or #P ? R( x , y ),S( y , a , u ),T( y , y , v ) R( x , y ), S( x , y , z ), T( x , z ) R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) 74

  46. Hint… q PTIME or #P ? y R( x , y ),S( y , a , u ),T( y , y , v ) R S T x u v x R( x , y ), S( x , y , z ), T( x , z ) T R S z y y x R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) u T R S U y x R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) v S R T U z 75

  47. …Answer q PTIME or #P ? y R( x , y ),S( y , a , u ),T( y , y , v ) PTIME R S T x u v x R( x , y ), S( x , y , z ), T( x , z ) #P T R S z y y x R( x , a ),S( y , u , x ),T( u , y ),U( x , y ) u #P T R S U y x R( x , y , z ),S( z , u , y ),T( y , v , z , x ),U( y ) v S R T U z PTIME 76

  48. Case 2: CQ 1 +Disjoint/independent • Dichotomy: in [Dalvi et al.’06,Dalvi&S’07] • Some safe plans also in [Andritsos’2006] • CQ 1 (conjunctive queries, no self-joins) • Independent/independent tables are OK Theorem Forall q ∈ CQ 1 • q has a safe plan and is in PTIME, OR • q is #P-hard 77

  49. The PTIME Queries Algorithm : find a Safe Plan 1. Root variable u  Π i -u 2. Variable u occurs in a subgoal with constant keys  Π D -u 3. Connected components  Join • Single subgoal  Leaf node y P q(y) :- R( x ,y,z) b 1-(1-p1-p2)(1-p3-p4) x y P i Π -x a1 b p1+p2 q1(x c ,y c ):-R( x c ,y c ,z) a2 b p3+p4 x y z P D Π -z b c1 p1 a1 b c2 p2 R( x ,y,z) b c1 p3 78 a2 b c2 p4

  50. D Π -u Disjoint project R( x ), S( x, y ), T( y ), U( u , y), W( ‘a’ , u) ⋈ u y x D Π -y T S R W p (‘a’,u) ⋈ y W U Disjoint u project I Π -x ⋈ x T p (y) U p (u,y) Independent project R p (x) S p (x,y) 79

  51. [Dalvi&S’2007] The #P-Hard Queries hd1 = R( x ), S( x, y ), T( y ) There are variations on hd2, hd3 hd2 = R( x ,y), S( y ) (see paper) hd3 = R( x ,y), S(x, y ) In general, a query is #P-hard if it can be “rewritten” to hd1, hd2, hd3 or one of their “variations”. Theorem Testing if q is PTIME or #P-hard is PTIME complete 80

  52. [Dalvi&S’2007b] Case 3: Any conjunctive query, independent tables Let q be hierarchical • x ⊇ y denotes: x is above y in the hierarchy • x ≡ y denotes: x ⊇ y and x ⊆ y Definition An inversion is a chain of unifications: x ⊃ y with u 1 ≡ v 1 with … with u n ≡ v n with x’ ⊂ y' Theorem Forall q ∈ CQ: • If q is non-hierarchical, or has an inversion* then it is #P-hard • Otherwise it is in PTIME 81 *without “eraser”: see paper.

  53. [Dalvi&S’2007b] The #P-hard Queries Hierarchical queries with “inversions”: hi1 = R( x ), S( x , y ), S( x’ , y’ ), T( y’ ) x ⊃ y unifies with x’ ⊂ y’ x y’ R S S T y x’ hi2 = R( x ), S( x , y ), S( u , v ), S’( u , v ),S’( x’ , y’ ), T( y’ ) x ⊃ y unifies with u ≡ v, which unifies with x’ ⊂ y’ u v x y’ R S S S’ S’ T y x’ 82

  54. The #P-hard Queries A query with a long inversion: hi k = R( x ), S 0 ( x , y ), S 0 ( u 1 , v 1 ), S 1 ( u 1 , v 1 ) S 1 ( u 2 , v 2 ), S 2 ( u 2 , v 2 ), . . . S k ( x ’, y ’), T( y’ ) 83

  55. The #P-hard Queries Sometimes inversions are exposed only after making a copy of the query q = R( x , y ), R( y , z ) R(x,y),R(y,z) R(x’,y’), R(y’,z’) 84

  56. The PTIME Queries Find movies with high reviews from Joe and Jim: q(x) :- Movie(x,y),Match(x,r), Review(r,Joe,s), s > 4 Match(x,r’), Review(r’,Jim,s’),s’>4 Unify, but Don’t no inversion unify Note: the query is hierarchical because x is a “constant” 85

  57. [Dalvi&S’2007b] The PTIME Queries Note: no “safe plans” are known ! PTIME algorithm for an inversion-free query is given in terms of expressions, not plans. Example: q :- R( a , x ), R( y , b ) p(q) = p(R(a,b))+(1-p(R(a,b))(1-(1- ∏ y ∈ Dom,y ≠ a (1-p(R(y,b))))(1- ∏ x ∈ Dom,x ≠ b (1-p(R(a,x)))) Open Problem : what are the natural operators that allow us to compute inversion-free queries in a database engine ? 86

  58. Query Com- Why plexity R(a,x), R(y,b) PTIME b a R(a,x), R(x,b) PTIME a b R(x,y), R(y,z) #P Inversion R(x,y),R(y,z),R(z,u) #P Non- hierarchical R(x,y),R(y,z),R(z,x) #P Non- hierarchical R(x,y),R(y,z),R(x,z) #P Non- hierarchical 87

  59. History • [Graedel, Gurevitch, Hirsch’98] – L(x,y),R(x,z),S(y),S(z) is #P-hard This is non-hierarchical, with a self-join • [Dalvi&S’2004] – R(x),S(x,y),T(y) is #P-hard This is non-hierarchical, w/o self-joins – Without self-joins: non-hierarchical = #P-hard, and hierarchical = PTIME • [Dalvi&S’2007] – All non-hierarchical queries are #P-hard 88

  60. Summary on the Dichotomy WHY WE CARE: Safe queries = most powerful optimization we have What we know: • Three dichotomies, of increasing complexity • Dichotomy for aggregates in HAVING [Re&S.2007] What is open • CQ + independent/disjoint • Extensions to ≤ , ≥ , ≠ • Extensions to unions of conjunctive queries 89

  61. Outline Part 1: • Motivation • Data model • Basic query evaluation Part 2: • The dichotomy of query evaluation • Implementation and optimization • Six Challenges 90

  62. Implementation and Optimization Topics: • General probabilistic inference • Optimization 1: Safe-subplans • Optimization 2: Top K • Performance of MystiQ 91

  63. General Query Evaluation • Query q + database DB  boolean expression Φ q DB • Run any probabilistic inference algorithm on Φ q DB This approach is taken in Trio 92

  64. Background: Probability of Boolean Expressions Given: P(X 1 )= p 1 , P(X 2 )= p 2 , P(X 3 )= p 3 Φ = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Compute P( Φ ) X 1 X 2 X 3 P Φ Pr( Φ )=(1-p 1 )p 2 p 3 + 0 0 0 0 p 1 (1-p 2 )p 3 + 0 0 1 0 0 1 0 0 p 1 p 2 (1-p 3 ) + (1-p 1 )p 2 p 3 0 1 1 1 Ω = p 1 p 2 p 3 1 0 0 0 p 1 (1-p 2 )p 3 1 0 1 1 p 1 p 2 (1-p 3 ) 1 1 0 1 #P-complete [Valiant:1979] 93 p 1 p 2 p 3 1 1 1 1

  65. Query q + Database PDB  Φ R( x , y ), S( x , z ) q= S p R p A C P PDB= A B P a 1 c 1 q 1 Y 1 a 1 b 1 p 1 X 1 a 1 c 2 q 2 Y 2 a 2 b 2 p 2 X 2 a 2 c 3 q 3 Y 3 a 2 c 4 q 4 Y 4  a 2 c 5 q 5 Y 5 Φ = X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 94

  66. Probabilistic Networks Nodes = random variables R( x , y ), S( x , z ) Edges = dependence Φ = X 1 Y 1 Ç X 1 Y 2 Ç X 2 Y 3 Ç X 2 Y 4 Ç X 2 Y 5 Ç Studied intensively in KR Typical networks: Ç Ç • Bayesian networks • Markov networks Æ Æ Æ Æ Æ • Boolean expressions X 1 X 2 Y 1 Y 2 Y 3 Y 4 Y 5 p 1 p 2 q 1 q 2 q 3 q 4 q 5

  67. Inference Algorithms for Boolean Expressions • Randomized: – Naïve Monte Carlo – Luby and Karp • Deterministic – Algorithmic guarantees: [Trevisan’04], [Luby&Velickovic’91] – Inference algorithms in AI: variable elimination, junction trees,… – Tractable cases: bounded-width trees [Zabiyaka&Darwiche’06] 96

  68. Naive Monte Carlo Simulation E = X 1 X 2 Ç X 1 X 3 Ç X 2 X 3 Cnt à 0 X 1 X 2 X 1 X 3 repeat N times X 2 X 3 randomly choose X 1 , X 2 , X 3 2 {0,1} if E(X 1 , X 2 , X 3 ) = 1 then Cnt = Cnt+1 May be big P = Cnt/N (in theory) return P /* ' Pr(E) */ Theorem (0-1 estimator) If N ¸ (1/ Pr(E)) £ (4ln(2/ δ )/ ε 2 ) then Pr[ | P/Pr(E) - 1 | > ε ] < δ 97

  69. [Graedel,Gurevitch,Hirsch:1998] [Karp&Luby:1983] Improved Monte Carlo Simulation E = C 1 Ç C 2 Ç . . . Ç C m Cnt à 0; S à Pr(C 1 ) + … + Pr(C m ); repeat N times randomly choose i 2 {1,2,…, m}, with prob. Pr(C i ) / S randomly choose X 1 , …, X n 2 {0,1} s.t. C i = 1 if C 1 =0 and C 2 =0 and … and C i-1 = 0 Now it’s then Cnt = Cnt+1 in PTIME P = Cnt/N * S / 2 n return P /* ' Pr(E) */ Theorem . If N ¸ (1/ m) £ (4ln(2/ δ )/ ε 2 ) then: Pr[ | P/Pr(E) - 1 | > ε ] < δ 98

  70. [Re,Dalvi&S’2007] An Example q(x,u) :- R p ( x , y ), S p ( y , z ), T p ( z ,u) R p S p T p A B B C C D P P P b1 p1 b1 c1 q1 d1 r1 a1 c1 b2 p2 c1 q2 d2 r2 a2 b1 p3 b2 c2 q3 d1 r3 c2 c3 q4 d2 r4 d3 r5 Step 1: evaluate this query on the representation to get the data qTemp(x,y,p,y,z,q,z,u, r) :- R(x,y,p), S(y,z,q), T(z,u,r) 99

  71. R p S p T p A B B C P P C D P a1 b1 p1 b1 c1 q1 d1 r1 c1 a1 b2 p2 b2 c1 q2 d2 r2 a2 b1 p3 b2 c2 q3 d1 r3 b2 c3 q4 c2 d2 r4 d3 r5 qTemp(x,y,p,y,z,q,z,u, r) :- R(x,y,p), S(y,z,q), T(z,u,r) Temp  A B P B C P C D P a1 b1 p1 b1 c1 q1 c1 d1 r1 a1 b2 p2 b2 c2 q3 c2 d1 r3 a2 b1 . . . . . . . . 100

Recommend


More recommend