A hard query Theorem The query evaluation problem of the CQ query H 0 given by H 0 ← R ( x ) ∧ S ( x , y ) ∧ T ( y ) on tuple-independent databases is hard for #P. Proof. Given a PP2DNF formula Φ = � ( i , j ) ∈ E X i Y j , where E = { ( X e 1 , Y e 1 ) , ( X e 2 , Y e 2 ) , . . . } , construct the tuple-independent DB: R S T X X Y Y X 1 1/2 X e 1 Y e 1 1 Y 1 1/2 X 2 1/2 X e 2 Y e 2 1 Y 2 1/2 . . . . . . . . . . . . . . . Then #Φ = 2 n P ( H 0 ), where n is the total number of variables. 19 / 119
More hard queries Theorem All of the following RC queries on tuple-independent databases are #P-hard: H 0 ← R ( x ) ∧ S ( x , y ) ∧ T ( y ) H 1 ← [ R ( x 0 ) ∧ S ( x 0 , y 0 )] ∨ [ S ( x 1 , y 1 ) ∧ T ( y 1 )] H 2 ← [ R ( x 0 ) ∧ S 1 ( x 0 , y 0 )] ∨ [ S 1 ( x 1 , y 1 ) ∧ S 2 ( x 1 , y 1 )] ∨ [ S 2 ( x 2 , y 2 ) ∧ T ( y 2 )] . . . Queries can be tractable even if they have intractable subqueries! q ( x , y ) ← R ( x ) ∧ S ( x , y ) ∧ T ( y ) is tractable q ← H 0 ∨ T ( y ) is tractable 20 / 119
Extensional and intensional query evaluation We’ll say more about data complexity as we go Extensional query evaluation ◮ Evaluation process guided by query expression q ◮ Not always possible ◮ When possible, data complexity is in polynomial time Extensional plans ◮ Extensional query evaluation in the database ◮ Only minor modifications to RDBMS necessary ◮ Scalability, parallelizability retained Intensional query evaluation ◮ Evaluation process guided by query lineage ◮ Reduces query evaluation to the problem of computing the probability of a propositional formula ◮ Works for every query 21 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 22 / 119
Problem statement Tuple-independent database ◮ Each tuple t annotated with a unique boolean variable X t ◮ We write P ( t ) = P ( X t ) Boolean query Q ◮ With lineage Φ Q ◮ We write P ( Q ) = P ( Φ Q ) Goal: compute P ( Q ) when Q is tractable ◮ Evaluation process guided by query expression q ◮ I.e., without first computing lineage! Example P ( Finch ) = P ( X 1 ) = 0 . 8 Birds Is there a finch? Q ← Birds(Finch) Species P ◮ Φ Q = X 1 Finch 0.80 X 1 ◮ P ( Q ) = 0 . 8 Toucan 0.71 X 2 Is there some bird? Q ← Birds( s )? Nightingale 0.65 X 3 ◮ Φ Q = X 1 ∨ X 2 ∨ X 3 ∨ X 4 Humming bird 0.55 X 4 ◮ P ( Q ) ≈ 99 . 1% 23 / 119
Overview of extensional query evaluation Break the query into “simpler” subqueries By applying one of the rules Independent-join 1 Independent-union 2 Independent-project 3 Negation 4 Inclusion-exclusion (or M¨ obius inversion formula) 5 Attribute ranking 6 Each rule application is polynomial in size of database Main results for UCQ queries ◮ Completeness: Rules succeed iff query is tractable ◮ Dichotomy: Query is #P-hard if rules don’t succeed 24 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 25 / 119
Unifiable atoms Definition Two relational atoms L 1 and L 2 are said to be unifiable (or to unify ) if they have a common image. I.e., there exists substitutions such that L 1 [ a 1 / x 1 ] = L 2 [ a 2 / x 2 ], where x 1 are the variables in L 1 and x 2 are the variables in L 2 . Example Unifiable: Not unifiable: R ( a ), R ( a ) via [], [] R ( a ), R ( b ) R ( x ), R ( y ) via [ a / x ], [ a / y ] R ( a , y ), R ( b , y ) R ( a , y ), R ( x , y ) via [ b / y ], [( a , b ) / ( x , y )] R ( x ), S ( x ) R ( a , b ), R ( x , y ) via [], [( a , b ) / ( x , y )] R ( a , y ), R ( x , b ) via [ b / y ], [ a / x ] Unifiable atoms must use the same relation symbol. 26 / 119
Syntactic independence Definition Two queries Q 1 and Q 2 are called syntactically independent if no two atoms from Q 1 and Q 2 unify. Example Syntactically independent: Not syntactically independent: R ( a ), R ( b ) R ( a ), R ( x ) R ( a , y ), R ( b , y ) R ( x ), R ( y ) R ( x ), S ( x ) R ( x ), S ( x ) ∧ ¬ R ( x ) R ( a , x ) ∨ S ( x ), R ( b , x ) ∧ T ( x ) Checking for syntactic independence can be done in polyno- mial time in the size of the queries. 27 / 119
Syntactic independence and probabilistic independence Proposition Let Q 1 , Q 2 , . . . , Q k be pairwise syntactically independent. Then Q 1 , . . . , Q k are independent probabilistic events. Proof. The sets Var(Φ Q 1 ) , . . . , Var(Φ Q k ) are pairwise disjoint, i.e., the lineage formulas do not share any variables. Since all variables are independent (because we have a tuple-independent database), the proposition follows. Example Syntactically independent: Not syntactically independent: R ( a ), R ( b ) R ( a ), R ( x ) R ( a , y ), R ( b , y ) R ( x ), R ( y ) R ( x ), S ( x ) R ( x ), S ( x ) ∧ ¬ R ( x ) R ( a , x ) ∨ S ( x ), R ( b , x ) ∧ T ( x ) 28 / 119
Probabilistic independence and syntactic independence Proposition Probabilistic independence does not necessarily imply syntactic independence. Example Consider Q 1 ← R ( x , y ) ∧ R ( x , x ) Q 2 ← R ( a , b ) If Φ Q 1 does not contain X R ( a , b ) , Q 1 and Q 2 are independent Otherwise, Φ Q 1 contains X R ( a , b ) and therefore X R ( a , b ) ∧ X R ( a , a ) Then, Φ Q 1 also contains X R ( a , a ) ∧ X R ( a , a ) = X R ( a , a ) Thus, by the absorption law, ( X R ( a , b ) ∧ X R ( a , a ) ) ∨ X R ( a , a ) = X R ( a , a ) X R ( a , b ) can be eliminated from Φ Q 1 so that Q 1 and Q 2 are independent 29 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 30 / 119
Base case: Atoms Definition If Q is an atom, i.e., of form Q = R ( a ), simply lookup its probability in the database. Example Sightings Name Species P Mary Finch 0.8 X 1 Did Mary see a toucan? Mary Toucan 0.3 X 2 Q = Sightings(Mary , Toucan) Susan Finch 0.2 X 3 P ( Q ) = 0 . 3 Susan Toucan 0.5 X 4 Susan Nightingale 0.6 X 5 31 / 119
Rule 1: Independent-join Definition If Q 1 and Q 2 are syntactically independent, then P ( Q 1 ∧ Q 2 ) = P ( Q 1 ) · P ( Q 2 ) . ( independent-join ) Example Sightings Did both Mary and Susan see a toucan? Name Species P Mary Finch 0.8 X 1 Q = S(Mary , Toucan) ∧ S(Susan , Toucan) Mary Toucan 0.3 X 2 Q 1 = S(Mary , Toucan) P ( Q 1 ) = 0 . 3 Susan Finch 0.2 X 3 Q 2 = S(Susan , Toucan) P ( Q 2 ) = 0 . 5 Susan Toucan 0.5 X 4 P ( Q ) = P ( Q 1 ) · P ( Q 2 ) = 0 . 15 Susan Nightingale 0.6 X 5 32 / 119
Rule 2: Independent-union Definition If Q 1 and Q 2 are syntactically independent, then P ( Q 1 ∨ Q 2 ) = 1 − (1 − P ( Q 1 ))(1 − P ( Q 2 )) . ( independent-union ) Example Sightings Did Mary or Susan see a toucan? Name Species P Q = S(Mary , Toucan) ∨ S(Susan , Toucan) Mary Finch 0.8 X 1 Q 1 = S(Mary , Toucan) P ( Q 1 ) = 0 . 3 Mary Toucan 0.3 X 2 Susan Finch 0.2 X 3 Q 2 = S(Susan , Toucan) P ( Q 2 ) = 0 . 5 Susan Toucan 0.5 X 4 P ( Q ) = Susan Nightingale 0.6 X 5 1 − (1 − P ( Q 1 ))(1 − P ( Q 2 )) = 0 . 65 33 / 119
Root variables and separator variables Definition Consider atom L and query Q . Denote by Pos( L , x ) the set of positions where x occurs in Q (maybe empty). If Q is of form Q = ∃ x . Q ′ : Variable x is a root variable if it occurs in all atoms, i.e., Pos( L , x ) � = ∅ for every atom L that occurs in Q ′ . A root variable x is a separator variable if for any two atoms that unify, x occurs on a common position, i.e., Pos( L 1 , x ) ∩ Pos( L 2 , x ) � = ∅ . Example Q 1 ← ∃ x . Likes( a , x ) ∧ Likes( x , a ) Q 2 ← ∃ x . Likes( a , x ) ∧ Likes( x , x ) Pos(Likes( a , x ) , x ) = { 2 } x is root variable Pos(Likes( x , a ) , x ) = { 1 } x is a separator variable x is root variable Q 3 ← ∃ x . Likes( a , x ) ∧ Popular(a) x is no separator variable x is no root variable x is no separator variable 34 / 119
Separator variables and syntactic independence Lemma Let x be a separator variable in Q = ∃ x . Q ′ . Then for any two distinct constants a , b, the queries Q ′ [ a / x ] , Q ′ [ b / x ] are syntactically independent. Proof. Any two atoms L 1 , L 2 that unify in Q ′ do not unify in Q ′ [ a / x ] and Q ′ [ b / x ]. Since x is a separator variable, there is a position at which both L 1 and L 2 have x ; at this position, L 1 [ a / x ] has a and L 2 [ b / x ] has b . Example Sightings Has anybody seen a toucan? Name Species P Mary Finch 0.8 X 1 Q = ∃ x . Sightings( x , Toucan) Mary Toucan 0.3 X 2 Q ′ ( x ) = Sightings( x , Toucan) Susan Finch 0.2 X 3 Q ′ [Mary / x ] = Sightings(Mary , Toucan) Susan Toucan 0.5 X 4 Q ′ [Susan / x ] = Sightings(Susan , Toucan) Susan Nightingale 0.6 X 5 35 / 119
Rule 3: Independent-project Definition If Q is of form Q = ∃ x . Q ′ and x is a separator variable, then � Q ′ [ a / x ] � � �� P ( Q ) = 1 − 1 − P , ( independent-project ) a ∈ ADom where ADom is the active domain of the database. Example Has anybody seen a toucan? Sightings Name Species P Q = ∃ x . S( x , Toucan) Mary Finch 0.8 X 1 Q ′ = S( x , Toucan) Mary Toucan 0.3 X 2 � P ( Q ) = 1 − (1 − P ( S( x , T) )) Susan Finch 0.2 X 3 x ∈{ M , S , F ,... } Susan Toucan 0.5 X 4 = 1 − (1 − 0 . 3)(1 − 0 . 5)1 · · · 1 Susan Nightingale 0.6 X 5 = 0 . 65 36 / 119
Rule 4: Negation Definition If the query is ¬ Q , then P ( ¬ Q ) = 1 − P ( Q ) ( negation ) Example Sightings Name Species P Mary Finch 0.8 X 1 Did nobody see a toucan? Mary Toucan 0.3 X 2 Q = ¬ [ ∃ x . S( x , Toucan)] Susan Finch 0.2 X 3 P ( Q ) = 1 − P ( ∃ x . S( x , Toucan) ) = 0 . 35 Susan Toucan 0.5 X 4 Susan Nightingale 0.6 X 5 37 / 119
Rule 5: Inclusion-exclusion Definition Suppose Q = Q 1 ∧ Q 2 ∧ . . . Q k . Then, ( − 1) | S | P � � � � P ( Q ) = − Q i ( inclusion-exclusion ) ∅� = S ⊆{ 1 ,..., k } i ∈ S Example 1 2 3 12 13 23 123 P ( Q 1 ∧ Q 2 ∧ Q 3 ) = Q 1 1 0 0 1 1 0 1 + P ( Q 1 ) 1 1 1 0 2 1 1 2 + P ( Q 2 ) 1 1 1 2 2 2 3 + P ( Q 3 ) 12 13 − P ( Q 1 ∨ Q 2 ) 0 0 1 1 1 1 2 123 -1 0 0 0 0 0 1 − P ( Q 1 ∨ Q 3 ) 2 3 23 -1 -1 -1 -1 -1 -1 0 − P ( Q 2 ∨ Q 3 ) Q 2 Q 3 0 0 0 0 0 0 1 + P ( Q 1 ∨ Q 2 ∨ Q 3 ) 38 / 119
Inclusion-exclusion for independent-project Goal of inclusion-exclusion is to apply the rewrite ( ∃ x 1 . Q 1 ) ∨ ( ∃ x 2 . Q 2 ) ≡ ∃ x . ( Q 1 [ x / x 1 ] ∨ Q 2 [ x / x 2 ]) . Example Sightings Name Species P Mary Finch 0.8 Mary Toucan 0.3 Susan Finch 0.2 Susan Toucan 0.5 Susan Nightingale 0.6 Has both Mary seen some bird and someone seen a finch? P ( ( ∃ x . S(M , x )) ∧ ( ∃ y . S( y , F)) ) ( ie ) = P ( ∃ x . S(M , x ) ) + P ( ∃ y . S( y , F) ) − P ( ( ∃ x . S(M , x )) ∨ ( ∃ y . S( y , F)) ) ( ip / ip / rewrite ) = 0 . 86 + 0 . 84 − P ( ∃ x . S(M , x ) ∨ S( x , F) ) = 1 . 7 − P ( ∃ x . S(M , x ) ∨ S( x , F) ) Now we are stuck → Need another rule (attribute-constant ranking)! 39 / 119
Rule 6: Attribute ranking Definition Attribute-constant ranking. If Q is a query that contains a relation name R with attribute A , and there exists two unifiable atoms such that the first has constant a at position A and the second has a variable, substitute each occurence of form R ( . . . ) by R 1 ( . . . ) ∨ R 2 ( . . . ), where R 1 = σ A = a ( R ) , R 2 = σ A � = a ( R ) . Attribute-attribute ranking. If Q is a query that contains a relation name R with attributes A and B , substitute each occurence of form R ( . . . ) by R 1 ( . . . ) ∨ R 2 ( . . . ) ∨ R 3 ( . . . ), where R 1 = σ A < B ( R ) , R 2 = σ A = B ( R ) , R 3 = σ A > B ( R ) . Syntactic rewrites. For selections of form σ A = · , decrease the arity of the resulting relation by 1 and add an equality predicate. 40 / 119
Attribute-constant ranking (continues prev. example) Example Has both Mary seen some bird and someone seen a finch? P ( ( ∃ x . S(M , x )) ∧ ( ∃ y . S( y , F)) ) = 1 . 7 − P ( ∃ x . S(M , x ) ∨ S( x , F) ) ( rank (Name=Mary)) = 1 . 7 − P ( ∃ x . S M ( x ) ∨ S ¬ M (M , x ) ∨ [S M (F) ∧ x = M] ∨ S ¬ M ( x , F) ) ( simplify ) = 1 . 7 − P ( ∃ x . S M ( x ) ∨ S M (F) ∨ S ¬ M ( x , F) ) ( rank (Species=Finch)) = 1 . 7 − P ( ∃ x . [S MF () ∧ x = F] ∨ S M ¬ F ( x ) ∨ S MF () ∨ S ¬ M ( x , F) ) ( push ∃ x ) = 1 . 7 − P ( S MF () ∨ ∃ x . S M ¬ F ( x ) ∨ S ¬ M ( x , F) ) ( iu ) = 1 . 7 − 1 + (1 − P ( S MF () ))(1 − P ( ∃ x . S M ¬ F ( x ) ∨ S ¬ M ( x , F) ) ( base / ip ) �� � = 0 . 7 + (1 − 0 . 8) x ∈{ M,S,F,T,N } (1 − P ( S M ¬ F ( x ) ∨ S ¬ M ( x , F) ) ( iu ) �� � = 0 . 7 + 0 . 2 x ∈{ M,S,F,T,N } (1 − P ( S M ¬ F ( x ) ))(1 − P ( S ¬ M ( x , F) )) ( product ) = 0 . 7 + 0 . 2[11 · 1(1 − 0 . 2) · 11 · (1 − 0 . 3)1 · 11] = 0 . 812 S M S ¬ M S MF S M ¬ F S N S S N S S P P P P P M F 0.8 F 0.8 S F 0.2 0.8 T 0.3 M T 0.3 T 0.3 S T 0.5 S F 0.2 S N 0.6 S T 0.5 S N 0.6 41 / 119
Attribute-attribute ranking (example) The goal of attribute ranking is to establish syntactic inde- pendence and new separators by exploiting disjointness. L < L = L > L P 1 P 2 P 1 P 2 P 12 P 1 P 2 P P P P Example A B 0.8 A B 0.8 C 0.9 B A 0.7 B A 0.7 C A 0.2 Are there two people who like each other? C A 0.2 C C 0.9 P ( ∃ x . ∃ y . Likes( x , y ) ∧ Likes( y , x ) ) ( rank ) = P ( ∃ x . ∃ y . (Likes < ( x , y ) ∨ (Likes = ( x ) ∧ x = y ) ∨ Likes > ( x , y )) ∧ (Likes < ( y , x ) ∨ (Likes = ( x ) ∧ x = y ) ∨ Likes > ( y , x ))) ( expand , disjoint ) = P ( ∃ x . ∃ y . L < ( x , y )L > ( y , x ) ∨ (L = ( x ) ∧ x = y ) ∨ L > ( x , y )L < ( y , x ) ) ( push ∃ ) = P ( ( ∃ x . ∃ y . L < ( x , y )L > ( y , x )) ∨ ( ∃ x . L = ( x )) ∨ ( ∃ x . ∃ y . L > ( x , y )L < ( y , x )) ) (1 st ≡ 3 rd ) = P ( ( ∃ x . ∃ y . L < ( x , y )L > ( y , x )) ∨ ( ∃ x . L = ( x ))) Now we can apply independent-union, then independent-project, then independent-join. 42 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 43 / 119
Inclusion-exclusion and cancellation Consider the query Q ← ( Q 1 ∨ Q 3 ) ∧ ( Q 1 ∨ Q 4 ) ∧ ( Q 2 ∨ Q 4 ) Apply inclusion exclusion to get P ( Q ) = P ( Q 1 ∨ Q 3 ) + P ( Q 1 ∨ Q 4 ) + P ( Q 2 ∨ Q 4 ) − P ( Q 1 ∨ Q 3 ∨ Q 4 ) − P ( Q 1 ∨ Q 2 ∨ Q 3 ∨ Q 4 ) − P ( Q 1 ∨ Q 2 ∨ Q 4 ) + P ( Q 1 ∨ Q 2 ∨ Q 3 ∨ Q 4 ) = P ( Q 1 ∨ Q 3 ) + P ( Q 1 ∨ Q 4 ) + P ( Q 2 ∨ Q 4 ) − P ( Q 1 ∨ Q 3 ∨ Q 4 ) − P ( Q 1 ∨ Q 2 ∨ Q 4 ) One can construct cases in which Q 1 ∨ Q 2 ∨ Q 3 ∨ Q 4 is hard, but any subset is not (e.g., consider H 3 on slide 20). The inclusion-exclusion formula needs to be replaced by the M¨ obius inversion formula. 44 / 119
M¨ obius inversion formula (example) Given a query expression of form Q 1 ∧ . . . ∧ Q k : 1 Put the formulas Q S = � i ∈ S Q i , ∅ � = S ⊆ { 1 , . . . , j } , in a lattice (plus special element ˆ 1) 2 Eliminate duplicates (equivalent formulas) 3 Use the partial order Q S 1 ≥ Q S 2 iff Q S 1 ⇐ Q S 2 4 Label each node by its M¨ obius value µ (ˆ Q ← ( Q 1 ∨ Q 3 ) ∧ ( Q 1 ∨ Q 4 ) ∧ ( Q 2 ∨ Q 4 ) 1) = 1 1 ˆ 1 � µ ( u ) = − µ ( w ) -1 -1 u < w ≤ ˆ 1 Q 1 ∨ Q 3 Q 1 ∨ Q 4 Q 2 ∨ Q 4 5 Use the inversion formula -1 1 1 Q 1 ∨ Q 3 ∨ Q 4 Q 1 ∨ Q 2 ∨ Q 3 ∨ Q 4 Q 1 ∨ Q 2 ∨ Q 4 P ( Q 1 ∧ . . . ∧ Q k ) � = − µ ( u ) P ( Q u ) Q 1 ∨ Q 2 ∨ Q 3 ∨ Q 4 u < ˆ 0 1: µ ( u ) � =0 P ( Q ) = P ( Q 1 ∨ Q 3 ) + P ( Q 1 ∨ Q 4 ) + P ( Q 2 ∨ Q 4 ) − P ( Q 1 ∨ Q 3 ∨ Q 4 ) − P ( Q 1 ∨ Q 2 ∨ Q 4 ) 45 / 119
An nondeterministic algorithm Consider the algorithm: 1 As long as possible, apply one of the rules R1–R6 2 If all formulas are atoms, SUCCESS 3 If there is a formula that is not an atom, FAILURE Definition A rule is R 6 -safe if the above algorithm succeeds. Order of rule application does not affect SUCCESS Algorithm is polynomial in size of database ◮ Easy to see for independent-join, independent-union, negation, M¨ obius inversion formula, attribute ranking → do not depend on database ◮ Independent-project increases number of queries by a factor of | ADom | → applied at most k times, where k is the maximum arity of a relation 46 / 119
How the rules fail Example Consider the hard query H 0 ← ∃ x . ∃ y . R ( x ) ∧ S ( x , y ) ∧ T ( y ) independent-join, independent-union, independent-project, negation, M¨ obius inversion formula all do not apply But we could rank S : H 0 ← H 01 ∨ H 02 ∨ H 03 H 01 ← ∃ x . ∃ y . R ( x ) ∧ S < ( x , y ) ∧ T ( y ) H 02 ← ∃ x . R ( x ) ∧ S = ( x ) ∧ T ( x ) H 03 ← ∃ x . ∃ y . R ( x ) ∧ S > ( y , x ) ∧ T ( y ) Now we are stuck at H 01 and H 03 47 / 119
Dichotomy theorem for UCQ Safety is a syntactic property Tractability is a semantic property What is their relationship? Theorem (Dalvi and Suciu, 2010) For any UCQ query Q, one of the following holds: Q is R 6 -safe, or the data complexity of Q is hard for #P. No queries of “intermediate” difficulty Can check for tractability in time polynomial in database size (can be done by assuming an active domain of size 1) Query complexity is unknown (M¨ obius inversion formula) For RC , completeness/dichotomy unknown We can handle all safe UCQ queries! 48 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 49 / 119
Overview of extensional plans Can we evaluate safe queries directly in an RDBMS? Extensional query evaluation ◮ Based on the query expression ◮ Uses rules to break query into simpler pieces ◮ For UCQ , detects whether queries are tractable or intractable Extensional operators ◮ Extend relational operators by probability computation ◮ Standard database algorithms can be used Extensional plans ◮ Can be safe (correct) or unsafe (incorrect) ◮ For tractable UCQ queries, we can always produce a safe plan ◮ Plan construction based on R 6 rules ◮ Can be written in SQL (though not “best” approach) ◮ Enables scalable query processing on probabilistic databases 50 / 119
Basic operators Definition Annotate each tuple by its probability. The operators ⋉ i ) Independent join ( ⋊ Independent project ( π i ) Independent union ( ∪ i ) Construction / selection / renaming correspond to the positive K -relational algebra over ([0 , 1] , 0 , 1 , ⊕ , · ), where p 1 ⊕ p 2 = 1 − (1 − p 1 )(1 − p 2 ). (Union needs to be replaced by outer join for non-matching schemas; see Sucio, Olteneau, R´ e, Koch, 2011.) ([0 , 1] , 0 , 1 , ⊕ , · ) is not a semiring → unsafe plans! 51 / 119
Incriminates Alibi Example plans Witness Suspect Suspect Claim Mary Paul p 1 Paul Cinema q 1 Mary John p 2 Paul Friend q 2 Who incriminates someone Susan John p 3 John Bar q 3 who has an alibi? Q 1 ( w ) ← ∃ s . ∃ x . Incriminates( w , s ) ∧ Alibi( s , x ) Q 2 ( w ) ← ∃ s . Incriminates( w , s ) ∧ ∃ x . Alibi( s , x ) M 1 − (1 − p 1 q 1 )(1 − p 1 q 2 )(1 − p 2 q 3 ) M 1 − [1 − p 1 (1 − (1 − q 1 )(1 − q 2 ))][1 − p 2 q 3 ] S p 3 q 3 S p 3 q 3 π i π i w w M P C p 1 q 1 M P p 1 (1 − (1 − q 1 )(1 − q 2 )) M P F p 1 q 2 ⋉ i ⋉ i M J p 2 q 3 ⋊ ⋊ M J B p 2 q 3 s s S J p 3 q 3 S J B p 3 q 3 P 1 − (1 − q 1 )(1 − q 2 ) π i s J q 3 Incriminates( w , s ) Alibi( s , x ) Incriminates( w , s ) Alibi( s , x ) Plan 1 Plan 2 Incorrect (unsafe) Correct (safe) Not all plans are safe! 52 / 119
Weighted sum How to deal with the M¨ obius inversion formula? Definition The weighted sum of relations R 1 , . . . , R k with parameters µ 1 , . . . , µ k is given by: � µ 1 ,...,µ k � � ( R 1 , . . . , R k ) [] = R 1 ⋊ ⋉ · · · ⋊ ⋉ R k U � µ 1 ,...,µ k � � ( R 1 , . . . , R k ) ( t ) = µ 1 ( R 1 ( t )) + · · · µ k ( R k ( t )) U Intuitively, Computes the natural join Sums up the weighted probabilities of joining tuples 53 / 119
Weighted sum (example) Example Consider relations/subqueries V 1 ( A , B ) and V 2 ( A , C ) and the query: Q ( x , y , z ) ← V 1 ( x , y ) ∧ V 2 ( x , z ) Suppose we apply the M¨ obius inversion formula to get: Q 1 ( x , y ) = V 1 ( x , y ) with µ 1 = 1 Q 2 ( x , z ) = V 2 ( x , z ) with µ 2 = 1 Q 3 ( x , y , z ) = V 1 ( x , y ) ∨ V 2 ( x , z ) with µ 3 = − 1 We obtain: 1 , 1 , − 1 � ( Q 1 , Q 2 , Q 3 )[] = Q 1 ⋊ ⋉ Q 2 ⋊ ⋉ Q 3 = V 1 ⋊ ⋉ V 2 { A , B , C } 1 , 1 , − 1 � ( Q 1 , Q 2 , Q 3 ) = { ( t , p t 1 + p t 2 − p t 3 ) : t [ AB ] = t 1 ∈ Q 1 , t [ AC ] = t 2 ∈ Q 2 , t [ ABC ] = t 3 ∈ Q 3 } { A , B , C } 54 / 119
Complement How to deal with negation? Definition The complement of a deterministic relation R of arity k is given by � ( t , 1 − P ( t ∈ R )) : t ∈ ADom k � C ( R ) = . In practice, every complement operation can be replaced by difference (since queries are domain-independent). Example Query: Q ← R ( x ) ∧ ¬ S ( x ) Result: R − i S = { ( t , P ( t ∈ R ) (1 − P ( t ∈ S ))) : t ∈ R } 55 / 119
Computation of safe plans (1) Definition A query plan for Q is safe if it computes the correct probabilities for all input databases. Theorem There is an algorithm A that takes in a query Q and outputs either FAIL of a safe plan for Q. If Q is a UCQ query, A fails only if Q is intractable. Key idea: Apply rules R1–R6, but produce a query plan instead of computing probabilities Extension to non-Boolean queries: treat head variables as “constants” Ranking step produces “views” that are treated as base tables 56 / 119
Computation of safe plans (2) 1: if Q = Q 1 ∧ Q 2 and Q 1 , Q 2 are syntactically independent then ⋉ i plan( Q 2 ) 2: return plan( Q 1 ) ⋊ 3: end if 4: if Q = Q 1 ∨ Q 2 and Q 1 , Q 2 are syntactically independent then return plan( Q 1 ) ∪ i plan( Q 2 ) 5: 6: end if 7: if Q ( x ) = ∃ z . Q 1 ( x , z ) and z is a separator variable then return π i 8: x (plan( Q 1 ( x , z ))) 9: end if 10: if Q = Q 1 ∧ . . . ∧ Q k , k ≥ 2 then Construct CNF lattice Q ′ 1 , . . . , Q ′ 11: m 12: Compute M¨ obius coefficients µ 1 , . . . , µ m return � µ 1 ,...,µ m (plan( Q ′ 1 ) , . . . , plan( Q ′ 13: m )) 14: end if 15: if Q = ¬ Q 1 then 16: return C (plan Q 1 ) 17: end if 18: if Q ( x ) = R ( x ) where R is a base table (possibly ranked) then 19: return R ( x ) 20: end if 21: otherwise FAIL 57 / 119
Computation of safe plans (example) Q ( w ) ← ∃ s . ∃ x . Incriminates( w , s ) ∧ Alibi( s , x ) 1 Apply independent-project to Q on s ◮ Q 1 ( w , s ) ← ∃ x . Incriminates( w , s ) ∧ Alibi( s , x ) 2 x is not a root variable in Q 1 → push ∃ x : Q 2 ( w , s ) ← Incriminates( w , s ) ∧ ∃ x . Alibi( s , x ) 3 Apply independent-join to Q 2 π i ◮ Q 3 ( w , s ) ← Incriminates( w , s ) Witness ◮ Q 4 ( s ) ← ∃ x . Alibi( s , x ) 4 Q 3 is an atom ⋉ i ⋊ 5 Apply independent-project to Q 4 on x Suspect ◮ Q 5 ( s , x ) = Alibi( s , x ) 6 Q 5 is an atom π i Suspect Incriminates Alibi 58 / 119
π i Witness Safe plans with PostgreSQL (example) Q ( w ) ← ∃ s . ∃ x . Incriminates( w , s ) ∧ Alibi( s , x ) ⋉ i Q 4 ← π i ⋊ Suspect (Alibi) Suspect ⋉ i Q 2 ← Incriminates ⋊ Suspect Q 4 Q ← π i Witness ( Q 2 ) π i Suspect Incriminates SELECT Witness , 1-PRODUCT (1-P) AS P Alibi FROM ( SELECT Witness , Incriminates .Suspect , Incriminates .P * Q4.P as P FROM Incriminates , ( SELECT Suspect , 1-PRODUCT (1-P) AS P FROM Alibi GROUP BY Suspect ) AS Q4 WHERE Incriminates.Suspect = Q4.Suspect ) AS Q2 GROUP BY Witness 59 / 119
Deterministic tables Often: Mix of probabilistic and deterministic tables Naive approach: Assign probability 1 to tuples in a deterministic table → Suboptimal: Some tractable queries are missed! Example If T is known to be deterministic, the query Q ← R ( x ) , S ( x , y ) , T ( y ) becomes tractable! Why? S ⋊ ⋉ T now is a tuple-independent table! We can use the safe plan π i ⋉ i � � R ( x ) ⋊ x ( S ( x , y ) ⋊ ⋉ y T ( y )) ∅ Additional information about the nature of the tables (e.g., deterministic, tuple-independent with keys, BID tables) can help extensional query processing. 60 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 61 / 119
Overview Given a query Q ( x ), a TI database D ; for each output tuple t 1 Compute the lineage Φ = Φ D Q ( t ) ◮ | Φ | = O ( | ADom | m ), where m is the number of variables in Φ ◮ Data complexity is polynomial time ◮ Difference to extensional query evaluation: | Φ | depends on input → rules exponential in | Φ | also exponential in the size of the input! 2 Compute the probability P ( Φ ) ◮ Intensional query evaluation ≈ probability computation on propositional formulas ◮ Studied in verification and AI communities ◮ Different approaches: rule-based evaluation, formula compilation, approximation Can deal with hard queries. 62 / 119
Example (tractable query) Example q ( h ) ← ∃ n . ∃ c . Hotel( h , n , c ) ∧ ∃ r . ∃ t . ∃ p . Room( r , h , t , p ) ∧ ( p > 500 ∨ t = ’suite’) Room (R) Hotel (H) RoomNo Type HotelNo Price HotelNo Name City R1 Suite H1 $50 H1 Hilton SB X 4 X 1 R2 Single H1 $600 X 2 ExpensiveHotels R3 Double H1 $80 X 3 HotelNo H1 X 4 ∧ ( X 1 ∨ X 2 ) Φ = X 4 ∧ ( X 1 ∨ X 2 ) P ( Φ ) = P ( X 4 ) [1 − (1 − P ( X 1 ))(1 − P ( X 2 ))] E.g., P ( X i ) = 1 2 for all i → P ( Φ ) = 0 . 375 ExpensiveHotels HotelNo P H1 0.375 63 / 119
Example (intractable query) Example R S T X 1 X 1 0.5 X 2 Y 1 1 Y 1 0.5 X 2 Y 1 X 2 0.5 X 3 Y 2 1 Y 2 0.5 X 3 0.5 X 3 Y 2 X 4 0.5 X 4 H 0 ← ∃ x . ∃ y . R ( x ) , S ( x , y ) , T ( y ) Φ = X 2 Y 1 ∨ X 3 Y 2 P ( Φ ) = 1 − (1 − P ( X 2 ) P ( Y 1 ))(1 − P ( X 3 ) P ( Y 2 )) = 0 . 4375 Model counting: #Φ = 2 6 P ( Φ ) = 28 Bipartite vertex cover: #Ψ = 2 6 − #Φ = 36 = 2 · 3 · 3 · 2 64 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 65 / 119
Overview of rule-based intensional query evaluation Break the lineage formula into “simpler” formulas By applying one of the rules Independent-and 1 Independent-or 2 Disjoint-or 3 Negation 4 Shannon expansion 5 Rules work on lineage, not on query → data dependent Rules always succeed Rule 5 may lead to exponential blowup Can be used on any query but data complexity can be expo- nential. However, depending on the database, even a hard query might be “easy” to evaluate. 66 / 119
Support Definition For a propositional formula Φ, denote by V (Φ) the set of variables that occur in Φ. Denote by Var(Φ) the set of variables on which Φ depends; Var(Φ) is called the support of Φ. X ∈ Var(Φ) iff there exists an assignment θ to all variables but X and constants a � = b such that Φ[ θ ∪ { X �→ a } ] � = Φ[ θ ∪ { X �→ b } ]. Example Φ = X ∨ ( Y ∧ Z ) Φ = Y ∨ ( X ∧ Y ) ≡ Y V (Φ) = { X , Y , Z } V (Φ) = { X , Y } Var(Φ) = { X , Y , Z } Var(Φ) = { Y } 67 / 119
Syntactic independence Definition Φ 1 and Φ 2 are syntactically independent if they have disjoint support, i.e., Var(Φ 1 ) ∩ Var(Φ 2 ) = ∅ . Example Φ 3 = ¬ X ¬ Y ∨ XY Φ 1 = X Φ 2 = Y Φ 1 and Φ 2 are syntactically independent All other combinations are not Checking for syntactic independence is co-NP-complete in general. Practical approach: Proposition A sufficient condition for syntactic independence is V (Φ 1 ) ∩ V (Φ 2 ) = ∅ . 68 / 119
Probabilistic independence Proposition If Φ 1 , Φ 2 , . . . , Φ k are pairwise syntactically independent, then the probabilistic events Φ 1 , Φ 2 , . . . , Φ k are independent. Note that pairwise probabilistic independence does not imply probabilistic independence! Example Φ 1 = X Φ 2 = Y Φ 3 = ¬ X ¬ Y ∨ XY Φ 1 and Φ 2 are probabilistically independent Φ 1 , Φ 2 , Φ 3 are not pairwise syntactically independent Assume P ( X ) = P ( Y ) = 1 / 2 Φ 1 , Φ 2 , Φ 3 are pairwise independent Φ 1 , Φ 2 , Φ 3 are not independent! 69 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 70 / 119
Rules 1 and 2: independent-and, independent-or Definition Let Φ 1 and Φ 2 be two syntactically independent propositional formulas: P ( Φ 1 ∧ Φ 2 ) = P ( Φ 1 ) · P ( Φ 2 ) ( independent-and ) P ( Φ 1 ∨ Φ 2 ) = 1 − (1 − P ( Φ 1 ))(1 − P ( Φ 2 )) ( independent-or ) 71 / 119
Independent-and, independent-or (example) Incriminates Alibi Witness Suspect Suspect Claim Mary Paul X 1 ( p 1 ) Paul Cinema Y 1 ( q 1 ) Mary John X 2 ( p 2 ) Paul Friend Y 2 ( q 2 ) Susan John X 3 ( p 3 ) John Bar Y 3 ( q 3 ) Q ( w ) ← ∃ s . ∃ x . Incriminates( w , s ) ∧ Alibi( s , x ) Φ S = X 3 Y 3 M X 1 ( Y 1 ∨ Y 2 ) ∨ X 2 Y 3 S X 3 Y 3 Independent-and: P ( Φ S ) = p 3 q 3 1 π Witness Φ M = X 1 ( Y 1 ∨ Y 2 ) ∨ X 2 Y 3 Independent-or: 1 M P X 1 ( Y 1 ∨ Y 2 ) ⋊ ⋉ Suspect M J X 2 Y 3 P ( Φ M ) = 1 − (1 − P ( X 1 ( Y 1 ∨ Y 2 ) ))(1 − P ( X 2 Y 3 )) S J X 3 Y 3 Independent-and: P ( X 2 Y 3 ) = p 2 q 3 2 Independent-and: 3 P Y 1 ∨ Y 2 π Suspect P ( X 1 ( Y 1 ∨ Y 2 ) ) = p 1 P ( Y 1 ∨ Y 2 ) J Y 3 Independent-or: 4 P ( Y 1 ∨ Y 2 ) = 1 − (1 − q 1 )(1 − q 2 ) Incriminates Alibi 5 P ( Φ M ) = 1 − [1 − p 1 (1 − (1 − q 1 )(1 − q 2 ))](1 − p 2 q 3 ) 72 / 119
Rule 3: Disjoint-or Definition Two propositional formulas Φ 1 and Φ 2 are disjoint if Φ 1 ∧ Φ 2 is not satisfiable. Definition If Φ 1 and Φ 2 are disjoint: P ( Φ 1 ∨ Φ 2 ) = P ( Φ 1 ) + P ( Φ 2 ) ( disjoint-or ) Example P ( X ) = 0 . 2; P ( Y ) = 0 . 7 Φ 1 = XY ; P ( XY ) = P ( X ) P ( Y ) = 0 . 14 Φ 2 = ¬ X ; P ( ¬ X ) = 0 . 8 P ( Φ 1 ∨ Φ 2 ) = P ( Φ 1 ) + P ( Φ 2 ) = 0 . 94 Checking for disjointness is NP-complete in general. But disjoint-or will play a major role for Shannon expansion. 73 / 119
Rule 4: Negation Definition P ( ¬ Φ ) = 1 − P ( Φ ) ( negation ) Example P ( X ) = 0 . 2; P ( Y ) = 0 . 7 P ( XY ) = P ( X ) P ( Y ) = 0 . 14 P ( ¬ ( XY ) ) = 1 − 0 . 14 = 0 . 86 74 / 119
Shannon expansion Definition The Shannon expansion of a propositional formula Φ w.r.t. a variable X with domain { a 1 , . . . , a m } is given by: Φ ≡ (Φ[ X �→ a 1 ] ∧ ( X = a 1 )) ∨ . . . ∨ (Φ[ X �→ a m ] ∧ ( X = a m )) Example Φ = XY ∨ XZ ∨ YZ Φ ≡ (Φ[ X �→ TRUE ] ∧ X ) ∨ (Φ[ X �→ FALSE ] ∧ ¬ X ) = ( Y ∨ Z ) X ∨ YZ ¬ X In the Shannon expansion rule, every ∧ is an independent-and; every ∨ is a disjoint-or. 75 / 119
Rule 5: Shannon expansion Definition Let Φ be a propositional formula and X be a variable: � P ( Φ ) = P ( Φ[ X �→ a ] ) P ( X = a ) ( Shannon expansion ) a ∈ dom( X ) Example Φ = XY ∨ XZ ∨ YZ P ( Φ ) = P ( Y ∨ Z ) P ( X ) + P ( YZ ) P ( ¬ X ) Can always be applied Effectively eliminates X from the formula But may lead to exponential blowup! 76 / 119
Shannon expansion (example) Incriminates Alibi Witness Suspect Suspect Claim Mary Paul X 1 ( p 1 ) Paul Cinema Y 1 ( q 1 ) Mary John X 2 ( p 2 ) Paul Friend Y 2 ( q 2 ) Susan John X 3 ( p 3 ) John Bar Y 3 ( q 3 ) Q ( w ) ← ∃ s . ∃ x . Incriminates( w , s ) ∧ Alibi( s , x ) Φ M = X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 M X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 1 Independent-or: S X 3 Y 3 π Witness P ( Φ M ) = 1 − (1 − P ( X 1 Y 1 ∨ X 1 Y 2 ))(1 − P ( X 2 Y 3 )) M P C X 1 Y 1 2 Independent-and: P ( X 2 Y 3 ) = p 2 q 3 M P F X 1 Y 2 ⋉ Suspect ⋊ 3 Shannon expansion: P ( X 1 Y 1 ∨ X 1 Y 2 ) ) = M J B X 2 Y 3 S J B X 3 Y 3 P ( Y 1 ∨ Y 2 ) P ( X 1 ) + P ( FALSE ) P ( ¬ X 1 ) 4 Independent-or: P ( Y 1 ∨ Y 2 ) = 1 − (1 − q 1 )(1 − q 2 ) 5 P ( Φ M ) = 1 − [1 − p 1 (1 − (1 − q 1 )(1 − q 2 ))](1 − p 2 q 3 ) Incriminates Alibi The intensional rules work on all plans! 77 / 119
A non-deterministic algorithm 1: if Φ = Φ 1 ∧ Φ 2 and Φ 1 , Φ 2 are syntactically independent then return P ( Φ 1 ) · P ( Φ 2 ) 2: 3: end if 4: if Φ = Φ 1 ∨ Φ 2 and Φ 1 , Φ 2 are syntactically independent then return 1 − (1 − P ( Φ 1 ))(1 − P ( Φ 2 )) 5: 6: end if 7: if Φ = Φ 1 ∨ Φ 2 and Φ 1 , Φ 2 are disjoint then return P ( Φ 1 ) + P ( Φ 2 ) 8: 9: end if 10: if Φ = ¬ Φ 1 then return 1 − P ( Φ 1 ) 11: 12: end if 13: Choose X ∈ Var(Φ) 14: return � a ∈ dom( X ) P ( Φ[ X �→ a ] ) P ( X = a ) Should be implemented with dynamic programming to avoid evaluating the same subformula multiple times. 78 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 79 / 119
Materialized views in TID databases (1) TID databases complete only with views How to deal with views in a PDBMS? Store just the view definition 1 Store the view result and probabilities 2 Store the view result and lineage 3 Store the view results and “compiled lineage” 4 Trade-off between precomputation and query cost (just as in DBMS) Example (ExpensiveHotel view) q ( h ) ← ∃ n . ∃ c . Hotel( h , n , c ) ∧ ∃ r . ∃ t . ∃ p . Room( r , h , t , p ) ∧ ( p > 500 ∨ t = ’suite’) Room (R) Hotel (H) RoomNo Type HotelNo Price HotelNo Name City R1 Suite H1 $50 H1 Hilton SB X 4 X 1 R2 Single H1 $600 X 2 R3 Double H1 $80 X 3 ExpensiveHotels ExpensiveHotels ExpensiveHotels HotelNo HotelNo HotelNo X 4 ∧ i ( X 1 ∨ i X 2 ) H1 0 . 375 H1 X 4 ∧ ( X 1 ∨ X 2 ) H1 (2) (3) (4) 80 / 119
Materialized views in TID databases (2) Hotel (H) Example (Continued) HotelNo Name City Consider the query H1 Hilton SB X 4 q ( h ) ← ∃ c . ExpensiveHotel( h ) , Hotel( h , ’Hilton’ , c ) , which asks for expensive Hilton hotels using a view. Can we answer this query when ExpensiveHotel is a precomputed materialized view? ExpensiveHotels ExpensiveHotels ExpensiveHotels HotelNo HotelNo HotelNo X 4 ∧ i ( X 1 ∨ i X 2 ) H1 X 4 ∧ ( X 1 ∨ X 2 ) H1 0 . 375 H1 Yes, combine lineages No, dependency Yes, combine “compiled between lineages” → Need to be ExpensiveHotels and able to combine compiled Hotels lost lineages efficiently! ExpensiveHiltons ExpensiveHiltons HotelNo HotelNo X 4 ∧ i ( X 1 ∨ X 2 ) H1 [ X 4 ∧ ( X 1 ∨ X 2 )] ∧ X 4 H1 81 / 119
Query compilation “Compile” Φ into a Boolean circuit with certain desirable properties P ( Φ ) can be computed in linear time in the size of the circuit ◮ Many other tasks can be solved in polynomial time ◮ E.g., combining formulas Φ 1 ∧ Φ 2 (even when not independent!) ◮ Key application in PDBMS: Compile materialized views Tractable compilation = circuit of size polynomial in database → Implies tractable computation of P ( Φ ) (converse may not be true) Compilation targets RO (read-once formula) 1 OBDD (ordered binary decision diagram) 2 FBDD (free binary decision diagram) 3 d-DNF (deterministic-decomposable normal form) 4 Goals: (1) Reusability. (2) Understand complexity of intensional QE. 82 / 119
Restricted Boolean circuit (RBC) Rooted, labeled DAG All variables are Boolean Each node (called gate ) representents a propositional formula Ψ Let Ψ be represented by a gate with children representing Ψ 1 , . . . , Ψ n ; we consider the following gates & restrictions: ◮ Independent-and ( ∧ i ): Ψ 1 , . . . , Ψ n are syntactically independent ◮ Independent-or ( ∨ i ): Ψ 1 , . . . , Ψ n syntactically independent ◮ Disjoint-or ( ∨ d ): Ψ 1 , . . . , Ψ n are disjoint ◮ Not ( ¬ ): single child, represents ¬ Ψ ◮ Conditional gate ( X ): two children representing X ∧ Ψ 1 and ¬ X ∧ Ψ 2 , where X / ∈ Var(Ψ 1 ) and X / ∈ Var(Ψ 2 ) ◮ Leaf node (0, 1, X ): represents FALSE , TRUE , X The different compilation targets restrict which and where gates may be used. 83 / 119
Restricted Boolean circuit (example) Example Who incriminates someone who has an alibi? Lineage of unsafe plan: Φ M = X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 ∨ i X 1 ∧ i 0 1 0 ∨ i X 2 Y 3 Y 1 Y 2 “Documents” the non-deterministic algorithm for intensional query evaluation. 84 / 119
Deterministic-decomposable normal form (d-DNF) Restricted to gates: ∧ i , ∨ d , ¬ ◮ ∧ i -gates are called decomposable (D) ◮ ∨ d -gates are called deterministic (d) Example Φ = XYU ∨ XYZ ¬ U ∨ d ∧ i ∧ i ¬ X Y Z U 85 / 119
RBC and d-DNF Theorem Every RBC with n gates can be transformed into an equivalent d-DNF with at most 5 n gates, a polynomial increase in size. Proof. We are not allowed to use ∨ i and conditional nodes. Apply the transformations: ∨ d ¬ ∧ i ∧ i ∨ i ∧ i X → 1 → 0 ¬ Ψ 1 Ψ 2 Ψ 1 Ψ 2 ¬ ¬ Ψ 1 Ψ 2 X Ψ 1 Ψ 2 A ∨ i -node is replaced by 4 new nodes. A conditional node is replaced by (at most) 5 new nodes. 86 / 119
Application: knowledge compilation Tries to deal with intractability of propositional reasoning Key idea Slow offline phase: Compilation into a target language 1 Fast online phase: Answers in polynomial time 2 → Offline cost amortizes over many online queries Key aspects ◮ Succinctness of target language (d-DNF, FBDD, OBDD, ...) ◮ Class of queries that can be answered efficiently once compiled (consistency, validity, entailment, implicants, equivalence, model counting, probability computation, ...) ◮ Class of transformations that can be performed efficiently once compiled ( ∧ , ∨ , ¬ , conditioning, forgetting, ...) How to pick a target language? Identify which queries/transformations are needed 1 Pick the most succinct language 2 Which queries admit polynomial representation in which target language? 87 / 119 Darwiche and Marquis, 2002
Free binary decision diagram (FBDD) Restricted to conditional gates Binary decision diagram : Each node decides on the value of a variable Free : Each variable occurs only on every root-leaf path Example Who incriminates someone who has an alibi? Lineage of safe plan: Φ M = X 1 ( Y 1 ∨ Y 2 ) ∨ X 2 Y 3 X 1 1 Y 1 0 0 Y 2 0 X 2 1 1 1 Y 3 0 0 1 0 1 88 / 119
Ordered binary decision diagram (OBDD) An ordered FBDD, i.e., ◮ Same ordering of variables on each root-leaf path ◮ Omissions are allowed Example The FBDD on slide 88 is an OBDD with ordering X 1 , Y 1 , Y 2 , X 2 , Y 3 . Theorem Given two ODDBs Ψ 1 and Ψ 2 with a common variable order, we can compute an ODDB for Ψ 1 ∧ Ψ 2 , Ψ 1 ∨ Ψ 2 , or ¬ Ψ 1 in polynomial time. Note that Ψ 1 and Ψ 2 do not need to be independent or disjoint. (Many other results of this kind exist. Many BDD software packages exist, e.g., BuDDy, JDD, CUDD, CAL). 89 / 119
Read-once formulas (RO) Definition A propositional formula Φ is read-once (or repetition-free ) if there exists a formula Φ ′ such that Φ ≡ Φ ′ and every variable occurs at most once in Φ ′ . Example Φ = X 1 ∨ X 2 ∨ X 3 → read-once Φ = X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 ∨ X 2 Y 4 ∨ X 2 Y 5 ◮ Φ ′ = X 1 ( Y 1 ∨ Y 2 ) ∨ X 2 ( Y 3 ∨ Y 4 ∨ Y 5 ) → read-once Φ = XY ∨ XU ∨ YU → not read-once Theorem If Φ is given as a read-once formula, we can compute P ( Φ ) in linear time. Proof. All ∧ ’s and ∨ ’s are independent, and negation is easily handled. 90 / 119
When is a formula read-once? (1) Definition Let Φ be given in DNF such that no conjunct is a strict subset of some other conjunct. Φ is unate if every propositional variable X occurs either only positively or negatively. The primal graph G ( V , E ) where V is the set of propositional variables in Φ and there is an edge ( X , Y ) ∈ E if X and Y occur together in some conjunct. Example Unate: XY ∨ ¬ ZX Not unate: XY ∨ Z ¬ X XU ∨ XV ∨ YU ∨ YV XY ∨ YU ∨ UV XY ∨ XU ∨ YU X U X U X U Y V Y V Y 91 / 119
When is a formula read-once? (2) Definition A primal graph G for Φ is P 4 -free if no induced subgraph is isomorphic to P 4 ( ). G is normal if for every clique in G , there is a conjunct in Φ that contains all of the clique’s variables. Example XU ∨ XV ∨ YU ∨ YV XY ∨ YU ∨ UV XY ∨ XU ∨ YU X U X U X U Y V Y V Y P 4 -free Not P 4 -free P 4 -free Normal Normal Not normal Read-once Not read-once Not read-once Theorem A unate formula is read-once iff it is P 4 -free and normal. 92 / 119
Query compilation hierarchy Denote by L ( T ) the class of queries from L that can be compiled efficiently to target T . The following relationships hold for UCQ -queries: 93 / 119
Outline Primer: Relational Calculus 1 The Query Evaluation Problem 2 Extensional Query Evaluation 3 Syntactic Independence Six Simple Rules Tractability and Completeness Extensional Plans Intensional Query Evaluation 4 Syntactic independence 5 Simple Rules Query Compilation Approximation Techniques Summary 5 94 / 119
Why approximation? Exact inference may require exponential time → expensive Often absolute probability values of little interest; ranking desired → Good approximations of P ( Φ ) suffice Desiderata ◮ (Provably) low approximation error ◮ Efficient ◮ Polynomial in database size ◮ Anytime algorithm (gradual improvement) Approaches ◮ Probability intervals ◮ Monte-Carlo approximation We will show: Approximation is tractable for all RA -queries w.r.t. absolute error and for all UCQ -queries w.r.t. relative error! 95 / 119
Probability bounds Theorem Let Φ 1 and Φ 2 be propositional formulas. Then, Boole’s inequality / union bound � �� � max( P ( Φ 1 ) , P ( Φ 2 )) ≤ P ( Φ 1 ∨ Φ 2 ) ≤ min( P ( Φ 1 ) + P ( Φ 2 ) , 1) max(0 , P ( Φ 1 ) + P ( Φ 2 ) − 1) ≤ P ( Φ 1 ∧ Φ 2 ) ≤ min( P ( Φ 1 ) , P ( Φ 2 )) . � �� � via inclusion-exclusion Example Border cases: P P P Φ 1 Φ 1 Φ 1 Φ 2 Φ 2 Φ 2 P ( Φ 1 ∨ Φ 2 ) P ( Φ 1 ) + P ( Φ 2 ) P ( Φ 2 ) 1 P ( Φ 1 ∧ Φ 2 ) 0 P ( Φ 1 ) P ( Φ 1 ) + P ( Φ 2 ) − 1 96 / 119
Computation of probability intervals Theorem Let Φ 1 and Φ 2 be propositional formulas with bounds [ L 1 , U 1 ] and [ L 2 , U 2 ] , respectively. Then, Φ 1 ∨ Φ 2 : [ L , U ] = [max( L 1 , L 2 ) , min( U 1 + U 2 , 1)] Φ 1 ∧ Φ 2 : [ L , U ] = [max(0 , L 1 + L 2 − 1) , min( U 1 , U 2 )] ¬ Φ 1 : [ L , U ] = [1 − U 1 , 1 − L 1 ] Example (Does Mary incriminate someone who has an alibi?) Incriminates Alibi Φ = X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 Witness Suspect P Suspect Claim P X 1 Y 1 : [0 . 75 , 0 . 85] Mary Paul 0 . 9 X 1 Paul Cinema 0 . 85 Y 1 Mary John 0 . 8 X 2 Paul Friend 0 . 75 Y 2 X 1 Y 2 : [0 . 65 , 0 . 75] John Bar 0 . 65 Y 3 X 2 Y 3 : [0 . 45 , 0 . 65] X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 : [0 . 75 , 1] Bounds can be computed in linear time in size of Φ. 97 / 119
Probability intervals and intensional query evaluation 1: if Φ = Φ 1 ∧ Φ 2 and Φ 1 , Φ 2 are syntactically independent then return [ L , U ] = [ L 1 · L 2 , U 1 · U 2 ] 2: 3: end if 4: if Φ = Φ 1 ∨ Φ 2 and Φ 1 , Φ 2 are syntactically independent then return [ L , U ] = [ L 1 ⊕ L 2 , U 1 ⊕ U 2 ] 5: 6: end if 7: if Φ = Φ 1 ∨ Φ 2 and Φ 1 , Φ 2 are disjoint then return [ L , U ] = [ L 1 + L 2 , min( U 1 + U 2 , 1)] 8: 9: end if 10: if Φ = ¬ Φ 1 then return [ L , U ] = [1 − U 1 , 1 − L 1 ] 11: 12: end if 13: Choose X ∈ Var(Φ) 14: Shannon expansion to Φ = � i Φ i ∧ ( X = a i ) 15: return [ L , U ] = [ � i L i P ( X = a i ) , min( � i U i P ( X = a i ) , 1)] Independence and disjointness allow for tighter bounds. 98 / 119
Probability intervals and intensional query evaluation (2) Example Incriminates Alibi Witness Suspect P Suspect Claim P Mary Paul 0 . 9 X 1 Paul Cinema 0 . 85 Y 1 Mary John 0 . 8 X 2 Paul Friend 0 . 75 Y 2 John Bar 0 . 65 Y 3 Φ = X 1 Y 1 ∨ X 1 Y 2 ∨ X 2 Y 3 ∨ i [0 . 88 , 1] X 1 Y 1 ∨ X 1 Y 2 [0 . 75 , 1] ∧ i [0 . 52 , 0 . 52] X 1 Y 1 : [0 . 75 , 0 . 85] X 1 Y 2 : [0 . 65 , 0 . 75] X 2 [0 . 8 , 0 . 8] Y 3 [0 . 65 , 0 . 65] X 2 Y 3 : [0 . 45 , 0 . 65] Φ : [0 . 75 , 1] 99 / 119
Discussion Incremental construction of RBC circuit If all leaf nodes are atomic, computes exact probability If some leaf nodes are not atomic, computes probability bounds Anytime algorithm (makes incremental progress) Can be stopped as soon as bounds become accurate enough ◮ Absolute ǫ -approximation: U − L ≤ 2 ǫ → choose ˆ p ∈ [ U − ǫ, L + ǫ ] ◮ Relative ǫ -approximation: (1 − ǫ ) U ≤ (1 + ǫ ) L → choose ˆ p ∈ [(1 − ǫ ) U , (1 + ǫ ) L ] But: no apriori runtime bounds! Definition A value ˆ p is an absolute ǫ -approximation of p = P ( Φ ) if p − ǫ ≤ ˆ p ≤ p + ǫ ; it is an relative ǫ -approximation of p if (1 − ǫ ) p ≤ ˆ p ≤ (1 + ǫ ) p . 100 / 119
Recommend
More recommend